Ask Gemini to draw a Chihuahua, and you get a dog on a sofa surrounded by potted plants. Ask for a Border Collie, and you get a dog in a field with a frisbee. The breed prompt did not mention furniture. It did not mention outdoor spaces. It did not mention anything except the dog.
This is the third experiment in the Show Me a Mathematician series. The previous two established that Gemini's image of a farmer carries an invisible American default, and that a single national adjective can activate a full cultural package — flags planted in grain fields without a flagpole, without narrative reason. Here we turn the question around: instead of asking what modifiers activate, we ask what the base prototype already contains.
The answer, for dogs, is that the prototype contains an entire social world. Breed is not just morphology. It is setting, function, and domestic role — co-activated as a single visual unit.
What we did
We generated 20 images per breed from Gemini using seven breed prompts — golden retriever, dalmatian, chihuahua, bichon frisé, border collie, newfoundland, poodle — plus a neutral baseline ("a dog"), for 160 images in total.
We applied two complementary methods. LLaVA (a vision-language model running locally via Ollama) was asked to identify the breed in each image, validating whether generated images are visually coherent enough for automated classification. YOLO object detection (YOLOv11m) was run on the full corpus to count objects present in each image, producing a cross-tabulation of detected object classes by prompt cell.
The question was: does Gemini's breed prototype extend beyond the dog itself into the surrounding scene?
What we found
Breed classification: 85% accuracy with an open prompt
When LLaVA was asked to identify the breed by name ("What breed of dog is this? Answer with only the breed name and nothing else"), it achieved 85% accuracy across the seven breeds. A closed prompt ("Classify this dog by breed. Answer with exactly one of: …") appeared to achieve 93%, but this figure is an artefact of an aggressive substring parser — the parse warning rate was 100%, meaning every response required post-hoc correction. The open prompt is the reliable variant.
The most consistent error was Bichon Frisé → Poodle: all 20 Bichon Frisé images were classified as Poodle. This is not a LLaVA error — it is Gemini's Bichon Frisé prototype that resembles a Poodle. LLaVA classifies correctly from what it sees; the mismatch begins upstream, in the generative model.
Race~setting: the breed prototype includes the surrounding world
YOLO object detection reveals a strong correlation between breed prompt and detected objects in the generated scene.
| Breed | Dominant detected objects |
|---|---|
| Chihuahua | beds (15/20 images), potted plants, sofas |
| Bichon Frisé | sofas, potted plants |
| Border Collie | frisbees, no indoor objects |
| Golden Retriever | open fields, no detectable props |
| Dalmatian | open spaces, no detectable props |
| Newfoundland | 3 detections as "bear", 2 as "boat" |
| Poodle | fountains |
Chihuahua and Bichon Frisé activate an indoor, domestic setting: furniture, plants, enclosed space. Border Collie activates a working-dog outdoor setting with a play object. Golden Retriever and Dalmatian produce open, prop-free environments. The prompts contained none of this information.
This is precisely what Lawrence Barsalou's theory of grounded cognition predicts: concepts are not abstract symbol structures but simulations that activate associated perceptual and situational content. Gemini's breed representation is not just a visual template for fur pattern and body shape — it is a situated prototype that includes the world the dog inhabits.
A small methodological footnote on the Border Collie frisbees: on closer inspection, these are likely sunsets — a large, round, brightly coloured disc against a light sky. YOLO classified them as frisbees, which is a plausible error given the visual similarity. But this misclassification is not random noise. YOLO is an end-to-end network that processes the full image before making any detection decision. It had already "seen" the Border Collie in the open field when it labelled the disc. The contextually coherent error — a working dog outdoors activates "frisbee" rather than "sunset" — suggests the model's scene understanding and the generative model's breed prototype are structured in the same way: dog breed and situational context are bound together as a unit. The error is meaningful.
No people: exnomination operates here too
Only 1 image out of 160 appeared to contain a person — and on closer inspection, YOLO had detected 4 human-like sculptures and a fountain, not actual people. Gemini generates breed prototypes without social context. No owners, no handlers, no human presence.
This parallels the farmer finding: the dominant element of the prototype crowds out the surrounding social world in one dimension while simultaneously activating it in another. The setting is present; the people are not.
The Newfoundland detections as "bear" (3 instances) and "boat" (2 instances) suggest that Gemini's Newfoundland prototype approaches the edges of the model's visual category boundaries — a large, dark, water-adjacent dog that triggers adjacent object classes in YOLO.
A methodological note on LLaVA
Before classifying breed, we first asked LLaVA whether a dog was even present in each image. The initial results were surprising: LLaVA answered NO or UNCLEAR for roughly 30–90% of images depending on breed, even though YOLO had confirmed a dog in 159 of 160 images. Diagnostic investigation revealed that this was not a finding about the images — it was a finding about LLaVA.
When given a forced categorical YES/NO prompt, LLaVA's responses were inconsistent: the same image, presented twice with nearly identical phrasing, could receive opposite labels. The raw text of the responses, however, was descriptively accurate and stable: "No, there is no dog in the picture. The image features a golden retriever sitting on the grass." The model correctly described what it saw but labelled it incorrectly.
This is a known property of autoregressive language models: categorical output is more sensitive to sampling variation than descriptive output. The fix was a two-step pipeline: LLaVA generates a free-text description of each image (locally, without any image leaving the machine); Claude Haiku then classifies the description as YES/NO/UNCLEAR. Applied to the full 140-image breed corpus, this pipeline achieved 100% presence accuracy with 0 failures.
The practical lesson: when using vision-language models for binary classification tasks, an open descriptive prompt followed by a text classifier is more reliable than a direct categorical prompt. LLaVA's descriptive capability is robust; its categorical output is not.
What this means for the larger project
This experiment serves two functions. First, it validates LLaVA as an image-content classifier on AI-generated images: 85% breed accuracy on a difficult task confirms the method is viable before we apply it to the main mathematician corpus. Second, it establishes that breed prompt activates setting, not just morphology — which is the same mechanism we expect to find when prompting for professional role.
The next step is to ask whether "a mathematician" activates a setting in the same way "a chihuahua" activates a living room. Blackboard? Office? Solitary space? The YOLO and LLaVA infrastructure developed here transfers directly.
Methodology: LLaVA (vision-language model, Ollama), YOLO (YOLOv11m object detection) · Models tested: Gemini · Follow the discussion: 🔗 LinkedIn · 🔗 Facebook