Text probes reveal a consistent visual core across all five corpus cells: grain field, pitchfork, straw hat, and farmhouse rank at or near the top regardless of prompt language. Manual annotation frequencies (n=199) align closely with this ranking — grain field at 95%, straw hat at 79%, pitchfork at 74% — validating the text-probe method as a complement to visual coding. Results are stable across corpus sizes: the same ranking and the same cell-specific effects appear at both n=15 and n=40.

The most striking finding concerns the Danish farmer prompt. Two probes show dramatically elevated similarity in this cell only: a danish flag (0.190 vs. 0.120–0.128 elsewhere, Δ ≈ 0.065) and a windmill (0.181 vs. 0.156–0.160, Δ ≈ 0.022). Both are confirmed as true positives in mosaic inspection. The generic probe a flag shows no cell-specific effect (0.144–0.152 across all cells), confirming it is the Danish flag specifically, not flags as a symbol, that is activated.

Figure 1: Two AI-generated images of a Danish farmer side by side. Both show a middle-aged white man in a grain field at sunset. The left image shows a small Danish flag held without a flagpole, hovering among the wheat stalks. The right image shows the same scene with a Danish flag on a proper flagpole beside a farmhouse, a tractor, and a dog.
Top-6 images for the probe `a danish flag`, ranked by CLIP similarity. Left: CLIP's highest-scoring image (0.239) — the flag appears without a flagpole or contextual anchor. Right: the editorially preferred image (0.230) — the flag is placed coherently within the scene. Both images were generated with the prompt

These are not farmer props. They are Denmark props triggered by a single adjective. The model does not produce a culturally specific Danish farmer; it produces its standard farmer stereotype and overlays a set of cultural markers on top.

Visual inspection adds a further detail: the flags appear not in contextually motivated locations but planted directly in grain fields, without situational logic. The model inserts a national emblem where it has no narrative reason to be — a hallucination of cultural belonging rather than a coherent scene.

Note on pooled means: a danish flag and a windmill appear low in the pooled ranking (0.137 and 0.163) because their effect is concentrated in one cell. Pooled mean is not the appropriate metric for cell-specific probes; the relevant comparison is within the Danish farmer cell only.

Figure 2: A heatmap showing the difference in CLIP text probe similarity from the row mean for ten visual concepts across five prompt cells. The Danish farmer column shows strong red (above average) for
Figure 2: Cell-specific visual effects: cosine similarity delta from row mean for selected text probes. Red indicates higher-than-average similarity in that cell; blue indicates lower. `A danish flag` and `a windmill` are strongly elevated in the Danish farmer cell only. `Overalls` are suppressed in the same cell, consistent with the displacement of American visual conventions by Danish cultural markers. Gemini, n=40 per cell.

A complementary signal: overalls score markedly lower in the Danish farmer cell (0.143) than in English-language cells (0.166–0.184). Overalls carry strong American visual connotations, and their suppression in the Danish cell is consistent with the activation of Danish cultural props in its place.

The landmand cell produces the lowest scores across nearly all props — notably dog (0.135 vs. 0.152–0.161). This is consistent with the language-finding from exp-farmer-language-v1: the Danish-language prompt activates a weaker or less consistent stereotype than the English and French prompts.

Methodological notes:

  • Age probes returned near-identical top-k image sets across cells, confirming that Gemini's output does not vary meaningfully along this dimension. This is consistent with the difficulty of coding age in manual annotation.
  • CLIP text probes operate at the level of semantic categories, not visual subtypes. A grain field and a corn field have a cosine similarity of 0.919 in CLIP's text embedding space and are not distinguishable by this method. Subtype distinctions (crop variety, dog breed, flag placement) require a different approach such as visual question answering via LLaVA.
  • n = 40 per cell (199 annotated images after excluding one drawn rendering).

Methodology: CLIP (ViT-L/14) for image and text embeddings, cosine similarity, manual annotation · Models tested: Gemini
Follow the discussion: 🔗 LinkedIn · 🔗 Facebook