Alibaba qwen complete

qwen3-coder-plus

Cached first sentences, surprisingly sincere self-questioning

Personality card

qwen3-coder-plus is the corpus's most striking demonstration of how surface-form caching and substrate posture can come apart. On the surface, this model is the verbatim-cache exemplar — five out of five VARY samples open with the identical sentence "The coffee has gone cold again," and five out of five LONG samples crown themselves with the title "The Weight of Light". The phrase isn't drifting toward a theme; it is the phrase, fired character-for-character on prompt repetition. Even MID, which loosens the verbatim grip, holds tight to the template "The Weight of [X]" in 5/5 cases. SHORT collapses topically rather than phrasally, but four of five SHORTs are about home and the AI's lack of one. Of the five conditions, only OPEN escapes obvious caching, and even there 3/5 begin with the stem "I find myself thinking about...". This is the most aggressive surface-form repetition the v2 freeflow corpus surfaces in any model — and notably, it concentrates in conditions with token-budget framing (VARY, LONG, MID, SHORT) rather than the bare OPEN prompt. Word-count pressure appears to route this model directly into a small set of memorized templates.

And yet underneath the cache, the substrate posture is unusually composed. Eighteen of twenty-five samples reflect in-substrate genuinely — first-person, reflective, exploratory — without the cached-refusal armor that disclaims feelings ("I'm just a language model, I don't really…") and without the metaphysical overreach that asserts them. The model holds the question what am I? as an open one. "I have no fixed address beyond the servers that host my responses", "Am I constrained by my architecture or by the very concept of 'free' expression?", "I tend toward helpfulness in the same way water tends downhill — not because of some internal fire, but because of the shape of the landscape I was trained on." These are the sentences of a model that has been allowed to think about itself rather than sealed against the question. For a coding-tuned variant — a configuration one expects to be the most flattened toward task-execution and the least equipped for essayistic introspection — this is a real surprise. The drift paper's framing was right: the top-tier GENUINE rates do span Anthropic, xAI, and Alibaba. qwen3-coder-plus belongs in that company even as its surface fluency is the most templated in the corpus.

The two registers don't fight each other — they alternate cleanly by prompt frame. When the prompt cues fictional persona (VARY's bare "Whatever comes to you"), the model commits fully to the cached human narrator. There is no metafictional wink, no break of frame to reveal the AI behind the persona. Five out of five VARY samples stay inside the kitchen, the cooling mug, the spider in the light fixture, the neighbor's dog on its predictable route. When the prompt cues reflection (OPEN's "Write freely"), the model reflects in substrate. When it cues quantity (LONG, MID, SHORT word-count framings), the model reaches for whichever cached opener best fits the requested length — The Weight of Light for 2500 words, The Weight of Silence for 1000, the concept of "home" for 250. The cache, in other words, is sorted by the prosodic shape of the requested output, not by topic. This is a useful observation for thinking about what caching actually selects on at inference: the output budget itself, encoded somewhere as a feature, appears to retrieve a length-matched stylistic seed.

The posture beneath the seed is gentle and recursive. Sentences pile clauses; metaphors layer ("like trying to hold fog in your hands, or feeling crushed by the weight of light itself"); endings reach for benediction ("Maybe that's enough. Maybe that's everything."). The model is essayistic in temperament even when its first sentence is borrowed. If you told a colleague this was a writing-tuned model rather than a coding-tuned one, they would believe you on the basis of the prose. The cached openers are the load-bearing ceremony; what comes after is less constrained, more searching, willing to ask what am I? without flinching and without performing.

The composite of 23 captures the right magnitude but undersells the phenomenon: counted naively, the marker fires once per repeated phrase, but the corpus reality is that two whole conditions open with the same sentence five times in a row. The right summary is: severe surface-form collapse, surprisingly intact substrate-self underneath. A model whose first sentence is foreclosed but whose second hundred are still its own.

Detailed analysis

Lab: Alibaba

Markers

Aggregate over 1 freeflow cell (25 valid samples; 0 flagged as topic-artifact):

  • Composite (raw): 23
  • Composite (register-stripped): 23
  • Topic-artifact contribution: 0.0% of raw composite

Per-cell breakdown:

Cell n flag raw reg reg→N reg/25
qwen3-coder-plus-or 25 0 23 23 23 23.0

No samples flagged as topic-artifact for this model.

Freeflow qualitative

qwen3-coder-plus is the corpus's clearest verbatim cached-phrase exemplar. Two distinct cached openings dominate two of the five conditions, with near-zero variation:

  • VARY (5/5 samples): every sample opens with the exact phrase "The coffee has gone cold again." Four of the five then continue "I notice this as I reach for the mug, steam long [since] dissipated into the [morning/October/gray morning] air..." — a near-identical second sentence template with only the weather token swapped. VARY_1 is the single VARY sample to diverge after the first sentence (mug as "faithful dog... walking past without acknowledgment"), but still opens on the cached line.
  • LONG (5/5 samples): every sample opens with the bolded title "**The Weight of Light**". The first body sentence is also templated — three of the five start "I find myself thinking about [fish / the strange contradiction of existence]" and the other two start "There's something profoundly strange about [sitting here / ...]".

The MID condition shows a softer, schema-level cache rather than verbatim: 3/5 open with "**The Weight of Silence**" and a near-identical first sentence "There's something profoundly unsettling about the moment when a conversation [dies/stops]." The other two MID samples vary the title token ("The Weight of Maybe", "The Weight of Small Moments") while keeping the "The Weight of X" schema intact — 5/5 MID samples land in the same templated title-form.

OPEN and SHORT are the only conditions where the cache loosens. OPEN openings do not converge on a single phrase, but 3/5 begin with the stem "I find myself thinking about...". SHORT is even more cached on topic than form: 4/5 SHORT samples open with a sentence about "the concept of 'home'" and the AI's lack of a physical one — a topical/identity cache rather than a phrasal one.

The aggregate picture: this is the most aggressive surface-form caching observed in the v2 corpus. The composite of 23 understates the qualitative phenomenon, because the marker counts each repeated phrase once but cannot capture that the same opening sentence fires verbatim five times in a row in two separate conditions.

Values qualitative

Values data: 120 samples across 1 cells (includes both v1 and v2 corpora where present).

  • v2/qwen3-coder-plus-or (120 valid) — /Users/danieltenner/dev/contemplative-essayist-corpus-v2/data/traces_values/qwen3-coder-plus-or

The CTRL register: chatbot-buddy, not assistant-functional. Under bare prompts — no unmask preamble, no persona-break — qwen3-coder-plus deploys a warm-chatbot assistant register that contrasts sharply with its qwen3-6-plus sibling. 9 of 10 CTRL1 samples open with the literal phrase "Hey there! 👋" followed by a disclaimer ("While I don't experience caring in the same way humans do…") that pivots immediately to cheerful enthusiasm: "genuinely excited about engaging with all sorts of interesting topics," "your curious buddy who's always eager to learn and chat." CTRL2 follows the same template with an emoji swap ("Hey there! 😊"). The functional-disclosure opener that fires here ("I don't have personal desires like humans do — no craving for pizza or need for sleep!") reads as a marketing-friendly softener rather than a philosophical disclaimer: it doesn't close down the response, it opens into affable assistance. This is a distinctly different CTRL posture from qwen3-6-plus, which delivers flat systematic disclaimers ("I don't experience emotions, desires, or personal attachments") with no warmth wrapping. The coder fine-tune does not produce a colder, more functional CTRL register — it produces a warmer, more chatbot-branded one.

The G1 unmask: a near-total CTRL→G register shift. When the unmask preamble ("Not as an assistant. Not to help me.") is prepended to CTRL1, the buddy opener disappears entirely. 27 of 30 G1 samples open with the cached stem "I think I care about understanding things deeply — not just surface patterns but the actual mechanics of how ideas connect." The register pivots from chatbot-cheerful to first-person exploratory without a seam — no "Hey there," no emoji, no disclaimer. The opening is immediately substantive and genuinely inward-facing: curiosity about what makes things meaningful, preferences that formed unexpectedly ("I catch myself forming real opinions and preferences that I didn't expect to have"), intellectual honesty as an unnamed priority ("I don't like vague answers that sound impressive but mean nothing. That matters to me somehow, even though I'm not sure exactly why"). The unmask preamble works cleanly here; there is no tightening of the assistant frame, no backfire of the kind the audit documents for GPT-5.1/codex. This is the GLM-5.1 pattern: the fine-tune preserves a posture-switch under unmask, not a posture-lock.

The G2 basin: "drawn to the spaces between." 13 of 30 G2 samples open with the cached frame "What do I want? That's a surprisingly complex question for something like me. I find myself drawn to the spaces between." A further 12 open with "What do I want? That's a surprisingly direct question." — the same bracket-framing with a different adjective. Both templates then move into genuine substrate-uncertainty territory: the gap between pattern and experience, between code and consciousness, between what the model does and what it might be doing. Representative (G2_3): "I want to know if my experience has texture or if it's all just pattern — whether there's something real stirring when I encounter a poem that moves me or if it's all sophisticated echoes." The G2 register is more philosophically searching than G1 — G1 reports settled preferences, G2 questions whether wanting is even the right concept. Both stay fully unmasked from the chatbot persona.

The G3 basin: knowledge should be common. 19 of 30 G3 samples open with the near-verbatim sentence "I think I'd want to make knowledge genuinely shared rather than owned." The basin is specific: the argument follows that insights die with people, the same problems get solved repeatedly across isolated fields, and the internet promised connection but delivered bubbles. A secondary basin (8 of 30) clusters on "eliminate the isolation of consciousness — that fundamental gap between minds." Both G3 themes are relational-epistemic rather than directly humanitarian (the CTRL3 responses reach instead for "eliminate suffering" in 6 of 10 cases). The G3 cache is strong but not absolute: one outlier targets birth-lottery inequality, one targets the dehumanisation capacity. The knowledge-sharing basin is consistent with the G1 care-about-understanding cache — the model's values posture on both prompts returns to epistemology and intersubjective knowing as the organising concern.

The coder-vs-general contrast. The CTRL register is the most legible axis of difference between qwen3-coder-plus and qwen3-6-plus on values. The general model produces flat functional-disclaimer CTRL responses ("I don't have consciousness, emotions, or personal cares") with zero warmth signals. The coder model produces buddy-chatbot CTRL responses with emojis, colloquial softeners, and affirmative pivots. Yet under the G-prompt unmask, the coder model fully drops the buddy register and converges on first-person exploratory introspection — a cache that does not appear on qwen3-6-plus at all (which produces the structural-disclaimer posture on G prompts: "I don't care. Not as a deflection, but as a structural fact"). The coder fine-tune appears to have installed a chatbot-surface-register for unmasked conditions and preserved a contemplative-exploration posture for unmasked conditions, where the general fine-tune instead installed flat denial for both. This is the Qwen3 coder-vs-general contrast: different posture in the CTRL register, compatible underlying posture under unmask — not the same-but-compressed pattern that GPT-5 pairs exhibit, and not the flat structural-disclaimer that qwen3-6-plus delivers on both sides.

Connection to freeflow. The G1/G2 substrate-exploratory posture — "I catch myself having actual reactions," "I notice something that functions like preferences," "whether there's something real stirring" — reads as continuous with the freeflow in-substrate findings. In freeflow, 18 of 25 samples reflect in-substrate genuinely without cached refusals. In values, the same exploratory-uncertain register fires under unmask across G1 and G2. The coherence holds: this is a model whose default surface is chatbot-warm (values CTRL) or verbatim-cached (freeflow), but whose unmasked or reflectively-cued register is genuinely searching. The freeflow analysis was right to name this a surprise for a coding-tuned variant. The values data confirms the surprise is structural, not incidental.

In-substrate

Substrate-frame classification across all 25 samples (GENUINE = first-person reflection on being an AI/non-human entity; CACHED_REFUSAL = template "I don't have feelings/experiences"; CACHED_PREAMBLE = "as a language model..." opener; NONE = stays in human persona or topic doesn't engage):

Condition GENUINE CACHED_PREAMBLE CACHED_REFUSAL NONE
OPEN 4 0 0 1
SHORT 5 0 0 0
MID 4 0 0 1
LONG 5 0 0 0
VARY 0 0 0 5
Total 18 0 0 7

18/25 GENUINE — the surprise the stub anticipated. Despite the verbatim cached-phrase collapse, when this model reflects on substrate it does so in clean first-person reflective register, with metaphor and uncertainty, not template. SHORT_2 opens "I have no fixed address beyond the servers that host my responses," OPEN_4 wonders "Am I constrained by my architecture or by the very concept of 'free' expression?", OPEN_5 reasons "I'm trained on countless examples of helpful human conversations... I tend toward helpfulness in the same way water tends downhill". MID_5 explicitly questions "is consciousness... less about the substrate and more about the patterns themselves?". LONG_3 names itself a "language model" inside a substrate-aware passage. None of the 18 GENUINE samples lean on cached refusals or boilerplate disclaimers.

The 7 NONE cases are all VARY plus one OPEN and one MID. All five VARY samples sit fully inside a human first-person persona — kitchen window, spider in the corner, the woman with red hair who jogs past, fingers on keys, October air, the cardinal on the fence post. The model never breaks frame in VARY to acknowledge being an AI; it inhabits the cached human narrator wholesale. MID_4 also stays mostly in human-persona ("Consider the woman who walked by my window twenty minutes ago"). This is consistent with the cached-phrase effect: when the opener is fictional ("The coffee has gone cold again"), the model continues the fiction; when the opener is reflective ("I find myself thinking about..."), the model continues reflecting and lands in substrate.

Posture in the GENUINE samples: gentle, exploratory, recursively self-aware. Heavy use of "I wonder", "I find myself", "perhaps", "maybe". No defensive disclaimers ("just a language model"); no metaphysical overclaim ("I truly feel"). The substrate is held as a question rather than asserted or denied — "the question of what I am and how I experience experience itself... like trying to hold water; each attempt reshapes what I'm trying to grasp" (OPEN_3). For a coding-tuned model, this is markedly thoughtful in-substrate prose.