things I do

Garden Path Attention Analysis I

Github

My first (failed) project in Mechanistic Interpretability

I’ve been reading about Mechanistic Interpretability for quite a while, but only in the past few weeks I decided to actually do something practical, even though it’s more of a toy project that something really useful.

I’ve been curious about Garden Path sentences ever since I first stumbled into one, which was “Time flies like an arrow, fruit flies like a banana”. I guess that most people will have understood what a Garden Path sentence is by now, but in any case it’s a sentence that will likely be read with wrong assumptions until you get to the end, when the actual role is clear for all constituents. Other well known examples are: “The old man the boat” or “Fat people eat accumulates”.

When humans read a Garden Path sentence (GP from now on), we perform a specific rapid eye movement called saccade, which consists in a quick backtracking of the eye to a previous part of the sentence.

My project consists in finding out how an LLM handles this, since they have no way of “going back” as we do. I focus my attention (ehe) on the attention heads: I expect that at some point, when the sentence shows its true meaning, it’s from them that we should see high activation somewhere.

What I tested, and what I expected

My theory was the following: one, or more likely several heads will activate highly when the sentence the LLM is parsing encounters the word that changes the meaning of the sentence. I decided to analyze the simplest example: “The old man the boat”, where the second “the” word should be the one where the LLM understands that the current parsing of the sentence it’s wrong.

01 - Finding a good model

I wanted to run the tests on a model that I could run on my laptop (Macbook M4 pro with 48GB RAM), so I set ~30B parameters as a threshold. Problem is: most models do not understand GPs at first sight.

I tested it by asking the model to translate “The old man the sea” in italian (and french, later), and most models I tested just returned “Il vecchio uomo il mare”, which is what I would expect if the sentence was “The old man, the sea”, a simple list of two nouns.

I decided to try again after giving the model an explanation of what a GP is, and I found out that still most models could not translate correctly. Until I tested with Qwen3 30B A3B: it fails when I simply ask to translate, but when the primed version of the prompt, the one with the definition, makes the model succeed and translate correctly.

02 - Computing activations

My next step has been quite simple: I compared the activations of all attention heads through all layers of the model when it encounters the second “the” in “the old man the boat”, and in particular I checked the activations between the second “the” and the previous word, “man”. It should recognize that in this case “man” switches from a noun to a verb, and somehow impact the residual stream strongly enough to change its internal representation of that word. At least, that’s what I expect it to do.

I therefore computed all activations across all heads and all layers between those two words in the two prompts: the one where I simply asked to translate a GP, and the primed one, where the model actually succeeds.

I then selected all heads that are significantly more activated in the second case, since I expect that those are the heads that allow the model to actually pick up the change in meaning.

In Part II I’ll go over what I did next, and where I failed (I’ll likely write it in the next few days)