
Duncan Parrish is an I’m Learning Mandarin contributor and coordinator of Arti Languages, a project using mobile games and cognitive science to enhance how we immerse in Chinese. In this post, he delves deep into what neuroscience research tells us about the way adults learn Mandarin.
“Let me tell you a story about my childhood in Shanghai…” said the rather distinguished old lady being interviewed. The reporter must have leaned forward, expecting some fascinating insight. I too leaned forward, but sadly it was because I was struggling to follow the videoclip in Chinese.
“让我跟你讲讲我小时候在上海的故事 Ràng wǒ gēn nǐ jiǎng jiǎng wǒ xiǎoshíhòu zài shànghǎi de gùshì” she had said. I understood all the vocabulary (I learned later), but found it too fast to take in: did she say “let me something a story”? Crucially, I had missed the word “childhood” that was essential to following what she said next.
A year or two later, I happened to watch the same video clip, but my experience was completely different: I noticed she spoke clearly and extremely slowly – perhaps this was why the interview had been recommended to me earlier. There seemed to be all the time in the world to notice the word 小时候 (xiǎoshíhòu “childhood”) and follow the conversation easily.
What had changed? Certainly not this particular vocabulary or my understanding of basic grammar. The answer is that in the intervening time my brain had changed, and with it my entire experience of listening: this simple Mandarin now seemed clear and unhurried.
Understanding what causes such a change in terms of the neuroscience of language learning – how our brains change as we acquire language as an adult – tells us a lot about how to best learn a language, and also how we should expect our experience of listening and speaking to change over time. As we examine the different brain regions processing speech, I will also try to take note of the relevance of each stage to the learning of Chinese in particular.
Developments in Neuroscience
The neuroscience of language has come a long way in the last twenty years, but this isn’t often widely discussed in the language learning community. Progress was long held back for two reasons: firstly, language can only be studied in humans using complex modern technology such as fMRI (functional Magnetic Resonance Imaging). As the old neuroscientist’s joke has it: the study of language suffers from a distinct shortage of talking mice.
Secondly, there is a high degree of variation between different people in the exact location of language functions, so determining exactly which brain regions are activated requires complex analysis. Both these difficulties have been increasingly overcome with new techniques, and so the old models of language in the brain have been supplanted by more sophisticated ideas.
The Old Model: Wernicke’s and Broca’s areas
The old model simply described a brain region called Wernicke’s area near the left temporal lobe (just above your left ear) that handles Speech comprehension (listening). Meanwhile a region called Broca’s area, further forward, above and behind your left eye, controlled Speech production (speaking). This model – originally based on observations of brain damage in the 19th century – is still widely taught, but as there is no longer any agreement on exactly where these regions are and what they do, most specialists now avoid these terms.
The “Dual Stream” Model
Instead, sensory processing of both sound and vision is now often described in terms of a “dual stream” model. In vision, initial processing by the Primary Visual Cortex leads to activity in both a “ventral” stream (around the side and underneath the brain) and a “dorsal” stream (over the top of the brain). The visual ventral stream proceeds through the temporal lobe (an area long-known to be important in speech comprehension) and functions primarily as a system for identifying and categorising what you are looking at. The visual dorsal stream seems to handle how what you are looking at is moving, and how you might, for example, interact with it and touch it.
It seems that the auditory system also has what and how streams, similarly called “ventral” and “dorsal” streams. The most important stream for Speech comprehension is the ventral stream, the “what” stream, identifying what we’re listening to: in this context converting sound into understandable speech. The dorsal stream, by contrast, connects to motor areas of the brain, indicating to us how we could repeat or imitate the sounds we’ve just heard, but seems to have a variety of other functions.
The Primary Auditory Cortex
Both streams start in the Primary Auditory Cortex, at the top of the temporal lobe (near our ears, unsurprisingly) on either side of the brain. Different functions often appear on different sides of the brain (in neuroscience-speak, they are “lateralized”), and here the left-side tends to specialise in analysing the length of sounds, and the right-side their pitch. This is applicable to any sound, such as music, but these regions rapidly distinguish what is human speech and what is not (to the extent that it is possible to have brain damage preventing you hearing words and only words (so-called “word deafness”), and vice versa to have damage limiting your hearing of sounds other than words).
A recent experiment at the University of Delaware showed the importance of these areas to learning Chinese. Through fMRI analysis of volunteers learning Chinese for the first time, researchers were able to predict which students were likely to do best: those who had highly activated pitch regions, on the right side of the brain, for the first few weeks of listening to Chinese. This suggests they were paying attention to tones more than the rest of the cohort. After this period, the effect reduced, with most activity occurring around the left temporal lobe, in a standard language activation pattern. This may indicate that their brains eventually adapted to treat the pitch of the language they were listening to (ie Chinese tones) as an integral part of the meaning of the words they were hearing.
Language processing proceeds from the auditory cortex through a series of regions in the temporal lobe, the core of our “language network”. These areas are still under research, but the outlines of their function are becoming clearer in recent years, and the developing “Dual Stream” model sets out a number of distinct components.
The first component is the Phonological network, which picks out the exact sounds of the words being uttered, after which the ventral and dorsal streams split up. The ventral stream then continues through the second component, the Lexical interface, to determine the meaning of the word, and later Combinatorial networks, which start to analyse the sense of the phrase the words belong to.
Finally, after some further grammatical analysis, the outcome of what we’ve heard is presented to our Working Memory in the front of the brain, the first time we become “conscious” of what has been said and what it might mean.

The Phonological network
The Phonological network takes information from the Primary Auditory Cortex and determines what word-sounds we are listening to. This involves first determining where words start and end (“word segmentation”) then determining exactly what those words are (“word identification”). When we understand a language fluently, we experience words as clearly distinct from each other and pronounced relatively slowly. But when we listen to a language we don’t know, or are just beginning to learn, the sounds appear like a bewildering continuous stream of noise.
The same word can be said by different people in different ways. The word “right” can be missing the final “t” if spoken by a Cockney, but sound like “raaat” if pronounced with a Southern US accent. Similar problems confront the listener in Chinese: a “shi2” syllable might be pronounced “si2” in some accents for example. The phonological network has to allow for this, at high speeds of six or more syllables per second, accurately mapping each sound to its correct phoneme, automatically, before we’re even aware of it.
A developed phonological network can rely on many different clues to quickly determine with near certainty which syllable or word was pronounced. For example, it can use many subtle clues in sound, or even integrate visual information from the speaker’s moving lips. It can use “predictive” processing from other parts of the Speech comprehension process to anticipate the words that will likely come next, “pre-activating” the right set of neurons that represents the word-sound.
To process any language quickly enough to follow a conversation, we need well-defined phonological sets of neurons that can accurately recognise each word as we hear it spoken, say for example the Chinese word 香蕉 (xiāngjiāo “banana”). This can only happen with lots of listening practice, which should be at the centre of any early language learning regime, training our phonological network to distinguish different sounds and words of the language.
In particular with Chinese, tones are crucial, as we saw in the Delaware experiment above. It is even possible that learners who ignore tones at the outset are setting up a much-reduced phonological network for Chinese over the long-term, with only one neuron-pattern that can be activated for “xiang”, rather than separate encoded representations for “xiang1” (香 “fragrant”) and “xiang3” (想 “want”). But all is not lost! A learner at any stage can re-program their phonological network to separate these representations – with some work.
As we’ll see next, establishing a well-defined phonological network is an important foundation for building a language network that can operate quickly enough to allow us to understand normal speed speech.
Lexical interface to Semantic hub
The words “right”, “write” and “rite” all sound the same, but have different meanings – and “right” has many meanings. So even though these words activate the same neuron cluster in the Phonological network (as they have identical sounds), when we hear the phrase “She’s right. You should write that. It’s your right.”, we aren’t confused, because the same sound is automatically mapped to different meanings by our Lexical interface: they “feel” like different words to us.
But what are they mapping to? The lexical interface connects word-sounds to a region at the tip of the temporal lobe called the ATL (the anterior temporal lobe). Recent theories suggest this may act as a kind of “Semantic hub” connecting knowledge and experience from all over the brain in a “hub and spoke” system, perhaps in conjunction with other brain regions like the Angular Gyrus.
For example, we know that bananas are a kind of fruit. This factual or “declarative” knowledge might be stored elsewhere in the brain’s cortex, but it is connected to a “banana-concept” set of neurons in the ATL that combine everything we know about bananas. Experience we have of what bananas look, smell and taste like might also be stored elsewhere, but form other connections to the ATL “banana-concept”.
The phoneme “banana” (literally the sound of the word in English) in our phonological network is connected by the Lexical interface to this “banana-concept” in the ATL. It’s just another spoke in the hub-spoke system, but it’s the connection that activates the concept of banana in our brains when someone says the word “banana”. Remember the visual ventral stream? That also connects to the ATL, helping us activate “what” we’re looking at: the very same “banana-concept” when we see a banana for example.
To process Chinese quickly enough to follow a conversation, we need well-defined phonological sets of neurons that can accurately recognise each word as we hear it spoken, for example the word 香蕉 (xiāngjiāo “banana”). These “sound-concept” neurons then need to get connected to our “banana-meaning-concept” in the ATL, which existed long before we began to learn Chinese. Now our understanding of “香蕉” is not just a theoretical, flashcard-like fact, perhaps stored elsewhere in the brain, but an automatic, almost instantaneous, understanding of the sound of the word, without conscious thought.
To a large extent, learning a language means developing thousands of new phonological patterns (the sounds of words) and then connecting these to concepts (which already exist, as we already understand the world) in the semantic hub in the ATL, so we can automatically understand these words without translating.
But these connections are not all one-to-one. There are homophones in every language (like “right”/“write”) but they are particularly common in Chinese, even when we take tones into account. So “shi4”, for example, must connect to many possible semantic concepts in the ATL. Which concept is activated depends on the context of the word, so our Lexical interface needs extensive training to get this right: when “shi4” means 市 (activating the idea of “city”) or 事 (activating the idea of “thing” or “matter”) or many other possibilities.
The lexical interface can only be trained to do this by listening to (or reading) large amounts of Chinese so it must “learn” how sounds and meaning fit together functionally and statistically in Chinese: this cannot be done by memorising more vocab or grammar rules.
Of course, these one-to-many connections in the lexical interface are made far more complex if the phonological network above it is not “well-defined” for Chinese because the learner has yet to sufficiently distinguish tones!
The Combinatorial network
The “Dual Stream” model and the associated Hub-and-Spoke semantic model are both still being refined, studied and challenged, though evidence in support of them continues to grow in recent years.
The “Dual Stream” model hypothesises additional networks in the temporal lobe called Combinatorial networks (a term often used in Artificial neural networks) which would help integrate the meanings of phrases and sentences, using both grammatical and semantic information. Experiments show that these regions are sensitive to whether sentences are grammatically and/or meaningfully well-constructed, demonstrating that more processing is occurring here than simply identifying the meanings of words.
Further processes occur outside the temporal lobe too, in the frontal lobe and parietal lobe, including the Dorsal Stream, emphasising how much wider, more complex and interconnected the brain’s language network is than our original conceptions of Wernicke’s Area.
And finally, Working Memory
Every time you are paying attention to something, or manipulating or analysing something consciously, you are using your Working Memory. It is where all the pre-processing of speech we’ve discussed so far is finally brought together.
It is not possible to listen to and understand speech without consciously paying some attention. So, as one would expect, brain regions associated with the Working Memory (the “seat” of attention) activate towards the end of each spoken phrase we listen to, as the Working Memory attempts to make full “sense” of the products of the processes and networks we’ve discussed above.
The most widely-used model for Working Memory, the multi-component model, describes Working Memory as having three components controlled by a Central Executive system: the Phonological Loop (in practice largely devoted to interpreting speech), the Visuo-Spatial Sketchpad and the Episodic Buffer. These components can operate fairly independently, which explains why we are able to perform actions that require attention on vision and space (like driving or doing the dishes, using the Visuo-Spatial Sketchpad) at the same time as we also attentively listen or talk (using elements of the Phonological Loop). It’s also worth noting how important interpreting speech is to the Working Memory.
A further crucial feature of the Working Memory model is limited capacity: there are only ever a certain number of “slots” (often counted as “7 plus-or-minus 2”) to hold information that our Working Memory will then attend to: in the case of listening, interpreting the full meaning of the phrase we’ve just heard.
Bearing this in mind, we can piece together how changes in cognitive processing can result in the experience of listening to Chinese I described at the start of the post.
Smarter processing makes everything seem easy
Early on in language learning, when we have an undeveloped Phonological Network, we can only sometimes make out the words that have been spoken. Our Working Memory’s slots will quickly become filled with poorly understood sounds, often lacking an activated meaning, which the WM slowly attempts to recall the meaning of while the conversation moves rapidly on.
As the Phonological Network and Lexical Interface develop, the Working Memory will operate with far more information: clearly activated and defined sounds and meanings, but the number of words arriving at normal conversational speeds will still overwhelm it, leading to my original experience of catching a few words in the sentence – “let me something a story” – but losing the rest.
Our example sentence has 16 characters - Chinese is often spoken at a rate of around 4 to 5 syllables per second, so this is a phrase that might take 3-4 seconds to say – and that means there may be 20 of these phrases to process per minute! Handling this word by word is very inefficient.
Only as all the components of the language network become better and better trained in Chinese phonetics, meanings, word frequency, sentence patterns and more can we start to perform the trick that overcomes the limitations of Working Memory capacity: “chunking”.
Rather than individual words occupying each precious slot of Working Memory, well-known patterns of words can be processed instead: 让我 (ràng wǒ “let me”), 跟你讲讲 (gēn nǐ jiǎng jiǎng “talk with you a bit”), 我小时候 (wǒ xiǎoshíhòu “my childhood”), 在上海的 (zài shànghǎi de “in Shanghai…”), 故事 (gùshì “story”).
As later components have already used these patterns to decode the grammatical structure and meaning of each of these chunks, the Working Memory has far less to do to interpret the full meaning of the phrase. Attention can now be paid to the speaker, how she sounds, her story, anticipating where the conversation might go, and so on. Her speech is handled so effortlessly and swiftly that it comes to seem slow, simple, unhurried.
Perhaps the greatest lesson a beginner can take from this is: things will become easier – as long as you keep at it. Every graded story you read, every simple video you watch is building networks and training them in statistics and patterns that will, eventually, make understanding Chinese seem much easier than it does now. Just keep listening, and your brain may well do the rest.
*Arti Languages are developing a new way to immerse in Chinese called Artificial Immersion – a completely new kind of Listening that aims to build your language network as fast and strongly as possible. You can watch Duncan demonstrate how it works below, and download their initial experimental game for iOS or Android on the website here.