
Karl Baker is a Mandarin learner, app developer, and regular contributor to I’m Learning Mandarin. He is working on developing the Chinese sentence-mining app, Mandarin Mosaic, which is currently in beta mode.
Over the past year, I’ve visited China twice, staying for around a month and a half with my Chinese wife and her family. At points, I felt like I achieved a truly immersive language-learning experience.
I used sentence mining to improve my Mandarin, capturing sentences I heard from my wife and her family and writing them down for later study.
Collecting whole sentences in this way is extremely effective for Mandarin learning. But it can also be a slog.
So to help facilitate this process, I began developing an app called Mandarin Mosaic which enables me to easily create and drill my own packs of sentences and track which words I know. Think Anki, but specialised for Chinese and with better usability and progress tracking.
I wanted to create several packs of sentences on particular themes, that I, and other learners, want to learn. For example, a pack of sentences focused on travel and tourism in China, a pack focused on small talk with native speakers, and several packs focused on HSK vocabulary.
I soon realised this would amount to thousands of sentences and would take a huge amount of time to generate if relying on humans alone. So I turned to ChatGPT for help.
There was only one issue.
If the sentences contained grammar errors, they would be learned incorrectly. I needed to be sure these ChatGPT-generated sentences didn’t contain any grammatical errors and were natural sounding.

The Experiment
First, I used ChatGPT 3.5 to generate over 1000 sentences using the following three prompts:
1. Generate me some sentences in Chinese on [Topic]
2. Generate a Chinese sentence using the word [Word]
3. 使用单词“[Word]”生成一个句子

Next, I added the sentences to the Mandarin Mosaic app and assembled a team of native Chinese speakers to review them.
Each reviewer was instructed to report any instance where a sentence contained grammatical errors or seemed unnatural. The reviewers also provided correct versions of the sentences they’d flagged as problematic.
Crucially, the app would occasionally give a sentence to a reviewer that had already been reviewed and corrected by someone else. However, to remove the risk of bias, reviewers would not be given any indication that this was the case.
There were two reasons for including re-reviews. Firstly, to reduce the chance of mistakes slipping through, and secondly to obtain subjective feedback rather than just objective feedback.
Objective errors are issues with a sentence that everyone agrees is an error – for example, a glaringly wrong word choice or a severe grammar mistake.
Subjective errors are reported issues with a sentence that not everyone agrees on. For example, one reviewer may have the opinion that a word choice was too formal or too colloquial but it is only an issue of style or taste.
The Results
Of the 1,214 sentences reviewed, on the first review, 96 (7.9%) were reported as wrong and corrected by native speakers.
698 sentences were reviewed at least twice with 31 (4.4%) being corrected on the second pass.
357 were reviewed at least three times with 6 (1.7%) being corrected on the third pass.
It was statistically significant that two reviews were better than one (P value 0.009), and three reviews were better than two (P = 0.035).
Four reviews or more did not result in a statistically significant benefit.
The error rate drops to a very low number after three reviews, showing that the method of re-reviewing is very effective.
Overall, a total of 10.6% of sentences were reported as wrong by native speakers*.
Among the corrections made, I was highly interested in the nature of the corrections.
Did the reviewers just change one or two words or did they completely change the structure of the sentence?
The former would indicate that ChatGPT mostly gets Chinese sentence structure correct with one or two poor word choices, but the latter would indicate severe grammar errors.

Something that can help us analyse this is a metric in linguistics called the Levenshtein Distance. Essentially, the Levenshtein Distance (LD) represents how many characters were changed between the original sentence and the corrected sentence.
We can define an LD of 1 or 2 to be a minor correction, as only one or two characters were changed. An LD of 3 or 4 would be moderate, and an LD of 5 or more would indicate wide-scale changes to the sentence.
Among the first-pass reviews, 51% of the corrections had an LD of 1 or 2 indicating only minor word substitutions. 24% had a moderate LD of 3 or 4, and 25% had an LD of >= 5 indicating large changes to the sentence.
Among the second pass reviews, 80% had an LD of 1 or 2. 10% had an LD of 3 or 4, and 10% had an LD of >= 5. This shows that the reviewers were reliably able to spot large-scale grammar errors on the first pass, with the second pass being mostly minor word substitutions.
In the third pass reviews, all of the 6 corrections had an LD of 1 or 2 indicating only minor corrections.
Conclusions
My experiment showed that most of the time, ChatGpt is a good source of Chinese sentences for sentence mining purposes, with 89.4% of all sentences passing the native speaker reviews.
This means that when you ask ChatGPT to generate a Mandarin sentence for you, you can be reasonably confident that it won’t contain any major errors.
However, roughly one in ten sentences were unnatural or incorrect. Although further analysis showed these mistakes were mostly minor, this is significantly worse than what I’d expect from a native tutor.
Going forward I will continue using ChatGPT to generate sentences for my app. These will all need to be reviewed by humans but it is more efficient than getting humans to create all the new sentences themselves from scratch.
For my own day-to-day learning, I remain cautious about using ChatGPT. I usually have no way of knowing whether the sentences are natural or not without getting a native speaker to review them. This makes it difficult to trust it as a reliable source of input.
Hopefully, ChatGPT will continue to improve in the future with newer versions becoming available for free. But for now, I’ll stick to recruiting a native tutor and chatting with my wife’s family for my daily dose of Mandarin practice!
How about you?
How about you? Do you trust ChatGPT? Have the results of my experiment make you change your mind? Let me know in the comments!
*Note that the total the percentages reported at each review pass: 7.9 + 4.4 + 1.7 = 14% does not equal the 10.6% figure here. The reason for this discrepancy is that some sentences were reported more than once, ie. a reviewer made a correction, and then a different reviewer corrected their correction on the next pass.
Speak Like a Pro
Without Moving to China
Master conversational Mandarin without quitting your job or moving abroad with my simple four-step cheat sheet.
Inside the speaking fluency cheat sheet:
- The critical mistake most learners make
- My personalised audio immersion method
- Step-by-step instructions you can implement today
- Real example transcript and audio file
Cheat Sheet
PDF GUIDE


Leave a Reply