Nov 15, 2024
LEARNING A LANGUAGE WITH REAL CONTENT
Language practice and learning with real, interesting topics often seem antithetical—but we’re tackling this problem by building a language context around you.
Think of language context as your personal linguistic database.
It keeps track of the words you know, the ones you want to learn, and those you might not recognize yet. It also gauges which parts of the language you understand and where you might need a little more practice.
In short, the better the context you build, the more personalized and understandable the content you can create for the student.
Building a language context would be a relatively easy task if it weren’t for one tricky human language invention: conjugation.
You see, your language context needs to get that when you’re looking up ‘swimming,’ ‘swam,’ and ‘swim,’ they’re all just different forms of the same word.
Conversly, it needs to know that ‘fuera’ in Spanish can mean ‘outside’ as a noun, but it’s not the same as ‘fuera,’ the conjugated form of the verb ‘ir’ (to go).
So how did we tackle this problem?
The process of converting words to their base form is called lemmatization.
Two years ago, I thought this was a solved problem, but I was wrong. Lemmatization is still far from perfect, especially for non-English languages.
Initially, after writing some topics, we used Python NLP libraries like spaCy to convert words to lemmas. However, these tools only achieved about an 80% success rate for Spanish, so we had to manually correct many of them. We repeated this process for nearly 100 topics we wrote.
Armed with high-quality training data that was tailored to our usecase, we pre-trained an LLM on it and increased our success rate from ~80% with only spaCy to about ~95% using the pre-trained LLM.
For language learning, that still wasn’t cutting it.
In our TCM app (Topics Content Management, see attached screenshot), which we use to craft our topics, we added a feature to manually correct words to their proper lemma. Then, we generated more training data and fed it back into the LLM. The success rate shot up to about 98%.
This felt like a good number for a small startup.