English’s stranglehold on NLP has been broken
I saw a Reddit post the other day where a number of folks in NLP shared how GPT is affecting their work. There's a decent amount of panic out there–LLMs are an extremely disruptive technology, and the world's in the middle of a gold rush to figure out where this technology can make the most impact. Traditional approaches to NLP just aren't producing the kind of results we're seeing from GPT.
While there are some tasks LLMs don't perform well on today1, they're getting better at an alarmingly fast rate. ChatGPT was already a big step up from GPT-3, and the newly released GPT-4 is another big step up from ChatGPT. OpenAI's "scale is all you need" approach to machine learning isn't showing any signs of slowing down soon, and it's unclear when–or if–they're going to hit the limits of their methodology.
There's going to be a lot of pain in the short term. A lot of NLP projects and research topics have been obsoleted by OpenAI's work. Looking past that short term pain, however, reveals a promising future for NLP researchers and startups on the horizon.
English dominates NLP
Up until now, English has had the NLP field in a stranglehold. Virtually all research and products are built around the English language, and (almost) all other languages have remained untouched. It's easy to see why this is the case:
- English is the most spoken language in the world by a comfortable margin. Mandarin has more native speakers, but there are a lot of people who know English as a second or third (or beyond) language.
- English is completely and utterly dominant on the Internet. While Mandarin can beat English in terms of native speakers and put up a decent fight on total speaker numbers, there is an order of magnitude more English on the Internet than there is Mandarin.
- English is the current lingua franca for international business, diplomacy, science, and technology.
English is the closest thing we've ever had to a universal language. The value of being able to process this language at scale is immense. The ROI from working on other languages doesn't come close–and consequently, we've seen innovation in NLP center on English2.
This overrepresentation of English isn't a good thing. The morphology and syntax of English are highly unusual, and linguists have described English as one of the "weirdest" languages in the world. NLP strategies which work well on English don't always translate well to other languages.
English's dominance is also self-reinforcing. There's a rich ecosystem of off-the-shelf tools, tutorials, and datasets out there which you can start using today to perform your own research or product development. You can work on anything from trying to eke out a small part-of-speech tagging improvement all the way up to a digital assistant–the world is your oyster. Most other languages don't have this infrastructure available, which means that for newcomers to the field English is the obvious place to start.
GPT changes the game
The value of further English language research is now circumspect. OpenAI has figured out that throwing millions of dollars at English NLP solves the problem. There's just so much text out there that brute forcing your way to a useful model works out quite well. Even if OpenAI were to stop development of new GPT generations today, it would be a long time before more traditional NLP approaches caught up.
What hope does your university or startup have of competing? Are you willing to bet your career on beating OpenAI? Do you really think your prompts are creative enough to create a competitive advantage for your business? Not only is GPT way better than the previous state of the art, but using it to build products is dead easy. Nontechnical users are achieving incredible things with prompt engineering, to the point where it's likely safe to proclaim English NLP as a commodity.
But English isn't the only language in the world.
In fact, there are about 7,000 different languages spoken all over the globe, and many of them have never been touched by an NLP researcher. Most of these languages are endangered, and building out digital infrastructure for these languages is a vitally important task for preserving these tongues into the future.
Researching these languages is important for a variety of reasons. On the one hand, it's important to understand the context behind the proliferation of English: indigenous people all over the world had English beaten into them through colonialism, and English's dominance on the Internet is a long-term consequence of the advantages accrued by that history. Doubling down on English language research and ignoring endangered languages is, in some way, perpetuating that history.
More practically, digital technology is really important in today's world. In the same way that poor web performance is exclusionary of people in developing nations, lack of language tooling for non-English tongues excludes a significant proportion of the world from engaging with technology and interacting with other people in the digital town square. Translators, grammar checkers, and language tutoring apps are all important for improving our world's interconnectivity.
And then finally, there is guaranteed to be valuable ancient knowledge out there waiting to be extracted. Only this year did we unravel the mystery of Roman concrete's self-healing properties, and there is undoubtedly going to be similarly valuable insights buried in ancient tomes waiting to be identified.
While GPT isn't totally useless in the domain of endangered languages–it seems to do an okay job of translating and tagging Māori language text–OpenAI's approach to training models is fundamentally reliant on having access to vast amounts of text.
The amount of text available for English is staggering: from January 2020 to December 2022, the Coronavirus Corpus collected 1.5 billion words of text. 1.5 billion words is a lot–it dwarves the amount of data available for many endangered languages–and this is from merely a two year time period, and restricted to a single topic.
While GPT demonstrates shockingly good zero-shot task-transfer learning performance, I think its transferability to endangered languages will be limited by the small datasets available. There's an opportunity here for more classical NLP approaches to continue to develop.
Overrepresentation of English in NLP has been a disaster for the world. The world's rich tapestry of languages has been steadily unravelling, threatening to cut off hundreds of millions of people around the world from their history and culture. We've reinforced social disadvantages, and risked losing access to valuable ancient knowledge.
GPT may just be the trigger needed to expand the NLP field.
- Check out this arXiv paper: ChatGPT is not all you need↩
- Chinese has also seen some play, and languages which are closely related to English have also seen some research–particularly in the context of transfer learning. These efforts are very small in comparison to English, however.↩