DeepMind: Why is AI so good at language? It’s something in the language itself

ali mohamed
ali mohamed28 May 2022Last Update : 2 years ago
DeepMind: Why is AI so good at language? It’s something in the language itself

Could the frequency of language, and qualities such as polysemy, influence whether a neural network can suddenly solve tasks for which it was not specifically designed, otherwise known as ‘little learning’? DeepMind says yes.

Tiernan Ray for ZDNet

How come a program like OpenAI’s GPT-3 neural network can answer multiple choice questions or write a poem in a certain style, even though it was never programmed for those specific tasks?

It could be because human language has statistical properties that lead a neural network to expect the unexpected, according to new research from DeepMind, Google’s AI unit.

Natural language, from the standpoint of statistics, has qualities that are ‘non-uniform’, such as words that can stand for multiple things, known as ‘polysemy’, such as the word ‘bank’, which means a place where you put money or a rising mound of earth. And words that sound the same can stand for different things, known as homonyms, such as “here” and “hear.”

Those qualities of language are the focus of an article published this month on arXiv, “Data Distributional Properties Drive Emergent Few-Shot Learning in Transformers,” by DeepMind scientists Stephanie CY Chan, Adam Santoro, Andrew K. Lampinen, Jane X. Wang, Aaditya Singh, Pierre H. Richemond, Jay McClelland and Felix Hill.

Also: What is GPT-3? Everything your business needs to know about OpenAI’s groundbreaking AI language program

The authors started by asking how programs like GPT-3 can solve tasks where they are given all kinds of questions for which they are not explicitly trained, which is known as ‘few-shot learning’.

For example, GPT-3 can answer multiple choice questions without ever being explicitly programmed to answer such question form, simply by being asked by a human user typing a sample multiple choice question and answer pair.

“Large, transformer-based language models are capable of learning in a few steps (aka in-context learning), without being explicitly trained for it,” they write, citing Google’s wildly popular Transformer neural net that the foundation of GPT-3 and Google’s BERT proofing tool.

As they explain, “We hypothesized that specific natural language distribution properties might trigger this emerging phenomenon.”

The authors speculate that such large language model programs behave like a different kind of machine learning program known as meta-learning. Meta-learning programs, which have been explored by DeepMind in recent years, function by being able to model data patterns that span different data sets. Such programs are trained to model not a single data distribution, but a distribution of data sets, as explained in previous research by team member Adam Santoro.

Also: OpenAI’s massive GPT-3 hints at the limits of language modeling for AI

The key here is the idea of different data sets. All the non-uniformities of language they suspect, such as polysemy and the ‘long tail’ of language, the fact that speech contains words that are used with relatively little frequency – each of these strange facts of language are related to separate data distribution.

In fact, language, they write, is something between supervised training data, with regular patterns, and meta-learning with many different data:

As with supervised training, items (words) do recur and item label assignments (eg word meanings) are somewhat fixed. At the same time, the long-tail distribution allows for the existence of many rare words that only occasionally appear in context windows, but can burst (appear multiple times) within context windows. We can also think of synonyms, homonyms, and polysemy as weaker versions of the completely unfixed item-label assignments used in few-shot meta-training, with the assignments changing with each episode.

To test the hypothesis, Chan and colleagues take a surprising approach: They don’t actually work with language tasks. Instead, they train a Transformer neural network to solve a visual task called Omniglot, which was introduced in 2016 by NYU, Carnegie Mellon and MIT scientists. Omniglot challenges a program to assign the correct classification label to 1,623 handwritten character glyphs.


In the case of Chan et al.’s work, they turn the labeled Omniglot challenge into a one-time task by randomly shuffling the labels of the glyphs so that the neural network learns with each “episode”:

Unlike during training, where the labels were pinned for all series, the labels for these two image classes were randomly reassigned for each series […] Because the labels are randomly reassigned for each series, the model must use the context in the current series to create a label prediction for the query image (a 2-way classification problem). Unless otherwise noted, learning a few shots was always evaluated on holdout image classes that were never seen in training.

In this way, the authors manipulate visual data, the glyphs, to capture the non-uniform qualities of language. “During training, we situate the Omniglot images and labels in arrays with different language-inspired distribution properties,” they write. For example, they gradually increase the number of class labels that can be assigned to a particular glyph to approximate the quality of polysemy.

“In evaluation, we then assess whether these traits give rise to low-level learning ability.”

What they found is that as they multiplied the number of labels for a given glyph, the neural network got better at executing a few shots. “We find that increasing this ‘polysemy factor’ (the number of labels assigned to each word) also increases learning in a few shots,” as Chan and colleagues put it.

“In other words, by making the generalization problem more difficult, learning a few shots came out stronger.”

At the same time, there’s something about the specific structure of Transformer’s neural network that helps it learn in a few steps, Chan and colleagues find. They test “a vanilla recurring neural network,” they write, and find that such a network never achieves a few-shot ability.

“Transformers show a significantly greater preference for ‘low-shot’ learning than recurrent models.”

The authors conclude that both the properties of the data, such as the long tail of the language, and the nature of the neural net, such as the transformer structure, matter. It’s not one or the other, but both.

The authors list a number of avenues to explore in the future. One is its connection to human cognition, as babies demonstrate what appears to be learning a few shots.

For example, babies quickly learn the statistical properties of language. Can these distributional characteristics help infants learn quickly, or can they serve as useful preliminary training for later learning? And could comparable non-uniform distributions in other domains of experience, such as vision, also play a role in this development?

It should be clear that the current work is not a language test at all. Rather, it aims to mimic the supposed statistical properties of language by recreating non-uniformities in visual data, the Omniglot images.

The authors do not explain whether that translation from one modality to another has any effect on the meaning of their work. Instead, they write that they expect to extend their work to more aspects of language.

“The above results suggest exciting lines of future research,” they write, including: “How do these data distribution properties interact with reinforcement learning versus controlled losses? How might the results differ in experiments replicating other aspects of language and language modeling, e.g., symbolic use input, train on next-token or masked-token prediction and let the meaning of words determine their context?”


Short Link

Leave a Comment

Your email address will not be published.Required fields are marked *

Comments Rules :

You can edit this text from "LightMag Panel" to match the comments rules on your site