In new research, an AI model was trained to learn words and concepts through the eyes and ears of a single child, using headcam video recordings from when the child was six months old and through their second birthday.
Researchers showed that the artificial intelligence (AI) model could learn a substantial number of words and concepts using limited slices of what the child experienced. Even though the video captured only one percent of the child's waking hours, they said that was enough for genuine language learning.
"By using AI models to study the real language-learning problem faced by children, we can address classic debates about what ingredients children need to learn words - whether they need language-specific biases, innate knowledge, or just associative learning to get going," said Brenden Lake, an assistant professor in NYU's Center for Data Science and Department of Psychology and senior author of the study published in the journal Science.
For developing the model, the researchers first analysed a child's learning process captured on first-person video - via a light, head-mounted camera - on a weekly basis beginning at six months and through 25 months.
Using the video footage collected of over 60 hours, the team observed that it contained roughly a quarter of a million-word instances - the number of words communicated, many of them repeatedly - that were linked with video frames of what the child saw as those words were spoken.
The footage also included a wide range of different activities across development, including mealtimes, reading books, and the child playing, the team said.
The researchers then trained a multimodal neural network with two separate modules - one that took in single frames of the video and another that took in the transcribed form of the speech directed at the child.
These modules were combined and trained using an algorithm called contrastive learning, which aims to learn by making associations in the input data, they said.
For instance, they explained, that when a parent said something in the child's view, it was likely that some of the words used were likely referring to something that the child could see, which meant that comprehension was instilled by linking visual and linguistic cues.
"This provides the model a clue as to which words should be associated with which objects," said Wai Keen Vong, a research scientist at NYU's Center for Data Science.
"Combining these cues is what enables contrastive learning to gradually determine which words belong with which visuals and to capture the learning of a child's first words," said Vong.
After training the model, the team tested it by presenting the model with the target word and an array of four different image options and asking it to select the image that matched the target word.
The model was able to learn a "substantial" number of the words and concepts present in the child's everyday experience, the researchers said.
Further, for some of the words the model learned, it was observed to be able to generalise them to visual instances different from those it saw in its training data.
This, the researchers said, reflected an aspect of generalisation also seen in children when they are studied in lab.