Determining how to caption an image properly can be tricky for humans. Choosing the focal point of a photo and describing it in a few concise words takes a lot of thought. For artificial intelligence (AI) doing so has long-been a nearly impossible task.
Everyone has experienced an automatically generated caption at some point that is more robotic gibberish than a description of the photo.
Fortunately, Microsoft is pioneering AI innovations that have drastically improved its ability to caption images. The company claims that its latest AI system is better at doing so than humans. Whether or not that claim holds up in the real world remains to be seen.
Nonetheless, Microsoft’s innovations will help make the internet a better place for visually impaired users and sighted individuals alike.
Back in 2016, Google claimed that its AI systems could caption images with 94 percent accuracy. Microsoft’s latest system pushes the boundary even further. It says that the new model is twice as accurate as the one it has been using since 2015.
Eric Boyd of Microsoft’s Azure AI division says, “[Image captioning] is one the hardest problems in AI. It represents not only understanding the objects in a scene, but how they’re interacting, and how to describe them.”
The Microsoft team used a new approach to train its AI model. Rather than using full captions, the team paired each training image with a set of keywords. This made it easier for the model to learn how images and words interact. The end result is better, more accurate captions.
Azure AI Cognitive Services chief technical officer Xuedong Huang says, “This visual vocabulary pre-training essentially is the education needed to train the system; we are trying to educate this motor memory.”
Typically, AI breakthroughs take years to become consumer-facing products. That won’t be the case for Microsoft’s new image captioning model. The company is rolling it out as part of Azure’s Cognitive Services. This will give any developer the ability to integrate the tool into their apps.
Meanwhile, Microsoft is including the tool in its app for the visually impaired—Seeing AI.
In time, the model will be integrated into some of Microsoft’s Office products. It will arrive in PowerPoint for the web, Windows, and Mac later this year as well as in Word and Outlook on desktop platforms.
Better image captioning isn’t just something that’s nice to have, though. It makes a significant positive impact for people with sight and for people who are visually impaired. The applications for the latter are more obvious. Things like screen readers particularly benefit from better captions as they can relay more accurate descriptions to blind users. It also makes it easier for these users to navigate the web.
The implications for sighted users are exciting for different reasons. For instance, better captions make it possible to find images in search engines more quickly. It also makes designing a more accessible internet far more intuitive.
It will be interesting to see how Microsoft’s new AI image captioning tools work in the real world as they start to launch throughout the remainder of the year.