Google has made an eerily accurate image captioning algorithm available to developers via the open source software library TensorFlow. The ‘Show and Tell’ captioning system works using a new version of Google’s ‘Inception’ image classification model, which includes a new vision component.
While the 2014 the Inception V1 was able to recognise image elements (such as an object, a structure, or a type of animal) 89. 6% of the time when tested under the ImageNet 2012 image classification benchmark task, the current Inception V3 has a 93. 9% level of accuracy in correctly identifying the content of a photographic image.
The system can not only identify what is in the photo, but also how those items relate to one another. “An image classification model will tell you that a dog, grass and a frisbee are in the image, but a natural description should also tell you the color of the grass and how the dog relates to the frisbee,” explains Google.
This fine-tuning required training of the captioning system’s vision and language components. This involved defining contextual information for recognised objects such as colours, locations, the positioning of the objects relative to each other, and any actions taking place. Think of it like teaching a child, “See Spot. See Spot run. ” First the basic element is recognised, then detail applied.
Hundreds of thousands of images captioned by humans were shown to the system as part of its training so as to understand the language patterns and conventions in captioning. This means that the system repurpose captions it has already seen when encountering untitled images of a visually similar nature.
A complete research paper, blogs on the process and the system itself are available online. One caveat is that the system available is untrained, so it will be unable to caption your holiday snaps until you put it through a full course of family photographs, so get uploading.
. digitalrev.com2016-10-7 03:00