Posted by Chao Chen, Software Engineer, Google Research
One of the greatest challenges faced by users who are visually impaired is identifying packaged foods, both in a grocery store and also in their kitchen cupboard at home. This is because many foods share the same packaging, such as boxes, tins, bottles and jars, and only differ in the text and imagery printed on the label. However, the ubiquity of smart mobile devices provides an opportunity to address such challenges using machine learning (ML).
In recent years, there have been significant improvements in the accuracy of on-device neural networks for various perception tasks. When coupled with the increased computing power in modern smartphones, it is now possible for many vision tasks to yield high performance while running entirely on a mobile device. The development of on-device models such as MnasNet and MobileNets (based on resource-aware architecture search) in combination with on-device indexing allows one to run a full computer vision system, such as labeled product recognition, entirely on-device, in real time.
Leveraging developments such as these, we recently released Lookout, an Android app that uses computer vision to make the physical world more accessible for users who are visually impaired. When the user aims their smartphone camera at the product, Lookout identifies it and speaks aloud the brand name and product size. To accomplish this, Lookout includes a supermarket product detection and recognition model with an on-device product index, along with MediaPipe object tracking and an optical character recognition model. The resulting architecture is efficient enough to run in real-time entirely on-device.
A completely on-device system has the benefit of being low latency and with no reliance on network connectivity. However, this means that for a product recognition system to be truly useful to the users, it must have a on-device database with good product coverage. These requirements drive the design of the datasets used by Lookout, which consist of two million popular products chosen dynamically according to the user’s geographic location.
Product recognition using computer vision has traditionally been solved using local image features extracted by, for example, the SIFT algorithm. These non ML-based approaches provide fairly reliable matching but are storage intensive per index image (typically ranging from 10KB to 40KB per image) and are less robust to poor lighting and blur in images. Additionally, the local nature of these descriptors means that it typically does not capture more global aspects of the product’s appearance.
An alternative approach that has a number of advantages would be to use ML and run an optical character recognition (OCR) system over the query image and database images to extract the text present on the product packaging. The text on the query image can be matched to the database using N-Grams to be robust to OCR errors such as spelling mistakes, misrecognitions, failed recognition of words on product packaging. N-Grams can also allow for partial match between query document and index document using measures such as Jaccard similarity coefficient, as opposed to requiring an exact match. However, with OCR, the index document size can grow very large since one would need to store N-Grams for product packaging text along with other signals like TF-IDF. Furthermore, the reliability of the matches is a concern with the OCR+N-Gram approach since it can easily over trigger in situations where there are a lot of common words present on the packaging of two different products.
In contrast to both the SIFT and OCR+N-Gram methods, our neural network-based approach, which generates a global descriptor (i.e., an embedding) for each image, requires only 64 bytes, significantly reducing the storage requirements from the 10-40KB per image needed for each SIFT feature index entry, or the few KBs per image for the less reliable OCR+N-gram approach. With fewer bytes consumed for each index image, more products can be included as a part of the index, yielding more complete product coverage and a better overall user experience.
The Lookout system consists of a frame cache, frame selector, detector, object tracker, embedder, index searcher, OCR, scorer and result presenter.
|Product recognition pipeline internal architecture.|
- Frame cache
The frame cache manages the lifecycle of the input camera frames in the pipeline. It efficiently delivers the data, including YUV/RGB/gray images, as requested by the other model components and manages the data life cycle to avoid duplicated conversions for the same camera frame requested by multiple components.
- Frame selector
When a user points the camera viewfinder towards a product, a lightweight IMU-based frame selector is run as a prefiltering stage. It selects the frames that best match a certain quality criterion (e.g., balanced image quality and latency) from the continuously incoming image stream, based on the jitter as measured by the angular rotation rate (deg/sec). This approach minimizes energy consumption by selectively processing only the high quality image frames and skipping the blurry frames.
Each selected frame is then passed to a product detector model, which proposes regions of interest(a.k.a. Detection bounding boxes) in the frames. The detector model architecture is a single-shot detector with an MnasNet backbone that strikes a balance between high quality and low latency.
- Object tracker
MediaPipe Box tracking is used to track the detected box in real-time, and plays an important role in filling the gap between the detection of different objects and reducing the detection frequency, thus reducing energy consumption. The object tracker also maintains an object map in which each object is assigned a unique object ID during runtime, which are later used by the result presenter to differentiate between objects and to avoid repeating the announcement of a single object. For each detection result, the tracker either registers a new object in the map or updates an existing object with the detection bounding box, using the Intersection over Union (IoU) between existing object bounding boxes with the detection result.
The regions of interest (ROIs) from the detector are sent to the embedder model, which then computes a 64-dimension embedding. The embedder model is initially trained from a large classification model (i.e., the teacher model, based on NASNet), which spans tens of thousands of classes. An embedding layer is added in the model to project the input image into an ‘embedding space’, i.e., a vector space where two points being close means that the images they represent are visually similar (e.g., two images show the same product). Analyzing only the embeddings ensures that the model is flexible and does not need to be retrained every time it is to be expanded to new products. However, because the teacher model is too large to be used directly on-device, the embeddings it generates are used to train a smaller, mobile-friendly student model that learns to map the input images to the same points in the embedding space as the teacher network. Finally, we apply principal component analysis (PCA) to reduce the dimensionality of the embedding vectors from 256 to 64, streamlining the embeddings for storing on-device.
- Index searcher
The index searcher performs KNN search over a pre-built, compatible ScaNN index using a query embedding. As a result, it returns the top-ranked index documents containing their metadata, such as product names, packaging size, etc. To reduce the index lookup latency, all embeddings are k-means clustered into clusters. At query time, the relevant clusters of data are loaded in memory for the actual distance computation. To reduce the index size without sacrificing quality, we use product quantization at indexing time.
OCR is executed on the ROI for each camera frame in order to extract additional information, such as packet size, product flavor variant, etc. Whereas traditional solutions used the OCR result for index searching, here we only use it for scoring. A proper scoring algorithm informed by the OCR text assists the scorer (below) in determining the correct result and improves the precision, especially in the case where multiple products have similar packages.
The scorer takes the input from the embeddings (with index results) and the OCR module and scores each of the previously retrieved index documents (embeddings and metadata retrieved via the index searcher). The top result after scoring is used as the final recognition from the system.
- Result presenter
Result presenter takes in all the results above, and surfaces the results to users by speaking the product name via text-to-speech service.
|Early experiments with on-device product recognition in a Swiss supermarket.|
The on-device system outlined here can be used to enable a spectrum of new in-store experiences, including the display of detailed product information (nutritional facts, allergens, etc.), customer ratings, product comparisons, smart shopping lists, price tracking, and more. We are excited to explore some of these future applications, while continuing research into advancing the quality and robustness of the underlying on-device models.
The work described here was authored by Abhanshu Sharma, Chao Chen, Lukas Mach, Matt Sharifi, Matteo Agosti, Sasa Petrovic and Tom Binder. This work wouldn’t have been possible without the support and help we received from Alec Go, Alessandro Bissacco, Cédric Deltheil, Eunyoung Kim, Haoran Qi, Jeff Gilbert and Mingxing Tan.