How machine learning is transforming the video scanning technology

Machine learning is enabling video intelligence

Machine learning is enabling video intelligence

With the advent and popularity of video content gaining giant strides by each day, the demand for need to make video content search enabled is also increasing. The overall task is simple – creating machine readable semantic metadata of the videos that can be analyzed using text mining techniques. But this task is a very challenging one. Not only does it require processing of video content at scale but also the preferred approach of breaking down videos into still frames, aka images, has its own challenges. The biggest one being processing 30 frames per second is a trash intensive process that certainly demands lookout for better approaches.

Thankfully, advanced researches in convolutional neural networks (CNN) have helped significantly. This machine learning technique promises dramatic improvements in things like computer vision, speech recognition, and natural language processing. You probably have heard of it by its more layperson-friendly name: “Deep Learning.”

How will the tech help?
Apart from video surveillance intelligence, the technology can open floodgates in other video related aspects as well. To begin with, in the digital advertising segment, the software can identify where in the video is the optimum place for an ad to be placed said. Companies can already pay to position their ads next to videos of a type or ones on a subject. Another major impacted segment will be video search. So far, video search has been limited to tagging of keywords. With video intelligence APIs populating the object data within video streams, video search with transform.

How is video content searched for objects?
There are number of approaches to search for objects using detection in video content:

  1. Run an object detector on individual frames, check for nearby detection in adjacent frames and discarding those that lack continuity
  2. Run an algorithm to estimate object motion between frames (methods include optical flow, phase correlation, pyramid block matching) and then aggregate multiple frames that capture motion of same object. This technique, however, is limited by the actual motion of the objects in the frames.
  3. Recent researches have amalgamated both the above-mentioned approaches where the algorithm switches between objection detection and their motion, using both the data streams to reinforce the accuracy of the model.

What are some of the companies and startups working in this field?

Clarifai, a startup, is offering a service that uses deep learning to understand video. Its video processing software can identify objects and scenes in video and provides a timeline so you can jump to the place a certain element appears at a speed much faster than the human eye. Its API returns tags that classify objects contained in images and video along with probabilities/confidence scores.

Dextro, which was recently acquired by Taser, is another startup that uses proprietary algorithms to make videos searchable and discoverable using video, audio and motion detection signals. Taser, a known name in the arena of non-lethal weapons manufacturing, hopes to make it easier for law enforcement to scan hours of footage using keywords to zoom in on relevant images, such as searching for a gun or a specific car. Given that the tool now is in the domain of law enforcement agencies, it upgrading to facial recognition in future is a distinct possibility. Another company named 3VR is already operating in this segment of video surveillance and intelligence using machine learning techniques.

The IBM Watson Visual Recognition Service uses machine learning CNN algorithms to analyze and understand the content of images and video frames. IBM Watson utilizes CNNs as semantic classifiers able to recognize settings, objects, events and other visual entities. The IBM Watson Visual Recognition API allows images to be uploaded in a programmatic way to the system, then analyzed against the labels specified. The API returns a list of tags/labels for the objects/concepts recognized within the image along with confidence scores.

Recently, Google unveiled the Google Cloud Video Intelligence API that leverages deep-learning models and frameworks such as TensorFlow to search and discover elements within video content. Google has built this API is for large media organizations and consumer technology companies who want to build their media catalogs or find easy ways to manage crowd-sourced content.

A group of scientists at MIT are attempting something novel and different from the above. Using machine learning on sound understanding from unlabeled videos, by capitalizing on the natural synchronization between vision and sound to learn an acoustic representation from unlabeled video. Essentially, given a video, the model recognizes objects and scenes from sound only.

While many believe that the computer vision and video intelligence market might become a winner-takes-all segment, especially after the entry of Google in it, I think the advent of novel technologies and machine learning methods will not allow any player the room to relax, irrespective of its stature.

Anubhav, a data scientist, writes about new developments and future trends in the machine learning and data analytics domain.
He can be reached at
Follow him on Twitter at:

Leave a Reply