The world is drowning in digital data. Interestingly, visual content makes up a majority of this data. Market research in the US shows that visuals made up over 70% of numerous companies’ digital content in 2019 and about half of the companies in consideration identify visual media as a key factor in their marketing strategies. Similarly, when it comes to journalism, media houses have an overflow of images and photographs that have to be sorted before they can be used. In 2019, the New York Times Magazine had over 500,000 photographs in their newsrooms; journalists sifted through this huge number of visuals and finally selected 116 images that would be finally published.
In order to tackle these huge archives of visual content, computer vision solutions have become a popular topic today. With the advancement of computer vision technology, tons of visual data can now be processed, analysed and tagged in a matter of seconds.
Computer vision- what started as a summer project in 1966 at MIT for an undergraduate program, has become a revolutionary technology that finds application in several fields in 2020.
Funnily enough, the computer vision project initiated at MIT was supposed to be done and dusted in that one summer. However, it has been almost 50 years since then, and the technology surrounding computer vision is still being comprehended and applied by scientists all over the world.
What is computer vision?
So let’s break it down- what is computer vision as we know it today? In a Stanford University publication called Computer Vision: Foundations and Applications, Olivier Moindrot describes computer vision as “building algorithms that can understand the content of images and use it for other applications”. In other words, computer vision gives machines the ability to “see”. It is a field of computer science that enables machines to extract info from visual data (images or videos), essentially doing the same tasks as a human's visual system.
Interestingly, computer vision can be thought of as an amalgamation of a number of disciplines. For instance, the interdisciplinary field that encapsulates computer vision can include applications of computer science and neuroscience: computer science, that involves building algorithms to analyse and do massive computation on pixel data , and neuroscience, which tries to understand from a physiological sense how our brains are able to perceive our visual world, thus giving these machines the magic of sight. One of the most popular and effective glues that allows us to connect these two fields are machine learning techniques, which encode the act of learning (and eventually understanding) computer algorithms. Current learning techniques are based on massive artificial neural networks, where each artificial neuron is a mathematical function that serves as a rough caricature of our physical neurons. Think about how humans managed to fly aeroplanes by drawing on the natural principles of flight used by birds. Our hope is to similarly model computer vision by means of artificial neurons which draw on the principles of how the neurons in our brains work, and consequently impart the magic of sight to computers.
“In a nutshell, computer vision is an interdisciplinary field that aims at giving computers the ability to see. It is engineering at its best, and often theoretically sound ideas don't work well in practice. And the things that work well often have little to no theoretical grounds, but rather have to be found through extensive experimentation.”
- Dominic Ruefenacht, Senior computer vision scientist, Mobius Labs
To understand the inner workings of this technology, we’ve tried to explain the building blocks of computer vision below.
How does modern computer vision work?
A large portion of modern* computer vision falls under the overarching umbrella of Machine Learning and Artificial Intelligence. The underlying principle in enabling computers to see, is extracting data from visual content, processing the data, and subsequently analysing and identifying the contents of the image. One of the techniques used to recognise such data is known as Convolutional Neural Networks (CNN). The mechanism behind the working of CNN is a complicated one; to understand the same, you must first understand what Convolution and Neural Networks are. We’ve tried to explain the same using a very relatable metaphor: cooking**!
Imagine you’re at your favourite restaurant with your friends, eating that one dish that makes you float in seventh heaven. You decide you want more of that heavenly experience and decide to recreate the exact dish at home in your own kitchen. The first attempt is always the trickiest: you somehow manage to get a hold of the recipe, assemble all the ingredients, saute, cook, simmer and serve. You ask your friends to act as the Masterchef critics and present them your dish for a tasting. Your friends are impressed, but the dish is not quite like the delicacy you tasted at the restaurant. It lacks ‘something’. Your friends ask you to add a bit more of that spice mix, maybe a little less salt.
So, you get back to work again. This time you alter the quantity of the ingredients slightly (keeping in mind your friends’ recommendations), so as to more accurately recreate the dish that the restaurant served you. You then repeat the next steps again: saute, cook, simmer and serve. A second taste-session happens and now the dish tastes almost like the original, but maybe you used too much of the chilly. If you’re motivated enough for the challenge, you repeat the same process all over again: take the feedback from your friends, alter ingredient ratios, saute-cook-simmer-serve. You keep doing this till your friends give you the golden pin of having achieved the precise taste of that delectable dish to the best of your abilities.
This process of making food in successive stages is equivalent to the feed forward mechanism of a neural network!
The working of a Neural Network involves a number of steps. First, there is an input stage where you provide all the data. Following the cooking metaphor, this data is all the ingredients you used to recreate your favourite dish. In the next stage, a transformation takes place: the input data is taken, and a number of mathematical operations are performed on it; this would be the first saute-ing of the ingredients. A number of such processes take place in stages which correspond to the cooking, simmering and the likes. Finally, you have the output- the first attempt of the completed recipe. The machine now has a rough idea of what the input image looks like. It compares the output with the initial input data, and finds the differences. Now, the output is fed back to the first input stage, and the same steps follow. The goal is to match the output (final dish) as close as possible to the input data; the very same as trying to replicate the final at-you-own-kitchen-dish as precisely as possible with the original restaurant-served-dish.
Of course, with visual content, this process becomes extremely complex because the data is huge. The input for a culinary dish consists of a few vegetables, a cheese or two, and some spices; for an image, the input data consists of cells of millions of pixels which make up the image.
Now that we explained the workings of a simple Neural Network, the next step is to understand what a Convolution is. Simply put, a Convolution (denoted by the symbol ‘*’) is a mathematical transformation of data. Convolution provides a way of 'multiplying together' two 2-Dimensional sets of numbers. In Convolutional Neural Networks used for computer vision, two Dimensional images are convolved (transformed) together. The input data (pixels) undergoes a transformation (or Convolution) before it proceeds on to the next steps of the Neural Network.
In simple terms, CNN has the ability to learn the important features of an image in order to recognise it. This is similar to how human vision works. For instance, if you give a computer a thousand images of, let’s say, a tomato, CNNs have the ability to extract the relevant information related to “tomato-ness”. As a result, no human has to sit down and hand-craft the specificity of “tomato-ness” to the computer; CNNs learn this from the data provided. However, for this to work, a huge data-set has to be provided to the machine. If this is available: you can train a classifier to identify if there is a tomato in the image or not; you can also do object detection, which will tell you where in the image the tomato is present or put boxes around the ten tomatoes present in a specific image. Here the “tomato” is just an example; CNNs work in the same way with all other concepts as well.
These are things that we, as humans, do very subconsciously. But for a machine to be able to do the same thing, this complex mechanism works in the background.
Timeline of the development of computer vision
The development of computer vision technology has been a lengthy and complex process. Now that we are somewhat familiar about the nitty-gritties of how computer vision actually works, we thought we should map out the major milestones that we think have led us to where we are with this technology today:
Due to its rapid advancement, computer vision technology has been adopted by a number of sectors in today’s market.
Computer vision: Applications
Computer vision eases the way in which visual content is sorted and analysed, and can be applied in many fields:
Press and Broadcasting: Computer vision is redefining the way in which journalists and broadcasters work with visual media. By taking over the cumbersome processes of sifting through tons of visual data, computer vision helps them focus on the human side of stories and thus enables them to deliver more relevant journalistic content.
Stock Photography: Computer vision not only detects the content of an image, but can also grade and tag an image. The aesthetic score of an image can be determined in a matter of seconds using computer vision, and thereby assist marketing, advertising or editorial departments to select the most pleasing photographs for usage.
Digital Marketing: Computer vision uses deep learning techniques to analyse, filter and enhance visual content depending on parameters set by the user. The software can filter out low-quality images, refine images using social trends and audience sentiment, scrutinize thousands of video clips to provide relevant recommendations, and even flag/block inappropriate content. The technology can also be trained to match influencers with brands in order to develop and grow new client bases. Creative agencies also benefit from computer vision: the technology digitally analyses massive volumes of photography and visual content to provide brand-friendly imagery, which dramatically reduces the many hours of manual searching for suitable materials.
Mobius Labs in computer vision: Superhuman Vision for every application
Computer vision has made major leaps in the last few years. We at Mobius are going a step further: it’s no longer just computer vision, it’s Superhuman Vision.
Superhuman Vision means instantly transforming thousands of images and videos into insight, experience, and advantage. Our technology supercharges visual searches by understanding everything about an image or a video, and automatically creating metadata for the same.
Superhuman Vision also means giving edge devices (like laptops and smartphones) the ability to instantly understand and analyze images, video footages, and real-life scenarios.
We strive to create cutting-edge technology that empowers anyone to interact with media in groundbreaking ways.
This is computer vision that non-techies can train using very little data, deploy to edge devices, and enable their businesses to seize markets where there are images or videos to analyze.
Optimised for the toughest computer visual challenges
Unlike those before, we are ready for the world’s most difficult visual problems:
- When there is very little training data available
- When the data is noisy
- When the use case is very private
- When performance cannot be compromised
- When complex solutions are to be run on the edge
- When videos and images are processed together
- When there are no AI or data science experts involved
We at Mobius are attempting to make a mark in this technology and compete against tech giants that have thousands of researchers and scientists working for them. Our compact science team is mindful about our resources, and are developing well-performing and efficient architectures. The fact that, when compared to bigger companies, our scientists have less data available to us in order to train machines to see, pushes them to develop ways in which our computer vision solutions can work efficiently even with small data-sets. We root ourselves in thorough scientific research and ascertain that our computer vision technology prospers in an ethical manner.
We believe that Vision is a power. Our goal is to add Superhuman Vision to any application, device or process, thus giving you unassailable competitive advantage.
*It is worth noting that there are many computer vision systems that use classical geometry or mathematical techniques and do not require any machine learning. For example, calibrating the views from two cameras are pure geometric and photometric operations.
**Cooking analogy of Data Learning has been adapted from here.