Visual content has been a part of our lives from the very beginning. Thousands of years later, how have we evolved in analysing images and videos? Read our blog to know more.
Sight and visuals are probably one of the earliest and most basic forms of human interaction. Visual modes of communication in the form of astonishing rock and cave paintings can be traced back to over 40,000 years. These visuals are a testament from our ancestors that images can convey deep meaning, both in the intellectual and visceral spheres, often encouraging us to pause and parse the meaning in them. In fact, visual communication even precedes the use of ideograms which eventually led to the development of alphabets as we know them today.
This legacy of interaction with the help of visuals has only grown with time. In the modern age of Instagram and Tiktok, some of our most emotional and impactful stories are still transmitted via stationary as well as dynamic images. However, what sets the last decade or so apart from the rest of human history is that we are in a paradoxical era at the moment: we produce way more visual content than what we consume. To put it bluntly, we are drowning in visuals. An extremely relevant example of this is what the editors of the New York Times Magazine faced in 2019. Journalists had to sit and sift through over 500,000 photos and scale it down to the top 116 pictures which would be finally displayed to their readers.
Humans have always aspired to impart human or superhuman qualities to the machines and devices that they build. The first tools we built centuries ago were weapons and agricultural tools that amplified our mechanical and kinetic abilities. As our civilizations evolved and our mathematical and computational knowledge grew, we aspired to make super-human versions of the various aspects of our intelligence. And this aspiration was probably the inception of the desire to make computers see. Let us not forget that the very genesis of the field of computer vision was an under-estimation of a colossal proportion. Computer vision research started out as a summer project at MIT, and was supposed to have been solved in that year! This was definitely not the case, and even after there were thousands of PhDs written about this eventually, we still have occasional difficulty to achieve its goal. We were struggling to even identify simple objects clearly, but this has changed drastically in the last couple of years.
One of the core, if not the core, observations of computing is Ada Lovelace's profound realization that if we encode ideas as mathematical symbols whose interactions can be represented by a set of operations ( a.k.a. algorithms ), we can use computation to understand or synthesize them further: for instance, the representation of a visual as a set of pixels, is nothing more than a set of numbers. The ability to operate on these numbers gives rise to machine vision, editing, and the likes. The core spirit that she articulated during her work was, "I want to find the calculus of the nervous system". And this is something that still motivates countless AI researchers across the globe. In fact, the most powerful server with 16 GPUs that our researchers use to train our models is called Ada.
Pursuing this core human instinct and drive, we at Mobius Labs aspire to impart the magic of superhuman vision to machines. This is a faculty which, if imparted to machines, can discover and explore the trillions of visuals that exist in our current world and the quadrillions of visuals we are yet to produce in the upcoming years. As a viewer you might be deeply touched by images that speak of human stories which are defining the unfortunate pandemic we are living through. A superhuman visual system should be able to identify the images that match the meaning that pops up in your mind from a collection of millions of visual content, and consequently present you with the most relevant and poignant content that you might have missed. It is that search for the optimal recipe that we are working hard on with regard to computer vision. We are specifically focussing on how algorithms allow us to explore and uncover the core story and meaning embedded in visuals in the best way possible.
At the core of most computer vision methods, is the art and science of arriving at optimal feature representations. Usually these feature representations are a set of numbers that summarises a visual idea. For example, our brain is able to figure out what a cat looks like, irrespective of the breed of cat, where the cat is located, and what cute or mischievous action he/she is upto.
However, if you look at the pixel values of these images, they vary drastically from image to image. An optimal feature representation should be invariant to these variabilities, and allow us to come to the right answer. Some of the most powerful models that are able to compute these feature representations, given a visual input, are deep neural networks (often convolution neural networks for visual data). These networks, when trained with a large amount of training data containing pairs of images and labels ( i.e. someone telling it explicitly this image is cat ), will be able to map an unseen image of cat to a feature representation that represents "cat-ness". This idea is what has powered the resurgence of computer vision in the past few years.
And this is merely scratching the surface. We are already on the path to make some wonderful discoveries, and the next few years will be the golden years for computer vision.
Seeking answers with Computer Vision
We strive to answer a number of things at Mobius Labs:
- What is an optimal feature representation for various acts of visual perception. For example: what might be a good feature representation to understand human expressions?
- Humans are amazing learning machines. We learn new things with hardly any supervision, and from visual signals. The example can be as simple as a child learning how to put on their shoes: you don’t need to dictate to him/her each and every step of the process; they watch and observe their parents, sometimes make mistakes (left shoe on right foot!), but in the end, learn how to do it by themselves. Can machines similarly learn with very little or no supervision from potentially noisy data ?
- How can we make our models efficient and lightweight, so that they can handle billions of content without consuming too much power.
- How can we make our technology accessible to our clients, so that they can build applications which can solve their previously unsolved problems, and unlock tremendous business value?
- What are the next generation of computer vision applications that can leverage the technology we are building?
It is an ambitious journey, and we constantly encounter false starts and dead ends, whether it is in regard to attempts within our company or among the wider community of computer vision enthusiasts, researchers, and scientists. The growth of the field, especially in the last decade or so, has been nothing short of remarkable. With the availability of data and faster computers, machine learning techniques have started to work surprisingly well. The bottomline is that the thrill of finding a solution that advances towards the end goal gives us such an enthrallingly high and unbelievable energy, that the low moments in the journey of computer vision technology are fast forgotten.
Our company is rooted on scientific principles: of probing the unknown with honesty and courage. A core element that ticks us a community is that once a discovery or a personal revelation is made, we openly share it with the world. This lets the idea have a life of its own, regardless of whether it is short lived or ends up being relevant across generations. This dynamic is what makes science move forward and enables us to prosper as a company. This blog is an attempt for us to participate with the rest of the community with the deeply personal and, in our perspective, meaningful work that we do here at Mobius Labs.