Contact a Mobian

We’d love to hear from you!
Drop us a message and we’ll get back to you soon.

Editshare - Enhancing EditShare’s flow media asset management with Automated AI
Nomad

Editshare - Enhancing EditShare’s flow media asset management with Automated AI

Science

Automatic Facial Expression Analysis and Customized Expression Tagging

February 16, 2023
November 1, 2020
Clock Icon
6
min
Introduction

Facial expressions are one of the most important forms of non-verbal communication between humans. As Alan Fridlund, a psychology professor at University of California Santa Barbara puts it: “our faces are ways we direct the trajectory of a social interaction”. Given their importance in our day-to-day lives, it can be expected that humans can benefit from a system that can automatically recognize facial expressions. In this post, we describe how we tackled this challenging task at Mobius Labs.

While often used interchangeably, it is important to highlight that facial expressions and emotions are not the same thing. Emotions are things that we feel, and are caused by neurons shooting electrons around pathways inside our brains. Facial expressions refer to the positions and motions of the muscles beneath the skin of a face. It can be an indicator of the emotion we are feeling, but do not always reflect our emotion. For example, we can be smiling but be in an emotional state that is far from happy.


Rather than training a system that can recognize a limited set of emotions, we decided to train facial expressions.

For the above reasons, we decided to train a system that can distinguish facial expressions, rather than teaching a machine the ability to classify a limited set of “emotions”. As we will see, this design choice allows us to easily add new facial expression tags to the system, and allows to distinguish facial expressions at a very fine-grained level.

It is worth highlighting that the ability to distinguish facial expressions goes beyond classifying them into a set of emotion — or rather facial expression —  tags. Other use-cases include searching for a specific expression by providing a sample face, or photo gallery summarisation, where for each person in the gallery we want to show a range of facial expressions, as shown below.


In this example, around half of the faces would be tagged as “happy”, and the system would have a hard time figuring out different levels of happiness and hence would probably only show one “happy face” in the summary. By teaching a system what similar facial expressions look like by assigning them a distance that reflects the similarity, we are able to differentiate much more fine-grained levels of facial expressions, and hence show a range of happiness in the facial expression summary.

Distance Between Two Facial Expressions

More formally, our goal here is to learn a feature embedding network F(x) which takes as input a face image x and outputs a facial expression embedding e in a way that the distance between two embeddings coming from faces with similar expressions is small, wheres the distance is large for faces with different expressions. The figure below gives an illustrative example:

In this example, the distance of the facial expression embedding between the image on the left and the one in the middle should be (much) smaller than the one between the right image and the other two.

Let us now have a look how we trained the facial expression embedding.

Training the Facial Expression Embedding

In the following, we give an overview over the key ingredients of how we trained the facial expression embedding: the dataset, the architecture, and the loss function.

Dataset

In order to train such a facial expression embedding (FEE), one has to provide a large number of annotated samples so that the CNN can learn what features to extract. Luckily, there is an excellent facial expression dataset out there called Facial Expression Comparison (FEC). In this dataset, we are given triplets of face images (as the ones above), where at least six human annotators had to select the face where the facial expression is most dissimilar to the other two faces of the triplet. The images of these triplets have been carefully sampled from an internal emotion dataset that contains 30 different emotions; see [1] for more details.

A note on the dataset size: The original training  dataset contains around 130K faces, which are combined to a total of 360K triplets where at least 60% of the raters agreed on the annotation. At the time we downloaded the dataset, only around 80% of it was still accessible.

Architecture

As with other tasks that focus on faces, the first step of our processing pipeline is to detect the faces in an image. After this, we use a facial landmark detector — in our case, we use RetinaNet [2] to extract the location of the eyes, the nose tip, as well as both sides of the mouth. Much more sophisticated landmark extractors that extract over 60 landmarks exist, but we found this one to be sufficient, as we are only using it to align the landmarks into a reference coordinate system; essentially, we undo rotation and scale the face such that the inter-ocular distance (distance between the eyes) is 70px, and crop the face to 224x224 pixels.


Input to our Facial Expression embedding network is a aligned face, which we obtain by extracting facial landmarks (green dots), and then applying an affine warping to reference coordinates.

This aligned face is then input to a convolutional neural network (CNN), which extracts a feature vector. We played around with a variety of CNN architectures, and found that the recently proposed EfficientNet [3] performs best as backbone architecture.

Proposed FEENet, which uses EfficientNet-B0 as backbone, followed by two fully-connected (FC) layers, as well as L2-normalization.

In the case of EfficientNet-B0, the output is a 1280-dimensional feature vector. This feature vector is then passed through two fully-connected layers (FC), reducing the dimensions down to just 16. Lastly, we apply L2-normalization to obtain the facial feature embedding.
The resulting FEENet architecture is very simple and lightweight, requiring just 4.7M parameters.

Loss Function

We train the facial feature embedding network using a triplet loss function L(a, p, b) [4],which encourages the distance between the more similar facial expressions (denoted as anchor a and positive p) to be smaller than the distance of these two to the third facial expression of the triplet (denoted as negative n). We can write the triplet loss function as follows:

where F(x) is a deep neural network which in our case outputs the facial embedding vector, and θ is the margin (which we empirically set to 0.2).
Illustration of the goal of training with triplet loss, which allows to train an embedding where the embedding distance between more visually similar facial expressions is small.

The figure above illustrates the main idea behind triplet loss, where we used the following simplified notation for the distance:

On the left, we see a possible situation of the distances of one particular triplet (there are hundreds of thousands such triplets in the training set) before training, where the distances between the anchor a, the positive p, and the negative n, are very similar. Using the above triplet loss function, we can achieve that the distance between the anchor and the positive, denoted as d(a,p), is much smaller than d(a,n) and d(p,n).


How Well Does it Perform?

This section shows quantitative results for the architecture presented in the Architecture section. As mentioned earlier, at the time we downloaded the dataset, only around 80% of the FEC dataset was still available; this fact has to be taken into account when comparing our results with the one from FECNet [1], which had access to the whole dataset for training.

The chart below shows the triplet prediction accuracy (i.e., the percentage of triplets where the distance between the anchor and the positive is smallest) of the proposed FEENet and compares it to FECNet from Google AI [1]; in addition, we show the average performance of the human annotators who created the ground truth labels for the FEC dataset.

Triplet prediction accuracy for Google AI’s FECNet and the proposed FEENet. On the right, we further show the average accuracy of the human annotators of the FEC dataset.

Somewhat surprisingly, the proposed FEENet has a 2.7% improvement in triplet prediction accuracy over FECNet [1], despite the simpler architecture (4.7M versus 7M parameters).

The proposed FEENet model achieves an accuracy very close to that of human annotators.

To put these results in context, it is worth highlighting that the annotators who created the ground truth annotations for the FEC dataset have an average triplet prediction accuracy of 86.2%. In other words, the model we trained almost reaches human performance on this challenging task.

Applications

Now to the most fun part — Applications. The list below is by no means complete, but should give the reader an idea of the versatility of the facial expression embeddings. If you have a specific application in mind that is not listed below, please visit our website (linked at the end of this article) to get in touch with us — we are always happy and excited to test out new things.

Finding the Most Similar Facial Expressions (Search)

Perhaps the most obvious application of the facial expression embeddings is to use them to find similar facial expressions in a database of images. This is particularly useful as often, facial expressions are difficult to describe in words; instead, one simply provides an image containing a face with the desired expression.

Below are a few examples. For each row, the leftmost face is the “query” face, and the others are the best matches from the FEC validation set (22K faces).

computer vision by Mobius Labs identifies facial expressions
In each row, the leftmost fave is the "query" face. The other faces are the Top 10 results for the query, selected out of 22,000 identified images. Bonus: It also works remarkably well with cartoons/illustrations as shown in the last row.

Instead of searching “just” for the best matching facial expressions in the whole database, one can also first narrow down the database to specific people, and then search for matching facial expressions of a different person. In the example below, we collected 130 images with Donald Trump and extracted facial expression embeddings. We then searched with images of Angela Merkel to find the best matching facial expressions. Note that due to the limited size of the search database, we only show the Top 3 matches here.

facial recognition: computer vision by Mobius Labs recognises matching emotions
Some examples of matching facial expressions of different people. First column: Query face, Columns 2-4: Top 3 closest matches.

Facial Expression Summary

The facial expression embeddings can also be used to create facial expression summaries of specific people. That is, given a set of photos of the same person, one can find the dominant facial expressions. One way of creating summaries is to use some form of clustering in the facial feature embedding space. For the examples below, we selected around 150 samples per celebrity, and used simple K-Means with K=8 clusters, and show for each cluster the face that is closest to the cluster center.

facial recognition using computer vision
Facial expression summaries for Donald Trump, Angela Merkel, Serena Williams, and Roger Federer.

Facial Expression Tags

The learned embeddings also lend themselves to be used for classifying faces into different types of facial expression tags. In order to get a sense of what the learned features can encode, we ran an off-the-shelf K-Means clustering algorithm on all the faces that are in the FEC validation dataset (around 22'000 samples). We played around with the number of clusters K and found K=100 to give good results.

Let us have a look at some of the clusters the facial expression model is able to distinguish. One of our favourite examples is how it is able to pick up different levels of happiness. Below we show the 15 faces from the FEC validation dataset that are closest to different cluster centers (as obtained using K-Means with K=100 clusters):

Neutral
Very light smile
Light smile
Smile with teeth
Big smile with teeth
Big smile with mouth open
Ecstatic smile

Note how all the above facial expressions would most likely be given the tag “happy” — with the learned embedding, we are able to go into much more fine-grained levels of happiness, which should make everyone happy.

The facial expression embedding is able to encode fine nuances, which allows to assign very fine-grained tags to the clusters.

Adding New Facial Expression Tags (Customization)

As mentioned earlier, the proposed system makes it is very easy to add new tags to the system. All that has to be done is to provide at least one face along with the tag(s) that describe the facial expression. This is drastically different from existing technologies, where a limited set of predefined emotions are trained (typically less than 10), where it is not possible to add new (emotion) tags without having to retrain the whole architecture using thousands of sample faces that show the desired emotion.

Adding a new facial expression tag is as easy as providing an image of a face that conveys that tag.

As an example, let us focus facial expressions where people have their mouth open. In the figure below, each row shows on the left a face containing the facial expression of interest, and the remaining ones the closest neighbors of that face, which would hence get assigned the same tags.

Screaming: Eyes open, frowning
Screaming, venting: Eyes closed, teeth showing
Crying (baby): Eyes closed, no teeth showing ;)
Shocked: Eyes open, no teeth showing, not frowning
Ecstatic: Squinting, teeth showing, smiling

In the first row for example, one might provide a face, along with the tag ‘screaming’. In addition, one might want to add more tags that describe some face features, such as ‘eyes open’ and ‘frowning’ in this case.

Overall, the figure above reinforces the fact that the embedding is able to encode fine nuances the embedding, which allows to assign very detailed facial feature tags to the clusters.

Customer details
Solution
Benefits
Written by
|
Edited by
Trisha Mandal
|
Illustrated by

Discover more articles

Announcements
Mobius Labs Unveils Breakthrough Multimodal AI Video Search Empowering Organizations to “Chat” with Their Content Libraries
Features
Turn Back Time with Visual DNA
Thought leadership
Unlocking the True Potential of Content with Visual AI
By clicking “Accept”, you agree to the storing of cookies on your device to enhance site navigation, analyze site usage, and assist in our marketing efforts. View our Privacy Policy for more information.
Mobius Labs GmbH is receiving additional funding by the ProFIT program of the Investment Bank of Berlin. The goal of the ProFIT project “Superhuman Vision 2.0 for every application- no code, customizable, on- premise  AI solutions ” is to revolutionize the work with technical images. (f.e.) This project is co-financed by the European Fund for Regional Development (EFRE).

In the ProFIT project, we are exploring models that can recognize various objects and keywords in images and can also detect and segment these objects into specific pixel locations. Furthermore, we are investigating the application of ML algorithms on edge devices, including satellites, mobile phones, and tablets. Additionally, we will explore the combination of multiple modalities, such as audio embedded in videos and the language extracted from that audio.