Anonymized Synthetic Face Datasets for Training ML Models

Published on
August 11, 2020
Thor Turrecha
EVP of SaaS

Advancement in computer vision continues to grow and the C-suite begins to discover the ability of the technology, leading many to invest in it. A global forecast even reports that the computer vision market is expected to grow from $11 billion to $19 billion in 2025.

However, since the earliest AI developments, machine and deep learning training remains to be a challenge in scaling computer vision as inadequacy of datasets is an issue, especially in facial recognition.

Training our CV model

When we started building and training our vision AI model around this time last year, we also grappled with data. With object recognition as a focus, gathering physical items and pictures of items to train our model with was a hug setback.

So we trained the unconventional way – with 3D-rendered objects and environments. At this point, we’re seeing a lot of benefits, including the ability to create more variations of a single object, and do it at speed.

With the COVID-19 pandemic, we started to shift our efforts to facial recognition, particularly anonymous audience measurement, to aid in complying with safety, cleanliness, and social distancing protocols.

On top of that, it also helps collect and analyse demographic and behavioral data, which is a retailer’s gold mine. It’s the same technology that facilitates Amazon Go’s Just Walk Out Shopping, and it’s more important than ever!

There are billions of faces online

Now, we stumble upon the same challenge. Collecting and labelling enormous amounts of real data, specifically for face-related learning, is “laborious, expensive, and error-prone.”

An ML model’s ability to “see” and “act” on tasks concerning human faces will largely depend on the quality and size of the training data, which we believe is scarce.

While there are datasets available online, with billions of faces scraped from multiple sources like Youtube, faces can still be difficult for the machine to comprehend.

There are factors like facial expression and color filters, among many others, that hinder the success of deep learning. Beyond these factors, privacy is another different issue as well, when using real data.

Face data augmentation

That being said, we’re currently exploring synthetic face datasets with our Vision AI solution, viana. In addition to data that’s readily available, we can create more face variations to train our model with. The process is called face data augmentation and it seeks to address the gap in the lack of training data.

Learning can be improved by transforming the face with “state-of-the-art techniques” (geometric, photometric, component, and attribute), transforming what is often obscured by real data, such as:

  • Lighting conditions
  • Facial expressions
  • Age
  • Pose or angle
  • Hairstyle
  • Makeup
  • Accessories (glasses, scarves)

A study that reviewed face data augmentation explored the challenges and opportunities, and found that “challenges such as identity preservation still exist, many remarkable achievements have been made.”


When training an ML model, whether for object or facial recognition, both quantity and quality of datasets matter. Synthetic data and using face data augmentation approach will improve facial recognition performance.

As both pros and cons surround this approach, the potential of our team learning is high. We will work to build solutions that help improve daily life, most especially in navigating the new normal.

Our journey of exploration towards the depths of computer vision continues…

Latest from meldCX

How to measure your in-store data like your website

Thor Turrecha
Sep 8, 2021
3 minutes
Creating great in-store customer experiences is more than just providing excellent customer service and great deals. Here are the 3 secrets behind the success of modern retail leaders.

Self-Service Kiosks: Not Just for Restaurants!

Thor Turrecha
Aug 2, 2021
3 minutes
Restaurant customers reportedly spend 30% more when ordering through self-service kiosks. How can this success be implemented in other industries?

On-Demand Webinar: Supercharge your device network with AI capabilities

Aug 25, 2021
1 minute
While the cloud has quickly become indispensable technology, challenges around latency, cost and complexity clear the way for something else to drive connected technology innovation: edge computing.

Get the latest meldCX news and insights right to your inbox!