Anonymized Synthetic Face Datasets for Training ML Models

Penned by meldCX’s EVP of SaaS, Thor Turrecha.

Advancement in computer vision continues to grow and the C-suite begins to discover the ability of the technology, leading many to invest in it. A global forecast even reports that the computer vision market is expected to grow from $11 billion to $19 billion in 2025. 

However, since the earliest AI developments, machine and deep learning training remains to be a challenge in scaling computer vision as inadequacy of datasets is an issue, especially in facial recognition.

Training our CV model 

When we started building and training our vision AI model around this time last year, we also grappled with data. With object recognition as a focus, gathering physical items and pictures of items to train our model with was a hug setback.

So we trained the unconventional way – with 3D-rendered objects and environments. At this point, we’re seeing a lot of benefits, including the ability to create more variations of a single object, and do it at speed.

With the COVID-19 pandemic, we started to shift our efforts to facial recognition, particularly anonymous audience measurement, to aid in complying with safety, cleanliness, and social distancing protocols.

On top of that, it also helps collect and analyse demographic and behavioral data, which is a retailer’s gold mine. It’s the same technology that facilitates Amazon Go’s Just Walk Out Shopping, and it’s more important than ever!

There are billions of faces online

Now, we stumble upon the same challenge. Collecting and labelling enormous amounts of real data, specifically for face-related learning, is “laborious, expensive, and error-prone.”

An ML model’s ability to “see” and “act” on tasks concerning human faces will largely depend on the quality and size of the training data, which we believe is scarce.

While there are datasets available online, with billions of faces scraped from multiple sources like Youtube, faces can still be difficult for the machine to comprehend.

There are factors like facial expression and color filters, among many others, that hinder the success of deep learning. Beyond these factors, privacy is another different issue as well, when using real data.

Face data augmentation

That being said, we’re currently exploring synthetic face datasets with our Vision AI solution, viana. In addition to data that’s readily available, we can create more face variations to train our model with. The process is called face data augmentation and it seeks to address the gap in the lack of training data.

Learning can be improved by transforming the face with “state-of-the-art techniques” (geometric, photometric, component, and attribute), transforming what is often obscured by real data, such as:

  • Lighting conditions
  • Facial expressions
  • Age
  • Pose or angle
  • Hairstyle
  • Makeup
  • Accessories (glasses, scarves)

 
A study that reviewed face data augmentation explored the challenges and opportunities, and found that “challenges such as identity preservation still exist, many remarkable achievements have been made.”

Conclusion

When training an ML model, whether for object or facial recognition, both quantity and quality of datasets matter. Synthetic data and using face data augmentation approach will improve facial recognition performance.

As both pros and cons surround this approach, the potential of our team learning is high. We will work to build solutions that help improve daily life, most especially in navigating the new normal.

Our journey of exploration towards the depths of computer vision continues…