Why AI needs a constant diet of synthetic data

Check out the Low-Code/No-Code Summit on-demand sessions to learn how to successfully innovate and achieve efficiencies by enhancing and scaling citizen developers. Watch now.


Artificial intelligence (AI) may be eating up the world as we know it, but experts say the AI ​​itself is also starving and needs to change its diet. One company says synthetic data is the answer.

“Data is food for AI, but AI today is undernourished and undernourished,” said Kevin McNamara, CEO and co-founder of synthetic data platform provider Parallel Domain, which just raised $30 million in a series B round led by March Capital. “That’s why things are growing slowly. But if we can better feed that AI, the models will grow faster and in a healthier way. Synthetic data is like food for training the AI.”

Research has shown that around 90% of AI and machine learning (ML) implementations fail. A Datagen report from earlier this year noted that many failures are due to a lack of training data. He found that 99% of computer vision professionals say they have had an ML project canceled specifically due to lack of data to carry it out. Even projects that are not fully canceled due to lack of data experience significant delays, which puts them off track, 100% of respondents reported.

In that vein, Gartner predicts that synthetic data will increasingly be used as a supplement for AI and ML training purposes. The research giant projects that by 2024 synthetic data will be used to accelerate 60% of AI projects.

Event

smart security summit

Learn about the critical role of AI and ML in cybersecurity and industry-specific case studies on December 8. Sign up for your free pass today.

Register now

Synthetic data is generated by machine learning algorithms that incorporate real data to train on behavior patterns and create simulated data that retains the statistical properties of the original data set. The resulting data replicates real-world circumstances, but unlike standard anonymous data sets, it is not vulnerable to the same flaws as real data.

Take AI out of the ‘Stone Age’

It may seem unusual to hear that a technology as advanced as AI is stuck in something of a “Stone Age,” but that’s what McNamara sees, and without the adoption of synthetic data, it will continue to be that way, he says.

“Right now, AI development is similar to computer programming in the 1960s or 1970s, when people used punch card programming, a manual, labor-intensive process,” he said. “Well, the world has finally moved away from this and into digital programming. We want to do that for AI development.”

The three biggest bottlenecks keeping AI in the Stone Age are as follows, according to McNamara:

  1. Real World Data Collection – which is not always feasible. Even for something like jaywalking, which happens quite often in cities around the world, if you need millions of examples to train your algorithm, that quickly becomes unachievable for companies to go out and get from the real world.
  2. Labelled – which often requires thousands of hours of human time and can be inaccurate because, well, humans make mistakes.
  3. Iterating over the data once it’s tagged, which requires you to adjust sensor settings, etc. and then apply it to start training your AI.

“That whole process is very slow,” McNamara said. “If you can change those things really quickly, you can actually figure out better setups and better ways to build your AI in the first place.”

Enter the scenario to the right: synthetic data

Parallel Domain works by generating map-based virtual worlds, which it calls “digital cousins” of real-world settings and geographies. These worlds can be altered and manipulated to, for example, have more jaywalking or rain, to help with autonomous vehicle training.

A sample of Parallel Domain synthetic data showing a map view of its virtual world capabilities.
A sample of Parallel Domain synthetic data showing a map view of its virtual world capabilities.

Because the worlds are digital cousins ​​and not digital twins, personalization can simulate the sometimes more difficult-to-obtain, but training-essential data that companies would normally have to go out and obtain themselves. The platform allows users to tailor it to their needs via an API, so they can move or manipulate factors the way they want. This speeds up the AI ​​training process and removes time and labor hurdles.

The company claims that within hours it can provide training data sets that are ready for use by its customers — clients that include the Toyota Research Institute, Google, Continental and Woven Planet.

“Customers can go into the simulated world and make things happen or pull data from that world,” McNamara said. “We have knobs for different types of asset categories and scenarios that could occur, as well as ways for clients to wire in their own logic for what they see, where they see it, and how those things behave.”

Clients then need a way to pull data from that world in configuration that matches their configuration, he explained.

“Our sensor setup tools and tag setup tools allow us to replicate the exact camera setup or the exact lidar, radar, and tagging setup that a customer would see,” he said.

Synthetic data, generative AI

Synthetic data is not only useful for training AI and ML models, but can be applied to make generative AI, an already rapidly growing use of technology, develop even faster.

Parallel Domain is eyeing the field as the company enters 2023 with fresh capital. He hopes to multiply the data that generative AI needs to train so that it can become an even more powerful tool for content creation. Its R&D team is focused on the variety and detail of the synthetic data simulations it can provide.

“I’m excited about generative AI in our space,” McNamara said. “We are not here to create an artistic interpretation of the world. We are here to create a digital cousin of the world. I think generative AI is really powerful at looking at image samples from around the world, then extracting them and creating interesting examples and novel information within the synthetic data. So generative AI will be a big part of the technological advances we invest in next year.”

The value of synthetic data is not limited to AI. Given the vast amount of data required to create realistic virtual environments, it’s also the only practical approach to moving the metaverse forward.

Parallel Domain is part of the fast-growing synthetic data startup sector, which Crunchbase previously reported is receiving a large amount of funding. Datagen, Gretel AI, and Mostly AI are some of its competitors that have also raised several million in the past year.

VentureBeat’s mission is to be a digital public square for technical decision makers to gain insights into transformative business technology and transact. Discover our informative sessions.

Leave a Comment