Venture Deep Dives: Synthetic Data
Learn how synthetic data is changing the $387 billion AI industry

In this series, we hone in on a specific element of venture capital with one of our investing experts.
Today I want to recommend our Deep Dive on synthetic data presented by Lakeshore Ventures Managing Partner Justin Strausbaugh. Justin has invested in companies across the U.S., Europe, and Israel from Seed stage to Series D and has been a board observer for 10 companies. Before venture capital, Justin traded derivatives in the global markets for eight years and then co-founded a financial markets education and research business.
Watch Justin’s Deep Dive into Synthetic Data
See video policy below.
Justin’s extensive deep dive examines the rapid adoption of the $387 billion AI industry and the startups that are utilizing computer-generated datasets used to train these types of machine learning algorithms. The field has immense potential to reshape not just the tech industry, but virtually all other markets and verticals.
Want to learn more?
View all our available funds and secure data rooms, or schedule an intro call.
New to AV?
Sign up and access exclusive venture content.
Contact [email protected] for additional information. To see additional risk factors and investment considerations, visit av-funds.com/disclosures.
Frequently Asked Questions
FAQ
Speaker 1:
Welcome to Venture Deep Dives, a weekly feature where our investing professionals highlight sectors and startups in venture capital that solve big problems through innovative and disruptive business models. In each episode, we choose a single topic to explore with a member of our investment team. I’m your host, Peter McEwen, EVP of Community at Alumni Ventures. Today I’m highlighting our exploration of synthetic data for artificial intelligence or AI presented by Lakeshore Ventures managing partner Justin Strausbaugh.Prior to venture capital, Justin traded derivatives in the global markets and co-founded a financial markets education and research business. Artificial intelligence is a complex space that encompasses many different technologies. AI is used in applications ranging from GPS navigation to autonomous vehicle guidance systems. Unlike other algorithms, machine learning AIs require large sets of training data in order to produce accurate predictions. This can be anything from vast libraries of digital images to terabytes of raw meteorological data.
Justin’s deep dive examines the rapid adoption of AI and the startups that are utilizing computer-generated data sets used to train these types of machine learning algorithms. Remember, you don’t have to be a venture capitalist to have venture capital in your portfolio. If you want to learn more about investing in exciting venture sectors like these through our firm, please visit AV VC. Thanks for listening, and please enjoy Justin’s deep dive into synthetic data.
Speaker 2:
Hi everyone. Thanks for joining us today. My name is Justin Strausbaugh. I’m the managing partner of Lakeshore Ventures, our UChicago Alumni Fund. Today’s deep dive topic will be synthetic data and its importance to artificial intelligence.The reality is AI has a data problem. Anytime someone creates an algorithm, it needs training data in order to learn, and it’s not practical to either get all that data from real-world experiences or have humans manually label the data to teach a machine something a human already knows. In order to unlock some of the interesting use cases for AI, like autonomous vehicles and drug discovery, many new startups are popping up to provide synthetically derived data that can be instrumental in training today’s AI algorithms. Many of you will believe that synthetic data will be an important factor in realizing the potential of AI in coming years.
The goal today is to help everyone understand the synthetic data opportunity and share what we’re seeing in the market. We’ll dive into some of the factors that could make it a success and highlight some of the prominent and headline-grabbing companies that are emerging.
Machine learning—or to be pedantic, supervised learning—is a form of artificial intelligence that’s used in numerous technologies, including natural language processing, computer vision, robotics, and autonomous vehicles. The “machine” in machine learning is essentially a prediction algorithm that ingests large swaths of data called training data. The goal of these algorithms is to produce an accurate prediction of output for any given input.
So for example, if I wanted a computer vision algorithm to be able to detect a car, I would need to train that algorithm on thousands of photos of cars in order to ensure its ability to accurately predict the existence of a car. In order to train an ML algorithm, I need relevant training data—and a lot of it. This data also must be cleaned, annotated, and organized before it can be effectively utilized. The reason is that a supervised learning algorithm cannot learn what you do not explicitly teach it.
Therefore, there’s no room for subtlety and nuance. If you want to recognize a random car, it needs to be trained on a highly accurate and extensive piece of data. To emphasize just how important the training data is to the accuracy of a machine learning technology, it’s estimated that 80% of time spent building these models is focused on the labeling and management of training data.
Some startups like Scale AI have had great success meeting this need in the AI market. These companies effectively work as outsourced training data managers. They clean, structure, label, and package large data sets through API calls and click farms that employ thousands of workers, often in low-cost labor geographies. Companies send these click farms their raw data sets, which could be millions of photos of houses, for example, and these companies will go through each photo and accurately label the houses so that the end users—and the software engineers’ machine learning algorithm—can be trained effectively.
As you might imagine, the manual labeling of these massive data sets is extraordinarily time-consuming and resource-draining. And Andreessen Horowitz, one of the most well-known VCs in the world, believes that this is one of the largest barriers to the adoption of artificial intelligence.
The main problems revolve around three key points. First, whether you like it or not, increasingly restrictive data privacy laws in the EU and the US make training data harder to attain. There’s reason to believe that countries like China, which take a different approach to data privacy and control, will have an advantage in collecting data. Second, the sheer amount of data that’s required can be overwhelming and hard to source. Think about driverless cars trying to travel millions of miles and experience every edge case before going into production. Lastly, there’s an inherent bias when humans are labeling data sets. In many instances, it can be hard to understand what is the ground truth because of the subjectivity of the human labeler that can often creep in. We see this a lot with disinformation and fake news detection use cases.
Not surprisingly, these issues result in faulty AI projects. Roughly 85% of them deliver incorrect outcomes, much of that related to bias in the data. Not to mention, the labeling and maintenance of training data is directly impacting the P&L of AI companies simply looking to build great products. Unfortunately, in many use cases, the process of manually labeling data sets is simply untenable from a time, financial, and resource perspective for companies.
Now that we’ve talked a bit about the problem, let’s spend a few minutes talking about a potential solution to the AI training model bottleneck. That solution is often referred to as synthetic data. Synthetic data is computer-generated images and data sets that can substitute as real data. These images and data sets have approximately the same statistical and mathematical properties as the real-world data that it’s mirroring, but don’t require all of the manual processing and data collection problems of collecting those real-world data sets that we talked about earlier.
Developers can apply their machine learning algorithms to these virtual data sets in the same way they would real-world data sets, but with a great reduction in time and cost. Synthetic data sets come with fully formed inputs and outputs. The data is typically free of any regulatory and compliance issues as well as human bias, and the data can also be heavily customized to fit the customer or developer’s needs. We believe there’s a ubiquitous need for this type of training data across nearly every vertical, and in fact, industry experts like Gartner believe that synthetic data will be the primary tool used to train machine learning algorithms in the future.
Since we believe there’s a universal need for synthetic data in the world of artificial intelligence, I wanted to call out a few industries on the next two slides that could especially benefit from more efficient dataset procurement and synthetic data. Those industries are autonomous vehicles, retail, and financial services.
In the world of autonomous vehicles, synthetic data will be used to model every scenario imaginable in order to create exhaustive worlds for the vehicles to train on. As you can imagine, waiting for a car to experience a crash before learning how to avoid an unexpected object is simply unacceptable. The computer vision in autonomous vehicles will have a better agility to recognize erratic vehicle and pedestrian behaviors and react accordingly with these rich synthetic data sets in order to train on.
Synthetic data could also be applied to retail stores. They could use synthetic data to train on shopper behaviors while still respecting user privacy, ultimately leading to implementation of automatic checkout systems and better inventory management. Lastly, financial services firms have strict guidelines in place for protecting customer information, but need access to data sets to help them with use cases like fraud and money laundering detection. So synthetic data is a natural fit to model out potential scenarios when it comes to risk management, pricing, and credit decisions.
To extend our example to a few more verticals, here are some additional use cases for synthetic data. In the healthcare space, synthetic data can create digital twin data sets that can merge and anonymize disparate data, typically locked behind red tape in highly regulated healthcare organizations. If doctors, scientists, and engineers had access to simulated patient data, they could expedite such AI applications as reading X-rays, predicting disease progression, and drug efficacy across different patient populations. The healthcare applications do seem wide and vast.
In our current crazy world of supply chain disruptions, AI models could use synthetic data to introduce black swan events to companies’ planning algorithms in order to better anticipate unforeseen challenges and plan for remediation. And lastly, telecom companies are increasingly using digital twins and synthetic data to analyze variations in placement, equipment, and protocols for new 5G deployments.
We’ve talked about the problem with AI data sets, the potential solution, and some use cases, but what is the data saying about adoption and utilization? On the slide, you’ll see a few key statistics to help us understand where things are headed for this space.
First off, more than half of enterprises accelerated their adoption of AI plans because of COVID, and 86% say AI is becoming mainstream in their company. Secondly, according to Gartner, by 2024, 60% of the data used for development of AI and analytics projects will be synthetically generated. And lastly, the broader data annotation space is expected to grow at a compound annual growth rate of greater than 30% through 2028. These statistics point to a customer base with a growing interest in using synthetic datasets both now and in the future.
While predicting the future is always tough, we’d like to highlight a few things that we believe and that we’ll be looking for when it comes to potential opportunities and challenges for synthetic data startups.
In terms of success factors, any synthetic dataset should provide a stronger value prop over real-world data, and a lot of that will come down to its ability to circumvent stringent regulatory red tape and depict black swan events. And of course, don’t forget cost and time, which should come naturally given the reduced need to manually label data, which we talked about earlier.
In terms of opportunities, we want to find startups that have the technical chops to create highly customizable data sets that can show high degrees of success in at least one vertical before moving on to other verticals. And lastly, we’ll be avoiding less pragmatic and unproven use cases such as unsupervised learning or unfocused horizontal applications.
Plus, this is a category to be price/valuation sensitive, since synthetic data startups should typically not receive the same valuation multiples as SaaS businesses, given the lower margin nature of these businesses.
And if we extend our discussion on opportunities and challenges, there are a few things we’ll be looking for if this is a hit. We’ll see enterprises continue to invest in AI at a rapid pace, and traditional data labeling services will remain cumbersome and challenging to scale. Next, regulatory headwinds will continue to grow around data privacy, making it challenging to obtain and use real-world data. Lastly, the desire from end customers for advanced technology such as autonomous vehicles and robotics will continue to grow, which we definitely expect.If it’s a miss, consumer privacy laws would have to weaken—at least in the more open economies—making it much easier to obtain real-world data without fear of repercussion from privacy laws. Next, we would need to see some prominent synthetic data companies underperform and prove in market that they’re unscalable businesses. Lastly, if the overall funding environment enthusiasm for AI wanes, especially in a recession, there could be limited capital for these startups to grow.
As we were researching this space, we came across a number of interesting early-stage companies, some of whom have already raised impressive rounds. These companies span different verticals and different AI end use cases.
As you will see, to start, you have Afi, which produces synthetic data for shopper behavior at retail stores. The company is currently operational in 26 stores, and its primary use case is to assist retailers with auto-checkout operations. They recently raised a $65 million Series B round.
Another company we found is called Gretel, which is creating synthetic data sets for developers that are anonymized. The two main featured use cases have been in genomics and financial data, which makes sense given the strict data compliance requirements in those verticals. Gretel also recently raised a large Series B.
Another one is Mostly AI, which creates synthetic customer data for the insurance and finance sectors. Their data can generate an unlimited number of realistic digital customers whose behaviors mimic real-world behaviors.
Moving to the earlier stage, we found a number of seed-stage companies also building synthetic data platforms. Anyverse is targeting the autonomous driving industry and creates pixel-accurate data that mimics exactly what sensors would see. Hazy is targeting fraud and money laundering in its initial use case and is already used by blue-chip customers such as Accenture, BMW, PwC, and RBS. Finally, Lexset recently raised a $4 million seed round. The company is industry agnostic, but its initial use cases have been in the 3D simulation of the indoor/outdoor residential space.
It’s worth noting that we found over 70 companies in the synthetic data space, so this is just a small sampling. It seems like more and more synthetic data companies are popping up every day.
So to summarize, there’s a critical bottleneck in AI development coming from the lack of access to cost-effective training data, and we believe that synthetic data is the answer. Also, there’s an increasing emphasis from enterprises on deploying AI, and the majority of data used for the development of these AI use cases will be synthetic in nature in the future. It’s still early days for this sector, as most of these companies have not raised more than a Series B. We’ll be watching consumer privacy laws and enterprise adoption of AI as we search for promising startups in the field.
Thanks for your time today, and please let us know if you have any questions.