Image by Malachi Witt from Pixabay
I had the chance to listen to a talk that Ivan Lee, Founder and CEO of Datasaur, gave during October 2024’s AI Summit here at the Computer History Museum in Mountain View, CA. Datasaur is an AWS partner and marketplace seller.
One of the things Lee and company have done with AWS’s customers is to help guide those who need to scale up their data labeling efforts in a cost effective way, and yet at the same time help them to maintain high quality.
“Generative AI models,” he and partner solutions architect Kruthi Jaysimha Rao wrote in an AWS post, “automatically create large synthetic (yet realistic) datasets to address the lack-of-data problem…. While generative AI shows promise for scalably synthesizing labeled data, there remain significant challenges to address before reliably replacing human effort.”
The perils versus the promise of synthetic data
Synthetic data is an increasingly thorny issue, one that has content providers and even tech Q&A sites such as StackOverflow worried. Much depends on how the data is sourced, how it is synthesized and how it’s used. And even then, there’s the prospect of synthetic data poisoning the large language model (LLM) well, which relies on genuine data.
It’s good to take a long-term, comprehensive perspective when considering such alternatives. The more the emergent AI landscape grows in size and complexity, the more the guidance of companies like Datasaur becomes valuable.
It’s essential to push the edge of the automation envelope, but at the same time evaluate the risks of doing so. Risks change and new risks emerge as new technologies become part of the mix.
The “best model”?
Datasaur helps organizations who are struggling to manage voluminous repositories of unstructured data. The US Federal Bureau of Investigation (FBI) is among its customers. Consider how many documents the FBI, a 35,000 person agency that’s a primary investigative arm of the 115,000 employee Department of Justice, stores. That’s a huge natural language processing (NLP) mountain to climb.
Lee said that people often ask him what the “best model” is. The answer is, it depends. Datasaur assesses more than 200 different learning models, considering how suitable each might be for a particular purpose, for a particular kind of use case.
Taking an additional step can open up new possibilities for cost savings. For example, a client might well benefit most from shrinking the footprint of the model used in production. Datasaur offers an LLM distillation service, taking an LLM and distilling it to a small language model (SLM) footprint within 48 hours.
To me, this sounds like a clever solution to a complex matching exercise that actually requires an iterative approach to planning. Clients don’t often have a good handle on what they’re requirements are to begin with, and a good sense of the actual cost burden might encourage them to revisit those requirements.
The implicit question that motivates a client to consider such a service might be, “Do I really need to pay for this whole LLM? What problem am I really trying to address? Which part of the LLM would I really be using the most? Can’t I just subscribe to that by itself?”
Cable TV subscribers waited decades for a version of this kind of a-la-carte selection ability, and had to cut the cable to get it.
Other considerations
On a tight budget? It’s worth mentioning that the variety of open source models continues to grow. In late October 2024, I counted 61 different Open LLMs that are licensed for commercial use on Github’s running list.
As of August 2024, DataCamp ranked Llama 3.1, BLOOM, BERT, Falcon 180B and Meta’s OPT-175B in the top five of its list of the top eight open source LLMs.
Lee during his talk walked the audience through some basic ROI considerations that are worth noting, especially considering how many estimates and unknowns you’ll have to ponder:
- What are the cost savings you’re expecting?
- How much revenue will using the model help you generate?
- What will the build, implementation and operations and maintenance cost be?
Even more fraught with uncertainty and potential peril are these other considerations on the Datasaur list that I’m elaborating on:
- Timeline: Enterprise planning cycles are long and slow. The model choices and assumptions you make will change by the time the cycle’s over.
- Security: My friend Dave Duggal, CEO of EnterpriseWeb, shared this advice in a LinkedIn post a couple of months ago: “The vulnerabilities of GenAI should have been obvious, but folks got caught up in the hype and best practices went out the window (once again). LLMs or small models need to be isolated and contained, like any new component. LLMs are black boxes. They should never be the center of architecture, they are just another element.”
- Data integrations: Lee mentioned Databricks and Snowflake, and I’d just say that if you’re limiting yourself to those, you don’t really understand the full data integration challenge enterprises face or how wide-ranging data transformation at enterprise scale will have to be to take proactive advantage of AI.
- Data management: Lee did allude to “ground truth” datasets and the need to keep and manage these, as they can be reusable. You can find more on the bigger picture of transformed data management in my Data Science Central posts, such as this one: https://www.datasciencecentral.com/help-a-friendly-influential-executive-build-a-foundation-for-a-data-empire-on-a-budget/
Making good choices about models isn’t going to get any easier. I appreciated the calm, methodical nature of the Datasaur approach. It’s not sufficient to all needs, but it is a great starting point.