A model on its own is typically not enough. It requires the data, which comes in a very specific format and has to be the same format that will be used at the time of inference or prediction. But the reality, on the other hand, is that data changes all the time, and sometimes even data formats change. So, typically, you need to put something in front of the model, which makes sure that the data coming in still fits the template, header, or schema the model has seen when it was trained. And that’s a very special factor—an artifact, really—for MLOps.
The only way to effectively productionize any machine learning project at scale is with MLOps tools and principles. But there are unique problems that, by their intrinsic nature, are universally challenging to the MLOps engineering teams. And those challenges are leading to the rise of foundation model operations (FMOps) and large language model operations (LLMOps)—a subset of FMOps.
The models get larger and larger, such as generative AI models and large language models. “They are so large that you need specialized infrastructure, and you sometimes need to come up with new creative ways so you can ensure that responses are delivered in a timely fashion,” says Dr. Ingo Mierswa, an industry-veteran computer scientist and founder of Altair RapidMiner. “And you need to start understanding if you can maybe sacrifice some precision of your models to actually reduce the memory footprint.”
All of that is very new. “This problem didn’t exist about 20 years ago when I started in this field,” Mierswa reflected during our call. We had already been running into all kinds of memory issues, and this was not a problem. Now, thanks to generative AI and LLMs, that problem is back—we are, because of how much data is being generated, developing resource-intensive models that the consumer-based hardware before us is not sufficient to work with. ‘And if you need to use specialized hardware’ — he implied it also means that the engineer needs specialty skills to work with that hardware and make it scalable or reduce the memory footprint of models.
Consider the GPT-based series of applications, since most of them are chat-based (text-to-text) applications. You will always see—you’ll type something, and then it streams a set of tokens or text. “One of the reasons behind that is the inference time of GPT is very slow, on the order of several seconds. And for deployment, the challenges that engineering teams face when deploying large language models for search applications, recommendation applications, or ad applications is that the latency requirements are on the order of several milliseconds—not seconds,” says Raghavan Muthuregunathan, Senior Engineering Manager at LinkedIn, leading the Typeahead and whole-page optimization of LinkedIn Search.
And how are big tech engineering teams trying to solve for that? “The knowledge distillation, and the fine-tuning where engineers are trying to deploy a very small, fine-tuned model for that specific task within a GPU. ‘This helps decrease the inference time from several seconds to just a few hundred milliseconds,’ Muthuregunathan articulated. ‘In fact, there is a technique called Look ahead decoding.’ It still is a very active area of research where engineers are trying to see if LLMs’ inference time can be reduced.”
Google has a very limited preview of their AI-powered research search where you can ask a question, such as ‘tell me about Pier 57 in NYC,’ and within seconds, you’ll see everything loading with the search results. It does take a few seconds to load, and that delay is likely due to the extremely slow inference time of a large language model. That’s why, for instance, if you were to type a query like ‘Donald Trump‘ on Google, they won’t provide AI-generated results immediately because they know it would take more time for this specific query, and users probably don’t need it instantly. The user’s intent is likely to navigate to some specific web page rather than consume the LLM summarized content. So, they’ve introduced a ‘generate’ button feature. If you choose to, you can wait for those several seconds, and then the results will be generated. “The way people are circumventing this latency issue is through a product experience rather than relying on advanced AI or infrastructure techniques,” Muthuregunathan explained. They’ve created a button so that when people click it, they’re okay with waiting a few seconds, as opposed to waiting several seconds of delay after typing a query, which otherwise might not create a good user experience.
“And everyone is trying to do streaming of LLMs to circumvent the limitation puzzle instead of making a single call to LLMs and getting all responses at once,” he mentioned. “Why? Because every application is more of a streaming application—that’s part of why most of these applications are chat applications instead of search engine applications.”
This is imperative because reducing the memory footprint is exceedingly difficult. We don’t have good hardware for vision-related tasks, and the availability of GPUs for production use cases today is a challenge. But there are sparks of progress. Nvidia’s founder and CEO, Jensen Huang, announced their new AI chip, the H200, at the last AWS re:Invent, which will be available for AWS customers. Google is introducing Cloud TPU v5p and the AI Hypercomputer for GPU-accelerated computing to run deep learning workloads. OpenAI is engaging in the arms race by developing its own chipset. And Tesla? They’re definitely forging ahead.
The extent to which these efforts will deliver on their seemingly great promise of overcoming computational limitations is still up in the air. But because of these limitations, it continues to be common among engineering teams to add more boxes, given that AI workloads rely heavily on extremely high-performance computing nodes.