Illustration by Dirk Wouters on PIxabay
I had the opportunity to attend Enterprise Agility University’s prompt engineering course in April. The course provided a helpful agility lens through which to view efforts on how to make large language models useful. An LLM in this current enterprise context is like a herd of unruly sheep—you need plenty of herding dogs and fencing to manage the herd.
The bigger the herd, the more dogs and fencing you need. The EAU prompt engineering textbook provided with the course listed 18 categories of prompt engineering techniques. Many of these are reasoning and knowledge creation oriented.
Which begs the question: Shouldn’t most reasoning be part of the main data input, rather than delivered ad hoc through the prompt? With so much ad hoc input, users are really just doing a glorified version of data entry. I’m not pointing this out as a criticism of EAU’s efforts, but rather as an evident shortcoming of LLMs in general.
The main question given such shortcomings is how to infuse more intelligence into LLMs. The obvious answer from a data layer perspective is to start with explicitly intelligent data, because we clearly confront a garbage in/garbage out scenario otherwise when it comes to any kind of statistical machine learning.
We’re simply having to deal with a lot of unnecessary stupidity on the part of LLMs and their agents via the human interface because the inputs aren’t explicitly contextualized. They’re either not explicitly relevant, or they’re ambiguous.
For enterprise use, the outputs aren’t useful often enough because the Retrieval Augmented Generation (RAG) approach used isn’t helpful enough. That’s not to mention what’s been scraped and labeled off the web as model training data.
Why is truly, explicitly contextualized data so important to the success of LLMs? We don’t want machines to have to guess more than they have to. In the current piles of scraped, labeled, compressed and tensored data that are used as LLM inputs, machines have to guess which context is meant (from the input) and which context is needed (for the output). Vector embeddings as an adjunct aren’t enough to solve the contextualization problem, because they are themselves lossy.
As I pointed out in a previous post (https://www.datasciencecentral.com/data-management-implications-of-the-ai-act/), intelligent data is data that describes itself so that machines don’t have to guess what the data means. With machines, as we all know, you have to be Captain Obvious. We train machines with data. What that data says therefore has to be obvious to the machines.
For valid, useful machine interpretations of data, the meaning of the data has to be explicit. For the data to be explicit, we need to eliminate ambiguity. Eliminating ambiguity implies a much more complete and articulated contextualization of the data than the “context” generally referred to in LLMs. And yet, companies can give a huge boost to the creation of this richer context with the help of a thoughtful approach that augments logically connected knowledge graph development with machine learning.
Data fitness: A digital twins example
Once you’ve mastered the ability to create intelligent data, data can take on a much larger role in enterprise transformation. Data becomes the driver. Thus the term “data driven”.
What data as the driver really implies in a fully digitized world is that continuous improvement happens by using the data for both prediction and control. In a business context, the activity of the physical world, past, present and future, is mirrored in the form of “digital twins”. Each of these twins models a different activity.
A twin among many other things predicts future behavior of the activity. Then the predictions, also in data form, help to optimize that activity.
More broadly, a twin paints the picture of each activity in motion and allows the interaction with other twins, even across organizations.
Consider the example of intelligent IoT system provider Iotics’ work with Portsmouth Ports in the UK, a part of the Sea Change project. Iotics helped the port authority install networked sensor nodes at various places across the geography in order to monitor air quality, as a part of a compliance effort to reduce pollution.
Port areas must deal with a heavy concentration of pollutants, because they’re where multiple forms of transportation come together to move goods from sea to land, land to sea, and inland over land. Both workers and local residents suffer from pollution exposure.
Iotics’ solution blends intelligent digital twins and agents to capture, integrate and share the information from the port network’s various sensor nodes. Each node’s twin includes a knowledge subgraph. Software agents manage the messaging from these subgraphs and make it possible for subscribers to the network to obtain specific, time- and location-stamped measurements of most relevance to each subscriber.
For instance, a shipping company can review the measurements and determine its own fleet’s pollution footprint, including where, when and what kind of pollutant emitted is of most concern, which ship is the most problematic, etc. Armed with this information, the firm can tackle the problem of reducing pollution levels, as well as being able in future to demonstrate success in doing so.
Data fitness broadens agility initiative potential
Much of the advantage of the Iotics solution at Portsmouth Ports has to do with a superior system design, advanced data architecture, close partner collaboration and thoughtful standards-based technology selection. The right semantic knowledge graph technology implemented in the right way makes it possible to maximize the impact of data collection efforts on port-wide transformation. This kind of data layer transformation lifts all boats.