Many are ground-breaking innovations that make LLMs much faster and not prone to hallucinations. They reduce the cost, latency, and amount of computer resources (GPU, training) by several orders of magnitude. Some of them improve security, making your LLM more attractive to corporate clients. I introduced a few of these features in my previous article “New Trends in LLM Architecture”, here. Now I offer a comprehensive list, based on the most recent developments.
1. Fast search
In order to match prompt components (say, embeddings) to the corresponding entities in the backend tables based on the corpus, you need good search technology. In general, you won’t find an exact match. The solution consists in using approximate nearest neighbor search (ANN), together with smart encoding of embedding vectors. See how it works, here. Use a caching mechanism to handle common prompts, to further speed up processing in real time.
2. Sparse databases
While vector and graph databases are popular in this context, they may not be the best solution. If you have two million tokens, you may have as many as one trillion pairs of tokens. In practice, most tokens are connected to a small number of related tokens, typically less than 1,000. Thus, the network or graph structure is very sparse, with less than a billion active connections. A far cry from a trillion! Hash tables are very good at handling this type of structure.
In my case, I use nested hash tables, a format similar to JSON, that is, similar to the way the input source (HTML pages) is typically encoded. A nested hash is a key-value table, where the value is itself a key-value table. The key in the root hash is typically a word, possibly consisting of multiple tokens. The keys in the child hash may be categories, agents, or URLs associated to the parent key, while values are weights indicating the association strength between a category and the parent key. See examples here.
3. Contextual tokens
In standard LLMs, tokens are tiny elements of text, part of a word. In my multi-LLM system, they are full words and even combination of multiple words. This is also the case in other architectures, such as LLama. They are referred to as multi-tokens. When it consists of non-adjacent words found in in a same text entity (paragraph and so on), I call them contextual tokens. Likewise, pairs of tokens consisting of non-adjacent tokens are called contextual pairs. When dealing with contextual pairs and tokens, you need to be careful to avoid generating a very large number of mostly irrelevant combinations. Otherwise, you face token implosion.
Note that a word such as “San Francisco” is a single token. It may exist along with other single tokens such as “San” and “Francisco”.
4. Adaptive loss function
The goal of many deep neural networks (DNN) is to minimize a loss function, usually via stochastic gradient descent. This is also true for LLMs that use transformers. The loss function is a proxy to the evaluation metric that measures the quality of your output. In supervised learning LLMs (for instance, those performing supervised classification), you may use the evaluation metric as the loss function, to get better results. One of the best evaluation metrics is the full multivariate Kolmogorov-Smirnov distance (KS), see here, with Python library here.
But it is extremely hard to design an algorithm that makes billions of atomic changes to KS extremely fast, a requirement in all DNNs as it happens each time you update a weight. A workaround is to use an adaptive loss function that slowly converges to the KS distance over many epochs. I did not succeed at that, but I was able to build one that converges to the multivariate Hellinger distance, the discrete alternative that is asymptotically equivalent to the continuous KS.
5. From one trillion parameters to less than 5
By parameter, here I mean the weight between two connected neurons in a deep neural network. How can you possibly replace one trillion parameters by less than 5, and yet get better results, faster? The idea is to use parametric weights. In this case, you update the many weights with a simple formula relying on a handful of explainable parameters, as opposed to neural network activation functions updating (over time) billions of Blackbox parameters — the weights themselves — over and over. I illustrate this in Figure 1. The example comes from my recent book, available here.
6. Agentic LLMs
An agent detects the intent of a user within a prompt and helps deliver results that meet the intent in question. For instance, a user may be looking for definitions, case studies, sample code, solution to a problem, examples, datasets, images, or PDFs related to a specific topic, or links and references. The task of the agent is to automatically detect the intent and guide the search accordingly. Alternatively, the LLM may feature two prompt boxes: one for the standard query, and one to allow the user to choose an agent within a pre-built list.
Either way, you need a mechanism to retrieve the most relevant information in the backend tables. Our approach is as follows. We first classify each text entity (say, a web page, PDF document or paragraph) prior to building the backend tables. More specifically, we assign one or multiple agent labels to each text entity, each with its own score or probability to indicate relevancy. Then, in addition to our standard backend tables (categories, URLs, tags, embeddings, and so on), we build an agent table with the same structure: a nested hash. The parent key is a multi-token as usual, and the value is also a hash table, where each daughter key is an agent label. The value attached to an agent label is the list of text entities matching the agent in question, each with its own score indicating relevance.
7. Data augmentation via dictionaries
When designing an LLM system serving professional users, it is critical to use top quality input sources. Not only to get high quality content, but also to leverage its embedded structure (breadcrumbs, taxonomy, knowledge graph). This allows you to create contextual backend tables, as opposed to adding knowledge graph as a top, frontend layer. However, some input sources may be too small, if specialized or if your LLM consists of multiple sub-LLMs, like a mixture of experts.
To augment your corpus, you can use dictionaries (synonyms, abbreviations), indexes, glossaries, or even books. You can also leverage user prompts. They help you identify what is missing in your corpus, leading to corpus improvement or alternate taxonomies. Augmentation is not limited to text. Taxonomy and knowledge graph augmentation can be done by importing external taxonomies. All this is eventually added to your backend tables. When returning results to a user prompt, you can mark each item either as internal (coming from the original corpus) or external (coming from augmentation). This feature increases the security of your system, especially for enterprise LLMs.
8. Contextual tables
In most LLMs, the core table is the embeddings. Not in our systems: in addition to embeddings, we have category, tags, related items and various contextual backend tables. They play a more critical role than the embeddings. It is more efficient to have them as backend tables, built during smart crawling, as opposed to reconstructed post-creation as frontend elements.
9. LLM router
Good input sources usually have their own taxonomy, with categories and multiple levels of subcategories, sometimes with subcategories having multiple parent categories. You can replicate the same structure in your LLM, having multiple sub-LLMs, one per top category. It is possible to cover the entire human knowledge with 2000 sub-LLMs, each with less than 200,000 multi-tokens. The benefit is much faster processing and more relevant results served to the user.
To achieve this, you need an LLM router. It identifies prompt elements and retrieve the relevant information in the most appropriate sub-LLMs. Each one hast its set of backend tables, hyperparameters, stopword list, and so on. There may be overlap between different sub-LLMs. Fine-tuning can be done locally, initially for each sub-LLM separately, or globally. You may also allow the user to choose a sub-LLM, by having a sub-LLM prompt box, in addition to the standard agent and query prompt boxes.
10. Smart crawling
Libraries such as BeautifulSoup allow you to easily crawl and parse content such as JSON entities. However, they may not be useful to retrieve the embedded structure present in any good repository. The purpose of smart crawling is to extract structure elements (categories and so on) while crawling, to add them to your contextual backend tables. It requires just a few lines of ad-hoc Python code depending in your input source, and the result is dramatic. You end up with a well-structured system from the ground up, eliminating the need for prompt engineering.
The next 20 features will be discussed in my upcoming articles.
About the Author
Vincent Granville is a pioneering GenAI scientist and machine learning expert, co-founder of Data Science Central (acquired by a publicly traded company in 2020), Chief AI Scientist at MLTechniques.com and GenAItechLab.com, former VC-funded executive, author (Elsevier) and patent owner — one related to LLM. Vincent’s past corporate experience includes Visa, Wells Fargo, eBay, NBC, Microsoft, and CNET. Follow Vincent on LinkedIn.
Have you investigated the use of Confabulation Theory by Hecht-Nielsen to improve LLM accuracy? My PhD research was in this area, and I was able to dramatically improve the precision and recall of recognizing medical entities in text. I’m interested in applying this to LLMs, and was wondering if you have experience with it.