How to understand the core concepts of AI, LLMs and RAG!

If you find some of the different terminology used for Large Language Models (LLMs) and AI confusing, you are not alone!

This is the first in a series of articles about AI, LLMs and Retrieval Augmented Generation (RAG) where we aim to explain clearly and succinctly, some of the key terminology you might be hearing about. We hope you find these posts helpful!


What are foundation models?


A foundation model is an AI model, trained on huge amounts of data (documents, audio, images, text….). It is trained to ‘generate’ the next word as it ‘learns’ the language. It should then be specialised and fine-tuned for a wide variety of applications and tasks, which then means it is no longer a foundation model!


What are LLMs?


A LLM is an umbrella term used for all foundation and specialised models.

For example:

In the case of Llama, the foundation model is not usable directly but serves as the foundation for all the subsequent specialised models. Llama instruct is a question and answering model and code Llama is a coding assistant.

All three models are LLMs.

What are the benefits and challenges of a foundation model?

In terms of benefits: 


Flexibility and adaptability

Foundation models are flexible and adaptable as they can be be fine-tuned for a wide range of tasks, saving time and resources compared to building new models from scratch for each specific task.

Cost efficient

While foundation models are costly, once you have them, you can adapt them as many times as you want on new tasks.

Accessibility

Open source foundation models are accessible as smaller companies with less access to computational resources can leverage these models to create innovative AI applications. (Note that there are many closed models which are not accessible!)

(Note – Open source foundation models – almost anyone can use, access the source code and customise the foundation model which in theory, improves accessibility, transparency etc.  Meta’s Llama 2 is an open source foundation model.  Chatgpt is not open source. 

As for the challenges: 

Bias

Foundation models are trained on large and diverse data sets which may contain biases present in the data, and which will be mirrored in the model’s outputs.

Security and privacy

The huge amounts of data needed to train a foundation model naturally raises security and privacy concerns.  The data should be secure and handled responsibility.

Lack of transparency

Foundation models can be a ‘black box’ .  The issue with data has already been highlighted.  In addition, it is important to understand how the foundation model generates its outputs to identify any potential errors or bias.  This is a hot topic with ongoing empirical studies.

LLMs – Generative AI is not Sci-fi!

LLMs

Lingua Custodia was delighted to co-host this event with Cosmian, a company specialised in cybersecurity, at Le Village by CA Paris.


What are LLMs?

Gaëtan Caillaut’s presentation for Lingua Custodia focused on Large Language models (LLMs) and aimed to ‘demystify’ the engineering and science behind large language models. He highlighted LLMs are a type of AI program able to recognise and generate text. These models are trained on large sets of data, which allow the models to learn the probability of generating the next word, based on the context of the word or phrase.

What are the limitations of LLMs?

The limitations of LLMs were also discussed. The quality of the text which is generated is very dependent on the underlying data and there is also a risk that these models can misinterpret the context of the words or phrase. A LLM hallucination happens where the model generates text that is irrelevant or inconsistent with the input data.
LLMs are also very expensive to run and complicated to train.

Retrieval Augmented Generation and RLHF for finetuning

He highlighted the benefit of RAG (Retrieval Augmented Generation) which references an external knowledge base to improve the accuracy and reliability of LLMs. RAG helps to enhance LLM capabilities and has the advantage of not requiring particular training.

RLHF (Reinforcement Learning from Human Feedback) is one of most used finetuning approaches. It helps the model by using human feedback to ensure the model is more efficient, logical and helpful.

Lingua Custodia’s Generative AI Multi-Document Analyser


Olivier Debeugny, Lingua Custodia’s CEO then presented the multi-document data extraction technology which uses RAG to optimise the data extraction quality.

Please note that Lingua Custodia now has a new address in Paris, Le Village by CA Paris, at 55 Rue La Boétie, 75008. We are delighted with our new offices and thrilled to be part of this dynamic eco system which prioritises supporting startups and PMEs.

How LLMs (Large Language Models) use long contexts

Large language models (LLMs) are capable of using large contexts, sometimes hundreds of thousands of tokens. OpenAIs GPT-4 is capable of handling inputs of up to 32K tokens, while Anthropic’s Claude AI can handle 100K context tokens. This enables LLMs to treat very large documents which can be very useful for question answering or information retrieval.

A newly released paper by Stanford University examines the usage of context in large language models, particularly long contexts for two key tasks: multi-document question answering and key-value retrieval. Their findings show that the best performance is typically achieved when relevant information occurs at the beginning or end of the input context. However, the performance of models significantly declines when they need to access relevant information in the middle of long contexts.This could be attributed to the way humans write, where the beginning and concluding segments of text mainly contain the most crucial information.

These findings show that one needs to be careful when using LLMs for search and information retrieval in long documents. Information found in the middle might be ignored by the LLM and hence wrong or less accurate responses will be provided.

Lingua Custodia has over 10 years of experience in language technologies for financial document processing and we are very aware of the importance of context for search and information retrieval sentiment analysis, content summary and extraction. We continuously study the impact of context size of these language models

Our expert team consists of scientists, engineers and developers, so we are well placed to create, customise and design secure LLMs which are perfectly tailored to meet your business needs.