Practical Guide to Prompt Compression: Essential Optimization for RAG-Based Applications

Phillip Peng
7 min readJan 16, 2024

Are you looking to reduce costs by 10x to 20x while still maintaining the accuracy of your RAG-based applications? Prompt compression is an indispensable strategy for achieving this. It’s not just a cost-saving measure; it’s a necessity for efficient and effective application performance.

1. Introduction

In the evolving world of natural language processing, efficiency and precision are paramount. This tutorial delves into the practical application of prompt compression in RAG (Retrieval-Augmented Generation) based applications, highlighting its indispensable role in performance enhancement. LLMLingua, developed by Microsoft Research experts, is a standout in this sector.

2. What are Prompt Compression and LLMLingua?

Prompt compression is a technique used to make the retrieved context to language models more concise and focused. This is particularly beneficial in RAG-based applications where efficient information retrieval and processing are crucial.

LLMLingua, a Microsoft innovation, optimizes the efficiency of RAG (Retrieval-Augmented Generation) through prompt compression. This method uses smaller models like GPT2-small or LLaMA-7B to identify and remove non-critical tokens from prompts, thereby preserving crucial information for accurate LLM conclusions.

3. Using LLMLingua to Improve RAG-Based Applications

In this tutorial, we will show how to integrate LLMLingua into a RAG-based application, demonstrating how it may reduce prompt length while increasing processing performance and latency.

3.1. Installation and Setup

To get started, we need to install the following packages:

  • openai: The official Python client for OpenAI’s API, which is required to access GPT models.
  • llmlingua: Specializing in quick compression to make the RAG process more efficient.
  • accelerate: Accelerates computations, which is notably beneficial for large-scale language models.
  • llama_index: Provides functionalities for retrieval-augmented generation tasks.
  • optimum and auto-gptq: Part of Hugging Face's ecosystem, optimizing transformer models for specific tasks.
!pip install openai -q
!pip install llmlingua -q
!pip install accelerate -q
!pip install llama_index -q
!pip install optimum auto-gptq -q

3.2. Configure the LLM

For the RAG-based application, we will use GPT-4 as the LLM. We must first create your API key:

import openai
openai.api_key = "your_api_key_here"

3.3. Set Up the Retriever with LLama Index

For efficient data retrieval, we’ll use llama_index:

from llama_index import VectorStoreIndex, download_loader
WikipediaReader = download_loader("WikipediaReader")
loader = WikipediaReader()
documents = loader.load_data(pages=['COVID-19'])
index = VectorStoreIndex.from_documents(documents)
retriever = index.as_retriever(similarity_top_k=3)

The wikipedia pages on COVID-19 are loaded using the wikipediaReader within llama_index. These pages will then serve as the knowledge base for our RAG-based application. We utilize the llama_index to execute semantic search across the knowledge base after indexing the data, and we keep the top three documents from the semantic search results.

3.4. Question-Based Semantic Search

The retriever searches the knowledge base and delivers the top three papers as the context to the question “When was the first case of Covid-19 reported?"

question = "When was the first case of Covid-19 reported?"
retrieved_nodes = retriever.retrieve(question)
original_contexts = "\n\n".join([n.get_content() for n in retrieved_nodes])

The following is the first portion of the original context’s content:

The US intelligence community has mixed views on the issue, but overall agrees with the scientific consensus that the virus was not developed as a biological weapon and is unlikely to have been genetically engineered. There is no evidence SARS-CoV-2 existed in any laboratory prior to the pandemic.The first confirmed human infections were in Wuhan. A study of the first 41 cases of confirmed COVID‑19, published in January 2020 in The Lancet, reported the earliest date of onset of symptoms as 1 December 2019. Official publications from the WHO reported the earliest onset of symptoms as 8 December 2019. Human-to-human transmission was confirmed by the WHO and Chinese authorities by 20 January 2020. According to official Chinese sources, these were mostly linked to the Huanan Seafood Wholesale Market, which also sold live animals. In May 2020, George Gao, the director of the CDC, said animal samples collected from the seafood market had tested negative for the virus, indicating that the market was the site of an early superspreading event, but that it was not the site of the initial outbreak. Traces of the virus have been found in wastewater samples that were collected in Milan and Turin, Italy, on 18 December 2019.By December 2019, the spread of infection was almost entirely driven by human-to-human transmission. The number of COVID-19 cases in Hubei gradually increased, reaching sixty by 20 December, and at least 266 by 31 December. On 24 December, Wuhan Central Hospital sent a bronchoalveolar lavage fluid (BAL) sample from an unresolved clinical case to sequencing company Vision Medicals. On 27 and 28 December, Vision Medicals informed the Wuhan Central Hospital and the Chinese CDC of the results of the test, showing a new coronavirus.

3.5. Setup the Prompt Compressor

We’ll now look at how to compress the obtained context using LLMLingua’s prompt compression tool.

# Setup LLMLingua
from llama_index.query_engine import RetrieverQueryEngine
from llama_index.response_synthesizers import CompactAndRefine
from llama_index.indices.postprocessor import LongLLMLinguaPostprocessor

node_postprocessor = LongLLMLinguaPostprocessor(
instruction_str="Given the context, please answer the final question",
model_name="TheBloke/Llama-2-7b-Chat-GPTQ",
model_config={"revision": "main"},
target_token=300,
rank_method="longllmlingua",
additional_compress_kwargs={
"condition_compare": True,
"condition_in_question": "after",
"context_budget": "+100",
"reorder_context": "sort", # enable document reorder,
"dynamic_context_compression_ratio": 0.4,
},
)

from llama_index.indices.query.schema import QueryBundle

new_retrieved_nodes = node_postprocessor.postprocess_nodes(
retrieved_nodes, query_bundle=QueryBundle(query_str=question)
)
compressed_contexts = "\n\n".join([n.get_content() for n in new_retrieved_nodes])

After prompt compression, the context has 378 tokens. This is due to the fact that we set target_token = 300 and context_budget = +100. The LLMLingua prompt compressor uses iterative token-level compression to ensure that critical information is retained while non-critical material is removed. The compressed context is as follows:

is be natural throughillo infection. A joint-stud in early 202 the Peoples China the World that desc from a coronavirus wild and anary wild are several theories about where index origin andations the origin thegoing. to in2 Science transmission through events in November19 and likely to trade the Huan market in the of Wuhan (i,). Douts conclus mostly centered the spver. phylog estimatedSCoV29. analysis that the haveating Guong beforehanMost scient sp human throughosis similar to theARSV-V outbreaks, and consistent pandem in human history. According to Intergovernmentanel on Climate social andcos destruction wildlife trade increased the likelihood of suchoonotic sp made the support of the European Union climate increased theelihood of the pandemiccing distribution batAvailable evidence suggests that the SAR-CoV-2 was originallyboured byats spread humans multiple times infected wild animals at theuanan Seaod Market in in December of and some members of intelligence the virus may been unally leaked from a such as thehan US intelligence community mixed issue overall agre with the scientific that was not developed as a bi and unlikely to have beenered. There is noARV existed inatory todemic.The in study of the first41 cases confirmed COVID9,0 in Theet, the earliest date of onset ofoms as 1 December09 publications from the W reported the earliestset of symptoms as 8 December 2019. Human-to-human transmission was confirmed by the WHO and Chinese authorities by 20 January 2020. According to official Chinese sources, these were mostly linked to the Huanan Seafood Wholesale Market, which also sold live animals.

3.6. Define Q&A Prompt Template

from llama_index.llms import OpenAI
from llama_index.prompts import PromptTemplate

def get_response(context_str, query_str, model="gpt-4-1106-preview"):

llm = OpenAI(model=model)

template = (
"Given the provided context information below: \n"
"---------------------\n"
"{context_str}"
"\n---------------------\n"
"please answer the question: {query_str}\n"
)

qa_template = PromptTemplate(template)

# you can create text prompt (for completion API)
prompt = qa_template.format(context_str=context_str, query_str=query_str)

response = llm.complete(prompt)
return response

Above is the prompt template of using llama_index and LLMs for RAG based application. Now, let’s compare the responses from using the original context and the compressed context.

response1 = get_response(context_str=original_contexts, query_str = question)
print(response1)

response2 = get_response(context_str=compressed_contexts, query_str = question)
print(response2)

Response1 is as follows:

The first case of COVID-19 was reported in a study published in January 2020 in The Lancet, which indicated the earliest date of onset of symptoms as 1 December 2019. This case was confirmed in Wuhan, China.

Response2 is as follows:
“The provided context suggests that the earliest date of onset of symptoms for the first confirmed cases of COVID-19 was December 1, 2019. However, it also mentions that publications from the WHO reported the earliest onset of symptoms as December 8, 2019. Therefore, the first reported case could be associated with the onset of symptoms around early December 2019.”

We can observe from the comparative findings that the compressed context can maintain the relevant information to answer the question correctly. In this scenario, the compression ratio is close to ten times.

original_tokens = node_postprocessor._llm_lingua.get_token_length(original_contexts)
compressed_tokens = node_postprocessor._llm_lingua.get_token_length(compressed_contexts)

print("Original Tokens:", original_tokens)
print("Compressed Tokens:", compressed_tokens)
print("Compressed Ratio:", f"{original_tokens/(compressed_tokens + 1e-5):.2f}x")

Original Tokens: 3512
Compressed Tokens: 378
Compressed Ratio: 9.29x

Conclusion and Implications

By integrating prompt compression, particularly through technologies like LLMLingua, developers can build more efficient, responsive, and cost-effective AI systems, ready to meet the complex demands of modern data processing in the digital age. Its ability to reduce latency and improve system responsiveness makes it invaluable in real-time interactions and data-intensive tasks.

Further Reading and Resources

Learn more about LLMLingua and its applications by visiting the LLMLingua website and reading the research paper by Huiqiang Jiang, Qianhui Wu, Chin-Yew Lin, and Yuqing Yang, which delves into its approach and applications in depth.

As natural language processing continues to evolve, tools like LLMLingua, born from cutting-edge research, pave the way for more sophisticated and user-friendly interactions with AI systems. By embracing such technologies, developers and researchers can build more efficient, responsive, and cost-effective solutions, responding to the ever-growing demands of data processing in the digital age.

--

--