Vector DB Retrieval: Chunking is matter

To chunk or not to chunk

What is chunking?

Chunking involves the process of dividing a large document into smaller segments, referred to as “chunks.” This technique is essential within the Retrieval-Augmented Generation (RAG) ecosystem for several reasons, with the most critical being the context window limitations of Large Language Models (LLMs). When engaging in extraction or reasoning tasks, the contents of documents are provided to an LLM through a “prompt.”

The total volume of text that can be processed by an LLM is constrained by its context size, which means that submitting an entire large document at once is typically not possible. Therefore, it is important to send only the contextually relevant sections of the document. This requires chunking the document first to effectively identify the relevant parts.

How large is the context size or window for an LLM?

Large language models are equipped with a feature known as context size or window. For a model like GPT-4, the context size defines the maximum number of tokens (including words and punctuation) that the model can analyze at once as input.

This characteristic is often referred to as the model’s “window” or “attention span.” For instance, the GPT-3.5 Turbo model has a context size of 4,096 tokens, allowing it to take into account up to 4,096 tokens of text for processing or generating responses. OpenAI has a reference page that outlines the context window sizes for its different models. It’s also important to keep in mind that users are typically charged based on the number of tokens processed by an LLM, which can affect costs. Furthermore, the processing speed (latency) of LLMs is expressed in tokens per second, which is another aspect to consider.

Note: Around 1,000 English words is about 750 tokens, give or take.

Let’s assume you have 2 documents:

  1. Document A with 2,000 words (approximately 1,500 tokens)
  2. Document B with 15,000 words (approximately 11,250 tokens)

Document A fits neatly within the 4,096 token limit of the GPT 3.5 turbo model, allowing us to submit the entire document along with a prompt for the model to process in a single instance. 

However, Document B, containing 11,250 tokens, exceeds this limit, which is where chunking becomes necessary. While Document A can be processed without chunking, Document B requires us to establish an appropriate chunk size for effective management.

Pros and Cons of Chunking

No ChunkingWith Chunking
Benefits1. The Context of the document is fully transferred to LLM
2. Leads to the best quality results
3. No Vector DB is involved
4. No Embedding is involved
5. No Retrieval engine is involved (we do not have to worry about the quality of embedding or chunk size strategies)
6. Almost always provides the best quality / most accurate results.
1. Information from very large documents, that won’t fit into the context of an LLM as a whole can be extracted
2. Lower cost since only small chunks of the document are sent to the LLM for a single prompt.
3. Lower latency (faster result generation) since only part of the document is sent to the LLM
Drawbacks1. Does not work for large documents
2. Higher cost since entire document is sent to LLM for every field extraction
3. Higher latency (slower result generation) since the entire document is sent to the LLM irrespective of the field being extracted.

1. Usage of Vector DB
2. Usage of Embeddings
3. Quality of retrieval (selecting the right chunks to send to LLM) is dependent on many factors:

a) Selection of chunk size
b)Selection of overlap
c)Retrieval strategy used
d)Quality of embedding
e) Information density / distribution within a document

4. Requires iterative experimentation to arrive at the above settings

Your Choice

If your documents are smaller than the context size of the LLM (Document A)

Choose not to chunk if your document’s text contents can fit into the context size of the selected LLM. This provides the best results. But if you have lots of extractions to be made from each document, the cost might increase significantly.

If your documents are larger than the context size of the LLM (Document B)

Chunking is necessary. There is no other option.

Popular LLMs and their limits on document sizes

From our experience, we see an average of 400 words per page for dense documents. Most documents have significantly lower than 400 words per page. For our calculations, let us assume that we will be dealing with documents with 500 words per page.

Note: 500 words per page is substantially on the higher end of the scale. Real world use cases, especially averaged over multiple pages will be much lower.

LLM ModelContext SizeNo Chunking
(Max pages)
Requires Chunking
if the document is..
Llama2,048 (2K)5 Pages> 5 Pages
Llama 24,096 (2K)10 Pages> 10 Pages
GPT 3.5 Turbo4,096 (4K)10 Pages> 10 Pages
GPT 48,192 (8K)20 Pages> 20 Pages
Mistral 7B32,768 (32K)80 Pages> 80 Pages
GPT 4 Turbo131,072 (128K)320 Pages> 320 Pages
Gemini 1.5 Pro131,072 (128K)320 Pages> 320 Pages
Claude 3 Sonnet204,800 (200K)500 Pages> 500 Pages

*As of June 2024

Cost Considerations for no chunking strategy

Embracing a no chunking strategy is very attractive, especially since current leading models can process over 100 pages. However, one must also factor in the cost considerations during the extraction process.

For our calculations, we will assume that each page contains approximately 400 words, which we view as a standard measure for information-dense documents.


LLM ModelApprox cost per page 
GPT 4$0.0100
GPT 4 Turbo$0.0030
Gemini 1.5 Pro$0.0021

** Pricing info as on 3rd May 2024

Also note that the table above is the price per page. In real world use cases, you might need to send the same page multiple times to the LLM when multiple information extraction is required.

Choosing chunk size and overlap

Choosing the right chunking size and overlap is crucial for optimising performance and retrieval quality.

Chunk Size

Each chunk of text is typically “embedded” and added to a vector database for retrieval. Embedding is the process in which the information in the chunk is converted into a “vector” which represents the information available in the chunk. The size of each chunk is important because it affects the density of information that will be available for embedding as vectors. A very small size can lead to not enough information being available and a very large size can dilute the information that will be embedded.

Overlap

Overlap between chunks ensures that information at the starting and ending of the chunk is not isolated contextually. An overlap helps in creating a seamless join of retrieved information.

Determining Chunk Size

Context size of the LLM: Consider the maximum context window of your LLM (e.g. 4,096 tokens for GPT 3.5 Turbo). When choosing smaller sizes, make sure that the chunk’s contents will be enough to provide meaningful content/context to the LLM. 

Content type: Depending on the type of the text (e.g., technical documents, conversational transcripts), different chunk sizes may be optimal. Note that texts which are dense with information rich content might require smaller chunks. If there is “too much” information in each chunk, the vector DB, during the retrieval phase, might not return optimal chunks for a given query.

Retrieval requirements: Some retrieval strategies will retrieve the “top-k” number of chunks and pass it to the LLM. So for a ‘k’ value of 3,  the db will retrieve 3 chunks needed to be passed to the LLM. All 3 chunks should fit in the LLM’s context size simultaneously.

Deciding on Overlap Size

The overlap should be big enough to maintain the context between chunks. Typically, an overlap of 10-20% of the chunk size is a good starting point. Too much overlap will dilute the information differentiation between two consecutive chunks. We should find a balance based on your specific application needs.

Experiment and Adjust

Initiate with a strategy derived from the earlier conversation, followed by testing the chunking and overlap in practical scenarios. Examine how these selections affect retrieval quality and the correctness of the LLM model outputs. Based on the performance, refine the chunk sizes and overlaps as needed. This may involve several rounds of adjustments to achieve the most effective size and overlap for a particular use case.

Resource Constraints

Large chunks and large overlaps can add stress on the compute resources. Make sure that the settings are sustainable based on cost and response latency requirements. If your application requires very low latency, you might need to optimize for fast retrieval and LLM evaluation at the potential cost of quality of retrieval. 

Ray Tracing and the Revolution in Digital Graphics
Allowing for the near-instant creation of photorealistic images