Semantic chunking in practice

Boudhayan Dev
10 min readJan 3, 2024
Photo by Tamara Gak on Unsplash

Recap

What a year 2023 has been!
Busy, happening, and most importantly, filled with breakthroughs in the software development space.

One such breakthrough — the availability of enterprise-ready LLMs (Large language models) has taken the industry by storm and is believed to make greater inroads in the year 2024. LLMs will become mainstream and are going to be infused into every aspect of software and product development. From business AI to gaming, there is not going to be any frontier that will remain untouched by the effects of LLM.

Good or bad? Only time will tell. The arms race is on and there is no looking back.

LLMs though are a very interesting topic from a technical perspective. I have been busy trying to keep myself updated on the workings of LLM. As a non-data scientist, it is both intimidating as well as mind-blowing. The capabilities of different LLMs — multimodal/ unimodal etc, various prompting techniques — ReAct, CoT (Chain of thought), etc, RAG — retrieval augmented generation, Fine-tuning……….. the list goes on. I have been fortunate enough to get a chance to work on these topics. The topics are vast and are continuously evolving. No two roadblocks are going to be the same and that's precisely what keeps the LLM journey so exciting.

This blog is going to be about one such roadblock — chunking. Or more precisely, semantic chunking of sections of a document so that the data that is fed to the LLMs (as part of prompts) is more precise and meaningful for the context of the prompt.

Let’s begin.

Problem

Consider that you are given a document that contains some texts and images, arranged into several sections (could be nested sections too), just like a book/novel/architecture doc.

In other words, any structured document that you might have come across.

Now, the ask is to split the document and extract the sections i.e. retain the semantic meaning of the chunks. Before you come up with a solution, here are a few additional restrictions (real life) —

  1. As outlined, the chunks have to be mapped to sections i.e. you cannot randomly create chunks based on page numbers, length of words/ characters, etc.
  2. The document does not have an index/table of contents.
  3. Do not assume the size of the sections. Some of them are a few paragraphs long while some could run into pages.
  4. Do not assume the start/end positions of the sections in the document. They may / may not start/end at the beginning/end of a page. They could very well start at the middle of the page.
  5. Do not assume the font size/style of the document. We need to support all and any kinds of documents.
  6. Needless to say, the document is larger than the context size (max tokens that an LLM prompt can take) of the LLM that you can use (will you ? :P ) for this problem.

For example, consider a page from a document such as the following —

This is a page from the document, which shows the following structure —

  1. Prologue
  2. Methodology

Note: I have ignored Approach 1 and Approach 2 under Methodology as the relative font height is smaller than Methodology, indicating they are child sections. However, if you were to identify the structure of the sections as follows, it would also be right. Ex -

  1. Prologue
  2. Methodology
  3. Approach 1
  4. Approach 2

The level of nestedness that you want to identify is left up to you. You can choose to club child sections together as one major section (as example 1) or you could present them as separate sections (as example 2). What is important, is the identification of these sections in the first place.

Example 1 -

Example 2 -

So, given the above restrictions, how do you go about extracting the sections?

Note: It is not expected to achieve 100 % accuracy as the candidate pool for the document is huge.

Design

As you’d have guessed, the solution is not very straightforward. At least, not something that can be solved with a deterministic algorithm. We are talking about autonomous detection and extraction of sections from a document. As if that was not difficult enough, we also intend the solution to work for any kind of document i.e. font used, style of writing, formatting of content, etc.

Difficult? Sure. But not impossible.
Definitely not, in the era of LLMs.

Before we attempt the solution, here are a few observations.

Observation 1 -> For any well-structured document, the section sizes will follow a pattern i.e. level 1 sections will have the same font size and style say 2X, next level section may have a smaller size — say X, and so on… But they will be consistent across the document.

Observation 2 -> Section headers/ titles are typically evenly distributed across the document. They are not clustered at the beginning or only at the end of the document. So throughout the document, there is a good chance to encounter a section title/header.

Observation 3 -> Section titles/headers are few as compared to the paragraphs or non-section text in the document.

Observation 4 -> Section titles/headers are generally bigger (font size) than the surrounding text and have fewer characters as compared to the surrounding blocks of text.

Thats it!
As long as we can identify the positions of the section titles/headers in the document, we can extract the blocks of text between 2 subsequent sections and meet our objective.

To identify a line as a possible section title/header, we are going to make use of LLM (any LLM should work, used GPT 3.5 for this blog).

What better than an LLM to make the above deductions for a given document !

So let us now look at the implementation.

Solution

Let us start by adding the dependencies. We are going to use Java. To parse the document, we will use the Apache PDFBox library.

<dependency>
<groupId>org.apache.pdfbox</groupId>
<artifactId>pdfbox</artifactId>
<version>2.0.27</version>
</dependency>

Now, the default PDF Reader class in the PDFBox library will not provide the metadata that we are interested in i.e. the font size, total characters in the line, and the corresponding lines having the identical metadata. To record these metadata (why ? later) as we pass through the document, we will have to create a custom text parser that will record the same.

Let us create a model that will capture the font metadata.

@Data
public class DocumentLineItem {
float positionX;
float positionY;
String content;
int fontHeight;
int totalCharacters;

@Override
public String toString(){
return this.content;
}
}

Now, let us implement the text parser.

public class TextParser extends PDFTextStripper {

private static final Logger oLogger = LoggerFactory.getLogger(TextParser.class);
private final List<DocumentLineItem> lineItems;
private final int pageStart;
private final int pageEnd;
private final PDDocument docu;
private final float pageHeight;
private int pageCounter;

public TextParser(PDDocument document) throws IOException {
this.pageStart = 1;
this.pageEnd = document.getNumberOfPages();
this.docu = document;
this.pageHeight = this.docu.getPage(0).getMediaBox().getHeight();
this.pageCounter = 0;
this.lineItems = new ArrayList<>();
}

@Override
protected void writeString(String text, List<TextPosition> textPositions) throws IOException {
super.writeString(text, textPositions);
this.createLineItems(text, textPositions);
}

private void createLineItems(String lineText, List<TextPosition> textPositions){
DocumentLineItem documentLineItem = new DocumentLineItem();
documentLineItem.setPositionX(textPositions.get(0).getXDirAdj());
documentLineItem.setPositionY(this.pageCounter*this.pageHeight + textPositions.get(0).getYDirAdj());
documentLineItem.setContent(lineText);
documentLineItem.setFontHeight((int) textPositions.get(0).getHeight());
documentLineItem.setTotalCharacters(lineText.length());
this.lineItems.add(documentLineItem);
}

public List<DocumentLineItem> getLineItems() throws ParseException{
try{
for(int i=this.pageStart; i<=this.pageEnd; i++){
this.setStartPage(i);
this.setEndPage(i);
this.getText(this.docu);
this.pageCounter++;
}
return this.lineItems;
} catch (Exception e){
oLogger.error(e.getMessage());
throw new ParseException("Error reading text line items");
}
}
}

We create a getLineItems() in the text parser, which invokes the writeString() method internally when the getText() method is invoked to read the page contents. The text parser creates a list of DocumentLineItem which contains information about each line’s — text, font height, total characters, and positions in the document.

The next step is to use this parser and read the PDF document. Once the list of DocumentLineItem is created, we will feed the info to the LLM to derive the font heights that are possible candidates for the section headers.

To use LLM, the following information is required -

  1. The list of all font heights in the document.
  2. The pattern of font distribution across the entire document. We would also need to pass the total characters + font height for such distribution.

The LLM can deduce from this pattern, the possible section header font sizes.

The list of unique fonts and the pattern of font variation across the document can be found using the following code.


/***
The following method return unique set of font heights found in the document
*/
private List<Integer> getListOfFonts(List<DocumentLineItem> contentLineItems) {
Set<Integer> fontHeights = new TreeSet<>(Comparator.reverseOrder());
for(DocumentLineItem lineItem: contentLineItems) fontHeights.add(lineItem.getFontHeight());
return new ArrayList<>(fontHeights);
}


/***
The following method is used to construct the variation of font across the document
The following section creates a list of font metadata that is fed to the LLM
The data returned by the method preserves the document style by removing the content.
*/
private String getFontMetadata(List<DocumentLineItem> contentLineItems) throws ParseException{
try{
StringBuilder sb = new StringBuilder();
int currentHeight = contentLineItems.get(0).getFontHeight();
int totalCharacters = contentLineItems.get(0).getTotalCharacters();
int lineCount = 1;
for(int i=1; i<contentLineItems.size(); i++){
DocumentLineItem lineItem = contentLineItems.get(i);
if(lineItem.getFontHeight()!=currentHeight){
sb.append("Metadata(fontHeight=").append(currentHeight).append(", lineCount=")
.append(lineCount).append(", totalCharacters=").append(totalCharacters).append(")\n");
currentHeight = lineItem.getFontHeight();
totalCharacters = lineItem.getTotalCharacters();
lineCount = 1;
} else {
totalCharacters+=lineItem.getTotalCharacters();
lineCount+=1;
}
}
sb.append("Metadata(fontHeight=").append(currentHeight).append(", lineCount=")
.append(lineCount).append(", totalCharacters=").append(totalCharacters).append(")\n");
if(currentHeight==-1) throw new ParseException("Error : No document found for section title detection");
return sb.toString();
} catch (Exception e){
oLogger.error("Error constructing font metadata for document - {}", e.getMessage());
throw new ParseException("Error constructing font metadata");
}
}

Now, once you have the font list and the pattern of font distribution in the document, all that is left is to construct a nice prompt for the LLM to deduce the possible font sizes.

Following is an example prompt that you can use to deduce the font sizes that are possible candidates for the section headers.

OBJECTIVE -

You are a AI assistant that is an expert at reading documents.
You will be provided the metadata information about all the lines that are contained in a document.
The objective is to figure out the section headers from the metadata.

Section Headers:
Section headers are basically topic headers that form the starting point of different sections.

***********************************************************

You will be provided with a list of metadata in the following structure:

Metadata(fontHeight=6, lineCount=3, totalCharacters=24)
Metadata(fontHeight=5, lineCount=1, totalCharacters=1)
Metadata(fontHeight=8, lineCount=1, totalCharacters=4)
Metadata(fontHeight=6, lineCount=1, totalCharacters=103)

and a list of decreasing unique font heights, which indicates all the font sizes that are available in the document :

[20, 18, 9, 6, 4....]


where,

fontHeight = height of the given line
lineCount = number of consecutive lines that have the same fontHeight
totalCharacters = total characters that these consecutive lines contain.

***********************************************************

RULES :

Generally, section headers/ titles have the following properties -
1. They are bigger in size than the surrounding text.
2. They are few in number compared to the total text that the document contains
3. The total characters of such lines are significantly lower as compared to the surrounding lines.
4. They are generally evenly spread across the document and not clustered at the beginning or end of the document.


***********************************************************

OUTPUT RESPONSE :


I want you to utilize the metadata of the fonts and the list of all available font sizes, to give me the possible section (can have sub-sections as well) titles.
I want the response as a list of Comma separated font sizes. i.e.

Ex - [20, 18 ..]

In the above, 20 and 18 are the font sizes from the metadata list that you consider to be section headers.
1. Do not contain any explanation or code.
2. Response should be comma separated list of font sizes that are possible sections


Here is the List of all font sizes :

<<INSERT LIST OF ALL FONTS IN THE DOCUMENT>>

Here is the list of font metadata for the whole document:

<<INSERT FONT PATTERN METADATA FOR THE DOCUMENT>>

With the above prompt, you will be able to receive a list of font heights, that the LLM constitutes as section headers. We must use a very low value for temperature for this exercise to achieve a high degree of accuracy.

The response from the LLM can be used as a basis to formulate possible section headers and therefore sections from the document.

For the document used as an example (see Problem section), the LLM rightly identified font sizes — 9 and 8 as section headers, which correspond to Prologue (9), Methodology (9), Approach 1 (8), and Approach 2 (8).

Now that we know the section headers, we can extract the section content easily !!!

Here, let me show you the results.

  1. Printing out the font heights of each line of the document with their content in the following notation: FONT_HEIGHT -> CONTENT

Keep an eye on the highlighted font sizes.

2. Now we find out the list of all unique font sizes in the document as well as the font pattern distribution for the whole document (using the getListOfFonts() and getFontMetadata() methods above).

Following is the output for the document.

As you can see from the pattern, the font heights and the total characters for successive identical lines (same font height) are retained. As a result, the overall font pattern distribution over the document is retained which will be used in the next stage to deduce the section headers.

3. Now that we have the above information, we use the LLM prompt to filter the pattern distribution data and extract the font sizes that could be probable section headers.

As you can see from the response, font sizes 9 and 8 are identified as possible section headers based on the pattern distribution. If you recall, font sizes 9 and 8 refer to the section headers highlighted in example 2 in the Problem section.

Conclusion

I hope this blog gave you an idea of how to leverage LLM to autonomously identify sections from a document. This can help you to prepare meaningful splits of your data for RAG-based use cases or to prepare the vector database contents for embeddings. The efficiency of the solution can be improved by adding more metadata such as font name, style, etc. to the pattern distribution in the prompt.

Warm Regards!

--

--

Boudhayan Dev

I am a full-stack software developer currently associated with SAP Labs. You can check out my work and connect with me on GH — https://github.com/boudhayan-dev