Chapter 9: Tackling Long Documents with Chunking Strategies

Welcome back, intrepid data explorer! So far, we’ve learned how to set up LangExtract, define schemas, and extract structured information from various texts. But what happens when your text isn’t a neat paragraph or a short email, but an entire legal contract, a research paper, or a lengthy financial report? These documents often exceed the “attention span” of even the most powerful Large Language Models (LLMs).

In this chapter, we’ll dive into one of LangExtract’s most powerful features: intelligent chunking strategies. You’ll discover why traditional LLM approaches struggle with long texts, how LangExtract elegantly solves this problem through automatic chunking and multi-pass processing, and how you can fine-tune these strategies to ensure accurate and complete data extraction from even the most verbose documents.

By the end of this chapter, you’ll not only understand the theory behind processing long documents but also gain hands-on experience in configuring LangExtract to handle them like a pro. Ready to conquer those monster texts? Let’s go!

The LLM Context Window Problem: Why Length Matters

Before we jump into solutions, let’s understand the challenge. LLMs, for all their intelligence, have a fundamental limitation: their context window. Think of it like a very focused, but small, working memory. An LLM can only process and “understand” a certain number of tokens (words or sub-word units) at any given time.

If you feed an LLM a document longer than its context window, it simply can’t “see” or process the entire text simultaneously. It’s like asking a person to summarize a 500-page book after only reading the first 50 pages – crucial information from later sections would be completely missed.

This limitation leads to several problems for data extraction:

Truncation: Important details might be cut off if they fall outside the context window.
Loss of Context: The LLM might miss relationships or dependencies between different parts of the document.
Reduced Accuracy: Without a full understanding, extraction quality suffers.
Increased Cost: While not directly related to the context window, inefficient processing of long texts can lead to more LLM calls and higher costs.

So, how do we get around this? We break the problem down!

Introducing Chunking: Breaking Down the Beast

Chunking is the process of dividing a long document into smaller, more manageable segments or “chunks.” Each chunk is small enough to fit within an LLM’s context window. This allows the LLM to process the entire document piece by piece.

But simply splitting text into arbitrary chunks isn’t enough. Imagine cutting a sentence in half – the meaning would be lost! This is where smart chunking strategies come in. LangExtract doesn’t just cut; it tries to split documents intelligently, preserving semantic coherence wherever possible.

LangExtract’s Intelligent Chunking & Multi-Pass Approach

LangExtract is designed from the ground up to handle long documents. It orchestrates a sophisticated workflow that often involves:

Initial Chunking: The input document is automatically divided into smaller chunks. LangExtract attempts to do this intelligently, often respecting paragraph breaks or other structural elements to keep related information together.
Parallel Processing (First Pass): Each chunk is sent to the LLM independently. The LLM then extracts information based on your defined schema from that specific chunk. This can happen in parallel, speeding up the process.
Aggregation & Refinement (Second Pass or More): Once all chunks have been processed, LangExtract doesn’t just combine the results. It often performs a second pass where it aggregates the extracted information, resolves inconsistencies, and fills in any missing details by re-evaluating the combined context or specific problematic chunks. This multi-pass approach is crucial for high-quality extraction from complex documents.

This orchestration is a significant advantage over simply prompting an LLM with manually chunked text, as LangExtract handles the complexity of managing context across chunks and synthesizing results.

Controlling Chunking: Parameters at Your Fingertips

While LangExtract’s default chunking is often excellent, you have the power to fine-tune it. The key parameters you’ll often interact with are chunk_size and chunk_overlap.

chunk_size: This parameter determines the maximum length of each chunk, typically measured in tokens. A larger chunk_size means fewer chunks but a higher risk of exceeding the LLM’s context window if set too high. A smaller chunk_size creates more chunks, potentially increasing processing time and cost, but reduces the risk of context overflow.
chunk_overlap: When splitting a document, chunk_overlap specifies how many tokens from the end of one chunk should be included at the beginning of the next chunk. This is incredibly important! Overlap helps maintain continuity and ensures that context isn’t lost at the chunk boundaries, which is a common problem in fixed-size chunking.

Let’s see this in action!

Step-by-Step Implementation: Extracting from a “Long” Document

For our example, we’ll simulate a slightly longer document – a fictional meeting transcript – and try to extract structured information from it. We’ll then experiment with chunk_size and chunk_overlap.

First, let’s set up our environment and define a simple schema for meeting notes.

# Assuming you have LangExtract and your LLM provider set up as in Chapter 2 & 3
import langextract as lx
import os
from pydantic import BaseModel, Field
from typing import List, Optional

# Ensure your LLM provider is configured (e.g., Google Generative AI)
# For this example, we'll use a placeholder, but in a real scenario,
# you'd configure it like:
# lx.set_llm_provider(lx.GoogleGenerativeAI(api_key=os.getenv("GEMINI_API_KEY")))
# For demonstration, we'll assume a provider is set.
# If you haven't set it up, please refer to Chapter 3.

# 1. Define our extraction schema
class MeetingActionItem(BaseModel):
    """An action item assigned during a meeting."""
    task: str = Field(description="Description of the task to be completed.")
    assignee: str = Field(description="The person responsible for the task.")
    due_date: Optional[str] = Field(None, description="The date by which the task should be completed, if specified.")

class MeetingSummary(BaseModel):
    """Summary of key decisions and action items from a meeting."""
    meeting_title: str = Field(description="The title or main topic of the meeting.")
    date: str = Field(description="The date the meeting took place.")
    attendees: List[str] = Field(description="List of attendees present at the meeting.")
    key_decisions: List[str] = Field(description="List of major decisions made during the meeting.")
    action_items: List[MeetingActionItem] = Field(description="List of action items with assignees and due dates.")

# 2. Prepare our "long" document (simulated)
long_meeting_transcript = """
Meeting Minutes: Project Alpha Kick-off
Date: 2025-12-15
Attendees: Alice Smith, Bob Johnson, Carol White, David Green

The meeting began at 10:00 AM. Alice welcomed everyone and outlined the agenda.
We discussed the overall project goals, focusing on market penetration and user feedback integration.
Bob presented the initial technical architecture, highlighting the use of microservices and a cloud-native approach.
Carol raised concerns about the timeline for phase 1, suggesting we might need more resources for the backend development.
After some debate, it was decided that Phase 1 deadline would remain January 31, 2026, but with an accelerated hiring push for two senior backend engineers.

Action Item: Draft job descriptions for senior backend engineers.
Assignee: David Green
Due Date: 2025-12-18

Next, we moved onto user interface (UI) design. David showcased some early wireframes.
Alice emphasized the need for a mobile-first approach and accessibility considerations.
Bob suggested integrating a new analytics tool to track user engagement from day one.
Decision: Proceed with mobile-first UI design, prioritize accessibility.

Action Item: Research and propose analytics tools.
Assignee: Bob Johnson
Due Date: 2026-01-05

The discussion then shifted to potential risks. Carol pointed out a dependency on an external API that has historically been unreliable.
Alice proposed building a robust retry mechanism and fallback strategy.
Decision: Implement robust error handling for external API dependency.

Action Item: Develop a detailed fallback strategy for the external API.
Assignee: Carol White
Due Date: 2026-01-10

Meeting adjourned at 11:30 AM. Next meeting scheduled for 2026-01-08.
"""

print("--- Starting Extraction with Default Chunking ---")
# 3. Perform extraction with default chunking
# LangExtract will automatically determine the best chunking strategy
# based on the LLM's context window and the input text length.
try:
    result_default = lx.extract(
        text_or_document=long_meeting_transcript,
        schema=MeetingSummary,
        name="Meeting Summary Extraction"
    )
    print("\nExtracted with default chunking:")
    print(result_default.parsed_output.model_dump_json(indent=2))
except Exception as e:
    print(f"Error during default extraction: {e}")

Explanation:

We import langextract and pydantic for schema definition.
We define two Pydantic models: MeetingActionItem and MeetingSummary. Notice how MeetingSummary contains a list of MeetingActionItems. This demonstrates nested schema extraction.
We create a long_meeting_transcript string. In a real application, this would come from a file or database.
We call lx.extract(), passing our long text and the MeetingSummary schema. Crucially, we don’t specify any chunking parameters yet. LangExtract is smart enough to detect the length and apply its default, intelligent chunking strategy.

Run this code. You should see a well-structured JSON output containing the meeting title, date, attendees, decisions, and all action items, even though the text is longer than a typical single-prompt might handle effectively.

Customizing Chunking Parameters

Now, let’s take control. What if we want to ensure smaller chunks or increase overlap? We can pass chunk_size and chunk_overlap directly to the lx.extract function.

# ... (previous code for imports, schema, and long_meeting_transcript) ...

print("\n--- Starting Extraction with Custom Chunking (Smaller Chunks, More Overlap) ---")
# 4. Perform extraction with custom chunking parameters
# Let's imagine our LLM has a very small context window, or we want
# to be extra careful about context at boundaries.
try:
    result_custom = lx.extract(
        text_or_document=long_meeting_transcript,
        schema=MeetingSummary,
        name="Meeting Summary Extraction with Custom Chunks",
        # Custom chunking parameters
        chunk_size=100,      # Max 100 tokens per chunk
        chunk_overlap=20     # 20 tokens overlap between chunks
    )
    print("\nExtracted with custom chunking:")
    print(result_custom.parsed_output.model_dump_json(indent=2))
except Exception as e:
    print(f"Error during custom extraction: {e}")

print("\n--- Starting Extraction with Chunking Disabled (Not Recommended for Long Texts) ---")
# 5. You can also explicitly disable chunking, but be warned:
# This will likely fail or return incomplete results for truly long documents
# if they exceed the LLM's context window.
try:
    result_no_chunking = lx.extract(
        text_or_document=long_meeting_transcript,
        schema=MeetingSummary,
        name="Meeting Summary Extraction without Chunking",
        chunking_enabled=False # Explicitly disable chunking
    )
    print("\nExtracted without chunking (might be incomplete/fail for very long texts):")
    print(result_no_chunking.parsed_output.model_dump_json(indent=2))
except Exception as e:
    print(f"Error during no-chunking extraction: {e}")
    print("This error is expected if the text exceeds the LLM's context window.")

Explanation:

We perform another extraction, this time adding chunk_size=100 and chunk_overlap=20. This tells LangExtract to break the document into chunks of approximately 100 tokens, with each chunk overlapping the next by 20 tokens.
We also show an example of chunking_enabled=False. While useful for very short texts where chunking overhead isn’t needed, for anything substantial, this will likely cause truncation or an error because the text will exceed the LLM’s context window. This demonstrates why LangExtract’s default behavior is so valuable.

By running this, you’ll observe that even with custom chunking, the results are consistent, demonstrating LangExtract’s robust aggregation capabilities. The “no chunking” attempt might fail or return an incomplete summary if the long_meeting_transcript is truly long enough to exceed your LLM’s context window.

Mini-Challenge: Optimizing for a Specific Detail

You’ve seen how LangExtract handles long documents. Now it’s your turn to experiment!

Challenge: Imagine you have a very long legal document, and you’re specifically interested in extracting all clauses related to “liability” and their corresponding section numbers. You suspect that some liability clauses might be short and scattered, meaning a large chunk_size could miss context, or a small chunk_overlap could split a clause.

Create a new Pydantic schema called LiabilityClause with fields for section_number (string) and clause_text (string).
Generate a mock “legal document” string that is significantly longer than our meeting transcript (e.g., 500-1000 words). Include at least 3-4 liability clauses, some short, some longer, and scatter them throughout the text. Make sure one clause spans a potential chunk boundary if you were to use a small chunk_size.
Perform an extraction using LangExtract with this new schema.
Experiment with chunk_size and chunk_overlap:
- First, try with a chunk_size of 80 and chunk_overlap of 10.
- Then, try with a chunk_size of 150 and chunk_overlap of 30.
- Finally, try with a chunk_size of 50 and chunk_overlap of 25.
Observe and compare the results. Did any configuration miss a clause or split one awkwardly? Which settings seemed to perform best for your mock document and schema?

Hint: Pay close attention to the clause_text for completeness and ensure all section_numbers are captured. If a clause seems truncated, it might indicate an issue with your chunk_size or chunk_overlap.

Common Pitfalls & Troubleshooting

Working with long documents and chunking can introduce new challenges. Here are a few common pitfalls:

Too Small chunk_size: While it ensures all chunks fit, excessively small chunks can break up semantically related sentences or paragraphs, leading to a loss of local context. The LLM might then struggle to extract complete information, as it doesn’t “see” enough surrounding text to understand the full meaning.
- Troubleshooting: If your extracted fields are consistently incomplete or fragmented, try increasing chunk_size slightly.
Insufficient chunk_overlap: If chunk_overlap is too small or zero, critical information might be split exactly at a boundary. For example, a key decision or an action item might start in one chunk and end in the next, making it difficult for the LLM to process it entirely in either chunk.
- Troubleshooting: If you notice missing information or truncated values, increase chunk_overlap. A good rule of thumb is 10-20% of chunk_size, but this can vary.
Ignoring Multi-Pass Aggregation Issues: Sometimes, even with good chunking, the final aggregated result might have inconsistencies. This usually points to ambiguities in your schema or the prompt instructions, or potential issues with the LLM’s ability to synthesize. LangExtract’s multi-pass system tries to mitigate this, but it’s not foolproof.
- Troubleshooting: Refine your schema’s descriptions and examples (from previous chapters) to be even more explicit. Review the raw LLM outputs if LangExtract provides access (e.g., in debug mode) to see what individual chunks yielded.

Remember, finding the optimal chunking strategy often involves experimentation, especially with highly domain-specific or unusually structured documents.

Summary

Congratulations! You’ve successfully navigated the complexities of processing long documents with LangExtract. Here are the key takeaways from this chapter:

LLMs have Context Window Limitations: They can only process a finite amount of text at once.
Chunking is the Solution: Breaking documents into smaller, manageable pieces allows LLMs to process entire long texts.
LangExtract Automates Intelligent Chunking: It goes beyond simple splitting, using sophisticated strategies and multi-pass processing to ensure accurate and complete extraction.
Control with chunk_size and chunk_overlap: These parameters give you fine-grained control over how documents are segmented, allowing you to optimize for specific document types and extraction goals.
Experimentation is Key: The best chunking strategy often requires testing different parameters to find the sweet spot for your specific use case.

You’re now equipped to handle virtually any document size, unlocking even more powerful data extraction possibilities. In the next chapter, we’ll shift our focus to refining our extraction process by exploring interactive visualization and advanced debugging techniques to ensure our extractions are not only complete but also correct.

References

This page is AI-assisted and reviewed. It references official documentation and recognized resources where relevant.