Welcome back, future data architects! In our previous chapters, we laid the groundwork for understanding LangExtract, setting up our environment, and performing basic extractions. You’ve seen how powerful Large Language Models (LLMs) can be when guided by a structured schema.

In this chapter, we’re going to put all that knowledge to the test with a practical, high-value project: extracting key information from legal contracts. Legal documents are notoriously complex, filled with jargon, and often lengthy, making them a perfect challenge for LangExtract’s capabilities. By the end of this chapter, you’ll have built a system to automatically pull out crucial details like parties involved, effective dates, and contract values from sample legal text. This isn’t just about coding; it’s about building confidence in tackling real-world, complex data extraction problems.

Extracting information from legal contracts requires both precision and robustness. A single missed detail or an incorrect interpretation can have significant consequences. LangExtract, with its schema-driven approach and LLM orchestration, is well-suited for this task.

Before we jump into code, let’s think about what kind of information is typically important in a contract:

  • Parties: Who are the entities entering into the agreement? (e.g., “The Company,” “The Client”)
  • Effective Date: When does the contract officially begin?
  • Contract Value: If applicable, what is the monetary amount associated with the agreement?
  • Governing Law: Which jurisdiction’s laws apply to the contract?
  • Specific Clauses: Details about termination, intellectual property, confidentiality, etc.

Our goal is to define a schema that captures these elements accurately.

The Power of Pydantic for Schema Definition

As we learned, LangExtract heavily leverages Pydantic for defining the structure of the data we want to extract. Pydantic allows us to define Python classes with type hints, which it then uses to validate data and, in LangExtract’s case, to instruct the LLM on the desired output format.

For legal documents, Pydantic’s ability to add descriptions to fields becomes incredibly valuable. These descriptions act as explicit instructions for the LLM, guiding it to extract precisely what we intend, even from ambiguous legal phrasing.

The LangExtract Workflow for Complex Documents

Let’s visualize the process we’ll follow:

flowchart TD A[Legal Contract Text] -->|Input Raw Document| B{Define Pydantic Schema}; B -->|Instruct LLM on Desired Output| C[LangExtract's LLM Orchestration]; C -->|Extract Structured Data| D[Raw Extracted JSON]; D -->|Validate against Schema| E{Pydantic Validation}; E -->|Structured Python Object| F[Review & Refine]; F -->|Output Final Data| G[Actionable Insights];

This diagram illustrates how our raw legal text passes through LangExtract, guided by our Pydantic schema, to produce validated, structured data. The “Review & Refine” step is particularly critical for legal use cases, where accuracy is paramount.

Let’s get our hands dirty and start building!

Step 1: Setting Up Your Environment (Quick Recap)

First, ensure you have langextract and pydantic installed. If you haven’t already, or if you want to ensure you’re on the latest stable versions as of early 2026:

pip install langextract pydantic~=2.0

Note: langextract is an actively developed library. For the absolute latest features and bug fixes, always refer to the official GitHub repository. Pydantic version 2.x is the current stable release, offering significant performance improvements.

Next, make sure your LLM provider’s API key is configured. For this example, we’ll assume you’re using a Google model (like Gemini Pro) and have your GOOGLE_API_KEY set as an environment variable.

# On Linux/macOS
export GOOGLE_API_KEY="YOUR_API_KEY_HERE"

# On Windows (Command Prompt)
set GOOGLE_API_KEY="YOUR_API_KEY_HERE"

# On Windows (PowerShell)
$env:GOOGLE_API_KEY="YOUR_API_KEY_HERE"

Replace "YOUR_API_KEY_HERE" with your actual key.

Now, let’s define the Pydantic model that will guide our extraction. We’ll specify the types and add clear descriptions for each field.

Create a new Python file, say contract_extractor.py, and add the following:

# contract_extractor.py
from pydantic import BaseModel, Field
from typing import List, Optional

class LegalContractDetails(BaseModel):
    """
    Schema for extracting key details from a legal contract.
    """
    contract_id: str = Field(
        description="A unique identifier for the contract, typically a reference number or code."
    )
    parties: List[str] = Field(
        description="A list of the names of all parties involved in the contract."
    )
    effective_date: str = Field(
        description="The date when the contract officially comes into effect, in YYYY-MM-DD format if possible."
    )
    contract_value: Optional[str] = Field(
        default=None,
        description="The total monetary value or consideration specified in the contract, including currency."
    )
    governing_law: Optional[str] = Field(
        default=None,
        description="The jurisdiction whose laws govern the contract, e.g., 'State of California' or 'England and Wales'."
    )

Let’s break down this code:

  • from pydantic import BaseModel, Field: We import the necessary components from Pydantic. BaseModel is the base class for our schema, and Field allows us to add metadata like descriptions and default values.
  • from typing import List, Optional: These standard Python type hints help define that parties will be a list of strings and contract_value and governing_law are optional fields that might not always be present.
  • class LegalContractDetails(BaseModel):: This declares our Pydantic schema class.
  • contract_id: str = Field(...): This defines a required field contract_id of type str. The description parameter is crucial here, giving the LLM explicit instructions on what to look for.
  • parties: List[str] = Field(...): This defines a field parties that expects a list of strings.
  • effective_date: str = Field(...): Another required string field for the date. We explicitly ask for a YYYY-MM-DD format to help standardize the output.
  • contract_value: Optional[str] = Field(default=None, ...): This field is Optional, meaning it might not always be found in the text. default=None explicitly states its default absence. We also provide a clear description for the LLM.
  • governing_law: Optional[str] = Field(default=None, ...): Similar to contract_value, this is an optional field with a clear description.

Step 3: Preparing the Sample Contract Text

Now, let’s create a simplified, simulated legal contract snippet. Remember, in a real scenario, this would be the content of a PDF, Word document, or a scanned image that has been OCR’d into text.

Add the following text to your contract_extractor.py file, after the schema definition:

# contract_extractor.py (continued)

sample_contract_text = """
CONTRACT AGREEMENT

This Contract Agreement ("Agreement") is made and entered into as of 2025-10-26 (the "Effective Date"),
by and between Tech Innovations Inc., a company registered in Delaware ("The Company"),
and Global Solutions LLC, a company registered in New York ("The Client").

WHEREAS, The Company desires to provide software development services to The Client, and The Client desires
to procure such services from The Company;

NOW, THEREFORE, in consideration of the mutual covenants and agreements hereinafter set forth, the parties hereto agree as follows:

1.  **Services.** The Company shall provide custom software development services as detailed in Schedule A.
2.  **Compensation.** The Client shall pay The Company a total sum of $150,000 (One Hundred Fifty Thousand US Dollars)
    for the services rendered under this Agreement. Payment terms are net 30 days.
3.  **Governing Law.** This Agreement shall be governed by and construed in accordance with the laws of the State of Delaware,
    without regard to its conflict of laws principles.
4.  **Contract ID.** The unique identifier for this agreement is TI-GS-2025-001.
"""

This text contains all the information our schema is looking for, presented in a typical legal document style.

Step 4: Performing the Extraction

Finally, let’s use langextract to extract the structured data from our sample contract text.

Add this to the end of your contract_extractor.py file:

# contract_extractor.py (continued)
import langextract as lx

if __name__ == "__main__":
    print("Attempting to extract legal contract details...")
    try:
        # Initialize LangExtract with your chosen LLM.
        # 'gemini-pro' is a good general-purpose model from Google.
        extractor = lx.Extractor(model_name="gemini-pro")

        # Perform the extraction
        result = extractor.extract(
            text=sample_contract_text,
            schema=LegalContractDetails,
            max_workers=1 # For short texts, 1 worker is sufficient
        )

        # Print the extracted data
        if result.parsed_object:
            print("\nExtraction Successful!")
            print(result.parsed_object.model_dump_json(indent=2)) # Use model_dump_json for Pydantic v2
        else:
            print("\nExtraction failed or returned no data.")
            if result.errors:
                print("Errors encountered:", result.errors)
            if result.raw_response:
                print("Raw LLM response (partial):", result.raw_response[:500]) # Print first 500 chars

    except Exception as e:
        print(f"\nAn error occurred during extraction: {e}")
        print("Please ensure your GOOGLE_API_KEY is set and valid, and you have network access.")

Explanation of the new code:

  • import langextract as lx: Imports the LangExtract library.
  • if __name__ == "__main__":: Ensures the extraction code runs only when the script is executed directly.
  • extractor = lx.Extractor(model_name="gemini-pro"): This creates an instance of the Extractor class. We specify model_name="gemini-pro" to use Google’s Gemini Pro model. LangExtract automatically uses your GOOGLE_API_KEY environment variable.
  • result = extractor.extract(...): This is the core function call.
    • text=sample_contract_text: Our input document.
    • schema=LegalContractDetails: The Pydantic schema we defined. LangExtract will instruct the LLM to output data conforming to this schema.
    • max_workers=1: For short texts, a single worker is fine. For very long documents, max_workers (e.g., up to 10, as per common recommendations) can process chunks in parallel, speeding up extraction.
  • result.parsed_object.model_dump_json(indent=2): If the extraction is successful, result.parsed_object will contain an instance of our LegalContractDetails Pydantic model. model_dump_json() (for Pydantic v2) converts this object into a nicely formatted JSON string.
  • Error Handling: We include a try-except block to catch potential API errors or issues with the extraction process, providing helpful messages.

Now, run your script from the terminal:

python contract_extractor.py

You should see output similar to this (actual content may vary slightly due to LLM non-determinism):

Extraction Successful!
{
  "contract_id": "TI-GS-2025-001",
  "parties": [
    "Tech Innovations Inc.",
    "Global Solutions LLC"
  ],
  "effective_date": "2025-10-26",
  "contract_value": "$150,000 (One Hundred Fifty Thousand US Dollars)",
  "governing_law": "State of Delaware"
}

Congratulations! You’ve successfully extracted structured data from a simulated legal contract using LangExtract and Pydantic.

Step 5: Reviewing and Refining with Interactive Visualization (Brief Mention)

For more complex or longer documents, simply printing the JSON isn’t enough. LangExtract offers powerful interactive visualization tools to help you review the extraction and debug issues. While beyond the scope of this simple example, remember that result.visualize() can be called to launch a local web interface where you can see:

  • The original text, with extracted entities highlighted.
  • Which chunks of text contributed to which extracted fields.
  • The raw LLM responses.

This tool is invaluable for understanding why an LLM extracted certain information or failed to extract others, allowing you to refine your schema or prompt instructions.

Mini-Challenge: Expanding Our Contract Schema

You’ve done a fantastic job with the initial extraction! Now, let’s make it a bit more complex.

Challenge: Imagine our legal team also needs to know the term of the contract – how long it’s valid for.

  1. Modify the LegalContractDetails schema: Add a new Optional[str] field called contract_term. Give it a clear description that explains what “contract term” means (e.g., “The duration for which the contract is valid, e.g., ‘1 year’ or ‘until December 31, 2026’”).
  2. Update the sample_contract_text: Add a new clause to the contract text that specifies a contract term, for example: “5. Term. This Agreement shall commence on the Effective Date and continue for a period of one (1) year.”
  3. Re-run the extraction: Observe if LangExtract successfully identifies and extracts the new contract_term.

Hint: Pay close attention to the description you provide for the contract_term field in your Pydantic schema. Clear instructions lead to better extraction!

Click for Solution HintMake sure your new clause in `sample_contract_text` clearly states the duration. For the schema, define `contract_term: Optional[str] = Field(default=None, description="...")`.

Common Pitfalls & Troubleshooting

Working with LLMs for extraction, especially in sensitive domains like legal, can present a few challenges.

  1. Schema Mismatch or Missing Data:

    • Problem: The LLM either returns an empty field, incorrect data, or fails to conform to your schema.
    • Solution:
      • Refine Field descriptions: Make your descriptions in the Pydantic schema as explicit and unambiguous as possible. Think about how you’d explain it to a human.
      • Check text quality: Is the information actually present in the input text? Is it clear enough for an LLM to understand?
      • Add examples (Advanced): For very tricky fields, you can sometimes include examples directly in the prompt or use LangExtract’s advanced features for few-shot prompting, though for simple cases, schema descriptions are usually sufficient.
      • Use result.visualize(): This is your best friend for debugging. It helps you see what the LLM saw and thought.
  2. API Key or Network Issues:

    • Problem: The script fails with connection errors or authentication failures.
    • Solution: Double-check that your GOOGLE_API_KEY (or equivalent for your chosen LLM) environment variable is correctly set and hasn’t expired. Ensure you have a stable internet connection.
  3. LLM Hallucinations or Inaccuracies (Critical for Legal):

    • Problem: The LLM confidently extracts information that is not present in the document, or extracts incorrect details. This is particularly dangerous in legal contexts.
    • Solution:
      • Human-in-the-Loop: For high-stakes applications, always involve human review of extracted legal data. LangExtract is an accelerator, not a fully autonomous legal agent.
      • Grounding (Advanced): LangExtract has features for “grounding,” which means tracing the extracted information back to its source in the original document. This helps verify accuracy.
      • Prompt Engineering: Experiment with your schema descriptions and potentially add overall instructions to the Extractor to emphasize factual accuracy and adherence to the document.

Summary

In this chapter, you’ve taken a significant step forward, applying LangExtract to a real-world project: extracting structured information from legal contracts.

Here are the key takeaways:

  • Schema is King: A well-defined Pydantic schema with clear Field descriptions is crucial for precise extraction from complex documents.
  • Practical Application: LangExtract shines in high-value scenarios like legal document processing, turning unstructured text into actionable data.
  • Incremental Building: We built our solution step-by-step, from schema definition to execution, explaining each part.
  • Debugging Tools: result.visualize() is an essential tool for understanding and refining extraction results, especially for complex texts.
  • Accuracy is Paramount: For legal data, always prioritize accuracy, leveraging human review and advanced grounding techniques when necessary.

You’re now equipped to tackle more intricate extraction tasks. In the next chapter, we’ll delve deeper into handling very long documents, exploring advanced chunking strategies and multi-pass extraction to maintain accuracy and efficiency.

References

  • LangExtract GitHub Repository: The official source for the library, including documentation and examples. https://github.com/google/langextract
  • Pydantic Documentation (v2): Comprehensive guide to defining data schemas in Python. https://docs.pydantic.dev/latest/
  • Google AI Studio Documentation: Information on obtaining API keys and using Google’s Gemini models. https://ai.google.dev/
  • Towards Data Science - Extracting Structured Data with LangExtract: An article discussing LangExtract’s workflow and capabilities. https://towardsdatascience.com/extracting-structured-data-with-langextract-a-deep-dive-into-llm-orchestrated-workflows/

This page is AI-assisted and reviewed. It references official documentation and recognized resources where relevant.