Chapter 3: Defining Your Extraction Task and Schema

Welcome back, future data alchemists! In the previous chapter, we got LangExtract up and running and connected to our chosen Large Language Model (LLM) provider. That’s a huge step! Now, it’s time to get down to the real magic: telling LangExtract exactly what kind of information we want to pull out of unstructured text.

This chapter is all about defining your “extraction task” and creating a “schema” – essentially, a blueprint for the structured data you expect to receive. This is arguably the most crucial part of using LangExtract effectively. Without a clear schema, an LLM might give you inconsistent, incomplete, or even hallucinated results. With a well-defined schema, you guide the LLM to focus its powerful understanding on precisely what you need, making your extractions reliable and robust.

Ready to sculpt your data? Let’s dive in!

What is an Extraction Schema? Why Does it Matter?

Imagine you’re trying to build a robot that can read a recipe and tell you the ingredients, cooking time, and serving size. If you just tell the robot, “Read this recipe and tell me stuff,” it might give you a poetic description of the dish, or perhaps just a list of steps without distinguishing ingredients.

An extraction schema is like giving your robot a detailed form to fill out:

Dish Name: [Text Field]
Ingredients: [List of Text Fields]
Cooking Time: [Number] [Unit: minutes/hours]
Serving Size: [Number] [Unit: people/portions]

This “form” is your schema. It tells LangExtract (and by extension, the underlying LLM) the exact structure, data types, and names for each piece of information you want to extract.

Why is this so important for LLMs?

Precision: It forces the LLM to be precise. Instead of vague answers, it must fit its output into your predefined slots.
Consistency: Every extraction, whether from a product review or a legal document, will follow the same output structure, making it easy to process programmatically.
Validation: The schema often comes with built-in validation (like ensuring a rating is a number), catching errors early.
Reduced Hallucination: By giving the LLM a clear target, you significantly reduce its tendency to invent information that isn’t present in the text.

LangExtract leverages Pydantic models for defining these schemas. If you’re new to Pydantic, don’t worry – we’ll cover the essentials. It’s a fantastic Python library for data validation and settings management using Python type annotations.

Let’s visualize this process:

flowchart LR A[Unstructured Text Document] --> B{LangExtract} B -->|\1| C[Extraction Schema] B --> D[Structured Data]

This diagram illustrates how LangExtract acts as the bridge, taking your raw text and, with the help of your schema blueprint, transforming it into neatly structured data.

Defining Your First Schema with Pydantic

Our goal is to extract information from a simple piece of text. Let’s start with a classic example: a product review. We want to extract the reviewer’s name, their rating, and the actual review comment.

First, ensure you have Pydantic installed. It’s usually a dependency of langextract, but it’s good to confirm:

pip install "pydantic>=2.0.0"

Now, let’s define our schema.

Step 3.1: Importing Pydantic and `BaseModel`

In your Python script (or interactive session), you’ll typically start by importing BaseModel from pydantic. This is the fundamental class that all your data models will inherit from.

# In your main_extraction.py file or similar
from pydantic import BaseModel, Field # We'll use Field a bit later

BaseModel: This is the core class from Pydantic. When you create a class that inherits from BaseModel, Pydantic automatically adds powerful data validation and parsing capabilities.
Field: This is a Pydantic utility function you can use to add extra information, validation, or default values to your model fields. We’ll explore it soon!

Step 3.2: Creating Your First Pydantic Model

Let’s define a schema for a ProductReview. We’ll need a reviewer’s name (a string), a rating (an integer), and the comment (also a string).

# Add this below your imports
class ProductReview(BaseModel):
    reviewer_name: str
    rating: int
    comment: str

class ProductReview(BaseModel):: We’re defining a new class ProductReview that inherits from BaseModel. This tells Pydantic to treat this class as a data schema.
reviewer_name: str: This defines a field named reviewer_name that Pydantic expects to be a string (str).
rating: int: A field rating that Pydantic expects to be an integer (int).
comment: str: A field comment that Pydantic expects to be a string (str).

It’s that simple! Pydantic uses Python’s native type hints to define the structure and types of your data.

Step 3.3: Adding Descriptions for Better LLM Guidance

While the types are clear to Pydantic, LLMs benefit immensely from human-readable descriptions. You can add these as docstrings to your class or, more granularly, using the Field function. Let’s use Field for clarity and better control.

# Modify your ProductReview class
from pydantic import BaseModel, Field

class ProductReview(BaseModel):
    reviewer_name: str = Field(description="The name of the person who wrote the review.")
    rating: int = Field(description="The star rating given by the reviewer, on a scale of 1 to 5.")
    comment: str = Field(description="The actual text content of the review.")

Field(description="..."): By assigning Field to a model attribute, you can provide a description argument. LangExtract passes these descriptions to the LLM, giving it more context about what each field represents. This significantly improves extraction accuracy.

Step 3.4: Using Your Schema with LangExtract

Now, let’s put it all together. We’ll use the LangExtract instance we set up in the previous chapter and our new ProductReview schema.

# Assuming you have your LLM client set up from Chapter 2
# For example, if you're using OpenAI:
import os
import langextract as lx
from pydantic import BaseModel, Field

# --- Your Pydantic Schema ---
class ProductReview(BaseModel):
    reviewer_name: str = Field(description="The name of the person who wrote the review.")
    rating: int = Field(description="The star rating given by the reviewer, on a scale of 1 to 5.")
    comment: str = Field(description="The actual text content of the review.")

# --- Text to Extract From ---
review_text = """
"Absolutely love this new coffee maker! It brews quickly and the coffee tastes fantastic. 
I'd give it a solid 5 stars. Highly recommend. - Sarah L."
"""

# --- Initialize LangExtract (using your preferred provider from Chapter 2) ---
# Example for OpenAI:
# You'd typically set OPENAI_API_KEY as an environment variable
# os.environ["OPENAI_API_KEY"] = "sk-..." # DON'T hardcode in production!
extractor = lx.LangExtract(llm_provider="openai", model_name="gpt-3.5-turbo") # Or your chosen model

# --- Perform the Extraction ---
print("Extracting review details...")
extracted_data = extractor.extract(text_or_document=review_text, schema=ProductReview)

# --- Print the Results ---
if extracted_data:
    print("\nExtraction Successful!")
    print(f"Reviewer Name: {extracted_data.reviewer_name}")
    print(f"Rating: {extracted_data.rating} stars")
    print(f"Comment: {extracted_data.comment}")
    print(f"Type of extracted_data: {type(extracted_data)}")
else:
    print("\nExtraction failed or returned no data.")

review_text: This is our unstructured input.
extractor = lx.LangExtract(...): We re-initialize our LangExtract instance, specifying the LLM provider and model. Remember to configure your API key securely, preferably via environment variables.
extracted_data = extractor.extract(text_or_document=review_text, schema=ProductReview): This is the core call!
- text_or_document: The text we want to process.
- schema: The Pydantic model (ProductReview) we just defined. LangExtract uses this to instruct the LLM on the desired output format.
The output extracted_data will be an instance of our ProductReview Pydantic model, making it easy to access the extracted fields using dot notation (extracted_data.reviewer_name).

Go ahead and run this code! You should see the LLM parse the review text and return a ProductReview object with Sarah L., 5, and the comment correctly extracted.

Step 3.5: Handling Optional Fields and Nested Structures

Not all information is always present in every document, and sometimes you need more complex data structures. Pydantic handles this elegantly.

Let’s say some reviews might include a purchase_date, but it’s not always there. Also, we want to separate the reviewer_name into first_name and last_name.

# Add this new schema below your existing one
from typing import Optional # For optional fields

class ReviewerInfo(BaseModel):
    first_name: str = Field(description="The first name of the reviewer.")
    last_name: str = Field(description="The last name of the reviewer.")

class DetailedProductReview(BaseModel):
    reviewer: ReviewerInfo = Field(description="Information about the reviewer.")
    rating: int = Field(description="The star rating given by the reviewer, on a scale of 1 to 5.")
    comment: str = Field(description="The actual text content of the review.")
    purchase_date: Optional[str] = Field(None, description="The date the product was purchased, if mentioned.")

from typing import Optional: We import Optional from Python’s typing module.
ReviewerInfo(BaseModel): We’ve created a nested Pydantic model. This allows us to group related fields logically.
reviewer: ReviewerInfo: The DetailedProductReview now includes a field reviewer which itself is an instance of ReviewerInfo.
purchase_date: Optional[str] = Field(None, ...): This declares purchase_date as an Optional string. If the LLM cannot find a purchase date in the text, it will return None for this field instead of raising an error. The Field(None, ...) explicitly sets the default value to None.

Let’s test this with a new text:

# --- New Text to Extract From ---
detailed_review_text = """
"This blender is a game-changer! Super powerful and surprisingly quiet. 
I bought it on 2025-11-15 and it's been perfect ever since. I'm giving it a 4-star rating because 
the lid is a bit tricky to clean. - John Doe"
"""

print("\nExtracting detailed review details...")
extracted_detailed_data = extractor.extract(text_or_document=detailed_review_text, schema=DetailedProductReview)

if extracted_detailed_data:
    print("\nDetailed Extraction Successful!")
    print(f"Reviewer First Name: {extracted_detailed_data.reviewer.first_name}")
    print(f"Reviewer Last Name: {extracted_detailed_data.reviewer.last_name}")
    print(f"Rating: {extracted_detailed_data.rating} stars")
    print(f"Comment: {extracted_detailed_data.comment}")
    print(f"Purchase Date: {extracted_detailed_data.purchase_date}")
else:
    print("\nDetailed extraction failed or returned no data.")

Notice how we access the nested fields: extracted_detailed_data.reviewer.first_name. This mirrors the structure of your Pydantic model, making the extracted data highly intuitive to work with.

Mini-Challenge: Extracting Event Details

You’ve learned the basics of defining schemas for single extractions. Now, it’s your turn!

Challenge: Imagine you have a short event announcement. Your goal is to extract the event’s title, date, time, location, and an organizer name. The organizer might not always be present, so make sure your schema can handle that.

Here’s the text:

"Annual Tech Summit 2026 - Join us for a day of innovation!
Date: February 10, 2026
Time: 9:00 AM - 5:00 PM PST
Venue: Virtual Event Platform
Organized by: FutureTech Inc."

Your Task:

Define a Pydantic BaseModel called EventDetails.
Include fields for title (str), date (str), time (str), location (str), and organizer (Optional[str]).
Add clear Field descriptions for each attribute.
Use your extractor instance to extract the information from the provided event_announcement_text.
Print the extracted details in a readable format.

Hint: Remember to import Optional from typing for the organizer field. For the purchase_date field in the previous example, we used Field(None, description="..."). This is a robust way to handle Optional fields in Pydantic V2.

Stuck? Click for a hint!

Make sure your `Optional[str]` field is initialized with `Field(None, description="...")` to properly indicate it's optional and provide a description.

Common Pitfalls & Troubleshooting

Even with clear schemas, you might encounter issues. Here are a few common pitfalls:

Schema Too Vague or Ambiguous:
- Problem: If your field descriptions are unclear (e.g., data: str without further context), the LLM might struggle to interpret what you want.
- Solution: Always provide precise and unambiguous description arguments to Field. Explain what the field represents and how it should be extracted from the text. For example, instead of “Date”, use “The exact date of the event, including year.”
Missing Optional for Non-Guaranteed Fields:
- Problem: If a field is defined as field_name: str, but the information isn’t present in the text, the LLM might hallucinate a value or LangExtract might raise a validation error because it couldn’t find a non-optional field.
- Solution: Use Optional[str] (or Optional[int], etc.) for any piece of information that might not always be present in the source text. Remember Field(None, description="...") for default None and description.
Incorrect Pydantic Types:
- Problem: Expecting an int when the text might contain “ten” or “three point five.” Or expecting a str when you actually want a structured List[str].
- Solution: Choose your Pydantic types carefully. If you expect a list of items, define List[str]. If a number might be fractional, use float instead of int. If the LLM returns “five” and you expect an int, LangExtract/Pydantic will try to parse it, but it’s best to guide the LLM with clear descriptions if you expect numerical representations.
LLM Hallucinating Data:
- Problem: The LLM might invent information that wasn’t in the original text to fill a schema field.
- Solution: While Field descriptions help, LangExtract’s core design (especially with features like source grounding, which we’ll explore later) aims to mitigate this. For now, ensure your descriptions emphasize extraction from the text rather than inference. Review the output carefully. If hallucination is persistent, try simplifying your schema or providing more explicit negative examples in advanced prompting (also a topic for later).

Summary

You’ve made incredible progress in this chapter! Here’s a quick recap of what we covered:

Understanding Schemas: We learned that extraction schemas are blueprints for the structured data we want to extract, guiding the LLM for precision and consistency.
Pydantic for Schema Definition: LangExtract uses Pydantic models to define these schemas, leveraging Python’s type hints.
Basic Schema Creation: You created your first ProductReview Pydantic model with str and int fields.
Enhancing with Field: We saw how Field(description="...") significantly improves LLM accuracy by providing rich context.
Handling Complexity: You learned to use Optional for fields that might not always be present and to create nested Pydantic models for more complex data structures.
Practical Application: You successfully used your schema with LangExtract’s extractor.extract() method.
Troubleshooting: We discussed common pitfalls like vague schemas, missing Optional types, incorrect Pydantic types, and LLM hallucination.

In the next chapter, we’ll expand on these concepts, exploring more advanced schema features, handling lists of extractions, and diving into how LangExtract processes longer documents through chunking and multi-pass extraction. Get ready to tackle even bigger data challenges!

References

This page is AI-assisted and reviewed. It references official documentation and recognized resources where relevant.