Welcome back, future data alchemists! In the previous chapter, we got LangExtract up and running and connected to our chosen Large Language Model (LLM) provider. That’s a huge step! Now, it’s time to get down to the real magic: telling LangExtract exactly what kind of information we want to pull out of unstructured text.
This chapter is all about defining your “extraction task” and creating a “schema” – essentially, a blueprint for the structured data you expect to receive. This is arguably the most crucial part of using LangExtract effectively. Without a clear schema, an LLM might give you inconsistent, incomplete, or even hallucinated results. With a well-defined schema, you guide the LLM to focus its powerful understanding on precisely what you need, making your extractions reliable and robust.
Ready to sculpt your data? Let’s dive in!
What is an Extraction Schema? Why Does it Matter?
Imagine you’re trying to build a robot that can read a recipe and tell you the ingredients, cooking time, and serving size. If you just tell the robot, “Read this recipe and tell me stuff,” it might give you a poetic description of the dish, or perhaps just a list of steps without distinguishing ingredients.
An extraction schema is like giving your robot a detailed form to fill out:
- Dish Name:
[Text Field] - Ingredients:
[List of Text Fields] - Cooking Time:
[Number] [Unit: minutes/hours] - Serving Size:
[Number] [Unit: people/portions]
This “form” is your schema. It tells LangExtract (and by extension, the underlying LLM) the exact structure, data types, and names for each piece of information you want to extract.
Why is this so important for LLMs?
- Precision: It forces the LLM to be precise. Instead of vague answers, it must fit its output into your predefined slots.
- Consistency: Every extraction, whether from a product review or a legal document, will follow the same output structure, making it easy to process programmatically.
- Validation: The schema often comes with built-in validation (like ensuring a rating is a number), catching errors early.
- Reduced Hallucination: By giving the LLM a clear target, you significantly reduce its tendency to invent information that isn’t present in the text.
LangExtract leverages Pydantic models for defining these schemas. If you’re new to Pydantic, don’t worry – we’ll cover the essentials. It’s a fantastic Python library for data validation and settings management using Python type annotations.
Let’s visualize this process:
This diagram illustrates how LangExtract acts as the bridge, taking your raw text and, with the help of your schema blueprint, transforming it into neatly structured data.
Defining Your First Schema with Pydantic
Our goal is to extract information from a simple piece of text. Let’s start with a classic example: a product review. We want to extract the reviewer’s name, their rating, and the actual review comment.
First, ensure you have Pydantic installed. It’s usually a dependency of langextract, but it’s good to confirm:
pip install "pydantic>=2.0.0"
Now, let’s define our schema.
Step 3.1: Importing Pydantic and BaseModel
In your Python script (or interactive session), you’ll typically start by importing BaseModel from pydantic. This is the fundamental class that all your data models will inherit from.
# In your main_extraction.py file or similar
from pydantic import BaseModel, Field # We'll use Field a bit later
BaseModel: This is the core class from Pydantic. When you create a class that inherits fromBaseModel, Pydantic automatically adds powerful data validation and parsing capabilities.Field: This is a Pydantic utility function you can use to add extra information, validation, or default values to your model fields. We’ll explore it soon!
Step 3.2: Creating Your First Pydantic Model
Let’s define a schema for a ProductReview. We’ll need a reviewer’s name (a string), a rating (an integer), and the comment (also a string).
# Add this below your imports
class ProductReview(BaseModel):
reviewer_name: str
rating: int
comment: str
class ProductReview(BaseModel):: We’re defining a new classProductReviewthat inherits fromBaseModel. This tells Pydantic to treat this class as a data schema.reviewer_name: str: This defines a field namedreviewer_namethat Pydantic expects to be a string (str).rating: int: A fieldratingthat Pydantic expects to be an integer (int).comment: str: A fieldcommentthat Pydantic expects to be a string (str).
It’s that simple! Pydantic uses Python’s native type hints to define the structure and types of your data.
Step 3.3: Adding Descriptions for Better LLM Guidance
While the types are clear to Pydantic, LLMs benefit immensely from human-readable descriptions. You can add these as docstrings to your class or, more granularly, using the Field function. Let’s use Field for clarity and better control.
# Modify your ProductReview class
from pydantic import BaseModel, Field
class ProductReview(BaseModel):
reviewer_name: str = Field(description="The name of the person who wrote the review.")
rating: int = Field(description="The star rating given by the reviewer, on a scale of 1 to 5.")
comment: str = Field(description="The actual text content of the review.")
Field(description="..."): By assigningFieldto a model attribute, you can provide adescriptionargument. LangExtract passes these descriptions to the LLM, giving it more context about what each field represents. This significantly improves extraction accuracy.
Step 3.4: Using Your Schema with LangExtract
Now, let’s put it all together. We’ll use the LangExtract instance we set up in the previous chapter and our new ProductReview schema.
# Assuming you have your LLM client set up from Chapter 2
# For example, if you're using OpenAI:
import os
import langextract as lx
from pydantic import BaseModel, Field
# --- Your Pydantic Schema ---
class ProductReview(BaseModel):
reviewer_name: str = Field(description="The name of the person who wrote the review.")
rating: int = Field(description="The star rating given by the reviewer, on a scale of 1 to 5.")
comment: str = Field(description="The actual text content of the review.")
# --- Text to Extract From ---
review_text = """
"Absolutely love this new coffee maker! It brews quickly and the coffee tastes fantastic.
I'd give it a solid 5 stars. Highly recommend. - Sarah L."
"""
# --- Initialize LangExtract (using your preferred provider from Chapter 2) ---
# Example for OpenAI:
# You'd typically set OPENAI_API_KEY as an environment variable
# os.environ["OPENAI_API_KEY"] = "sk-..." # DON'T hardcode in production!
extractor = lx.LangExtract(llm_provider="openai", model_name="gpt-3.5-turbo") # Or your chosen model
# --- Perform the Extraction ---
print("Extracting review details...")
extracted_data = extractor.extract(text_or_document=review_text, schema=ProductReview)
# --- Print the Results ---
if extracted_data:
print("\nExtraction Successful!")
print(f"Reviewer Name: {extracted_data.reviewer_name}")
print(f"Rating: {extracted_data.rating} stars")
print(f"Comment: {extracted_data.comment}")
print(f"Type of extracted_data: {type(extracted_data)}")
else:
print("\nExtraction failed or returned no data.")
review_text: This is our unstructured input.extractor = lx.LangExtract(...): We re-initialize our LangExtract instance, specifying the LLM provider and model. Remember to configure your API key securely, preferably via environment variables.extracted_data = extractor.extract(text_or_document=review_text, schema=ProductReview): This is the core call!text_or_document: The text we want to process.schema: The Pydantic model (ProductReview) we just defined. LangExtract uses this to instruct the LLM on the desired output format.
- The output
extracted_datawill be an instance of ourProductReviewPydantic model, making it easy to access the extracted fields using dot notation (extracted_data.reviewer_name).
Go ahead and run this code! You should see the LLM parse the review text and return a ProductReview object with Sarah L., 5, and the comment correctly extracted.
Step 3.5: Handling Optional Fields and Nested Structures
Not all information is always present in every document, and sometimes you need more complex data structures. Pydantic handles this elegantly.
Let’s say some reviews might include a purchase_date, but it’s not always there. Also, we want to separate the reviewer_name into first_name and last_name.
# Add this new schema below your existing one
from typing import Optional # For optional fields
class ReviewerInfo(BaseModel):
first_name: str = Field(description="The first name of the reviewer.")
last_name: str = Field(description="The last name of the reviewer.")
class DetailedProductReview(BaseModel):
reviewer: ReviewerInfo = Field(description="Information about the reviewer.")
rating: int = Field(description="The star rating given by the reviewer, on a scale of 1 to 5.")
comment: str = Field(description="The actual text content of the review.")
purchase_date: Optional[str] = Field(None, description="The date the product was purchased, if mentioned.")
from typing import Optional: We importOptionalfrom Python’stypingmodule.ReviewerInfo(BaseModel): We’ve created a nested Pydantic model. This allows us to group related fields logically.reviewer: ReviewerInfo: TheDetailedProductReviewnow includes a fieldreviewerwhich itself is an instance ofReviewerInfo.purchase_date: Optional[str] = Field(None, ...): This declarespurchase_dateas anOptionalstring. If the LLM cannot find a purchase date in the text, it will returnNonefor this field instead of raising an error. TheField(None, ...)explicitly sets the default value toNone.
Let’s test this with a new text:
# --- New Text to Extract From ---
detailed_review_text = """
"This blender is a game-changer! Super powerful and surprisingly quiet.
I bought it on 2025-11-15 and it's been perfect ever since. I'm giving it a 4-star rating because
the lid is a bit tricky to clean. - John Doe"
"""
print("\nExtracting detailed review details...")
extracted_detailed_data = extractor.extract(text_or_document=detailed_review_text, schema=DetailedProductReview)
if extracted_detailed_data:
print("\nDetailed Extraction Successful!")
print(f"Reviewer First Name: {extracted_detailed_data.reviewer.first_name}")
print(f"Reviewer Last Name: {extracted_detailed_data.reviewer.last_name}")
print(f"Rating: {extracted_detailed_data.rating} stars")
print(f"Comment: {extracted_detailed_data.comment}")
print(f"Purchase Date: {extracted_detailed_data.purchase_date}")
else:
print("\nDetailed extraction failed or returned no data.")
Notice how we access the nested fields: extracted_detailed_data.reviewer.first_name. This mirrors the structure of your Pydantic model, making the extracted data highly intuitive to work with.
Mini-Challenge: Extracting Event Details
You’ve learned the basics of defining schemas for single extractions. Now, it’s your turn!
Challenge: Imagine you have a short event announcement. Your goal is to extract the event’s title, date, time, location, and an organizer name. The organizer might not always be present, so make sure your schema can handle that.
Here’s the text:
"Annual Tech Summit 2026 - Join us for a day of innovation!
Date: February 10, 2026
Time: 9:00 AM - 5:00 PM PST
Venue: Virtual Event Platform
Organized by: FutureTech Inc."
Your Task:
- Define a Pydantic
BaseModelcalledEventDetails. - Include fields for
title(str),date(str),time(str),location(str), andorganizer(Optional[str]). - Add clear
Fielddescriptions for each attribute. - Use your
extractorinstance to extract the information from the providedevent_announcement_text. - Print the extracted details in a readable format.
Hint: Remember to import Optional from typing for the organizer field. For the purchase_date field in the previous example, we used Field(None, description="..."). This is a robust way to handle Optional fields in Pydantic V2.
Stuck? Click for a hint!
Make sure your `Optional[str]` field is initialized with `Field(None, description="...")` to properly indicate it's optional and provide a description.Common Pitfalls & Troubleshooting
Even with clear schemas, you might encounter issues. Here are a few common pitfalls:
- Schema Too Vague or Ambiguous:
- Problem: If your field descriptions are unclear (e.g.,
data: strwithout further context), the LLM might struggle to interpret what you want. - Solution: Always provide precise and unambiguous
descriptionarguments toField. Explain what the field represents and how it should be extracted from the text. For example, instead of “Date”, use “The exact date of the event, including year.”
- Problem: If your field descriptions are unclear (e.g.,
- Missing
Optionalfor Non-Guaranteed Fields:- Problem: If a field is defined as
field_name: str, but the information isn’t present in the text, the LLM might hallucinate a value or LangExtract might raise a validation error because it couldn’t find a non-optional field. - Solution: Use
Optional[str](orOptional[int], etc.) for any piece of information that might not always be present in the source text. RememberField(None, description="...")for defaultNoneand description.
- Problem: If a field is defined as
- Incorrect Pydantic Types:
- Problem: Expecting an
intwhen the text might contain “ten” or “three point five.” Or expecting astrwhen you actually want a structuredList[str]. - Solution: Choose your Pydantic types carefully. If you expect a list of items, define
List[str]. If a number might be fractional, usefloatinstead ofint. If the LLM returns “five” and you expect anint, LangExtract/Pydantic will try to parse it, but it’s best to guide the LLM with clear descriptions if you expect numerical representations.
- Problem: Expecting an
- LLM Hallucinating Data:
- Problem: The LLM might invent information that wasn’t in the original text to fill a schema field.
- Solution: While
Fielddescriptions help, LangExtract’s core design (especially with features like source grounding, which we’ll explore later) aims to mitigate this. For now, ensure your descriptions emphasize extraction from the text rather than inference. Review the output carefully. If hallucination is persistent, try simplifying your schema or providing more explicit negative examples in advanced prompting (also a topic for later).
Summary
You’ve made incredible progress in this chapter! Here’s a quick recap of what we covered:
- Understanding Schemas: We learned that extraction schemas are blueprints for the structured data we want to extract, guiding the LLM for precision and consistency.
- Pydantic for Schema Definition: LangExtract uses Pydantic models to define these schemas, leveraging Python’s type hints.
- Basic Schema Creation: You created your first
ProductReviewPydantic model withstrandintfields. - Enhancing with
Field: We saw howField(description="...")significantly improves LLM accuracy by providing rich context. - Handling Complexity: You learned to use
Optionalfor fields that might not always be present and to createnestedPydantic models for more complex data structures. - Practical Application: You successfully used your schema with LangExtract’s
extractor.extract()method. - Troubleshooting: We discussed common pitfalls like vague schemas, missing
Optionaltypes, incorrect Pydantic types, and LLM hallucination.
In the next chapter, we’ll expand on these concepts, exploring more advanced schema features, handling lists of extractions, and diving into how LangExtract processes longer documents through chunking and multi-pass extraction. Get ready to tackle even bigger data challenges!
References
- LangExtract GitHub Repository
- Pydantic V2 Documentation
- Pydantic Field documentation
- Python
typingmodule (forOptional)
This page is AI-assisted and reviewed. It references official documentation and recognized resources where relevant.