Welcome back, intrepid data explorer! In our previous chapters, you learned the foundational steps of setting up LangExtract, connecting it to an LLM, and crafting basic schemas to pull simple pieces of information from text. You’ve seen how powerful even simple extraction can be.
But what if the information you need isn’t just a single name or a simple description? What if you need to extract a list of items, each with its own set of properties, or deeply nested structures like an address with street, city, and zip code? This is where the true power of LangExtract’s schema definition shines!
In this chapter, we’re going to level up your schema design skills. We’ll explore how to define richer data types beyond just plain text, such as numbers, booleans, and dates. More excitingly, you’ll learn to create nested schemas and extract lists of objects, allowing you to capture complex, hierarchical, and repetitive information from your documents with precision. By the end, you’ll be able to design schemas for even the most intricate data extraction challenges, preparing you for real-world document processing.
Ready to sculpt your data with even finer detail? Let’s dive in!
Core Concepts: Sculpting Your Data with Advanced Schemas
Remember how we defined schemas using Python dictionaries, mapping keys to simple str types? That was just the beginning! LangExtract, leveraging the underlying LLM’s understanding, can infer and enforce a wide array of data types, ensuring your extracted data is not just present, but also correctly formatted and validated.
Beyond Basic Strings: Richer Data Types
Why settle for just str when your data has more specific forms? LangExtract allows you to specify common Python types directly in your schema, guiding the LLM to extract values that conform to these types. This is crucial for data integrity and downstream processing.
Here are some fundamental types you can use:
str: For text, names, descriptions (our default so far).int: For whole numbers (e.g., quantities, ages).float: For decimal numbers (e.g., prices, measurements).bool: For true/false values (e.g., “Is active?”, “Has discount?”).list: For extracting multiple items of the same type (we’ll cover lists of objects shortly!).enum: For a fixed set of predefined choices (e.g., “status”: “pending”, “approved”, “rejected”).
Using these types ensures that if LangExtract extracts “twenty” for an int field, it will attempt to convert it to 20, or flag it if it cannot.
Nested Schemas: Extracting Hierarchical Data
Many real-world entities aren’t flat. A person has a name, but also an address, and that address itself has a street, city, state, and zip code. This is where nested schemas come in. You can define a schema that contains other schemas, forming a hierarchical structure.
Think of it like building a set of Russian nesting dolls, where each doll (schema) contains smaller dolls (sub-schemas) that represent more granular details.
Analogy: Imagine you’re describing a car.
- The top-level schema is for the
Car. - Inside
Car, you might have aEngineschema withhorsepower,cylinders, andfuel_type. - You might also have a
Tiresschema withbrand,size, andpressure.
LangExtract handles this by allowing you to define a dictionary as the type for a field, where that dictionary is another schema.
Handling Optional Fields
Sometimes, a piece of information might not always be present in every document. If you define a field as mandatory and the LLM cannot find it, the extraction might fail or return None (depending on the exact LLM and LangExtract configuration). To gracefully handle missing data, you can mark fields as optional.
In Python, we often use typing.Optional or Union[Type, None] to signify optional values. LangExtract schemas can use Optional[Type] from the typing module to indicate that a field is not strictly required. If the LLM doesn’t find the information for an optional field, it will simply omit it or set it to None without causing an error.
Enums for Categorical Data
When a field can only take one of a few predefined values (e.g., “status” can be “draft”, “published”, or “archived”), using an enum is perfect. Enums prevent the LLM from hallucinating arbitrary values and ensure consistency in your extracted data.
You define an enum by providing a list of possible string values. LangExtract will then instruct the LLM to choose only from these options.
Lists of Objects: Extracting Multiple Entities
This is perhaps one of the most powerful features for document processing. Imagine extracting all the line items from an invoice, all the attendees from a meeting report, or all the authors from a research paper. These are all lists of objects, where each object in the list conforms to a specific sub-schema.
To achieve this, you define a sub-schema for a single item, and then specify that a field’s type is list[YourSubSchema]. LangExtract will then prompt the LLM to identify and extract all instances of that item, each structured according to its schema.
Step-by-Step Implementation: Building a Rich Product Schema
Let’s put these concepts into practice. We’ll imagine we’re extracting data from product descriptions, which often contain diverse information.
First, ensure you have LangExtract installed and your LLM provider configured as we did in Chapter 2 and 3.
# Make sure you have LangExtract installed (latest stable as of 2026-01-05)
# pip install langextract
# And your LLM provider configured, e.g., for Google Generative AI
# pip install google-generativeai
import langextract as lx
import os
from typing import Optional, List # We'll need these for advanced types
# For Google Generative AI (e.g., Gemini Pro)
# Make sure to set your API key as an environment variable or replace 'os.getenv("GOOGLE_API_KEY")'
# For example: export GOOGLE_API_KEY="YOUR_API_KEY"
try:
llm_provider = lx.GoogleGenerativeAI(api_key=os.getenv("GOOGLE_API_KEY"))
print("Google Generative AI provider configured.")
except Exception as e:
print(f"Error configuring Google Generative AI: {e}")
print("Please ensure GOOGLE_API_KEY is set and 'google-generativeai' is installed.")
llm_provider = None # Set to None if configuration fails
Explanation:
- We import
langextractaslxandosto access environment variables. - Crucially, we import
OptionalandListfrom thetypingmodule. These are standard Python type hints that LangExtract understands for defining complex schemas. - We re-configure the
llm_providerusinglx.GoogleGenerativeAI, assuming you have yourGOOGLE_API_KEYset up. This is a robust way to handle the LLM connection.
Now, let’s define a sample product description.
product_description = """
Introducing the "Quantum Leap Widget Pro" - a revolutionary device designed for tech enthusiasts.
It boasts a powerful 2.5 GHz Octa-core processor and 16 GB of RAM, ensuring silky-smooth performance.
The widget features a stunning 6.7-inch AMOLED display and a durable aluminum casing.
It's currently available in Midnight Black and Arctic White.
Launch Date: 2025-11-15.
Price: $799.99.
Special Offer: Includes a free protective case (Value: $29.99).
Customer reviews highlight its "blazing speed" and "intuitive interface."
Warranty: 2 years.
"""
Step 1: Basic Types, Optional Fields, and Enums
Let’s start by extracting some basic information, including an optional field and an enum for color.
if llm_provider: # Only proceed if LLM provider is configured
print("\n--- Step 1: Basic Types, Optional Fields, and Enums ---")
# Define the schema using a dictionary
product_schema_step1 = {
"name": str,
"processor_speed_ghz": float, # Expect a decimal number
"ram_gb": int, # Expect a whole number
"display_size_inches": Optional[float], # Display size might not always be mentioned
"available_colors": List[str], # A list of string colors
"launch_date": str, # For now, let's keep it a string, we'll refine later
"price": float,
"has_special_offer": bool, # Is there a special offer? True/False
"warranty_years": Optional[int] # Warranty might be missing
}
print("\nExtracting with product_schema_step1...")
result_step1 = lx.extract(
text_or_document=product_description,
schema=product_schema_step1,
llm_provider=llm_provider
)
print("\nExtracted Data (Step 1):")
print(result_step1)
# print(result_step1.json(indent=2)) # If you prefer JSON output for readability
Explanation:
product_schema_step1: We define a Python dictionary where keys are the field names and values are their expected types.processor_speed_ghz: float,ram_gb: int,price: float: We’re explicitly telling LangExtract to expect specific numerical types.display_size_inches: Optional[float]: This field might not always be present. If the LLM can’t find it, it won’t cause an error, and the field might beNoneor omitted.available_colors: List[str]: This tells LangExtract to expect multiple colors, which should be extracted as a list of strings.has_special_offer: bool: This guides the LLM to look for an indication of a special offer and returnTrueorFalse.warranty_years: Optional[int]: Another optional field, expecting an integer.
Observe the output. LangExtract has done a great job of converting “2.5 GHz” to 2.5, “16 GB” to 16, and identifying the colors and the presence of a special offer.
Step 2: Introducing Nested Schemas
Now, let’s make our schema more structured. A product often has a specifications section and maybe reviews. We can define these as nested objects.
if llm_provider: # Only proceed if LLM provider is configured
print("\n--- Step 2: Introducing Nested Schemas ---")
# Define a sub-schema for Specifications
specifications_schema = {
"processor_speed_ghz": float,
"ram_gb": int,
"display_size_inches": Optional[float]
}
# Define a sub-schema for a single Customer Review
customer_review_schema = {
"aspect": str, # e.g., "blazing speed"
"sentiment": str # e.g., "positive" or "negative"
}
# Integrate these sub-schemas into the main product schema
product_schema_step2 = {
"name": str,
"specifications": specifications_schema, # Nested schema!
"available_colors": List[str],
"launch_date": str,
"price": float,
"has_special_offer": bool,
"warranty_years": Optional[int],
"customer_reviews_summary": List[customer_review_schema] # List of nested schemas!
}
print("\nExtracting with product_schema_step2...")
result_step2 = lx.extract(
text_or_document=product_description,
schema=product_schema_step2,
llm_provider=llm_provider
)
print("\nExtracted Data (Step 2):")
print(result_step2)
Explanation:
specifications_schema: A new dictionary defining the structure for product specifications.customer_review_schema: Another dictionary for a single review.specifications: specifications_schema: In the mainproduct_schema_step2, we assign ourspecifications_schemadirectly as the type for thespecificationsfield. This tells LangExtract to extract an object conforming tospecifications_schemahere.customer_reviews_summary: List[customer_review_schema]: This is a powerful combination! It instructs LangExtract to find multiple customer reviews and structure each one according to thecustomer_review_schema.
Notice how the output is now much more organized, with specifications and customer_reviews_summary as nested objects and a list of objects, respectively. This is getting closer to real-world data structures!
Step 3: Refinements with datetime and enum
While str works for launch_date, it’s better to get a proper datetime object for date manipulation. Also, let’s add an enum for product category.
from datetime import date # Import date type for schema
from typing import Literal # For defining enums with fixed strings
if llm_provider: # Only proceed if LLM provider is configured
print("\n--- Step 3: Refinements with datetime and enum ---")
# Product category enum
ProductCategory = Literal["electronics", "apparel", "home_goods", "software"]
# Updated Specifications schema (no change needed here for this step)
specifications_schema_final = {
"processor_speed_ghz": float,
"ram_gb": int,
"display_size_inches": Optional[float]
}
# Updated Customer Review schema (no change needed here for this step)
customer_review_schema_final = {
"aspect": str,
"sentiment": str
}
# Integrate these sub-schemas into the main product schema
product_schema_final = {
"name": str,
"category": ProductCategory, # Using our custom enum!
"specifications": specifications_schema_final,
"available_colors": List[str],
"launch_date": date, # Now expecting a date object!
"price": float,
"has_special_offer": bool,
"warranty_years": Optional[int],
"customer_reviews_summary": List[customer_review_schema_final]
}
print("\nExtracting with product_schema_final...")
result_final = lx.extract(
text_or_document=product_description,
schema=product_schema_final,
llm_provider=llm_provider
)
print("\nExtracted Data (Final Schema):")
print(result_final)
# print(result_final.json(indent=2)) # For pretty printing
Explanation:
from datetime import date: We importdatefrom thedatetimemodule. LangExtract can intelligently parse common date formats into Pythondateobjects.ProductCategory = Literal["electronics", "apparel", ...]: We define aLiteraltype which acts as an enum. This tells the LLM that thecategoryfield must be one of these exact strings. If the LLM cannot confidently assign a category from the text, it might returnNoneor default behavior depending on its capabilities.launch_date: date: We set the type todate. LangExtract will attempt to convert “2025-11-15” into adatetime.dateobject.category: ProductCategory: This field will be constrained to our predefined enum values. In this case, “Quantum Leap Widget Pro” clearly falls under “electronics”.
Now, your extracted data is not only structured but also typed precisely, making it immediately usable for further analysis, database storage, or application logic.
Visualizing the Schema Structure (Optional)
Sometimes, especially with complex nested schemas, it helps to visualize the structure. While LangExtract doesn’t have a built-in schema visualizer, we can represent it using Mermaid.js. This helps clarify the relationships between fields and nested objects.
Explanation:
This Mermaid graph TD (top-down) diagram visually represents our product_schema_final.
A[Product]is the top-level entity.- Arrows
-->indicate that theProductcontains various fields. D[specifications: Specifications]shows a nested object.J[customer_reviews_summary: List[CustomerReview]]indicates a list of nested objects.ProductCategoryis an enum, represented as a simple field for simplicity in this diagram.
This visual aid helps in understanding how complex data structures are broken down and extracted.
Mini-Challenge: Extracting Event Details
You’ve learned about advanced data types, nested schemas, and lists of objects. Now, it’s your turn to apply these concepts!
Challenge: You are given a short announcement about an upcoming tech conference. Your task is to define a LangExtract schema that extracts the following information:
- Conference Name (
str) - Main Host Organization (
str) - Start Date (
date) - End Date (
date) - Location (a nested object with
city: str,country: str, andvenue_name: Optional[str]) - Key Speakers (a list of objects, where each speaker object has
name: strandtopic: str) - Ticket Price (
float) - Is Virtual (
bool) - whether the conference offers a virtual attendance option.
Here’s the text:
event_text = """
Announcing "FutureTech Summit 2026"! Hosted by Global Innovations Inc., this premier event
will run from 2026-03-10 to 2026-03-12. It's set to take place in Berlin, Germany, at the
historic "TechHub Arena". Our lineup includes Dr. Anya Sharma discussing "AI Ethics in Practice"
and Prof. Ben Carter on "Quantum Computing's Next Frontier." Tickets are priced at $1250.00.
Virtual attendance options are fully supported.
"""
Hint:
- Remember to import
date,Optional, andListfromtyping. - Define your nested
LocationandSpeakerschemas first, then integrate them into the main conference schema. - Pay attention to the expected data types for each field.
Take a moment, try to build the schema and run the extraction yourself. What do you observe about the output?
Click for Solution (Optional)
from datetime import date
from typing import Optional, List
# Define the nested schemas first
location_schema = {
"city": str,
"country": str,
"venue_name": Optional[str]
}
speaker_schema = {
"name": str,
"topic": str
}
# Define the main conference schema
conference_schema = {
"conference_name": str,
"host_organization": str,
"start_date": date,
"end_date": date,
"location": location_schema, # Nested object
"key_speakers": List[speaker_schema], # List of nested objects
"ticket_price": float,
"is_virtual": bool
}
# The text to extract from
event_text = """
Announcing "FutureTech Summit 2026"! Hosted by Global Innovations Inc., this premier event
will run from 2026-03-10 to 2026-03-12. It's set to take place in Berlin, Germany, at the
historic "TechHub Arena". Our lineup includes Dr. Anya Sharma discussing "AI Ethics in Practice"
and Prof. Ben Carter on "Quantum Computing's Next Frontier." Tickets are priced at $1250.00.
Virtual attendance options are fully supported.
"""
if llm_provider:
print("\n--- Mini-Challenge Solution ---")
challenge_result = lx.extract(
text_or_document=event_text,
schema=conference_schema,
llm_provider=llm_provider
)
print("Extracted Conference Data:")
print(challenge_result)
Common Pitfalls & Troubleshooting
Even with powerful tools like LangExtract, complex schema design can introduce a few common hiccups.
Type Mismatches: If you define a field as
intbut the LLM extracts text like “not applicable,” LangExtract will try to convert it and might raise an error or returnNone(depending on the specific LLM and its error handling).- Solution: Use
Optional[Type]for fields that might be missing or non-conformant. If a field must be a certain type, ensure the prompt is clear or the text unambiguously contains that type of data.
- Solution: Use
Overly Ambitious Schemas: Defining a schema that’s too deep, too broad, or requests too many items in a list can sometimes overwhelm the LLM, leading to incomplete or incorrect extractions.
- Solution: Start simple, then incrementally add complexity. Test your schema frequently. If an LLM struggles with a very complex schema, consider breaking the extraction into multiple passes (which we’ll cover in a later chapter) or simplifying your schema.
Ambiguous Instructions: While LangExtract abstracts much of the prompting, if your schema field names are vague (e.g.,
iteminstead ofproduct_item_details), the LLM might not understand what to extract.- Solution: Use descriptive field names in your schema. Sometimes, adding a
descriptionto the field in the schema (a feature of some schema definition libraries, or implied through good naming in LangExtract) can help, but clear names are usually sufficient.
- Solution: Use descriptive field names in your schema. Sometimes, adding a
Missing
typingImports: ForOptionalandList(andLiteral), you must import them from thetypingmodule. Forgetting this will lead to Python errors before LangExtract even runs.- Solution: Double-check your imports at the top of your script:
from typing import Optional, List, Literal.
- Solution: Double-check your imports at the top of your script:
Summary
Congratulations! You’ve successfully navigated the exciting world of advanced schema design with LangExtract. Here’s a quick recap of what you’ve mastered:
- Richer Data Types: You can now specify
int,float,bool,date, andLiteral(for enums) in your schemas, ensuring your extracted data is not just present but also correctly typed. OptionalFields: You learned how to gracefully handle missing information usingOptional[Type], preventing errors and making your extractions more robust.- Nested Schemas: You can define complex, hierarchical data structures by embedding one schema within another, perfect for entities with sub-components like addresses or specifications.
- Lists of Objects: You discovered how to extract multiple, similar entities (e.g., multiple speakers, multiple product features) using
List[YourSubSchema], transforming unstructured text into structured collections. - Visualizing Schemas: You saw how Mermaid.js diagrams can help you understand and communicate complex schema structures.
By combining these techniques, you can design highly effective schemas that precisely capture the nuanced information buried within your documents. You’re now equipped to tackle a vast array of structured data extraction challenges!
In the next chapter, we’ll explore how LangExtract handles very long documents, introducing concepts like chunking and multi-pass extraction to overcome LLM context window limitations. Get ready to process entire reports and contracts!
References
- LangExtract GitHub Repository: https://github.com/google/langextract
- Python
typingModule Documentation: https://docs.python.org/3/library/typing.html - Python
datetimeModule Documentation: https://docs.python.org/3/library/datetime.html - Mermaid.js Diagram Syntax Reference: https://mermaid.js.org/intro/syntax-reference.html
- Google Generative AI Python SDK (for LLM setup): https://developers.google.com/gemini/docs/reference/python
This page is AI-assisted and reviewed. It references official documentation and recognized resources where relevant.