Introduction to the LangExtract API
Welcome back, intrepid data explorer! In our previous chapters, we laid the groundwork for using LangExtract by setting up your environment and understanding how to define extraction tasks using schemas. Now, it’s time to get to the heart of the matter: the LangExtract API itself.
This chapter will guide you through the core functions that empower you to perform structured information extraction. We’ll focus primarily on the star of the show: the langextract.extract() function. You’ll learn how to use its various parameters to precisely control your extraction tasks, from specifying your input text to selecting the underlying Large Language Model (LLM) and fine-tuning performance.
Understanding these API functions and their parameters is crucial for building robust, efficient, and accurate extraction workflows. By the end of this chapter, you’ll feel confident in orchestrating LangExtract to perform complex data extraction with clarity and control.
Prerequisites
Before we dive in, make sure you’ve:
- Successfully installed LangExtract (Chapter 3).
- Configured at least one LLM provider (e.g., Google AI Studio, OpenAI) and its API key (Chapter 4).
- Understood how to define an
ExtractionSchemausing Pydantic (Chapter 5).
Ready? Let’s unlock the power of LangExtract!
The langextract.extract() Function: Your Extraction Command Center
At the core of LangExtract is a single, powerful function: langextract.extract(). This function serves as your primary interface for initiating an information extraction task. It takes your raw text or document, applies your defined schema, and uses an LLM to produce structured data.
Let’s break down its most important parameters. Think of them as the dials and levers you’ll use to steer your extraction process.
Figure 7.1: Overview of the langextract.extract() function and its key inputs.
1. text_or_document: What You Want to Extract From
This is the most fundamental parameter. It’s where you provide the raw data that LangExtract will process.
- What it is: The source text or document content from which you want to extract information.
- Why it’s important: Without input, there’s nothing to extract!
- How it functions:
- It can be a simple
strcontaining your text. - For more complex scenarios (like PDFs or scanned documents), LangExtract can integrate with document loaders, but for now, we’ll focus on string input.
- It can be a simple
2. schema: What You Want to Extract
You’re already familiar with this one! The schema parameter tells LangExtract exactly what kind of information to look for and how to structure it.
- What it is: An instance of
langextract.schemas.ExtractionSchema, which wraps your Pydantic model definition. - Why it’s important: This is your blueprint. It defines the target structure and types of the extracted data.
- How it functions: LangExtract uses the schema’s
descriptionand the Pydantic model’s field names and types to craft effective prompts for the underlying LLM.
3. llm_model: Choosing Your Brain
LangExtract is LLM-agnostic, meaning it can work with various LLM providers. The llm_model parameter allows you to specify which LLM to use for a particular extraction task.
- What it is: A string identifier for the LLM model you wish to use. This typically corresponds to a model name provided by your configured LLM provider (e.g., “gemini-pro”, “gpt-4”, “claude-3-opus-20240229”).
- Why it’s important: Different LLMs have varying capabilities, costs, and performance characteristics. Choosing the right one for your task can significantly impact accuracy and efficiency.
- How it functions: LangExtract uses this identifier to route the extraction request to the appropriate LLM provider and model, assuming you’ve configured that provider in your environment (as we did in Chapter 4). If not specified, LangExtract might use a default configured model or raise an error if no default is set.
4. config: Fine-Tuning the Extraction Process
The config parameter is your gateway to advanced control over how LangExtract processes documents, especially longer ones. It accepts an instance of langextract.config.ExtractionConfig.
- What it is: An object that holds various settings for the extraction process, such as chunking strategies, parallel processing, and retry logic.
- Why it’s important: For large documents, a single LLM call is often insufficient or impossible due to context window limits.
configallows you to break down the problem intelligently. - How it functions:
chunk_size: Defines the maximum number of characters (or tokens, depending on the LLM) in each piece of text sent to the LLM.chunk_overlap: Specifies how much overlap there should be between consecutive chunks. This helps maintain context across chunk boundaries.max_workers: Controls the number of parallel LLM calls LangExtract can make, speeding up processing for multi-chunk documents.max_retries: How many times to retry an LLM call if it fails.temperature: (If exposed by LangExtract for the underlying LLM) Controls the “creativity” or randomness of the LLM’s output. Lower values make it more deterministic.
5. visualize: Seeing is Believing
LangExtract offers powerful interactive visualization tools to help you understand and debug your extraction results.
- What it is: A boolean flag (
TrueorFalse) that, when set toTrue, enables an interactive visualization of the extraction process. - Why it’s important: This feature helps you see exactly which parts of the text contributed to which extracted fields, identify errors, and iterate on your schemas or prompts quickly.
- How it functions: When
visualize=True, LangExtract will typically return a visualization object or launch a local web server displaying the results, highlighting the extracted entities and their source spans in the original text. This is an invaluable tool for debugging and refining your extraction tasks.
Step-by-Step Implementation: Using extract()
Let’s put these parameters into practice. We’ll start simple and then gradually add more control.
First, ensure you have your environment set up and an LLM configured. For this example, we’ll assume you’ve configured a Google Gemini Pro model and can refer to it as "gemini-pro".
# Step 1: Import necessary libraries
import langextract as lx
from langextract.schemas import ExtractionSchema
from langextract.config import ExtractionConfig
from pydantic import BaseModel
# Step 2: Define your Pydantic model for the desired output structure
class CompanyContact(BaseModel):
company_name: str
contact_person: str | None = None
email: str | None = None
phone: str | None = None
role: str | None = None
# Step 3: Create an ExtractionSchema instance
# This is our blueprint for what to extract.
company_schema = ExtractionSchema(
name="Company Contact Information",
description="Extract the company name, a primary contact person, their role, email, and phone number.",
output_model=CompanyContact
)
# Step 4: Prepare your input text
text_data_short = """
Acme Corp. is pleased to announce a new partnership. For inquiries, please contact
our CEO, Jane Doe, at [email protected] or call +1 (555) 123-4567.
"""
# Step 5: Perform a basic extraction using only text and schema
print("--- Basic Extraction ---")
try:
# We'll use a placeholder for the LLM model name.
# In a real scenario, replace "your-configured-llm-model" with your actual model name, e.g., "gemini-pro"
basic_result = lx.extract(
text_or_document=text_data_short,
schema=company_schema,
llm_model="your-configured-llm-model" # IMPORTANT: Replace with your actual configured LLM model name
)
print(basic_result.parsed_output)
except Exception as e:
print(f"An error occurred during basic extraction: {e}")
print("Please ensure your LLM model is correctly configured and the name matches.")
Explanation:
- We import
langextractaslx,ExtractionSchema,ExtractionConfig, andBaseModelfrompydantic. - We define
CompanyContact, a Pydantic model, specifying the fields we want to extract:company_name,contact_person,email,phone, androle. Notice the| None = Nonefor optional fields, which is good practice. - An
ExtractionSchemainstance,company_schema, is created, linking ourCompanyContactmodel with a descriptive name and description. This description is vital for the LLM’s understanding. text_data_shortholds the input text for our extraction.- Finally,
lx.extract()is called. We pass ourtext_data_shortandcompany_schema. Crucially, we also explicitly specifyllm_model. Remember to replace"your-configured-llm-model"with the actual identifier of an LLM you have configured (e.g.,"gemini-pro","gpt-4"). The.parsed_outputattribute of the result object gives us the structured Pydantic model instance.
Adding Advanced Configuration (config)
Now, let’s imagine we have a much longer document. We’ll use ExtractionConfig to manage how LangExtract handles it.
# Step 6: Prepare a longer input text (simulated)
text_data_long = """
This is the first part of a very long report about various companies.
Today, we focus on Innovate Solutions Inc. Their lead engineer, Dr. Sarah Lee,
can be reached at [email protected]. Dr. Lee is a pioneer in AI.
Her direct line is +1 (800) 555-0101.
Later in the report, we discuss Global Dynamics. You can find their CEO,
Mr. John Smith, at [email protected]. His office number is
+1 (800) 555-0102. Mr. Smith often speaks at industry conferences.
This document contains many details, requiring careful chunking to ensure
all relevant information is processed by the LLM without exceeding context windows.
""" * 5 # Repeat to make it artificially long
# Step 7: Create an ExtractionConfig instance
# We'll set a small chunk size for demonstration, and allow parallel processing.
extraction_config = ExtractionConfig(
chunk_size=200, # Break text into chunks of 200 characters
chunk_overlap=50, # 50 characters overlap between chunks to maintain context
max_workers=2 # Process up to 2 chunks in parallel (if your LLM provider supports it)
)
# Step 8: Perform extraction with advanced configuration
print("\n--- Extraction with Config (Chunking) ---")
try:
config_result = lx.extract(
text_or_document=text_data_long,
schema=company_schema,
llm_model="your-configured-llm-model", # IMPORTANT: Replace
config=extraction_config
)
# LangExtract handles aggregation of results from multiple chunks.
# The 'parsed_output' will contain a list if multiple entities match the schema across chunks.
print(config_result.parsed_output)
except Exception as e:
print(f"An error occurred during config extraction: {e}")
print("Please ensure your LLM model is correctly configured and the name matches.")
Explanation:
- We create
text_data_longby repeating a paragraph multiple times to simulate a longer document. - An
ExtractionConfigobject,extraction_config, is created. We setchunk_sizeto200characters (very small for demonstration) andchunk_overlapto50. We also enablemax_workers=2to show how parallel processing can be configured. - The
lx.extract()call now includes theconfigparameter, passing ourextraction_configobject. LangExtract will automatically breaktext_data_longinto chunks, send them to the LLM, and then intelligently merge the results. For schemas that define a single entity (like ourCompanyContact), if multiple instances are found across chunks, LangExtract will return a list of parsed outputs.
Enabling Interactive Visualization
To see how LangExtract processed the information, especially when debugging, the visualize=True parameter is incredibly useful.
# Step 9: Perform extraction with visualization enabled
print("\n--- Extraction with Visualization ---")
# Using the shorter text for a clearer visualization example
try:
# Note: The exact behavior of 'visualize=True' might depend on your environment
# and LangExtract version. It often launches a browser tab or returns a special object.
visual_result = lx.extract(
text_or_document=text_data_short,
schema=company_schema,
llm_model="your-configured-llm-model", # IMPORTANT: Replace
visualize=True
)
print("Visualization enabled. Check your browser or the returned object for interactive results.")
# In a real application, you might need to call a method on visual_result
# to explicitly open the visualization, e.g., visual_result.display()
# For now, we'll just print its type to show it's not a simple Pydantic model.
print(f"Type of visual_result: {type(visual_result)}")
# The actual parsed output is usually still accessible, e.g., visual_result.parsed_output
if hasattr(visual_result, 'parsed_output'):
print(f"Parsed output from visual_result: {visual_result.parsed_output}")
except Exception as e:
print(f"An error occurred during visualization extraction: {e}")
print("Ensure your LLM model is configured and any necessary display libraries are installed.")
Explanation:
- We call
lx.extract()again, this time settingvisualize=True. - When
visualizeis enabled, LangExtract returns a special result object that contains the parsed data and the information needed to render an interactive view. The exact way this visualization is displayed (e.g., automatically opening a browser tab, requiring a method call) can vary, so the print statement indicates what to expect. This is a powerful debugging tool to verify source grounding.
Mini-Challenge: Customize Your Extraction
You’ve seen the core parameters in action. Now it’s your turn to experiment!
Challenge:
Take the text_data_long from our previous example.
- Modify the
ExtractionConfigto use achunk_sizeof150and achunk_overlapof75. - If you have multiple LLM models configured (e.g., a faster, cheaper one for drafts and a more powerful one for final passes), try switching the
llm_modelparameter to a different model identifier. If not, just stick with your primary model. - Run the extraction and observe the
parsed_output. Does changing the chunking parameters affect the results, especially if information spans across chunk boundaries?
Hint: Remember to replace "your-configured-llm-model" with your actual LLM model identifier. Pay close attention to the ExtractionConfig parameters.
What to observe/learn:
- How subtle changes in
chunk_sizeandchunk_overlapcan influence the LLM’s ability to capture complete entities that might be split across chunks. - The impact of different LLM models on extraction quality (if you were able to switch models).
Common Pitfalls & Troubleshooting
“LLM Model Not Found/Configured” Error:
- Pitfall: You specified an
llm_modelname that either isn’t configured in your environment variables or isn’t recognized by LangExtract for your setup. - Troubleshooting: Double-check your environment variables (e.g.,
GOOGLE_API_KEY,OPENAI_API_KEY) and ensure they are loaded. Verify thellm_modelstring matches an available model from your configured provider (e.g.,"gemini-pro"for Google,"gpt-3.5-turbo"for OpenAI). Refer back to Chapter 4 for LLM provider setup.
- Pitfall: You specified an
Incomplete or Incorrect Extraction Results:
- Pitfall: The LLM isn’t extracting all the information you expect, or it’s making mistakes.
- Troubleshooting:
- Schema Review: Is your
ExtractionSchema’sdescriptionclear and precise? Are your Pydantic model’s field names self-explanatory? Ambiguous descriptions lead to poor results. - Input Text Quality: Is the information actually present in the
text_or_document? - Chunking Issues: For long documents, if
chunk_sizeis too small orchunk_overlapis insufficient, crucial context might be lost between chunks, leading to missed extractions. Usevisualize=Trueto debug chunking boundaries. - LLM Choice: Some LLMs are better at complex extraction tasks than others. Consider using a more powerful model if accuracy is paramount.
- Schema Review: Is your
Performance Issues (Slow Extraction):
- Pitfall: Your extraction is taking a long time, especially for large documents.
- Troubleshooting:
max_workers: Increase themax_workersparameter inExtractionConfigto allow more parallel LLM calls. Be mindful of rate limits from your LLM provider.chunk_size: While smaller chunks can sometimes improve accuracy by keeping context tight, very small chunks mean more LLM calls. Experiment with largerchunk_sizevalues if your LLM’s context window allows it.- LLM Latency: Some LLMs are inherently slower than others. Consider a faster, potentially cheaper, model for initial passes or less critical extractions.
Summary
Congratulations! You’ve navigated the core of the LangExtract API. Here are the key takeaways from this chapter:
langextract.extract()is your primary function for initiating all extraction tasks.text_or_documentprovides the raw input data.schema(anExtractionSchemainstance) defines the structure and type of information to be extracted.llm_modelallows you to select a specific LLM from your configured providers for the task.config(anExtractionConfiginstance) offers granular control over document processing, includingchunk_size,chunk_overlap, andmax_workersfor parallelization.visualize=Trueenables interactive debugging and understanding of how extractions are grounded in the source text.- Understanding these parameters is essential for building efficient, accurate, and robust information extraction pipelines.
In the next chapter, we’ll dive deeper into handling long documents, exploring advanced chunking strategies and multi-pass extraction to tackle even the most challenging documents with LangExtract. Get ready to master complex document processing!
References
- LangExtract GitHub Repository
- LangExtract Community Providers Documentation
- Pydantic Documentation
- Towards Data Science: Extracting Structured Data with LangExtract
This page is AI-assisted and reviewed. It references official documentation and recognized resources where relevant.