Introduction to the LangExtract API

Welcome back, intrepid data explorer! In our previous chapters, we laid the groundwork for using LangExtract by setting up your environment and understanding how to define extraction tasks using schemas. Now, it’s time to get to the heart of the matter: the LangExtract API itself.

This chapter will guide you through the core functions that empower you to perform structured information extraction. We’ll focus primarily on the star of the show: the langextract.extract() function. You’ll learn how to use its various parameters to precisely control your extraction tasks, from specifying your input text to selecting the underlying Large Language Model (LLM) and fine-tuning performance.

Understanding these API functions and their parameters is crucial for building robust, efficient, and accurate extraction workflows. By the end of this chapter, you’ll feel confident in orchestrating LangExtract to perform complex data extraction with clarity and control.

Prerequisites

Before we dive in, make sure you’ve:

  • Successfully installed LangExtract (Chapter 3).
  • Configured at least one LLM provider (e.g., Google AI Studio, OpenAI) and its API key (Chapter 4).
  • Understood how to define an ExtractionSchema using Pydantic (Chapter 5).

Ready? Let’s unlock the power of LangExtract!

The langextract.extract() Function: Your Extraction Command Center

At the core of LangExtract is a single, powerful function: langextract.extract(). This function serves as your primary interface for initiating an information extraction task. It takes your raw text or document, applies your defined schema, and uses an LLM to produce structured data.

Let’s break down its most important parameters. Think of them as the dials and levers you’ll use to steer your extraction process.

flowchart TD A["Your Input Text/Document"] -->|"text_or_document"| B("langextract.extract()") C["Your Extraction Schema"] -->|"schema"| B D["LM Model Selection"] -->|"llm_model"| B E["Advanced Configuration"] -->|"config"| B F["Enable Visualization"] -->|"visualize"| B B ---> G{"Extraction Result"}

Figure 7.1: Overview of the langextract.extract() function and its key inputs.

1. text_or_document: What You Want to Extract From

This is the most fundamental parameter. It’s where you provide the raw data that LangExtract will process.

  • What it is: The source text or document content from which you want to extract information.
  • Why it’s important: Without input, there’s nothing to extract!
  • How it functions:
    • It can be a simple str containing your text.
    • For more complex scenarios (like PDFs or scanned documents), LangExtract can integrate with document loaders, but for now, we’ll focus on string input.

2. schema: What You Want to Extract

You’re already familiar with this one! The schema parameter tells LangExtract exactly what kind of information to look for and how to structure it.

  • What it is: An instance of langextract.schemas.ExtractionSchema, which wraps your Pydantic model definition.
  • Why it’s important: This is your blueprint. It defines the target structure and types of the extracted data.
  • How it functions: LangExtract uses the schema’s description and the Pydantic model’s field names and types to craft effective prompts for the underlying LLM.

3. llm_model: Choosing Your Brain

LangExtract is LLM-agnostic, meaning it can work with various LLM providers. The llm_model parameter allows you to specify which LLM to use for a particular extraction task.

  • What it is: A string identifier for the LLM model you wish to use. This typically corresponds to a model name provided by your configured LLM provider (e.g., “gemini-pro”, “gpt-4”, “claude-3-opus-20240229”).
  • Why it’s important: Different LLMs have varying capabilities, costs, and performance characteristics. Choosing the right one for your task can significantly impact accuracy and efficiency.
  • How it functions: LangExtract uses this identifier to route the extraction request to the appropriate LLM provider and model, assuming you’ve configured that provider in your environment (as we did in Chapter 4). If not specified, LangExtract might use a default configured model or raise an error if no default is set.

4. config: Fine-Tuning the Extraction Process

The config parameter is your gateway to advanced control over how LangExtract processes documents, especially longer ones. It accepts an instance of langextract.config.ExtractionConfig.

  • What it is: An object that holds various settings for the extraction process, such as chunking strategies, parallel processing, and retry logic.
  • Why it’s important: For large documents, a single LLM call is often insufficient or impossible due to context window limits. config allows you to break down the problem intelligently.
  • How it functions:
    • chunk_size: Defines the maximum number of characters (or tokens, depending on the LLM) in each piece of text sent to the LLM.
    • chunk_overlap: Specifies how much overlap there should be between consecutive chunks. This helps maintain context across chunk boundaries.
    • max_workers: Controls the number of parallel LLM calls LangExtract can make, speeding up processing for multi-chunk documents.
    • max_retries: How many times to retry an LLM call if it fails.
    • temperature: (If exposed by LangExtract for the underlying LLM) Controls the “creativity” or randomness of the LLM’s output. Lower values make it more deterministic.

5. visualize: Seeing is Believing

LangExtract offers powerful interactive visualization tools to help you understand and debug your extraction results.

  • What it is: A boolean flag (True or False) that, when set to True, enables an interactive visualization of the extraction process.
  • Why it’s important: This feature helps you see exactly which parts of the text contributed to which extracted fields, identify errors, and iterate on your schemas or prompts quickly.
  • How it functions: When visualize=True, LangExtract will typically return a visualization object or launch a local web server displaying the results, highlighting the extracted entities and their source spans in the original text. This is an invaluable tool for debugging and refining your extraction tasks.

Step-by-Step Implementation: Using extract()

Let’s put these parameters into practice. We’ll start simple and then gradually add more control.

First, ensure you have your environment set up and an LLM configured. For this example, we’ll assume you’ve configured a Google Gemini Pro model and can refer to it as "gemini-pro".

# Step 1: Import necessary libraries
import langextract as lx
from langextract.schemas import ExtractionSchema
from langextract.config import ExtractionConfig
from pydantic import BaseModel

# Step 2: Define your Pydantic model for the desired output structure
class CompanyContact(BaseModel):
    company_name: str
    contact_person: str | None = None
    email: str | None = None
    phone: str | None = None
    role: str | None = None

# Step 3: Create an ExtractionSchema instance
# This is our blueprint for what to extract.
company_schema = ExtractionSchema(
    name="Company Contact Information",
    description="Extract the company name, a primary contact person, their role, email, and phone number.",
    output_model=CompanyContact
)

# Step 4: Prepare your input text
text_data_short = """
Acme Corp. is pleased to announce a new partnership. For inquiries, please contact
our CEO, Jane Doe, at [email protected] or call +1 (555) 123-4567.
"""

# Step 5: Perform a basic extraction using only text and schema
print("--- Basic Extraction ---")
try:
    # We'll use a placeholder for the LLM model name.
    # In a real scenario, replace "your-configured-llm-model" with your actual model name, e.g., "gemini-pro"
    basic_result = lx.extract(
        text_or_document=text_data_short,
        schema=company_schema,
        llm_model="your-configured-llm-model" # IMPORTANT: Replace with your actual configured LLM model name
    )
    print(basic_result.parsed_output)
except Exception as e:
    print(f"An error occurred during basic extraction: {e}")
    print("Please ensure your LLM model is correctly configured and the name matches.")

Explanation:

  1. We import langextract as lx, ExtractionSchema, ExtractionConfig, and BaseModel from pydantic.
  2. We define CompanyContact, a Pydantic model, specifying the fields we want to extract: company_name, contact_person, email, phone, and role. Notice the | None = None for optional fields, which is good practice.
  3. An ExtractionSchema instance, company_schema, is created, linking our CompanyContact model with a descriptive name and description. This description is vital for the LLM’s understanding.
  4. text_data_short holds the input text for our extraction.
  5. Finally, lx.extract() is called. We pass our text_data_short and company_schema. Crucially, we also explicitly specify llm_model. Remember to replace "your-configured-llm-model" with the actual identifier of an LLM you have configured (e.g., "gemini-pro", "gpt-4"). The .parsed_output attribute of the result object gives us the structured Pydantic model instance.

Adding Advanced Configuration (config)

Now, let’s imagine we have a much longer document. We’ll use ExtractionConfig to manage how LangExtract handles it.

# Step 6: Prepare a longer input text (simulated)
text_data_long = """
This is the first part of a very long report about various companies.
Today, we focus on Innovate Solutions Inc. Their lead engineer, Dr. Sarah Lee,
can be reached at [email protected]. Dr. Lee is a pioneer in AI.
Her direct line is +1 (800) 555-0101.

Later in the report, we discuss Global Dynamics. You can find their CEO,
Mr. John Smith, at [email protected]. His office number is
+1 (800) 555-0102. Mr. Smith often speaks at industry conferences.

This document contains many details, requiring careful chunking to ensure
all relevant information is processed by the LLM without exceeding context windows.
""" * 5 # Repeat to make it artificially long

# Step 7: Create an ExtractionConfig instance
# We'll set a small chunk size for demonstration, and allow parallel processing.
extraction_config = ExtractionConfig(
    chunk_size=200,      # Break text into chunks of 200 characters
    chunk_overlap=50,    # 50 characters overlap between chunks to maintain context
    max_workers=2        # Process up to 2 chunks in parallel (if your LLM provider supports it)
)

# Step 8: Perform extraction with advanced configuration
print("\n--- Extraction with Config (Chunking) ---")
try:
    config_result = lx.extract(
        text_or_document=text_data_long,
        schema=company_schema,
        llm_model="your-configured-llm-model", # IMPORTANT: Replace
        config=extraction_config
    )
    # LangExtract handles aggregation of results from multiple chunks.
    # The 'parsed_output' will contain a list if multiple entities match the schema across chunks.
    print(config_result.parsed_output)
except Exception as e:
    print(f"An error occurred during config extraction: {e}")
    print("Please ensure your LLM model is correctly configured and the name matches.")

Explanation:

  1. We create text_data_long by repeating a paragraph multiple times to simulate a longer document.
  2. An ExtractionConfig object, extraction_config, is created. We set chunk_size to 200 characters (very small for demonstration) and chunk_overlap to 50. We also enable max_workers=2 to show how parallel processing can be configured.
  3. The lx.extract() call now includes the config parameter, passing our extraction_config object. LangExtract will automatically break text_data_long into chunks, send them to the LLM, and then intelligently merge the results. For schemas that define a single entity (like our CompanyContact), if multiple instances are found across chunks, LangExtract will return a list of parsed outputs.

Enabling Interactive Visualization

To see how LangExtract processed the information, especially when debugging, the visualize=True parameter is incredibly useful.

# Step 9: Perform extraction with visualization enabled
print("\n--- Extraction with Visualization ---")
# Using the shorter text for a clearer visualization example
try:
    # Note: The exact behavior of 'visualize=True' might depend on your environment
    # and LangExtract version. It often launches a browser tab or returns a special object.
    visual_result = lx.extract(
        text_or_document=text_data_short,
        schema=company_schema,
        llm_model="your-configured-llm-model", # IMPORTANT: Replace
        visualize=True
    )
    print("Visualization enabled. Check your browser or the returned object for interactive results.")
    # In a real application, you might need to call a method on visual_result
    # to explicitly open the visualization, e.g., visual_result.display()
    # For now, we'll just print its type to show it's not a simple Pydantic model.
    print(f"Type of visual_result: {type(visual_result)}")
    # The actual parsed output is usually still accessible, e.g., visual_result.parsed_output
    if hasattr(visual_result, 'parsed_output'):
        print(f"Parsed output from visual_result: {visual_result.parsed_output}")

except Exception as e:
    print(f"An error occurred during visualization extraction: {e}")
    print("Ensure your LLM model is configured and any necessary display libraries are installed.")

Explanation:

  1. We call lx.extract() again, this time setting visualize=True.
  2. When visualize is enabled, LangExtract returns a special result object that contains the parsed data and the information needed to render an interactive view. The exact way this visualization is displayed (e.g., automatically opening a browser tab, requiring a method call) can vary, so the print statement indicates what to expect. This is a powerful debugging tool to verify source grounding.

Mini-Challenge: Customize Your Extraction

You’ve seen the core parameters in action. Now it’s your turn to experiment!

Challenge: Take the text_data_long from our previous example.

  1. Modify the ExtractionConfig to use a chunk_size of 150 and a chunk_overlap of 75.
  2. If you have multiple LLM models configured (e.g., a faster, cheaper one for drafts and a more powerful one for final passes), try switching the llm_model parameter to a different model identifier. If not, just stick with your primary model.
  3. Run the extraction and observe the parsed_output. Does changing the chunking parameters affect the results, especially if information spans across chunk boundaries?

Hint: Remember to replace "your-configured-llm-model" with your actual LLM model identifier. Pay close attention to the ExtractionConfig parameters.

What to observe/learn:

  • How subtle changes in chunk_size and chunk_overlap can influence the LLM’s ability to capture complete entities that might be split across chunks.
  • The impact of different LLM models on extraction quality (if you were able to switch models).

Common Pitfalls & Troubleshooting

  1. “LLM Model Not Found/Configured” Error:

    • Pitfall: You specified an llm_model name that either isn’t configured in your environment variables or isn’t recognized by LangExtract for your setup.
    • Troubleshooting: Double-check your environment variables (e.g., GOOGLE_API_KEY, OPENAI_API_KEY) and ensure they are loaded. Verify the llm_model string matches an available model from your configured provider (e.g., "gemini-pro" for Google, "gpt-3.5-turbo" for OpenAI). Refer back to Chapter 4 for LLM provider setup.
  2. Incomplete or Incorrect Extraction Results:

    • Pitfall: The LLM isn’t extracting all the information you expect, or it’s making mistakes.
    • Troubleshooting:
      • Schema Review: Is your ExtractionSchema’s description clear and precise? Are your Pydantic model’s field names self-explanatory? Ambiguous descriptions lead to poor results.
      • Input Text Quality: Is the information actually present in the text_or_document?
      • Chunking Issues: For long documents, if chunk_size is too small or chunk_overlap is insufficient, crucial context might be lost between chunks, leading to missed extractions. Use visualize=True to debug chunking boundaries.
      • LLM Choice: Some LLMs are better at complex extraction tasks than others. Consider using a more powerful model if accuracy is paramount.
  3. Performance Issues (Slow Extraction):

    • Pitfall: Your extraction is taking a long time, especially for large documents.
    • Troubleshooting:
      • max_workers: Increase the max_workers parameter in ExtractionConfig to allow more parallel LLM calls. Be mindful of rate limits from your LLM provider.
      • chunk_size: While smaller chunks can sometimes improve accuracy by keeping context tight, very small chunks mean more LLM calls. Experiment with larger chunk_size values if your LLM’s context window allows it.
      • LLM Latency: Some LLMs are inherently slower than others. Consider a faster, potentially cheaper, model for initial passes or less critical extractions.

Summary

Congratulations! You’ve navigated the core of the LangExtract API. Here are the key takeaways from this chapter:

  • langextract.extract() is your primary function for initiating all extraction tasks.
  • text_or_document provides the raw input data.
  • schema (an ExtractionSchema instance) defines the structure and type of information to be extracted.
  • llm_model allows you to select a specific LLM from your configured providers for the task.
  • config (an ExtractionConfig instance) offers granular control over document processing, including chunk_size, chunk_overlap, and max_workers for parallelization.
  • visualize=True enables interactive debugging and understanding of how extractions are grounded in the source text.
  • Understanding these parameters is essential for building efficient, accurate, and robust information extraction pipelines.

In the next chapter, we’ll dive deeper into handling long documents, exploring advanced chunking strategies and multi-pass extraction to tackle even the most challenging documents with LangExtract. Get ready to master complex document processing!

References

This page is AI-assisted and reviewed. It references official documentation and recognized resources where relevant.