Introduction: Turning Product Text into Gold
Welcome back, future data wizard! In our journey so far, you’ve mastered the fundamentals of LangExtract, understood how to set up your LLM provider, and crafted basic extraction schemas. Now, it’s time to put that knowledge to the test with a real-world, highly practical project: extracting structured data from e-commerce product listings.
Imagine you’re building a tool to compare prices across different online stores, or perhaps enriching your own product catalog with information scraped from various sources. The raw data often comes as messy, unstructured text – a product name, a description paragraph, a list of features, all jumbled together. Our goal in this chapter is to transform this chaotic text into clean, structured data like product names, prices, descriptions, and key features, using LangExtract’s powerful LLM-orchestrated capabilities. This project will solidify your understanding of schema design, prompt engineering, and handling common data extraction challenges.
Before we dive in, ensure you have your Python environment ready and your preferred LLM provider (like Google’s Gemma/Gemini, OpenAI, or Anthropic) configured as we covered in previous chapters. We’ll be building on that foundation to create a robust extraction pipeline for e-commerce data.
Core Concepts: The E-commerce Extraction Blueprint
Extracting data from e-commerce listings requires a thoughtful approach. We’re not just pulling out random words; we’re looking for specific pieces of information that adhere to a predefined structure.
Understanding E-commerce Product Data
E-commerce product listings can vary widely, but typically contain:
- Product Name: The main identifier.
- Price: A numerical value, often with a currency.
- Description: A paragraph or two detailing the product’s benefits and features.
- Key Features/Specifications: Bullet points or short phrases.
- Brand: The manufacturer’s name.
- Availability: In stock, out of stock, limited quantity.
The challenge lies in these pieces of information not always being in the same place or format. LangExtract, powered by a Large Language Model (LLM), excels at this kind of semantic understanding and extraction.
Designing the Extraction Schema with Pydantic
The heart of any LangExtract project is the schema. This defines what data you want to extract and how it should be structured. For e-commerce products, we’ll use Pydantic to define a Product model. Pydantic is a fantastic library that provides data validation and settings management using Python type hints, making it perfect for defining our target data structure.
Here’s a visual overview of our extraction process:
This diagram illustrates how your raw text flows into the LangExtract Extractor, which then leverages both your chosen LLM and the Pydantic schema to produce clean, structured output.
The Role of Prompt Engineering
While the Pydantic schema tells LangExtract what to extract, prompt engineering (through instructions) tells the LLM how to interpret the text and where to find the information. You’ll often start with a basic schema and then refine your instructions as you encounter variations in the input text, guiding the LLM to better accuracy.
Step-by-Step Implementation: Building Our Extractor
Let’s get our hands dirty and start coding!
Step 1: Set Up Your Environment
First, ensure you have the necessary libraries installed. As of early 2026, langextract is still actively maintained, and pydantic v2 is the recommended stable version.
pip install "langextract>=0.1.0" "pydantic>=2.0" "google-generativeai>=0.3.0" # Or your chosen LLM client like 'openai'
We’ll use google-generativeai for this example, but feel free to substitute with your preferred LLM client as shown in previous chapters.
Step 2: Define Your Product Schema
Let’s define our Product Pydantic model. This model will specify the fields we want to extract from each product listing.
Create a new Python file, say ecommerce_extractor.py, and add the following:
from pydantic import BaseModel, Field
from typing import List, Optional
# Our desired schema for an e-commerce product
class Product(BaseModel):
"""
Represents a product listing with key details.
"""
name: str = Field(description="The full name of the product.")
brand: Optional[str] = Field(None, description="The brand or manufacturer of the product, if specified.")
price: float = Field(description="The numerical price of the product.")
currency: str = Field(description="The currency of the price (e.g., USD, EUR).")
description: str = Field(description="A detailed description of the product.")
features: List[str] = Field(
default_factory=list,
description="A list of key features or specifications of the product, extracted as short phrases."
)
availability: str = Field(description="The current availability status (e.g., 'In Stock', 'Out of Stock', 'Limited').")
print("Product schema defined successfully!")
Explanation:
- We import
BaseModel,Field,List, andOptionalfrompydanticandtyping. Productinherits fromBaseModel, making it a Pydantic model.- Each attribute (
name,price, etc.) is defined with a type hint (e.g.,str,float,List[str]). Field()allows us to add metadata like adescription, which is incredibly useful. LangExtract (and the underlying LLM) can use these descriptions to better understand what each field represents.Optional[str]indicates a field might beNoneif the information isn’t present.default_factory=listforfeaturesensures it defaults to an empty list if no features are found.
Step 3: Prepare Sample Data
Now, let’s create a few realistic (but simplified) e-commerce product listing texts.
Add these sample texts to your ecommerce_extractor.py file:
# ... (previous code for Product schema) ...
product_listing_1 = """
**Acme Smartwatch X1** - Track your fitness and stay connected.
Price: $199.99 USD. This sleek smartwatch features a vibrant AMOLED display,
heart rate monitoring, GPS, and up to 7 days of battery life.
Water-resistant design. Limited stock available!
"""
product_listing_2 = """
**SuperJuice Blender Pro** by VitaMix.
Blend smoothies, soups, and more with 1500W of power.
Only €149.00. Durable stainless steel blades. 2-year warranty.
Currently In Stock.
"""
product_listing_3 = """
**Eco-Friendly Bamboo Toothbrush Set** (4-pack).
Sustainable dental care. Description: Made from 100% natural bamboo, these toothbrushes
are biodegradable and gentle on your gums. Features: Soft bristles, ergonomic design,
travel-friendly. Only $12.50 USD. Out of Stock until next week.
"""
print("Sample product listings prepared!")
Step 4: Instantiate LangExtract and Perform Extraction
Next, we’ll set up our LangExtract Extractor and try to extract data from one of our listings. Remember to replace "YOUR_GEMINI_API_KEY" with your actual API key or configure your environment variable.
Continue adding to ecommerce_extractor.py:
# ... (previous code for Product schema and sample listings) ...
import os
import langextract as lx
from google.generativeai import GenerativeModel
# Configure your LLM client. Replace with your actual API key or env var setup.
# Ensure you have your API key securely stored, e.g., as an environment variable.
# For Google Generative AI, you might do:
# os.environ["GOOGLE_API_KEY"] = "YOUR_GEMINI_API_KEY" # NOT recommended for production
# It's better to use `google.generativeai.configure(api_key="YOUR_GEMINI_API_KEY")`
# or rely on the `GOOGLE_API_KEY` environment variable being set.
# For this example, we'll assume the API key is set via environment variable.
# Initialize the Gemini model
llm_model = GenerativeModel("gemini-pro") # Or "gemini-1.5-pro", "gemini-1.0-pro" etc.
llm_provider = lx.GoogleGenerativeAI(model=llm_model)
# Create the LangExtract Extractor instance
# We pass our Pydantic Product model as the schema
extractor = lx.Extractor(
schema=Product,
llm=llm_provider,
instructions="Extract all relevant product information from the given text. Pay close attention to price, currency, and availability status."
)
print("\n--- Extracting from Product Listing 1 ---")
try:
result_1 = extractor.extract(text=product_listing_1)
if result_1.extracted_data:
print("Extraction successful!")
print(result_1.extracted_data.model_dump_json(indent=2)) # Using model_dump_json for Pydantic v2
else:
print("No data extracted.")
print(f"Errors: {result_1.errors}")
except Exception as e:
print(f"An error occurred during extraction: {e}")
print("\n--- Extracting from Product Listing 2 ---")
try:
result_2 = extractor.extract(text=product_listing_2)
if result_2.extracted_data:
print("Extraction successful!")
print(result_2.extracted_data.model_dump_json(indent=2))
else:
print("No data extracted.")
print(f"Errors: {result_2.errors}")
except Exception as e:
print(f"An error occurred during extraction: {e}")
Explanation:
- We import
osandlangextractaslx, andGenerativeModelfromgoogle.generativeai. - We initialize
GenerativeModel("gemini-pro"). The specific model name might vary, always check the latest available models from Google’s documentation for the most up-to-date options as of 2026. llm_provider = lx.GoogleGenerativeAI(model=llm_model)wraps our LLM client for LangExtract.extractor = lx.Extractor(...)creates our extractor. We passschema=Productand ourllm_provider.- The
instructionsstring is crucial. It tells the LLM what kind of information to look for and how to interpret it. extractor.extract(text=...)is called to perform the extraction.result.extracted_dataholds the Pydantic model instance if successful.model_dump_json(indent=2)is used for pretty-printing Pydantic v2 models.- We also check
result.errorsto see if any issues occurred during validation or extraction.
Run this script: python ecommerce_extractor.py. Observe the output. You should see structured JSON for each product!
Step 5: Iteration and Refinement (Prompt Engineering in Action)
What if the initial extraction isn’t perfect? Perhaps availability is sometimes missed, or features aren’t always a clean list. This is where iteration and refining your instructions come in.
Let’s say for product_listing_3, the features might come out as a single string instead of a list, or brand might be missed. We can enhance our instructions.
Modify the extractor instantiation slightly:
# ... (previous code) ...
extractor = lx.Extractor(
schema=Product,
llm=llm_provider,
# Enhanced instructions
instructions="""
Extract all product details from the text.
- Product Name should be concise.
- Identify the Brand if explicitly mentioned.
- Price and Currency must be accurately extracted.
- Description should capture the main selling points.
- Features should be a list of distinct, short phrases.
- Availability must be one of: 'In Stock', 'Out of Stock', 'Limited', or a similar clear status.
If a field is not present, infer it as accurately as possible or leave it blank if optional and truly missing.
"""
)
print("\n--- Extracting from Product Listing 3 with Refined Instructions ---")
try:
result_3 = extractor.extract(text=product_listing_3)
if result_3.extracted_data:
print("Extraction successful!")
print(result_3.extracted_data.model_dump_json(indent=2))
else:
print("No data extracted.")
print(f"Errors: {result_3.errors}")
except Exception as e:
print(f"An error occurred during extraction: {e}")
Explanation:
We’ve made the instructions more explicit, guiding the LLM on how to handle each field, especially features and availability. This is a common pattern: start broad, then get specific as you identify areas for improvement.
Step 6: Handling Multiple Listings (Batch Processing)
LangExtract is designed for efficiency. You can process multiple texts in a batch.
Add this section to ecommerce_extractor.py:
# ... (previous code) ...
all_listings = [product_listing_1, product_listing_2, product_listing_3]
print("\n--- Extracting from All Listings in Batch ---")
try:
# The .extract method can take a list of texts
batch_results = extractor.extract(text=all_listings)
for i, result in enumerate(batch_results):
print(f"\n--- Result for Listing {i+1} ---")
if result.extracted_data:
print(result.extracted_data.model_dump_json(indent=2))
else:
print(f"No data extracted for listing {i+1}.")
print(f"Errors: {result.errors}")
except Exception as e:
print(f"An error occurred during batch extraction: {e}")
Explanation:
The extract() method can directly accept a list of strings, returning a list of ExtractionResult objects. This is much more efficient than looping and calling extract() for each item individually, as LangExtract can optimize API calls.
Step 7: Handling Long Descriptions (Chunking)
What if a product description is extremely long, exceeding the LLM’s context window? LangExtract intelligently handles this internally. It employs “smart chunking strategies” to break down large documents into smaller, manageable pieces, processes them (potentially in parallel), and then orchestrates the recombination of results. This means you generally don’t have to manually chunk your input text; LangExtract takes care of it.
For very large documents, you might consider parameters like max_workers in the Extractor if you’re dealing with a local or self-hosted LLM setup that can benefit from parallel processing of chunks. However, for most cloud-based LLMs, LangExtract’s internal management is sufficient.
Mini-Challenge: Enhance Your Product Extractor!
You’ve done a great job building the foundation. Now, let’s make it even better!
Challenge:
Extend your Product schema and instructions to extract two new pieces of information:
rating: An optionalfloatrepresenting the average customer rating (e.g., 4.5 out of 5). If no rating is found, it should beNone.num_reviews: An optionalintrepresenting the total number of customer reviews. If not found, it should beNone.
Then, add a new sample product listing that includes this information, and try to extract everything!
Hint:
- Remember to add
rating: Optional[float]andnum_reviews: Optional[int]to yourProductPydantic model. - Update your
instructionsstring in theExtractorto explicitly tell the LLM to look for “customer rating” and “number of reviews”. - Create a new
product_listing_4string with example rating and review data.
What to Observe/Learn:
- How easily you can extend your schema to capture new data points.
- The importance of clear
instructionsin guiding the LLM to identify new fields. - How
Optionalfields behave when data is present or absent.
Common Pitfalls & Troubleshooting
Even with LangExtract’s intelligence, you might encounter issues. Here are a few common ones:
Schema Mismatch / Validation Errors:
- Problem: The LLM extracts data that doesn’t perfectly match your Pydantic type (e.g., extracts “two hundred dollars” instead of
200.00for afloatfield). - Solution:
- Refine
instructions: Be very specific about the format you expect. “Price should be a numerical value only.” - Use Pydantic
Fielddescriptions: The descriptions you add toField()are passed to the LLM. Make them helpful! - Add Examples (Few-shot): For complex cases, you can include
examplesin yourExtractorinstantiation, providing a few input-output pairs. This teaches the LLM the desired format directly.
- Refine
- Problem: The LLM extracts data that doesn’t perfectly match your Pydantic type (e.g., extracts “two hundred dollars” instead of
Incomplete Extraction:
- Problem: Some fields are consistently
Noneor empty, even though the information is present in the text. - Solution:
- Strengthen
instructions: Emphasize the importance of the missing field. “It is CRITICAL to extract the availability status.” - Check text variations: Does the text use different phrasing for the same concept (e.g., “in stock,” “available,” “ready to ship”)? Update instructions to cover these.
- Strengthen
- Problem: Some fields are consistently
Hallucinations / Incorrect Data:
- Problem: The LLM invents data for a field or extracts completely wrong information.
- Solution:
- Specificity in
instructions: Guide the LLM to only extract what’s present. “Do not infer or invent information. If a detail is not explicitly stated, leave it blank.” - Grounding: For more advanced scenarios, LangExtract supports “grounding,” where the extraction is tied back to specific spans of text. While beyond this introductory chapter, it’s a powerful feature for ensuring accuracy.
- Specificity in
LLM API Rate Limits:
- Problem: You hit the maximum number of requests per minute or tokens per minute for your LLM provider.
- Solution:
- Batching: Process multiple documents at once using
extractor.extract([text1, text2, ...]). LangExtract will manage API calls more efficiently. - Implement Retry Logic: Use a library like
tenacityto automatically retry failed API calls with exponential backoff. - Monitor Usage: Keep an eye on your LLM provider’s dashboard and consider increasing your quota if needed.
- Batching: Process multiple documents at once using
Summary
Congratulations! You’ve just completed a practical LangExtract project, extracting structured data from e-commerce product listings. Here are the key takeaways:
- Schema-Driven Extraction: Pydantic models are fundamental for defining the target structure and providing clear guidance to the LLM.
- Iterative Prompt Engineering: Crafting effective
instructionsis an art. Start simple, observe results, and refine your prompts to improve accuracy and completeness. - Batch Processing Power: LangExtract efficiently handles lists of texts, making it suitable for processing large datasets.
- Built-in Intelligence: LangExtract’s internal chunking strategies simplify handling long documents, abstracting away LLM context window limitations.
- Troubleshooting is Key: Be prepared to debug and refine your setup as you encounter real-world data variations and LLM behaviors.
This project has demonstrated the power of LangExtract in transforming unstructured text into valuable, usable data. In the next chapters, we’ll explore even more advanced techniques, error handling strategies, and how to integrate LangExtract into larger data pipelines. Keep experimenting and building!
References
- LangExtract GitHub Repository: https://github.com/google/langextract
- Pydantic V2 Documentation: https://docs.pydantic.dev/latest/
- Google AI for Developers (Gemini API Documentation): https://ai.google.dev/
- Mermaid.js Documentation: https://mermaid.js.org/intro/syntax-reference.html
This page is AI-assisted and reviewed. It references official documentation and recognized resources where relevant.