Chapter 16: Project: Data Extraction for E-commerce Product Listings

Introduction: Turning Product Text into Gold

Welcome back, future data wizard! In our journey so far, you’ve mastered the fundamentals of LangExtract, understood how to set up your LLM provider, and crafted basic extraction schemas. Now, it’s time to put that knowledge to the test with a real-world, highly practical project: extracting structured data from e-commerce product listings.

Imagine you’re building a tool to compare prices across different online stores, or perhaps enriching your own product catalog with information scraped from various sources. The raw data often comes as messy, unstructured text – a product name, a description paragraph, a list of features, all jumbled together. Our goal in this chapter is to transform this chaotic text into clean, structured data like product names, prices, descriptions, and key features, using LangExtract’s powerful LLM-orchestrated capabilities. This project will solidify your understanding of schema design, prompt engineering, and handling common data extraction challenges.

Before we dive in, ensure you have your Python environment ready and your preferred LLM provider (like Google’s Gemma/Gemini, OpenAI, or Anthropic) configured as we covered in previous chapters. We’ll be building on that foundation to create a robust extraction pipeline for e-commerce data.

Core Concepts: The E-commerce Extraction Blueprint

Extracting data from e-commerce listings requires a thoughtful approach. We’re not just pulling out random words; we’re looking for specific pieces of information that adhere to a predefined structure.

Understanding E-commerce Product Data

E-commerce product listings can vary widely, but typically contain:

Product Name: The main identifier.
Price: A numerical value, often with a currency.
Description: A paragraph or two detailing the product’s benefits and features.
Key Features/Specifications: Bullet points or short phrases.
Brand: The manufacturer’s name.
Availability: In stock, out of stock, limited quantity.

The challenge lies in these pieces of information not always being in the same place or format. LangExtract, powered by a Large Language Model (LLM), excels at this kind of semantic understanding and extraction.

Designing the Extraction Schema with Pydantic

The heart of any LangExtract project is the schema. This defines what data you want to extract and how it should be structured. For e-commerce products, we’ll use Pydantic to define a Product model. Pydantic is a fantastic library that provides data validation and settings management using Python type hints, making it perfect for defining our target data structure.

Here’s a visual overview of our extraction process:

flowchart LR A[Unstructured Product Listing Text] --> B{LangExtract Extractor} B --> C[LLM Provider] B --> D[Pydantic Schema] C & D --> E[Structured Product Data]

This diagram illustrates how your raw text flows into the LangExtract Extractor, which then leverages both your chosen LLM and the Pydantic schema to produce clean, structured output.

The Role of Prompt Engineering

While the Pydantic schema tells LangExtract what to extract, prompt engineering (through instructions) tells the LLM how to interpret the text and where to find the information. You’ll often start with a basic schema and then refine your instructions as you encounter variations in the input text, guiding the LLM to better accuracy.

Step-by-Step Implementation: Building Our Extractor

Let’s get our hands dirty and start coding!

Step 1: Set Up Your Environment

First, ensure you have the necessary libraries installed. As of early 2026, langextract is still actively maintained, and pydantic v2 is the recommended stable version.

pip install "langextract>=0.1.0" "pydantic>=2.0" "google-generativeai>=0.3.0" # Or your chosen LLM client like 'openai'

We’ll use google-generativeai for this example, but feel free to substitute with your preferred LLM client as shown in previous chapters.

Step 2: Define Your Product Schema

Let’s define our Product Pydantic model. This model will specify the fields we want to extract from each product listing.

Create a new Python file, say ecommerce_extractor.py, and add the following:

from pydantic import BaseModel, Field
from typing import List, Optional

# Our desired schema for an e-commerce product
class Product(BaseModel):
    """
    Represents a product listing with key details.
    """
    name: str = Field(description="The full name of the product.")
    brand: Optional[str] = Field(None, description="The brand or manufacturer of the product, if specified.")
    price: float = Field(description="The numerical price of the product.")
    currency: str = Field(description="The currency of the price (e.g., USD, EUR).")
    description: str = Field(description="A detailed description of the product.")
    features: List[str] = Field(
        default_factory=list,
        description="A list of key features or specifications of the product, extracted as short phrases."
    )
    availability: str = Field(description="The current availability status (e.g., 'In Stock', 'Out of Stock', 'Limited').")

print("Product schema defined successfully!")

Explanation:

We import BaseModel, Field, List, and Optional from pydantic and typing.
Product inherits from BaseModel, making it a Pydantic model.
Each attribute (name, price, etc.) is defined with a type hint (e.g., str, float, List[str]).
Field() allows us to add metadata like a description, which is incredibly useful. LangExtract (and the underlying LLM) can use these descriptions to better understand what each field represents.
Optional[str] indicates a field might be None if the information isn’t present.
default_factory=list for features ensures it defaults to an empty list if no features are found.

Step 3: Prepare Sample Data

Now, let’s create a few realistic (but simplified) e-commerce product listing texts.

Add these sample texts to your ecommerce_extractor.py file:

# ... (previous code for Product schema) ...

product_listing_1 = """
**Acme Smartwatch X1** - Track your fitness and stay connected.
Price: $199.99 USD. This sleek smartwatch features a vibrant AMOLED display, 
heart rate monitoring, GPS, and up to 7 days of battery life. 
Water-resistant design. Limited stock available!
"""

product_listing_2 = """
**SuperJuice Blender Pro** by VitaMix.
Blend smoothies, soups, and more with 1500W of power.
Only €149.00. Durable stainless steel blades. 2-year warranty. 
Currently In Stock.
"""

product_listing_3 = """
**Eco-Friendly Bamboo Toothbrush Set** (4-pack).
Sustainable dental care. Description: Made from 100% natural bamboo, these toothbrushes 
are biodegradable and gentle on your gums. Features: Soft bristles, ergonomic design, 
travel-friendly. Only $12.50 USD. Out of Stock until next week.
"""

print("Sample product listings prepared!")

Step 4: Instantiate LangExtract and Perform Extraction

Next, we’ll set up our LangExtract Extractor and try to extract data from one of our listings. Remember to replace "YOUR_GEMINI_API_KEY" with your actual API key or configure your environment variable.

Continue adding to ecommerce_extractor.py:

# ... (previous code for Product schema and sample listings) ...

import os
import langextract as lx
from google.generativeai import GenerativeModel

# Configure your LLM client. Replace with your actual API key or env var setup.
# Ensure you have your API key securely stored, e.g., as an environment variable.
# For Google Generative AI, you might do:
# os.environ["GOOGLE_API_KEY"] = "YOUR_GEMINI_API_KEY" # NOT recommended for production
# It's better to use `google.generativeai.configure(api_key="YOUR_GEMINI_API_KEY")`
# or rely on the `GOOGLE_API_KEY` environment variable being set.
# For this example, we'll assume the API key is set via environment variable.

# Initialize the Gemini model
llm_model = GenerativeModel("gemini-pro") # Or "gemini-1.5-pro", "gemini-1.0-pro" etc.
llm_provider = lx.GoogleGenerativeAI(model=llm_model)

# Create the LangExtract Extractor instance
# We pass our Pydantic Product model as the schema
extractor = lx.Extractor(
    schema=Product,
    llm=llm_provider,
    instructions="Extract all relevant product information from the given text. Pay close attention to price, currency, and availability status."
)

print("\n--- Extracting from Product Listing 1 ---")
try:
    result_1 = extractor.extract(text=product_listing_1)
    if result_1.extracted_data:
        print("Extraction successful!")
        print(result_1.extracted_data.model_dump_json(indent=2)) # Using model_dump_json for Pydantic v2
    else:
        print("No data extracted.")
        print(f"Errors: {result_1.errors}")

except Exception as e:
    print(f"An error occurred during extraction: {e}")

print("\n--- Extracting from Product Listing 2 ---")
try:
    result_2 = extractor.extract(text=product_listing_2)
    if result_2.extracted_data:
        print("Extraction successful!")
        print(result_2.extracted_data.model_dump_json(indent=2))
    else:
        print("No data extracted.")
        print(f"Errors: {result_2.errors}")
except Exception as e:
    print(f"An error occurred during extraction: {e}")

Explanation:

We import os and langextract as lx, and GenerativeModel from google.generativeai.
We initialize GenerativeModel("gemini-pro"). The specific model name might vary, always check the latest available models from Google’s documentation for the most up-to-date options as of 2026.
llm_provider = lx.GoogleGenerativeAI(model=llm_model) wraps our LLM client for LangExtract.
extractor = lx.Extractor(...) creates our extractor. We pass schema=Product and our llm_provider.
The instructions string is crucial. It tells the LLM what kind of information to look for and how to interpret it.
extractor.extract(text=...) is called to perform the extraction.
result.extracted_data holds the Pydantic model instance if successful. model_dump_json(indent=2) is used for pretty-printing Pydantic v2 models.
We also check result.errors to see if any issues occurred during validation or extraction.

Run this script: python ecommerce_extractor.py. Observe the output. You should see structured JSON for each product!

What if the initial extraction isn’t perfect? Perhaps availability is sometimes missed, or features aren’t always a clean list. This is where iteration and refining your instructions come in.

Let’s say for product_listing_3, the features might come out as a single string instead of a list, or brand might be missed. We can enhance our instructions.

Modify the extractor instantiation slightly:

# ... (previous code) ...

extractor = lx.Extractor(
    schema=Product,
    llm=llm_provider,
    # Enhanced instructions
    instructions="""
    Extract all product details from the text.
    - Product Name should be concise.
    - Identify the Brand if explicitly mentioned.
    - Price and Currency must be accurately extracted.
    - Description should capture the main selling points.
    - Features should be a list of distinct, short phrases.
    - Availability must be one of: 'In Stock', 'Out of Stock', 'Limited', or a similar clear status.
    If a field is not present, infer it as accurately as possible or leave it blank if optional and truly missing.
    """
)

print("\n--- Extracting from Product Listing 3 with Refined Instructions ---")
try:
    result_3 = extractor.extract(text=product_listing_3)
    if result_3.extracted_data:
        print("Extraction successful!")
        print(result_3.extracted_data.model_dump_json(indent=2))
    else:
        print("No data extracted.")
        print(f"Errors: {result_3.errors}")
except Exception as e:
    print(f"An error occurred during extraction: {e}")

Explanation: We’ve made the instructions more explicit, guiding the LLM on how to handle each field, especially features and availability. This is a common pattern: start broad, then get specific as you identify areas for improvement.

Step 6: Handling Multiple Listings (Batch Processing)

LangExtract is designed for efficiency. You can process multiple texts in a batch.

Add this section to ecommerce_extractor.py:

# ... (previous code) ...

all_listings = [product_listing_1, product_listing_2, product_listing_3]

print("\n--- Extracting from All Listings in Batch ---")
try:
    # The .extract method can take a list of texts
    batch_results = extractor.extract(text=all_listings)

    for i, result in enumerate(batch_results):
        print(f"\n--- Result for Listing {i+1} ---")
        if result.extracted_data:
            print(result.extracted_data.model_dump_json(indent=2))
        else:
            print(f"No data extracted for listing {i+1}.")
            print(f"Errors: {result.errors}")

except Exception as e:
    print(f"An error occurred during batch extraction: {e}")

Explanation: The extract() method can directly accept a list of strings, returning a list of ExtractionResult objects. This is much more efficient than looping and calling extract() for each item individually, as LangExtract can optimize API calls.

Step 7: Handling Long Descriptions (Chunking)

What if a product description is extremely long, exceeding the LLM’s context window? LangExtract intelligently handles this internally. It employs “smart chunking strategies” to break down large documents into smaller, manageable pieces, processes them (potentially in parallel), and then orchestrates the recombination of results. This means you generally don’t have to manually chunk your input text; LangExtract takes care of it.

For very large documents, you might consider parameters like max_workers in the Extractor if you’re dealing with a local or self-hosted LLM setup that can benefit from parallel processing of chunks. However, for most cloud-based LLMs, LangExtract’s internal management is sufficient.

Mini-Challenge: Enhance Your Product Extractor!

You’ve done a great job building the foundation. Now, let’s make it even better!

Challenge: Extend your Product schema and instructions to extract two new pieces of information:

rating: An optional float representing the average customer rating (e.g., 4.5 out of 5). If no rating is found, it should be None.
num_reviews: An optional int representing the total number of customer reviews. If not found, it should be None.

Then, add a new sample product listing that includes this information, and try to extract everything!

Hint:

Remember to add rating: Optional[float] and num_reviews: Optional[int] to your Product Pydantic model.
Update your instructions string in the Extractor to explicitly tell the LLM to look for “customer rating” and “number of reviews”.
Create a new product_listing_4 string with example rating and review data.

What to Observe/Learn:

How easily you can extend your schema to capture new data points.
The importance of clear instructions in guiding the LLM to identify new fields.
How Optional fields behave when data is present or absent.

Common Pitfalls & Troubleshooting

Even with LangExtract’s intelligence, you might encounter issues. Here are a few common ones:

Schema Mismatch / Validation Errors:
- Problem: The LLM extracts data that doesn’t perfectly match your Pydantic type (e.g., extracts “two hundred dollars” instead of 200.00 for a float field).
- Solution:
  - Refine instructions: Be very specific about the format you expect. “Price should be a numerical value only.”
  - Use Pydantic Field descriptions: The descriptions you add to Field() are passed to the LLM. Make them helpful!
  - Add Examples (Few-shot): For complex cases, you can include examples in your Extractor instantiation, providing a few input-output pairs. This teaches the LLM the desired format directly.
Incomplete Extraction:
- Problem: Some fields are consistently None or empty, even though the information is present in the text.
- Solution:
  - Strengthen instructions: Emphasize the importance of the missing field. “It is CRITICAL to extract the availability status.”
  - Check text variations: Does the text use different phrasing for the same concept (e.g., “in stock,” “available,” “ready to ship”)? Update instructions to cover these.
Hallucinations / Incorrect Data:
- Problem: The LLM invents data for a field or extracts completely wrong information.
- Solution:
  - Specificity in instructions: Guide the LLM to only extract what’s present. “Do not infer or invent information. If a detail is not explicitly stated, leave it blank.”
  - Grounding: For more advanced scenarios, LangExtract supports “grounding,” where the extraction is tied back to specific spans of text. While beyond this introductory chapter, it’s a powerful feature for ensuring accuracy.
LLM API Rate Limits:
- Problem: You hit the maximum number of requests per minute or tokens per minute for your LLM provider.
- Solution:
  - Batching: Process multiple documents at once using extractor.extract([text1, text2, ...]). LangExtract will manage API calls more efficiently.
  - Implement Retry Logic: Use a library like tenacity to automatically retry failed API calls with exponential backoff.
  - Monitor Usage: Keep an eye on your LLM provider’s dashboard and consider increasing your quota if needed.

Summary

Congratulations! You’ve just completed a practical LangExtract project, extracting structured data from e-commerce product listings. Here are the key takeaways:

Schema-Driven Extraction: Pydantic models are fundamental for defining the target structure and providing clear guidance to the LLM.
Iterative Prompt Engineering: Crafting effective instructions is an art. Start simple, observe results, and refine your prompts to improve accuracy and completeness.
Batch Processing Power: LangExtract efficiently handles lists of texts, making it suitable for processing large datasets.
Built-in Intelligence: LangExtract’s internal chunking strategies simplify handling long documents, abstracting away LLM context window limitations.
Troubleshooting is Key: Be prepared to debug and refine your setup as you encounter real-world data variations and LLM behaviors.

This project has demonstrated the power of LangExtract in transforming unstructured text into valuable, usable data. In the next chapters, we’ll explore even more advanced techniques, error handling strategies, and how to integrate LangExtract into larger data pipelines. Keep experimenting and building!

References

LangExtract GitHub Repository: https://github.com/google/langextract
Pydantic V2 Documentation: https://docs.pydantic.dev/latest/
Google AI for Developers (Gemini API Documentation): https://ai.google.dev/
Mermaid.js Documentation: https://mermaid.js.org/intro/syntax-reference.html

This page is AI-assisted and reviewed. It references official documentation and recognized resources where relevant.