Welcome back, aspiring data extractor! In Chapter 1, you successfully set up your development environment and installed LangExtract. That’s a fantastic first step! But right now, LangExtract is like a powerful car without an engine. It has the structure, but it can’t do anything until we give it the “brain” – a Large Language Model (LLM).
In this chapter, we’re going to connect LangExtract to a real LLM provider. This is where the magic happens! You’ll learn how to securely manage your API keys, configure LangExtract to use different LLM services (like Google’s Gemini or OpenAI’s GPT models), and understand why these steps are absolutely crucial for your extraction tasks. By the end of this chapter, LangExtract will be ready to tap into the intelligence of cutting-edge AI models, setting the stage for some truly amazing data extraction.
The Brain Behind the Extraction: LLM Providers
Think of LangExtract as a sophisticated orchestrator. It doesn’t contain the massive intelligence of a Large Language Model itself. Instead, it acts as a skilled conductor, sending your text and instructions to external LLM services (the “providers”) and then interpreting their responses. These LLM providers host powerful AI models that can understand, reason, and generate human-like text, making them perfect for structured data extraction.
Why Do We Need API Keys? Your Access Pass to AI Power!
When you use an LLM provider like Google or OpenAI, you’re accessing their cloud-based services. To ensure security, track usage, and manage billing, these providers require an API Key. An API Key is essentially a secret token that authenticates your requests, proving that you have permission to use their services.
Why is it a secret? Just like your house key, you wouldn’t leave your API key lying around for anyone to find! If someone gets hold of your API key, they could use your account, potentially incurring significant costs or misusing the AI services under your name. Therefore, managing your API keys securely is paramount.
Environment Variables: The Gold Standard for Secrets
So, how do we use API keys in our code without exposing them directly in our scripts (which would be a massive security risk if you ever shared your code online)? The answer is environment variables.
Environment variables are dynamic named values that can affect the way running processes behave on a computer. They live outside your code, making them a secure and flexible way to store sensitive information like API keys. LangExtract, like many modern Python libraries, is designed to look for these keys in your environment.
Here’s a simplified view of the connection process:
Step-by-Step Implementation: Getting Connected
Let’s get hands-on and connect LangExtract to an LLM provider. We’ll focus on using environment variables for security and demonstrate with both Google’s Gemini and OpenAI’s GPT models, as these are common choices as of early 2026.
Step 1: Install python-dotenv
First, if you haven’t already, install the python-dotenv library. This handy package makes it super easy to load environment variables from a .env file into your Python application.
pip install python-dotenv==1.0.1
(Version 1.0.1 is a stable, widely used version as of 2026-01-05. Always check for the absolute latest if you encounter issues, but this is a reliable starting point.)
Step 2: Obtain Your LLM Provider API Key
You’ll need an API key from your chosen LLM provider.
For Google Gemini:
- Visit the Google AI Studio or Google Cloud Console.
- Create a new project or select an existing one.
- Navigate to the “API keys” section (often under “Credentials” or “API & Services”).
- Generate a new API key. It will usually start with
AIza.... - Official Documentation: Google Cloud API Keys (search for “API Keys” on cloud.google.com if the direct link changes).
For OpenAI GPT:
- Go to the OpenAI platform website (platform.openai.com).
- Log in or sign up.
- Navigate to “API keys” under your user settings.
- Create a new secret key. It will usually start with
sk-.... - Official Documentation: OpenAI API Keys
Important: Copy your API key immediately after creation, as some platforms (like OpenAI) only show it once.
Step 3: Create a .env File
In the root directory of your LangExtract project (the same folder where your Python scripts will live), create a new file named .env.
Open this .env file and add your API key(s) like this:
# .env file content
GOOGLE_API_KEY="YOUR_GOOGLE_GEMINI_API_KEY_HERE"
OPENAI_API_KEY="YOUR_OPENAI_GPT_API_KEY_HERE"
Replace "YOUR_GOOGLE_GEMINI_API_KEY_HERE" and "YOUR_OPENAI_GPT_API_KEY_HERE" with your actual keys! If you’re only using one provider, you only need to include that provider’s key.
Step 4: Load Environment Variables and Initialize LangExtract
Now, let’s write some Python code to load these environment variables and prepare LangExtract. Create a new Python file, for example, app.py.
# app.py
from dotenv import load_dotenv
import os
import langextract as lx
# --- Step 1: Load environment variables from .env file ---
# This line looks for a .env file in the current directory and loads
# any key-value pairs found there into your script's environment.
load_dotenv()
print("Environment variables loaded.")
# --- Step 2: Access API keys from environment variables ---
google_api_key = os.getenv("GOOGLE_API_KEY")
openai_api_key = os.getenv("OPENAI_API_KEY")
if not google_api_key and not openai_api_key:
print("WARNING: No LLM API keys found in environment variables. Extraction will fail.")
else:
if google_api_key:
print("Google API Key found.")
if openai_api_key:
print("OpenAI API Key found.")
# --- Step 3: Configure LangExtract with your chosen LLM provider ---
# LangExtract is designed to be provider-agnostic. You typically specify
# the model you want to use directly in the extraction call.
# LangExtract will then automatically look for the corresponding
# API key in your environment variables (e.g., GOOGLE_API_KEY for Google models,
# OPENAI_API_KEY for OpenAI models).
# Let's try to initialize a dummy extractor to confirm setup
# Note: LangExtract itself doesn't have a global 'init' for providers.
# Instead, the provider and model are specified when you call `lx.extract`.
# We'll just confirm our keys are loaded for now.
# You're all set! The actual LLM integration happens when you call lx.extract()
# with a specific model name, which we'll cover in the next chapter.
# For example:
# result = lx.extract("Some text", schema, model="gemini-pro")
# or
# result = lx.extract("Some text", schema, model="gpt-4")
print("\nLangExtract is now ready to use LLM providers based on available API keys.")
print("Proceed to the next chapter to define your first extraction task!")
Explanation of the code:
from dotenv import load_dotenv: Imports the function we need to load our.envfile.import os: Imports theosmodule, which allows us to interact with the operating system’s environment variables.load_dotenv(): This is the magic line! It reads your.envfile and makes its contents available as environment variables within your Python script.os.getenv("VARIABLE_NAME"): This function retrieves the value of an environment variable. We use it to fetch ourGOOGLE_API_KEYandOPENAI_API_KEY.- The
if/elseblock simply checks if the keys were successfully loaded and prints a helpful message. - Important Note on LangExtract Initialization: Unlike some libraries that require an explicit
init()call for a provider, LangExtract is designed to be highly flexible. It often determines which provider to use based on themodelname you provide during thelx.extract()call (e.g.,model="gemini-pro"implies Google,model="gpt-4"implies OpenAI). It then automatically uses the corresponding API key found in your environment variables. This simplifies the setup process!
Run this script from your terminal:
python app.py
You should see output similar to this (depending on which keys you set):
Environment variables loaded.
Google API Key found.
OpenAI API Key found.
LangExtract is now ready to use LLM providers based on available API keys.
Proceed to the next chapter to define your first extraction task!
If you only set one key, you’d only see that one reported. If you forgot to set any, you’d see the warning.
Mini-Challenge: Connect to a Specific Provider
Let’s ensure you’ve got the hang of connecting.
Challenge: Modify your app.py file to explicitly check for only the OPENAI_API_KEY and print a success message if found, otherwise print a message indicating it’s missing. Remove the Google API key check for this challenge.
Hint: You’ll only need to change the if/else block in app.py to focus on openai_api_key.
What to observe/learn: This exercise reinforces how os.getenv() works and how you can target specific environment variables. It also helps you confirm that your API key for OpenAI (or your chosen provider) is correctly loaded.
Common Pitfalls & Troubleshooting
Connecting to external services often comes with a few common hiccups. Here’s how to debug them:
“WARNING: No LLM API keys found…” or
Noneforos.getenv():- Cause: The
.envfile is missing, not in the correct directory (it should be in the same folder you run your Python script from), or the variable name in.envdoesn’t match what you’re looking for (e.g.,GOOGLE_API_KEYvs.GOOGLE_APIKEY). - Fix: Double-check the
.envfile’s name and location. Verify variable names are exact (GOOGLE_API_KEY,OPENAI_API_KEY). Ensure there are no leading/trailing spaces around the variable name or value in the.envfile. Also, make sure you’ve installedpython-dotenv.
- Cause: The
KeyErrororAuthenticationError(when you start usinglx.extractlater):- Cause: Your API key is incorrect, expired, revoked, or has insufficient permissions for the model you’re trying to use.
- Fix: Go back to your LLM provider’s console (Google AI Studio/Cloud, OpenAI Platform) and regenerate a new API key. Update your
.envfile with the new key. Also, ensure you have sufficient credits or a valid subscription with the provider.
Network Issues:
- Cause: Your internet connection is down, or there’s a temporary issue with the LLM provider’s service.
- Fix: Check your internet connection. You can also visit the provider’s status page (e.g., Google Cloud Status, OpenAI Status) to see if there are any ongoing outages.
Summary
Phew! You’ve successfully laid the groundwork for powerful AI-driven extraction. Let’s recap what we’ve learned:
- LLM Providers are the brains: LangExtract leverages external LLM services (like Google Gemini and OpenAI GPT) for its intelligence.
- API Keys are your access: These secret tokens authenticate your requests to LLM providers.
- Environment Variables are for security: Storing API keys in
.envfiles and loading them withpython-dotenvis the secure and recommended practice. - LangExtract’s flexible setup: It automatically uses the appropriate API key from your environment variables based on the LLM
modelyou specify during thelx.extract()call.
You’re now ready to move beyond just connecting. In the next chapter, we’ll dive into the exciting world of defining your extraction schema – telling LangExtract exactly what kind of structured data you want it to pull from your unstructured text!
References
- LangExtract GitHub Repository
- Python-dotenv Documentation
- Google Cloud API Keys Documentation
- OpenAI API Keys Documentation
This page is AI-assisted and reviewed. It references official documentation and recognized resources where relevant.