Introduction to Asynchronous Operations

Welcome back, future AI architect! In our journey with any-llm, we’ve learned how to connect to various LLM providers and get intelligent responses. So far, our interactions have been synchronous, meaning one operation completes entirely before the next one begins. While this is straightforward, it’s not always the most efficient, especially when dealing with tasks that involve waiting.

Think about ordering coffee. If you order, then wait for your coffee to be made, then order a pastry, then wait for that to be ready, that’s synchronous. What if you could order both at once, and while the coffee is brewing, the barista starts preparing your pastry? That’s closer to asynchronous!

In this chapter, we’ll dive into the exciting world of asynchronous programming with Python’s asyncio and how any-llm leverages it. This is crucial for building high-performance, responsive applications that interact with large language models, especially when you need to make multiple LLM calls concurrently. By the end, you’ll understand why and how to make your any-llm interactions non-blocking and super efficient!

Before we begin, make sure you’re comfortable with the basics of any-llm from previous chapters, including setting up providers and making basic completion calls. A foundational understanding of Python functions and variables will also be helpful.

Core Concepts: Why Go Asynchronous?

Why should we bother with async and await? The main reason is efficiency, particularly when your program spends a lot of time “waiting.” LLM API calls, like many network requests, are inherently I/O-bound tasks. This means your program sends a request to a server (e.g., OpenAI, Mistral, Ollama) and then waits for that server to process the request and send back a response. During this waiting period, your program isn’t doing much useful work.

Synchronous vs. Asynchronous

Let’s visualize the difference:

flowchart TD subgraph Synchronous Execution A[Request LLM 1] --> B{Wait for LLM 1} B --> C[Process LLM 1 Response] C --> D[Request LLM 2] D --> E{Wait for LLM 2} E --> F[Process LLM 2 Response] F --> G[Done] end subgraph Asynchronous Execution H[Initiate LLM 1 Request] I[Initiate LLM 2 Request] H & I --> J{Wait for ALL LLM Responses} J --> K[Process LLM 1 & 2 Responses] K --> L[Done] end

As you can see, in synchronous execution, each request blocks the program until its response arrives. In asynchronous execution, we can initiate multiple requests and then await their completion concurrently. This doesn’t mean they run in parallel on separate CPU cores (that’s multi-threading/multi-processing), but rather that Python can switch to doing other useful work (like initiating another request) while waiting for an I/O operation to complete. This is often called “concurrent” execution.

Python’s asyncio Basics

Python’s standard library provides asyncio for writing concurrent code using the async/await syntax. This allows a single-threaded program to manage multiple I/O-bound tasks efficiently.

  • async def: This keyword defines a coroutine. A coroutine is a special kind of function that can be paused and resumed. It doesn’t run immediately when called; instead, it returns a coroutine object.
  • await: This keyword can only be used inside an async def function. When await is used before an awaitable (like another coroutine or an I/O operation), it tells Python: “Hey, I’m going to wait here for this task to finish. While I’m waiting, you can go run other async tasks!”

Let’s look at a simple asyncio example:

import asyncio
import time

# Define an asynchronous function (coroutine)
async def say_hello(delay, message):
    print(f"[{time.time():.2f}] Starting: {message}")
    await asyncio.sleep(delay) # Simulate an I/O bound task (like an LLM call)
    print(f"[{time.time():.2f}] Finished: {message}")
    return f"Result for {message}"

# This is the entry point for running async code
async def main():
    print(f"[{time.time():.2f}] Main function started synchronously.")

    # Call coroutines - they don't run immediately, they return coroutine objects
    task1 = say_hello(2, "Task 1")
    task2 = say_hello(1, "Task 2")

    # Await them concurrently using asyncio.gather
    # This means both tasks start, and we wait for both to complete
    results = await asyncio.gather(task1, task2)

    print(f"[{time.time():.2f}] All tasks finished. Results: {results}")

# Run the main asynchronous function
if __name__ == "__main__":
    # Python 3.7+ uses asyncio.run() to execute the top-level async function
    asyncio.run(main())

Explanation:

  1. We define say_hello as an async def function. Inside it, await asyncio.sleep(delay) simulates a network call where the program waits for delay seconds.
  2. The main function is also an async def function.
  3. task1 = say_hello(2, "Task 1") and task2 = say_hello(1, "Task 2") don’t execute the say_hello function immediately. Instead, they create coroutine objects.
  4. await asyncio.gather(task1, task2) is where the magic happens. It schedules both task1 and task2 to run concurrently. Python will start task1, then task2, and whenever one hits an await (like asyncio.sleep), it can switch to the other.
  5. asyncio.run(main()) is the standard way to run the top-level async function in a Python script.

If you run this code, you’ll notice that “Task 1” and “Task 2” start almost at the same time, and the total execution time is closer to the longest task (2 seconds) rather than the sum of both (3 seconds). This is the power of asyncio!

Step-by-Step Implementation: Asynchronous any-llm Calls

Now, let’s apply this to any-llm. The any-llm library is designed with asyncio in mind, offering asynchronous versions of its core functions. Typically, the main completion or chat interface will have async methods or directly support await.

1. Setting Up Your Environment (Revisit)

First, ensure you have any-llm installed with the providers you intend to use. For this example, let’s assume we’re using mistral and ollama.

pip install 'any-llm-sdk[mistral,ollama]'

Remember to set your API keys as environment variables. For local models like Ollama, ensure it’s running.

export MISTRAL_API_KEY="YOUR_MISTRAL_API_KEY"
# For Ollama, no specific API key is needed, just ensure the server is running.

2. Making a Single Asynchronous Call

any-llm’s completion function can often be awaited directly if you’re within an async context.

Let’s create a new Python file, say async_llm_example.py:

import asyncio
import os
from any_llm import completion
from any_llm.schemas import CompletionRequest, ProviderConfig

# Ensure environment variables are set for the provider you want to use
# For Mistral: MISTRAL_API_KEY
# For Ollama: No API key, but ensure Ollama server is running
# For OpenAI: OPENAI_API_KEY
# ...and so on for other providers.

async def get_llm_response_async(prompt: str, provider: str) -> str:
    """
    Makes an asynchronous call to any-llm for a completion.
    """
    print(f"[{asyncio.current_task().get_name()}] Requesting from {provider} with prompt: '{prompt[:30]}...'")

    try:
        # The completion function from any_llm is designed to be awaitable
        response = await completion(
            CompletionRequest(
                prompt=prompt,
                model="mistral/mistral-large-latest" if provider == "mistral" else "ollama/llama3", # Example models
                temperature=0.7,
                max_tokens=50
            ),
            provider=provider
        )
        print(f"[{asyncio.current_task().get_name()}] Received response from {provider}.")
        return response.choices[0].text.strip()
    except Exception as e:
        print(f"[{asyncio.current_task().get_name()}] Error from {provider}: {e}")
        return f"Error getting response from {provider}"

async def main():
    print("Starting asynchronous LLM calls...")

    # A single asynchronous call
    prompt1 = "Explain the concept of quantum entanglement in a sentence."
    response1 = await get_llm_response_async(prompt1, "mistral")
    print(f"\nMistral Response: {response1}")

    print("\nAll asynchronous calls completed.")

if __name__ == "__main__":
    asyncio.run(main())

Explanation:

  1. We import asyncio, os, completion, CompletionRequest, and ProviderConfig from any_llm.
  2. get_llm_response_async is an async def function. Inside it, await completion(...) pauses the execution of this coroutine until the LLM provider responds, allowing other async tasks to run.
  3. We use asyncio.current_task().get_name() for clearer output, showing which task is printing.
  4. In main(), we call get_llm_response_async with await to get a single response.
  5. asyncio.run(main()) kicks off our asyncio event loop.

Run this script. You’ll see the output for the single request. This is the foundation.

3. Making Multiple Concurrent Asynchronous Calls

Now, let’s make multiple calls concurrently using asyncio.gather(). This is where the performance benefits truly shine.

Modify your async_llm_example.py main function to look like this:

# ... (imports and get_llm_response_async function remain the same) ...

async def main():
    print("Starting multiple concurrent asynchronous LLM calls...")

    prompts = [
        ("Explain the concept of quantum entanglement in a sentence.", "mistral"),
        ("Write a short, encouraging haiku about learning Python.", "ollama"),
        ("What is the capital of France?", "mistral"),
        ("Generate a random number between 1 and 100.", "ollama")
    ]

    # Create a list of coroutine objects
    tasks = [
        get_llm_response_async(prompt, provider)
        for prompt, provider in prompts
    ]

    # Run all tasks concurrently and await their results
    # asyncio.gather preserves the order of results corresponding to the tasks list
    responses = await asyncio.gather(*tasks)

    print("\n--- All concurrent calls completed ---")
    for i, (prompt, provider) in enumerate(prompts):
        print(f"Prompt {i+1} ({provider}): '{prompt[:30]}...'")
        print(f"Response {i+1}: {responses[i]}\n")

if __name__ == "__main__":
    asyncio.run(main())

Explanation:

  1. We define a list of prompts, each paired with a provider.
  2. We then create a tasks list, where each element is a coroutine object returned by calling get_llm_response_async. Remember, these functions don’t run yet.
  3. await asyncio.gather(*tasks) is the key. It takes all the coroutine objects in tasks, schedules them to run concurrently, and waits until all of them have completed. The * unpacks the list of tasks into individual arguments for gather.
  4. The responses list will contain the results from each coroutine, in the same order as the tasks list.

Run this updated script. You should observe that the “Starting” messages for different prompts appear very quickly, and the “Finished” messages arrive as responses come back, likely overlapping significantly. The total time taken will be closer to the longest individual LLM call, rather than the sum of all of them, demonstrating the power of concurrency!

4. Handling Timeouts

LLM providers can sometimes be slow or unresponsive. It’s good practice to add timeouts to your asynchronous calls to prevent your application from hanging indefinitely. asyncio.wait_for is perfect for this.

Let’s wrap our completion call with asyncio.wait_for.

Modify get_llm_response_async slightly:

# ... (imports remain the same) ...

async def get_llm_response_async(prompt: str, provider: str, timeout_seconds: int = 15) -> str:
    """
    Makes an asynchronous call to any-llm for a completion with a timeout.
    """
    task_name = asyncio.current_task().get_name()
    print(f"[{task_name}] Requesting from {provider} with prompt: '{prompt[:30]}...' (Timeout: {timeout_seconds}s)")

    try:
        response_task = completion(
            CompletionRequest(
                prompt=prompt,
                model="mistral/mistral-large-latest" if provider == "mistral" else "ollama/llama3",
                temperature=0.7,
                max_tokens=50
            ),
            provider=provider
        )
        # Await the response with a timeout
        response = await asyncio.wait_for(response_task, timeout=timeout_seconds)

        print(f"[{task_name}] Received response from {provider}.")
        return response.choices[0].text.strip()
    except asyncio.TimeoutError:
        print(f"[{task_name}] Timeout occurred after {timeout_seconds} seconds for {provider}.")
        return f"Error: Request to {provider} timed out."
    except Exception as e:
        print(f"[{task_name}] Error from {provider}: {e}")
        return f"Error getting response from {provider}"

# ... (main function remains the same, but you can add timeout_seconds to your calls if desired) ...

Explanation:

  1. We added a timeout_seconds parameter to get_llm_response_async.
  2. response_task = completion(...) now creates the coroutine object for the any-llm call without immediately awaiting it.
  3. await asyncio.wait_for(response_task, timeout=timeout_seconds) then awaits this specific coroutine, but it will raise an asyncio.TimeoutError if it doesn’t complete within the specified time.
  4. We added a try...except asyncio.TimeoutError block to gracefully handle these timeouts.

This is a critical best practice for robust production systems!

Mini-Challenge: Asynchronous Provider Comparison

Let’s put your new async skills to the test!

Challenge: Write an async Python script that performs the following:

  1. Define a single, challenging prompt (e.g., “Summarize the history of artificial intelligence in 100 words.”).
  2. Make concurrent any-llm calls to at least three different providers (e.g., mistral, ollama, openai if you have keys for all).
  3. Each call should have a reasonable timeout (e.g., 20 seconds).
  4. Print the response from each provider, clearly indicating which provider gave which response. Also, note if any provider timed out.
  5. Observe the execution time. Does it feel faster than doing them sequentially?

Hint:

  • You’ll need asyncio.gather() for concurrent execution.
  • Remember to set up environment variables for all your chosen providers.
  • You can reuse the get_llm_response_async function we just modified!

What to Observe/Learn:

  • How efficiently asyncio can handle multiple I/O-bound tasks.
  • The potential differences in response times and quality across various LLM providers for the same prompt when called concurrently.
  • The importance of timeouts in managing external API dependencies.

Common Pitfalls & Troubleshooting

Working with asyncio can sometimes be tricky. Here are some common issues and how to resolve them:

  1. Forgetting await: This is probably the most common mistake. If you call an async def function without awaiting it, you’re not actually running the coroutine; you’re just creating a coroutine object.

    • Symptom: Your async function doesn’t seem to do anything, or the output appears instantly without waiting.
    • Fix: Ensure you await all coroutine calls within an async def function (e.g., response = await get_llm_response_async(...)).
  2. Mixing async and sync incorrectly: You cannot directly await an async def function from a regular def function. Similarly, you shouldn’t call blocking sync code directly inside a busy async loop, as it will block the entire event loop.

    • Symptom: RuntimeError: 'coroutine' object is not awaitable or your async program becomes unresponsive.
    • Fix:
      • Use asyncio.run() as the entry point for your top-level async function.
      • If you must run blocking code in an async context, use await asyncio.to_thread(blocking_function, *args).
      • If you need to call async code from a sync context (e.g., in a web framework that’s not fully async), you might use asyncio.run() or asyncio.get_event_loop().run_until_complete() cautiously, but it’s generally better to make the entire call stack asynchronous.
  3. Incorrect asyncio.run() usage: asyncio.run() should typically only be called once, at the very top level of your application. Calling it multiple times can lead to RuntimeError: Event loop is already running.

    • Symptom: RuntimeError: Event loop is already running.
    • Fix: Structure your application so that a single asyncio.run(main_async_function()) call starts everything. Inside main_async_function, you then await all other coroutines.
  4. Unhandled Exceptions in asyncio.gather: If one of the coroutines passed to asyncio.gather raises an exception, by default, gather will immediately raise that exception, potentially leaving other tasks incomplete.

    • Symptom: Your application crashes if one LLM call fails, even if others might succeed.
    • Fix: Use the return_exceptions=True parameter in asyncio.gather. This will make gather return the exception objects themselves instead of raising them, allowing you to process results and errors for all tasks.
    # Example for return_exceptions=True
    results_or_exceptions = await asyncio.gather(*tasks, return_exceptions=True)
    for res in results_or_exceptions:
        if isinstance(res, Exception):
            print(f"Task failed with: {res}")
        else:
            print(f"Task succeeded with: {res}")
    

Summary

Phew! You’ve just unlocked a superpower for building responsive and efficient AI applications. Here’s what we covered:

  • Synchronous vs. Asynchronous: Understood the fundamental difference and why asynchronous execution is vital for I/O-bound tasks like LLM API calls.
  • Python asyncio Core: Learned about async def for defining coroutines and await for pausing execution to allow other tasks to run.
  • any-llm Asynchronous API: Saw how any-llm integrates seamlessly with asyncio, allowing you to await its core functions like completion.
  • Concurrent Calls with asyncio.gather: Mastered making multiple LLM calls simultaneously, significantly speeding up applications that require many interactions.
  • Robustness with asyncio.wait_for: Learned to implement timeouts to prevent indefinite hangs, making your applications more resilient.
  • Common Pitfalls: Explored typical mistakes and how to troubleshoot them effectively.

By leveraging asynchronous operations, you can build any-llm powered applications that are faster, more scalable, and provide a much better user experience. In the next chapter, we’ll delve deeper into performance tuning and advanced integration patterns, building on these asynchronous foundations. Get ready to optimize!

References


This page is AI-assisted and reviewed. It references official documentation and recognized resources where relevant.