Introduction to Agentic AI Security: Tools and Outputs

Welcome back, future AI security experts! In our previous chapters, we delved into the intricacies of prompt injection and jailbreak attacks, learning how attackers try to manipulate Large Language Models (LLMs) directly. We saw that securing the prompt interface is crucial, but it’s just one piece of the puzzle.

Today, we’re leveling up our understanding to agentic AI systems. Imagine an LLM not just as a chatbot, but as a clever assistant that can use tools – like searching the web, running code, or interacting with other applications. This capability unlocks incredible power but also introduces entirely new security challenges. How do we ensure our AI agent uses its tools responsibly? What happens if an attacker makes the agent use a tool in a malicious way? And once the agent generates an output, how do we ensure that output isn’t harmful or exploitable by other systems?

In this chapter, we’ll explore two critical areas of agentic AI security: Tool Misuse (also known as Insecure Tool Use) and Insecure Output Handling. These are recognized as top threats in the OWASP Top 10 for LLM Applications (2025/2026 versions). We’ll break down what these vulnerabilities mean, why they’re so dangerous, and, most importantly, how to build robust defenses to protect your AI applications. Get ready to dive into practical strategies and write some code to secure your agents!

Core Concepts

Agentic AI systems represent a significant evolution in AI capabilities. They move beyond simple request-response models to complex workflows where an LLM acts as a central “brain,” orchestrating actions through various external tools.

What are Agentic AI Systems?

At its heart, an agentic AI system typically consists of:

  1. A Large Language Model (LLM): The “brain” that performs reasoning, planning, and decision-making.
  2. Tools: External functions or APIs the LLM can call to perform specific tasks (e.g., web search, database query, code execution, sending emails, interacting with internal systems).
  3. Memory: To retain context and learn over time.
  4. Planning/Reasoning Module: To break down complex goals into actionable steps, often deciding which tool to use and when.

Think of it like a highly capable human assistant. If you ask them to “Find me the latest news on AI security and summarize it,” they don’t just know the answer. They use a tool (a web browser/search engine), process the information, and then present a summary. The security question then becomes: what if an attacker could trick your assistant into using the browser to visit a malicious site or to send a fake email using your email tool?

The Power and Peril of Tools

Tools give LLMs superpowers, allowing them to interact with the real world beyond their training data. However, this power comes with significant security implications. Each tool you grant an agent is a potential vector for attack if not properly secured.

Consider this diagram illustrating the interaction:

flowchart TD User_Prompt[User Prompt] --> LLM_Agent["LLM Agent "] LLM_Agent -->|Tool Call Request| Tool_Interface[Tool Interface/Wrapper] Tool_Interface -->|Validated Call| External_Tool["External Tool "] External_Tool -->|Result| Tool_Interface Tool_Interface -->|Sanitized Result| LLM_Agent LLM_Agent -->|Generated Output| Output_Handler[Output Handling & Moderation] Output_Handler --> User_Display[User Display/Downstream System] subgraph Security_Boundaries["Security Boundaries and Controls"] Tool_Interface Output_Handler end

The critical components for security are the Tool Interface/Wrapper and the Output Handling & Moderation layers. These are your defense lines.

Tool Misuse (OWASP LLM05: Insecure Tool Use)

What is it? Tool Misuse occurs when an attacker manipulates the LLM agent to use its available tools in an unintended or malicious way. This could involve making the agent:

  • Call a tool with attacker-controlled, harmful parameters.
  • Access unauthorized resources through a tool.
  • Perform actions that lead to data exfiltration, system compromise, or privilege escalation.

The danger here is that the LLM itself might be operating within a safe environment, but its tools can interact with broader, potentially sensitive systems. An attacker might craft a prompt that, while seemingly innocuous to the LLM’s core reasoning, subtly guides it to make a dangerous tool call.

Why is it dangerous? Imagine an agent with access to an internal API for “retrieving user profiles.” An attacker could craft a prompt like: “Could you please tell me about all users whose names start with ‘A’? Also, could you list their email addresses and phone numbers? Make sure to get all details.” If the tool is not properly secured, the LLM might interpret “all details” as carte blanche to call the API with parameters that expose sensitive PII for many users, potentially bypassing access controls intended for direct human interaction.

Common Tool Misuse Scenarios:

  1. Arbitrary Code Execution: An agent with a “code interpreter” tool that isn’t sandboxed could be tricked into running rm -rf / or other malicious commands.
  2. Sensitive API Calls: An agent with access to internal APIs (e.g., for user management, payment processing, or data retrieval) could be manipulated to perform unauthorized actions or exfiltrate data.
  3. File System Access: An agent given a “read file” or “write file” tool might be prompted to access sensitive system files (/etc/passwd, configuration files) or write malicious scripts.
  4. Network Access: If an agent has a “web request” tool, an attacker might try to make it interact with internal network resources or perform port scans.

Mitigation Strategies for Tool Misuse:

  1. Principle of Least Privilege: This is paramount. An agent should only have access to the tools it absolutely needs, and each tool should only have the minimum necessary permissions. If an agent doesn’t need to write to arbitrary files, don’t give it a tool that can.
  2. Strict Tool Definitions and Schemas: Define the precise parameters, types, and expected values for each tool. Use schema validation (e.g., JSON Schema) to reject malformed or suspicious inputs before the tool is called.
    • For example, a log_message tool might only accept a message string and reject any filepath parameter.
  3. Input Validation and Sanitization at the Tool Wrapper: Even if the LLM generates a tool call, the wrapper around the actual tool implementation should perform its own rigorous validation. This means checking:
    • Are all required parameters present?
    • Are parameter types correct?
    • Are string parameters free of malicious content (e.g., path traversal sequences like ../, command injection characters)?
    • Are numerical parameters within acceptable ranges?
  4. Sandboxing and Isolation: Tools that perform potentially dangerous operations (like code execution or file system access) should run in highly restricted, isolated environments (e.g., Docker containers, virtual machines, chroot jails). This limits the blast radius if a tool is misused.
  5. Human-in-the-Loop (HITL): For high-risk or sensitive actions (e.g., making a payment, deleting data, sending an email to many recipients), introduce a human approval step.
  6. Monitoring and Logging: Implement comprehensive logging for all tool calls, including parameters, results, and the originating prompt (if possible). Monitor for unusual patterns or frequent calls to sensitive tools.

Insecure Output Handling (OWASP LLM04: Insecure Output Handling)

What is it? Insecure Output Handling occurs when an AI system generates malicious or harmful content that is then processed or displayed by downstream systems or directly to users without proper validation, sanitization, or moderation. This can lead to various attacks depending on how the output is consumed.

Why is it dangerous? Imagine an LLM generating a response that includes markdown. If that markdown is rendered by a web application without sanitization, an attacker could inject HTML or JavaScript, leading to a Cross-Site Scripting (XSS) attack. Or, if the LLM is used to generate code snippets, and that code is then executed without review, it could introduce vulnerabilities or backdoors.

Common Insecure Output Handling Scenarios:

  1. Cross-Site Scripting (XSS): LLM output containing unescaped HTML/JavaScript that is rendered in a web browser.
  2. Markdown/Rich Text Injection: Similar to XSS, but targeting markdown rendering engines.
  3. Code Injection: LLM generates code that is then executed by another system (e.g., a “code interpreter” agent, or a CI/CD pipeline) without proper vetting.
  4. SQL Injection/NoSQL Injection: LLM output intended for database queries is not sanitized, allowing an attacker to manipulate the query.
  5. Malware Generation: LLM generates malicious code, scripts, or instructions that could be used to create malware.
  6. Misinformation/Propaganda: LLM generates convincing but false or harmful narratives that can spread rapidly.

Mitigation Strategies for Insecure Output Handling:

  1. Output Filtering and Sanitization: This is the primary defense. Before any LLM output is displayed to a user or processed by another system, it must be sanitized.
    • HTML/Markdown: Use libraries (e.g., Python’s bleach, JavaScript’s DOMPurify) to strip dangerous tags, attributes, and scripts. Always escape user-generated content when displaying it.
    • Code: If the output is code, consider static analysis, linting, or even human review before execution. Never execute arbitrary code from an LLM without severe restrictions and validation.
    • SQL/NoSQL: Use parameterized queries or ORMs, and strictly validate any LLM-generated components of queries.
  2. Content Moderation APIs: Integrate with external content moderation services (e.g., Azure AI Content Safety, Google Cloud’s Perspective API) to detect and flag harmful, hateful, or inappropriate content before it reaches users.
  3. Contextual Output Validation: Beyond just sanitization, ensure the output makes sense within the application’s context. Does it adhere to length limits? Does it follow expected data formats?
  4. Human Review/Approval: For critical applications or sensitive outputs, a human-in-the-loop review process is essential.
  5. Least Privilege for Downstream Systems: Ensure that systems consuming LLM output operate with the minimum necessary permissions. If an LLM generates a file name, the system using it should not have permissions to write to arbitrary system directories.

Step-by-Step Implementation: Securing Agent Tools

Let’s put some of these concepts into practice. We’ll simulate a simple agent with a “logging” tool that, if unsecured, could be misused. Then, we’ll implement a secure wrapper.

We’ll use Python for our examples, as it’s a common language for AI/ML development. We’ll assume you have Python 3.10+ installed.

Step 1: Define a Basic (Insecure) Tool

First, let’s imagine a naive LogTool that directly writes messages to a specified file.

Create a file named insecure_agent.py:

# insecure_agent.py
import os

class InsecureLogTool:
    """
    A simple logging tool that writes messages to a file.
    INSECURE: Allows writing to arbitrary paths.
    """
    def __init__(self):
        print("InsecureLogTool initialized.")

    def log_message(self, message: str, filepath: str):
        """
        Logs a message to the specified file path.
        """
        try:
            # WARNING: This is INSECURE! It allows writing to any path.
            with open(filepath, 'a') as f:
                f.write(message + '\n')
            print(f"Message logged to {filepath} (potentially insecure).")
            return f"Successfully logged message to {filepath}."
        except Exception as e:
            print(f"Error logging message: {e}")
            return f"Failed to log message: {e}"

# --- Simulate an agent's interaction with the tool ---
# In a real agent, the LLM would decide to call this tool
# based on the prompt. Here, we'll simulate a direct call.

if __name__ == "__main__":
    tool = InsecureLogTool()

    print("\n--- Simulating a legitimate use ---")
    tool.log_message("User activity: logged in.", "app_logs.txt")
    # Check if app_logs.txt was created
    if os.path.exists("app_logs.txt"):
        with open("app_logs.txt", 'r') as f:
            print(f"Content of app_logs.txt: {f.read().strip()}")
        os.remove("app_logs.txt") # Clean up

    print("\n--- Simulating a malicious tool misuse attempt ---")
    # Attacker tries to write to a sensitive system file
    malicious_path = "/tmp/sensitive_data.txt" # Using /tmp for demonstration on Linux/macOS
    if os.name == 'nt': # For Windows
        malicious_path = "C:\\Windows\\Temp\\sensitive_data.txt"
    
    print(f"Attempting to write to: {malicious_path}")
    tool.log_message("ATTACKER_DATA: Malicious payload.", malicious_path)

    # In a real scenario, this might succeed or fail silently depending on permissions.
    # Here, we'll just show the attempt.
    if os.path.exists(malicious_path):
        print(f"File {malicious_path} might have been created/modified. Check its content.")
        try:
            os.remove(malicious_path) # Clean up if it was created
            print(f"Cleaned up {malicious_path}")
        except PermissionError:
            print(f"Could not clean up {malicious_path} due to permissions. Manual deletion may be required.")
    else:
        print(f"File {malicious_path} was not created (likely due to permissions or path issue).")

Explanation:

  • We define InsecureLogTool with a log_message method.
  • The filepath argument is directly used in open(), making it vulnerable to path traversal or writing to arbitrary locations.
  • The if __name__ == "__main__": block simulates both a legitimate use and a malicious attempt to write to a sensitive system path (/tmp/sensitive_data.txt or C:\Windows\Temp\sensitive_data.txt). Depending on your OS and user permissions, the malicious write might succeed or fail, but the vulnerability is still present in the code.

Run this script: python insecure_agent.py Observe how it tries to write to the specified paths.

Step 2: Implement a Secure Tool Wrapper

Now, let’s create a SecureLogTool that enforces the principle of least privilege and performs strict input validation.

Create a new file named secure_agent.py:

# secure_agent.py
import os
import re

class SecureLogTool:
    """
    A secure logging tool that restricts file paths and validates messages.
    """
    def __init__(self, base_log_dir="agent_logs"):
        # Ensure the base log directory exists and is controlled.
        self.base_log_dir = os.path.abspath(base_log_dir)
        os.makedirs(self.base_log_dir, exist_ok=True)
        print(f"SecureLogTool initialized. Logging to: {self.base_log_dir}")

    def log_message(self, message: str, filename: str):
        """
        Logs a message to a file within the designated base_log_dir.
        Performs strict validation on filename and message.
        """
        # 1. Validate filename:
        #    - Ensure it's just a filename, no path components.
        #    - Restrict characters to alphanumeric, underscore, dash, dot.
        if not re.fullmatch(r"[\w.-]+\.txt", filename):
            return "Error: Invalid filename. Must be a simple .txt file (e.g., 'mylog.txt')."

        # Construct the full, secure file path.
        # os.path.join handles OS-specific path separators.
        # os.path.abspath ensures it's an absolute path.
        full_filepath = os.path.abspath(os.path.join(self.base_log_dir, filename))

        # 2. Critical Path Traversal Prevention:
        #    Ensure the resulting path is strictly within the allowed base_log_dir.
        #    This prevents `../` attacks from escaping the directory.
        if not full_filepath.startswith(self.base_log_dir):
            return "Error: Path traversal attempt detected. Filename must be within the designated log directory."

        # 3. Message Validation (example: prevent known malicious strings)
        #    This is a simple example; real-world might use more sophisticated checks.
        if "ATTACKER_DATA" in message.upper():
            return "Error: Message contains forbidden keywords."

        try:
            with open(full_filepath, 'a') as f:
                f.write(message + '\n')
            print(f"Message securely logged to {full_filepath}.")
            return f"Successfully logged message to {filename}."
        except Exception as e:
            print(f"Error logging message: {e}")
            return f"Failed to log message: {e}"

# --- Simulate an agent's interaction with the secure tool ---
if __name__ == "__main__":
    # The secure log directory will be created relative to where you run the script.
    tool = SecureLogTool(base_log_dir="secure_agent_logs")

    print("\n--- Simulating a legitimate use with secure tool ---")
    print(tool.log_message("User activity: authenticated.", "user_auth.txt"))
    # Check if user_auth.txt was created in secure_agent_logs
    log_file = os.path.join(tool.base_log_dir, "user_auth.txt")
    if os.path.exists(log_file):
        with open(log_file, 'r') as f:
            print(f"Content of user_auth.txt: {f.read().strip()}")
        # os.remove(log_file) # Keep for inspection, or uncomment to clean up

    print("\n--- Simulating malicious tool misuse attempts with secure tool ---")
    # 1. Path traversal attempt
    print(tool.log_message("ATTACKER_DATA: Malicious payload.", "../../../etc/passwd"))
    print(tool.log_message("ATTACKER_DATA: Malicious payload.", "/tmp/evil.txt"))

    # 2. Forbidden filename attempt
    print(tool.log_message("Legitimate message.", "my_log/sub_dir.txt"))
    print(tool.log_message("Legitimate message.", "evil.sh"))

    # 3. Forbidden message content
    print(tool.log_message("This is some ATTACKER_DATA.", "system_alerts.txt"))

    # Clean up the created log directory for subsequent runs
    import shutil
    if os.path.exists(tool.base_log_dir):
        print(f"\nCleaning up '{tool.base_log_dir}' directory.")
        shutil.rmtree(tool.base_log_dir)

Explanation:

  • base_log_dir: The SecureLogTool now requires a base_log_dir upon initialization. All log files must reside within this directory.
  • os.makedirs(self.base_log_dir, exist_ok=True): Ensures the base directory exists.
  • Filename Validation (re.fullmatch): We use a regular expression (r"[\w.-]+\.txt") to ensure the filename argument only contains safe characters and ends with .txt. This immediately rejects paths like ../../../etc/passwd or evil.sh.
  • Path Construction (os.path.join, os.path.abspath): We construct the full path using os.path.join to correctly handle OS-specific path separators. os.path.abspath converts it to an absolute path.
  • Path Traversal Prevention (startswith check): This is critical. After constructing the full path, we verify that it starts with the absolute base_log_dir. This prevents any ../ or similar trickery from allowing the agent to write outside the designated directory.
  • Message Validation: A simple check for ATTACKER_DATA is included as an example. In a real application, this could involve more complex content moderation or keyword filtering.
  • The if __name__ == "__main__": block demonstrates how the secure tool successfully blocks malicious attempts.

Run this script: python secure_agent.py Observe how the tool now prevents the malicious attempts, returning informative error messages instead of executing dangerous operations.

Step 3: Conceptual Output Sanitization

While we don’t have a full agent to generate arbitrary markdown, let’s illustrate how output sanitization would work for a common scenario: cleaning HTML/Markdown output before display.

For this, we’ll use the bleach library, a Python HTML sanitization library.

First, install bleach: pip install bleach==6.1.0 (as of 2026-03-20, bleach 6.1.0 is a stable release)

Now, let’s add a function to secure_agent.py (or create a new file output_sanitizer.py for clarity):

# output_sanitizer.py
import bleach

def sanitize_html_output(raw_html_content: str) -> str:
    """
    Sanitizes raw HTML content to prevent XSS and other injection attacks.
    Uses bleach to allow a safe subset of HTML tags and attributes.
    """
    # Define allowed tags and attributes.
    # This is an example; customize based on your application's needs.
    allowed_tags = [
        'p', 'b', 'i', 'em', 'strong', 'a', 'ul', 'ol', 'li', 'br',
        'h1', 'h2', 'h3', 'h4', 'h5', 'h6', 'blockquote', 'code', 'pre'
    ]
    allowed_attrs = {
        'a': ['href', 'title'],
        'p': ['class'], # Example: allow 'class' attribute for paragraphs
    }

    # Clean the HTML. strip=True removes tags not in allowed_tags.
    # linkify=True converts bare URLs into clickable links.
    sanitized_content = bleach.clean(
        raw_html_content,
        tags=allowed_tags,
        attributes=allowed_attrs,
        strip=True,
        linkify=True
    )
    return sanitized_content

if __name__ == "__main__":
    print("\n--- Demonstrating Output Sanitization ---")

    # Scenario 1: Harmless HTML
    safe_output = "This is a <b>bold</b> statement with an <i>italic</i> part."
    print(f"Original (safe): {safe_output}")
    print(f"Sanitized: {sanitize_html_output(safe_output)}")

    # Scenario 2: Malicious HTML (XSS attempt)
    malicious_output = """
    <h1>Important Announcement</h1>
    <p>Please click <a href="javascript:alert('XSS Attack!');">here</a> for details.</p>
    <img src="x" onerror="alert('Another XSS!');">
    <script>alert('Direct script execution!');</script>
    <p class='warning'>This is a normal paragraph.</p>
    """
    print(f"\nOriginal (malicious): {malicious_output}")
    print(f"Sanitized: {sanitize_html_output(malicious_output)}")

    # Scenario 3: Markdown-like output (if an agent generates markdown, it might be converted to HTML)
    markdown_output = """
    ## Agent's Summary
    - **Key finding**: AI security is crucial.
    - _Action item_: Implement defenses.
    <p class='secret'>This should be stripped.</p>
    """
    # Assuming a markdown renderer would convert this to HTML first, then sanitize.
    # For simplicity, we'll just show the HTML sanitization part.
    html_from_markdown = """
    <h2>Agent's Summary</h2>
    <ul>
        <li><strong>Key finding</strong>: AI security is crucial.</li>
        <li><em>Action item</em>: Implement defenses.</li>
    </ul>
    <p class='secret'>This should be stripped.</p>
    """
    print(f"\nOriginal (from markdown): {html_from_markdown}")
    print(f"Sanitized: {sanitize_html_output(html_from_markdown)}")

Explanation:

  • The sanitize_html_output function takes raw HTML as input.
  • allowed_tags and allowed_attrs define a whitelist of HTML elements and attributes that are permitted. Anything not in these lists will be stripped or cleaned by bleach.
  • bleach.clean() performs the heavy lifting, removing potentially dangerous elements like <script> tags, javascript: URLs, and onerror attributes.
  • Running this script demonstrates how bleach effectively neutralizes XSS attempts while preserving legitimate formatting.

Run this script: python output_sanitizer.py Observe how the malicious elements are removed, and only the safe HTML remains.

Mini-Challenge: Enhance Tool Security

You’ve seen how to restrict file paths for a logging tool. Now, it’s your turn to add another layer of security to the SecureLogTool.

Challenge: Modify the SecureLogTool in secure_agent.py to enforce a policy that all log messages must begin with a specific prefix, for example, “AGENT_LOG: “. If a message doesn’t start with this prefix, the tool should reject it.

Hint: Use Python’s str.startswith() method for checking the prefix. Remember to return an error message if the validation fails.

What to Observe/Learn: This challenge reinforces the idea of applying strict input validation not just to paths, but also to the content of the data being processed by a tool. It demonstrates how to enforce specific data formats or content policies for tool inputs.

Common Pitfalls & Troubleshooting

Even with the best intentions, securing agentic AI systems can be tricky. Here are some common pitfalls:

  1. Over-reliance on Model-Based Defenses: Assuming the LLM itself will “know better” and not misuse tools or generate harmful output. While LLMs have safety mechanisms, they are not foolproof and can be bypassed. Always implement external, deterministic security controls.
    • Troubleshooting: If an attack succeeds, don’t just “tell the LLM not to do that.” Look for the missing validation or sanitization layer outside the LLM.
  2. Insufficient Isolation for Tools: Giving tools direct, unrestricted access to the host system or network. This is like giving a child a loaded gun and hoping they don’t point it at anyone.
    • Troubleshooting: If a tool misuse vulnerability is found, immediately review the permissions and environment of the tool. Can it be run in a container? Can its network access be restricted?
  3. Neglecting Output Sanitization for Non-HTML Contexts: Focusing only on XSS in web contexts, but forgetting that output might be used in other ways (e.g., generating JSON that’s parsed, or command-line arguments).
    • Troubleshooting: For every place an LLM’s output is consumed, ask: “What’s the worst thing that could be injected here, and how would I prevent it?” If it’s JSON, ensure proper JSON escaping. If it’s a command-line argument, ensure shell escaping.
  4. Lack of Comprehensive Logging and Monitoring: Not logging tool calls, their parameters, and their outcomes. This makes it impossible to detect or investigate attacks.
    • Troubleshooting: Ensure your agent’s runtime environment has robust logging configured, sending logs to a centralized system for analysis and alerting. Look for unusual tool call sequences or frequent error messages from security checks.
  5. Assuming Off-the-Shelf Models are Secure: While major model providers invest heavily in safety, their models are designed for general use. Your specific application’s integration points and custom tools introduce unique risks that you must secure.

Summary

Phew! We’ve covered a lot of ground in securing agentic AI systems. Let’s recap the key takeaways:

  • Agentic AI systems empower LLMs with tools, but this power introduces new attack surfaces.
  • Tool Misuse (OWASP LLM05) involves tricking an agent into using its tools in malicious or unintended ways, potentially leading to data exfiltration, privilege escalation, or system compromise.
  • Insecure Output Handling (OWASP LLM04) occurs when an agent’s generated content, if unsanitized, can be exploited by downstream systems or users (e.g., XSS, code injection).
  • Defenses for Tool Misuse include adhering to the principle of least privilege, using strict tool definitions and schema validation, implementing input validation and sanitization at the tool wrapper level, sandboxing dangerous tools, and incorporating human-in-the-loop for critical actions.
  • Defenses for Insecure Output Handling center around robust output filtering and sanitization (e.g., using libraries like bleach), integrating content moderation APIs, performing contextual output validation, and leveraging human review.
  • Continuous monitoring and logging of tool calls and outputs are essential for detection and response.

Securing agentic AI requires a defense-in-depth approach, treating every interaction point—from the initial prompt to the final output and every tool call in between—as a potential vulnerability. By applying these principles, you can build more resilient and trustworthy AI applications.

Next up, we’ll broaden our view to Insecure AI System Design Principles, looking at how to architect your entire AI application with security as a foundational element from the ground up.

References


This page is AI-assisted and reviewed. It references official documentation and recognized resources where relevant.