Chapter 3: Crafting the Parser: Building the Abstract Syntax Tree (AST)

Welcome to Chapter 3 of our journey to build a robust Mermaid analyzer! In the previous chapter, we successfully laid the foundation with our lexer, transforming raw Mermaid code into a stream of meaningful tokens. This chapter marks a significant leap forward as we tackle the next crucial phase of our compiler pipeline: parsing. Here, we will take the tokens generated by the lexer and construct a strongly typed Abstract Syntax Tree (AST).

The AST is the heart of our analyzer. It’s a hierarchical, language-agnostic representation of the input Mermaid code’s structure, free from the surface-level details of syntax. Think of it as the “understanding” layer of our tool. Without a well-defined and accurately built AST, our subsequent validation, fixing, and formatting steps would be impossible or highly error-prone. By the end of this chapter, you will have a Parser module capable of consuming a token stream and producing a structured AST for various Mermaid diagram types, along with robust error reporting for syntax issues.

1. Planning & Design

Building a parser is a complex task that requires careful planning. Our parser will follow a recursive descent approach, which is a top-down parsing technique where a set of recursive procedures are used to process tokens. Each procedure typically corresponds to a non-terminal in the grammar and tries to match a sequence of tokens according to its production rules.

1.1. Parser Architecture Overview

The Parser will interact with our Lexer to consume tokens and our Diagnostic system to report errors. It will construct the AST, which will reside in its own module.

flowchart LR MermaidCode[Mermaid Code] --> Lexer(Lexer) Lexer --> TokenStream[Token Stream] TokenStream --> Parser(Parser) Parser --> AST[Abstract Syntax Tree] Parser --> Diagnostics(Diagnostics Reporter) AST --> Validator[Validator (Next Chapter)]

1.2. Abstract Syntax Tree (AST) Structure

The AST needs to accurately represent the various components of a Mermaid diagram. Given our requirement to support flowcharts, sequence diagrams, and class diagrams, our AST must be flexible. We’ll start with a core structure that can be extended.

Consider a simple flowchart:

graph TD A[Start] --> B(Process); B --> C{Decision}; C -->|Yes| D[End]; C -->|No| E[Error];

The AST for this would involve a Diagram containing a DiagramType (Flowchart) and a list of Statements. Each Statement could be a Node or an Edge. Nodes have IDs, labels, and shapes. Edges connect two nodes, have labels, and arrow types.

Here’s a conceptual diagram of our AST structure for a basic flowchart:

graph TD Diagram_Root["Diagram (Root)"] Diagram_Root --> Diagram_Type["DiagramType (Flowchart)"] Diagram_Root --> Statements_List["Vec<Statement>"] Statements_List --> Statement_NodeA["Statement::Node"] Statements_List --> Statement_EdgeAB["Statement::Edge"] Statements_List --> Statement_NodeB["Statement::Node"] Statements_List --> Statement_EdgeBC["Statement::Edge"] Statements_List --> Statement_NodeC["Statement::Node"] Statements_List --> Statement_EdgeCD["Statement::Edge"] Statements_List --> Statement_NodeD["Statement::Node"] Statement_NodeA --> Node_A_ID[ID: "A"] Statement_NodeA --> Node_A_Label[Label: "Start"] Statement_NodeA --> Node_A_Shape[Shape: Rectangle] Statement_EdgeAB --> Edge_AB_Source[Source: "A"] Statement_EdgeAB --> Edge_AB_Target[Target: "B"] Statement_EdgeAB --> Edge_AB_Label[Label: None] Statement_EdgeAB --> Edge_AB_Arrow[Arrow: ArrowHead] Statement_EdgeAB --> Edge_AB_Line[Line: Solid] Statement_NodeB --> Node_B_ID[ID: "B"] Statement_NodeB --> Node_B_Label[Label: "Process"] Statement_NodeB --> Node_B_Shape[Shape: Rounded]

We need to capture not just the content but also the exact source location of each element for precise diagnostics.

1.3. File Structure

We’ll organize our parser components within a dedicated src/parser module:

src/
├── lib.rs
├── lexer/
│   └── ... (from Chapter 2)
├── diagnostics/
│   └── ... (from Chapter 2)
├── parser/
│   ├── mod.rs
│   ├── ast.rs          # Defines the Abstract Syntax Tree data structures
│   └── parser.rs       # Contains the Parser implementation
└── main.rs

2. Step-by-Step Implementation

2.1. Setup and Configuration

First, let’s create the necessary files and update our lib.rs to include the new module.

Action: Create src/parser/mod.rs, src/parser/ast.rs, and src/parser/parser.rs.

File: src/parser/mod.rs

// src/parser/mod.rs
pub mod ast;
pub mod parser;

pub use self::parser::Parser;
pub use self::ast::{
    Diagram, DiagramType, Statement, Node, Edge, Subgraph, ArrowType, LineType, NodeShape,
    EdgeDirection,
};

Explanation: This mod.rs acts as the public interface for the parser module, making its sub-modules and key types easily accessible.

Action: Update src/lib.rs to include the new parser module.

File: src/lib.rs

// src/lib.rs
pub mod lexer;
pub mod diagnostics;
pub mod parser; // Add this line

// Re-export key types for easier access
pub use lexer::{Lexer, Token, TokenType, Location};
pub use diagnostics::{Diagnostic, DiagnosticLevel, DiagnosticCode, DiagnosticEmitter};
pub use parser::{
    Parser, Diagram, DiagramType, Statement, Node, Edge, Subgraph, ArrowType, LineType, NodeShape,
    EdgeDirection,
};

Explanation: We’re making the parser module public and re-exporting its main types. This allows other parts of our application (like main.rs or future modules) to easily import Parser or Diagram without needing to specify parser::parser::Parser or parser::ast::Diagram.

2.2. Core Implementation - Defining the AST

Now, let’s define the core data structures for our Abstract Syntax Tree in src/parser/ast.rs. We’ll start with the most common elements, focusing on flowcharts, and then extend. Remember to include Location from our diagnostics module to track source positions.

Action: Define the AST structures in src/parser/ast.rs.

File: src/parser/ast.rs

// src/parser/ast.rs
use crate::{Location, TokenType}; // Import Location and TokenType from root crate

/// Represents the top-level structure of a Mermaid diagram.
#[derive(Debug, PartialEq, Eq, Clone)]
pub struct Diagram {
    pub diagram_type: DiagramType,
    pub statements: Vec<Statement>,
    pub location: Location, // Location of the 'graph' or 'flowchart' keyword
}

/// Enumerates the supported types of Mermaid diagrams.
#[derive(Debug, PartialEq, Eq, Clone)]
pub enum DiagramType {
    Flowchart(Option<FlowchartDirection>), // e.g., graph TD, flowchart LR
    Sequence,
    Class,
    // Add other diagram types as needed (e.g., Gantt, State, Git, etc.)
}

/// Represents the direction of a flowchart.
#[derive(Debug, PartialEq, Eq, Clone)]
pub enum FlowchartDirection {
    TopToBottom, // TD
    BottomToTop, // BT
    LeftToRight, // LR
    RightToLeft, // RL
}

/// Represents a single statement within a Mermaid diagram.
#[derive(Debug, PartialEq, Eq, Clone)]
pub enum Statement {
    Node(Node),
    Edge(Edge),
    Subgraph(Subgraph),
    Directive(Directive), // e.g., %%{init: ...}
    Comment(String, Location), // e.g., %% This is a comment
    EmptyLine(Location), // For tracking formatting, useful for formatter
    // Potentially add other statement types like Class members, Sequence messages, etc.
}

/// Represents a node in the diagram.
#[derive(Debug, PartialEq, Eq, Clone)]
pub struct Node {
    pub id: String,
    pub label: Option<String>, // Optional label for nodes like A[Label]
    pub shape: NodeShape,
    pub location: Location, // Location of the node ID
    pub label_location: Option<Location>, // Location of the label, if present
}

/// Enumerates the possible shapes a node can have.
#[derive(Debug, PartialEq, Eq, Clone)]
pub enum NodeShape {
    Rectangle,          // A[Text]
    RoundedRectangle,   // A(Text)
    Stadium,            // A([Text])
    Subroutine,         // A[[Text]]
    Cylinder,           // A[(Text)]
    Circle,             // A((Text))
    Asymmetric,         // A>Text]
    Rhombus,            // A{Text}
    Hexagon,            // A{{Text}}
    Parallelogram,      // A[/Text/]
    ParallelogramAlt,   // A[\Text\]
    Trapezoid,          // A[/Text\]
    TrapezoidAlt,       // A[\Text/]
    DoubleCircle,       // A(((Text)))
    // Add other shapes as per Mermaid specification
    Default, // Implicit rectangle if no shape specified (e.g., A)
}

/// Represents an edge (connection) between two nodes.
#[derive(Debug, PartialEq, Eq, Clone)]
pub struct Edge {
    pub source: String,
    pub target: String,
    pub label: Option<String>,
    pub arrow_type: ArrowType,
    pub line_type: LineType,
    pub location: Location, // Location of the first token of the edge (source ID)
    pub arrow_location: Location, // Location of the arrow/line tokens
    pub label_location: Option<Location>, // Location of the label, if present
}

/// Enumerates the types of arrows that can be used in an edge.
#[derive(Debug, PartialEq, Eq, Clone)]
pub enum ArrowType {
    None,          // ---
    ArrowHead,     // -->
    Cross,         // --x
    Open,          // --o
    Bidirectional, // <-->
    // Add other arrow types (e.g., ===>, --+>, etc.)
}

/// Enumerates the types of lines that can be used in an edge.
#[derive(Debug, PartialEq, Eq, Clone)]
pub enum LineType {
    Solid,       // ---
    Dotted,      // -.->
    Thick,       // ===
    Invisible,   // ~~~
    // Add other line types
}

/// Represents a subgraph in the diagram.
#[derive(Debug, PartialEq, Eq, Clone)]
pub struct Subgraph {
    pub id: String,
    pub label: Option<String>, // Subgraph label (e.g., subgraph "My Subgraph")
    pub statements: Vec<Statement>,
    pub location: Location, // Location of the 'subgraph' keyword
    pub label_location: Option<Location>, // Location of the label, if present
}

/// Represents a directive, like `%%{init: ...}`.
#[derive(Debug, PartialEq, Eq, Clone)]
pub struct Directive {
    pub content: String, // The full content of the directive
    pub location: Location,
}

// Helper to determine node shape from token type
impl NodeShape {
    pub fn from_token_type(token_type: &TokenType) -> Option<Self> {
        match token_type {
            TokenType::NodeOpenBracket => Some(NodeShape::Rectangle),
            TokenType::NodeOpenParen => Some(NodeShape::RoundedRectangle),
            TokenType::NodeOpenSquareBracketParen => Some(NodeShape::Stadium),
            TokenType::NodeOpenDoubleSquareBracket => Some(NodeShape::Subroutine),
            TokenType::NodeOpenParenSquareBracket => Some(NodeShape::Cylinder),
            TokenType::NodeOpenDoubleParen => Some(NodeShape::Circle),
            TokenType::NodeOpenGt => Some(NodeShape::Asymmetric),
            TokenType::NodeOpenBrace => Some(NodeShape::Rhombus),
            TokenType::NodeOpenDoubleBrace => Some(NodeShape::Hexagon),
            TokenType::NodeOpenSlash => Some(NodeShape::Parallelogram),
            TokenType::NodeOpenBackslash => Some(NodeShape::ParallelogramAlt),
            TokenType::NodeOpenSlashBackslash => Some(NodeShape::Trapezoid),
            TokenType::NodeOpenBackslashSlash => Some(NodeShape::TrapezoidAlt),
            TokenType::NodeOpenTripleParen => Some(NodeShape::DoubleCircle),
            _ => None,
        }
    }

    pub fn get_closing_token_type(&self) -> TokenType {
        match self {
            NodeShape::Rectangle => TokenType::NodeCloseBracket,
            NodeShape::RoundedRectangle => TokenType::NodeCloseParen,
            NodeShape::Stadium => TokenType::NodeCloseSquareBracketParen,
            NodeShape::Subroutine => TokenType::NodeCloseDoubleSquareBracket,
            NodeShape::Cylinder => TokenType::NodeCloseParenSquareBracket,
            NodeShape::Circle => TokenType::NodeCloseDoubleParen,
            NodeShape::Asymmetric => TokenType::NodeCloseBracket, // Asymmetric closes with ]
            NodeShape::Rhombus => TokenType::NodeCloseBrace,
            NodeShape::Hexagon => TokenType::NodeCloseDoubleBrace,
            NodeShape::Parallelogram => TokenType::NodeCloseSlash,
            NodeShape::ParallelogramAlt => TokenType::NodeCloseBackslash,
            NodeShape::Trapezoid => TokenType::NodeCloseBackslash, // Trapezoid closes with \]
            NodeShape::TrapezoidAlt => TokenType::NodeCloseSlash, // TrapezoidAlt closes with /]
            NodeShape::DoubleCircle => TokenType::NodeCloseTripleParen,
            NodeShape::Default => TokenType::Identifier, // No explicit closing for default
        }
    }
}

// Helper to determine arrow and line types from token type
impl EdgeDirection {
    // This enum might be useful for representing the entire arrow token
    // For now, we'll parse ArrowType and LineType separately.
    // This can be expanded if we need more complex arrow structures.
}

Explanation:

Diagram: The root of our AST, holding the diagram type and a vector of statements.
DiagramType: An enum to distinguish between different Mermaid diagram kinds. We start with Flowchart, Sequence, and Class.
FlowchartDirection: Specific to flowcharts (TD, LR, etc.).
Statement: An enum representing the different top-level constructs in a diagram (nodes, edges, subgraphs, directives, comments). This is crucial for handling the mixed nature of Mermaid syntax.
Node: Represents a single node with its ID, optional label, and NodeShape.
NodeShape: An enum covering all standard Mermaid node shapes. Includes helper methods to map TokenType to NodeShape and get the corresponding closing token.
Edge: Connects two nodes, with an optional label, ArrowType, and LineType.
ArrowType and LineType: Enums to capture the various arrowheads and line styles.
Subgraph: Represents nested blocks of statements.
Directive: For %%{init: ...} style directives.
Location: Crucially, every major AST element stores its Location to enable precise error reporting and future formatting.

2.3. Core Implementation - The Parser

Now, we’ll implement the Parser struct and its methods in src/parser/parser.rs. This will be an incremental process, starting with the basic structure and parsing the overall diagram type, then moving to statements, nodes, and edges.

Action: Define the Parser struct and its basic methods.

File: src/parser/parser.rs

// src/parser/parser.rs
use crate::{
    diagnostics::{Diagnostic, DiagnosticCode, DiagnosticEmitter, DiagnosticLevel},
    lexer::{Lexer, Token, TokenType},
    parser::ast::*, // Import all AST types
    Location,
};
use std::iter::Peekable;
use std::vec::IntoIter;

/// The Parser transforms a stream of tokens into an Abstract Syntax Tree (AST).
pub struct Parser<'a> {
    lexer: Peekable<IntoIter<Token>>,
    current_token: Token,
    peek_token: Token,
    diagnostics: DiagnosticEmitter<'a>,
    source_code: &'a str,
}

impl<'a> Parser<'a> {
    /// Creates a new Parser instance.
    /// It consumes the lexer's token stream and initializes the current and peek tokens.
    pub fn new(lexer: Lexer, source_code: &'a str, diagnostics: DiagnosticEmitter<'a>) -> Self {
        let mut token_iter = lexer.into_iter().peekable();

        // Prime the current and peek tokens
        let current_token = token_iter.next().unwrap_or_else(|| Token {
            token_type: TokenType::Eof,
            literal: String::new(),
            location: Location::new(0, 0, 0, 0),
        });
        let peek_token = token_iter.next().unwrap_or_else(|| Token {
            token_type: TokenType::Eof,
            literal: String::new(),
            location: Location::new(0, 0, 0, 0),
        });

        Parser {
            lexer: token_iter,
            current_token,
            peek_token,
            diagnostics,
            source_code,
        }
    }

    /// Advances the parser's token stream by one.
    fn advance_token(&mut self) {
        self.current_token = self.peek_token.clone();
        self.peek_token = self.lexer.next().unwrap_or_else(|| Token {
            token_type: TokenType::Eof,
            literal: String::new(),
            location: Location::new(self.source_code.len(), self.source_code.len(), 0, 0),
        });
    }

    /// Checks if the current token matches the expected type.
    fn current_token_is(&self, token_type: TokenType) -> bool {
        self.current_token.token_type == token_type
    }

    /// Checks if the peek token matches the expected type.
    fn peek_token_is(&self, token_type: TokenType) -> bool {
        self.peek_token.token_type == token_type
    }

    /// Consumes the current token if it matches the expected type, otherwise reports an error.
    /// Returns the consumed token or an error diagnostic.
    fn expect_token(&mut self, expected: TokenType) -> Result<Token, Diagnostic> {
        if self.current_token_is(expected) {
            let token = self.current_token.clone();
            self.advance_token();
            Ok(token)
        } else {
            let diagnostic = Diagnostic::new(
                DiagnosticCode::UnexpectedToken,
                format!(
                    "Expected token {:?}, but found {:?}",
                    expected, self.current_token.token_type
                ),
                self.current_token.location,
                DiagnosticLevel::Error,
            )
            .with_help(format!(
                "Ensure that '{}' is used correctly here.",
                expected.to_string()
            ));
            self.diagnostics.emit(diagnostic.clone());
            Err(diagnostic)
        }
    }

    /// Consumes the current token if it matches any of the expected types, otherwise reports an error.
    fn expect_any_token(&mut self, expected_types: &[TokenType]) -> Result<Token, Diagnostic> {
        if expected_types.contains(&self.current_token.token_type) {
            let token = self.current_token.clone();
            self.advance_token();
            Ok(token)
        } else {
            let expected_str = expected_types
                .iter()
                .map(|t| format!("{:?}", t))
                .collect::<Vec<_>>()
                .join(" or ");
            let diagnostic = Diagnostic::new(
                DiagnosticCode::UnexpectedToken,
                format!(
                    "Expected one of {}, but found {:?}",
                    expected_str, self.current_token.token_type
                ),
                self.current_token.location,
                DiagnosticLevel::Error,
            )
            .with_help(format!(
                "Ensure that one of the expected tokens is used here."
            ));
            self.diagnostics.emit(diagnostic.clone());
            Err(diagnostic)
        }
    }

    /// Parses the entire Mermaid diagram. This is the entry point for parsing.
    pub fn parse(&mut self) -> Result<Diagram, Vec<Diagnostic>> {
        let mut collected_diagnostics = Vec::new();

        // First, parse the diagram type (e.g., 'graph', 'flowchart', 'sequence')
        let diagram_type_token = match self.expect_any_token(&[
            TokenType::KeywordGraph,
            TokenType::KeywordFlowchart,
            TokenType::KeywordSequence,
            TokenType::KeywordClass,
            // Add other diagram type keywords here
        ]) {
            Ok(t) => t,
            Err(diag) => {
                collected_diagnostics.push(diag);
                // Attempt to recover by assuming a default diagram type or skipping until a known keyword
                // For now, we'll just return the error. Robust error recovery comes later.
                return Err(collected_diagnostics);
            }
        };

        let diagram_type_location = diagram_type_token.location;
        let diagram_type = match self.parse_diagram_type(&diagram_type_token) {
            Ok(dt) => dt,
            Err(diag) => {
                collected_diagnostics.push(diag);
                // Recovery: assume Flowchart(None)
                DiagramType::Flowchart(None)
            }
        };

        let mut statements = Vec::new();
        while !self.current_token_is(TokenType::Eof) {
            match self.parse_statement() {
                Ok(Some(stmt)) => statements.push(stmt),
                Ok(None) => { /* No statement parsed, possibly a skipped invalid token */ }
                Err(diag) => {
                    collected_diagnostics.push(diag);
                    // Error recovery: skip tokens until a likely statement start or EOF
                    self.synchronize_parser();
                }
            }
        }

        if collected_diagnostics.is_empty() {
            Ok(Diagram {
                diagram_type,
                statements,
                location: diagram_type_location,
            })
        } else {
            Err(collected_diagnostics)
        }
    }

    /// Parses the specific diagram type and its associated properties (e.g., direction for flowchart).
    fn parse_diagram_type(&mut self, token: &Token) -> Result<DiagramType, Diagnostic> {
        match token.token_type {
            TokenType::KeywordGraph | TokenType::KeywordFlowchart => {
                // Flowcharts can have an optional direction (TD, LR, etc.)
                if self.peek_token_is(TokenType::KeywordTD) {
                    self.advance_token(); // Consume TD
                    self.advance_token(); // Consume the keyword itself
                    Ok(DiagramType::Flowchart(Some(FlowchartDirection::TopToBottom)))
                } else if self.peek_token_is(TokenType::KeywordLR) {
                    self.advance_token(); // Consume LR
                    self.advance_token(); // Consume the keyword itself
                    Ok(DiagramType::Flowchart(Some(FlowchartDirection::LeftToRight)))
                } else if self.peek_token_is(TokenType::KeywordBT) {
                    self.advance_token(); // Consume BT
                    self.advance_token(); // Consume the keyword itself
                    Ok(DiagramType::Flowchart(Some(FlowchartDirection::BottomToTop)))
                } else if self.peek_token_is(TokenType::KeywordRL) {
                    self.advance_token(); // Consume RL
                    self.advance_token(); // Consume the keyword itself
                    Ok(DiagramType::Flowchart(Some(FlowchartDirection::RightToLeft)))
                } else {
                    // No explicit direction, default for flowchart/graph is TD
                    Ok(DiagramType::Flowchart(Some(FlowchartDirection::TopToBottom)))
                }
            }
            TokenType::KeywordSequence => Ok(DiagramType::Sequence),
            TokenType::KeywordClass => Ok(DiagramType::Class),
            _ => {
                // This should ideally not be reached if `expect_any_token` handles it correctly.
                // But as a fallback, report an error.
                Err(Diagnostic::new(
                    DiagnosticCode::InvalidDiagramType,
                    format!("Unknown diagram type: {:?}", token.literal),
                    token.location,
                    DiagnosticLevel::Error,
                ))
            }
        }
    }

    /// Parses a single statement (Node, Edge, Subgraph, Directive, Comment, or EmptyLine).
    /// Returns `Ok(Some(Statement))` if a statement was successfully parsed.
    /// Returns `Ok(None)` if the current token doesn't start a valid statement but isn't an error
    /// (e.g., whitespace tokens that should be ignored at this level).
    /// Returns `Err(Diagnostic)` if a syntax error prevents parsing a statement.
    fn parse_statement(&mut self) -> Result<Option<Statement>, Diagnostic> {
        // Skip whitespace and comments at the statement level for now.
        // The lexer already handles most whitespace, but empty lines or comments
        // between statements might be passed through.
        while self.current_token_is(TokenType::Whitespace) || self.current_token_is(TokenType::Newline) {
            self.advance_token();
        }

        if self.current_token_is(TokenType::Eof) {
            return Ok(None);
        }

        if self.current_token_is(TokenType::DirectiveStart) {
            return self.parse_directive().map(Some);
        }

        if self.current_token_is(TokenType::Comment) {
            let comment_token = self.current_token.clone();
            self.advance_token();
            return Ok(Some(Statement::Comment(comment_token.literal, comment_token.location)));
        }

        // A statement typically starts with an identifier (for a node or source of an edge)
        // or a keyword like 'subgraph'.
        if self.current_token_is(TokenType::Identifier) {
            let id_token = self.current_token.clone();
            if self.peek_token_is(TokenType::Arrow)
                || self.peek_token_is(TokenType::DottedArrow)
                || self.peek_token_is(TokenType::ThickArrow)
                || self.peek_token_is(TokenType::InvisibleArrow)
                || self.peek_token_is(TokenType::BidirectionalArrow)
                || self.peek_token_is(TokenType::EdgeOpenParen) // A---B style
                || self.peek_token_is(TokenType::EdgeOpenBracket) // A--label--B style
            {
                // This looks like an edge
                return self.parse_edge().map(Some);
            } else {
                // This looks like a node
                return self.parse_node().map(Some);
            }
        } else if self.current_token_is(TokenType::KeywordSubgraph) {
            return self.parse_subgraph().map(Some);
        }

        // If we reach here, it's an unexpected token at the start of a statement.
        let diagnostic = Diagnostic::new(
            DiagnosticCode::UnexpectedToken,
            format!(
                "Unexpected token '{:?}' at the start of a statement.",
                self.current_token.token_type
            ),
            self.current_token.location,
            DiagnosticLevel::Error,
        )
        .with_help("Expected a node ID, 'subgraph' keyword, a directive, or a comment.");
        self.diagnostics.emit(diagnostic.clone());

        // Error recovery: advance token to try and find a valid statement start
        self.advance_token();
        Err(diagnostic)
    }

    /// Parses a node definition (e.g., `A[Start]`, `B(Process)`, `C`).
    fn parse_node(&mut self) -> Result<Node, Diagnostic> {
        let id_token = self.expect_token(TokenType::Identifier)?;
        let id = id_token.literal.clone();
        let mut label: Option<String> = None;
        let mut node_shape = NodeShape::Default;
        let mut label_location: Option<Location> = None;

        // Check for node shape and label
        if let Some(shape) = NodeShape::from_token_type(&self.current_token.token_type) {
            let open_bracket_token = self.current_token.clone();
            self.advance_token(); // Consume opening bracket/paren

            // Node label can be a quoted string or unquoted identifier-like text
            if self.current_token_is(TokenType::StringLiteral) {
                label = Some(self.current_token.literal.clone());
                label_location = Some(self.current_token.location);
                self.advance_token(); // Consume string literal
            } else if self.current_token_is(TokenType::Identifier) {
                // Allow identifiers as labels if not quoted (Mermaid often does this)
                label = Some(self.current_token.literal.clone());
                label_location = Some(self.current_token.location);
                self.advance_token(); // Consume identifier as label
            } else {
                // Empty label is allowed (e.g., A[])
                label = Some("".to_string());
                label_location = Some(self.current_token.location); // Location of the empty string
            }

            // Expect closing bracket/paren
            let expected_closing_type = shape.get_closing_token_type();
            self.expect_token(expected_closing_type)?;
            node_shape = shape;
        }

        Ok(Node {
            id,
            label,
            shape: node_shape,
            location: id_token.location,
            label_location,
        })
    }

    /// Parses an edge definition (e.g., `A --> B`, `B -- "label" --> C`).
    fn parse_edge(&mut self) -> Result<Edge, Diagnostic> {
        let source_token = self.expect_token(TokenType::Identifier)?;
        let source_id = source_token.literal.clone();
        let edge_start_location = source_token.location; // Start location of the edge

        let mut label: Option<String> = None;
        let mut arrow_type = ArrowType::None; // Default to no arrow
        let mut line_type = LineType::Solid;   // Default to solid line
        let mut arrow_location = self.current_token.location; // Default, will update

        // Handle optional label between source and arrow (e.g., A -- label -- B)
        if self.current_token_is(TokenType::EdgeOpenParen) || self.current_token_is(TokenType::EdgeOpenBracket) {
            let open_token = self.current_token.clone();
            self.advance_token(); // Consume '(' or '['

            if self.current_token_is(TokenType::StringLiteral) {
                label = Some(self.current_token.literal.clone());
                self.advance_token(); // Consume label
            } else if self.current_token_is(TokenType::Identifier) {
                label = Some(self.current_token.literal.clone());
                self.advance_token(); // Consume label
            } else {
                // Empty label
            }

            // Expect closing token
            let expected_close = match open_token.token_type {
                TokenType::EdgeOpenParen => TokenType::EdgeCloseParen,
                TokenType::EdgeOpenBracket => TokenType::EdgeCloseBracket,
                _ => unreachable!(), // Should be caught by the if condition
            };
            self.expect_token(expected_close)?;
        }

        // Parse arrow/line type
        let arrow_line_token = self.expect_any_token(&[
            TokenType::Arrow,
            TokenType::DottedArrow,
            TokenType::ThickArrow,
            TokenType::InvisibleArrow,
            TokenType::BidirectionalArrow,
            // Add more specific arrow types as they are tokenized
        ])?;
        arrow_location = arrow_line_token.location;

        match arrow_line_token.token_type {
            TokenType::Arrow => { // e.g., -->, ---
                if arrow_line_token.literal.ends_with('>') {
                    arrow_type = ArrowType::ArrowHead;
                } else if arrow_line_token.literal.ends_with('x') { // --x
                    arrow_type = ArrowType::Cross;
                } else if arrow_line_token.literal.ends_with('o') { // --o
                    arrow_type = ArrowType::Open;
                }
                line_type = LineType::Solid; // Default for '---' and '-->'
            }
            TokenType::DottedArrow => { // e.g., -.->
                arrow_type = ArrowType::ArrowHead;
                line_type = LineType::Dotted;
            }
            TokenType::ThickArrow => { // e.g., ===>, ===
                arrow_type = ArrowType::ArrowHead;
                line_type = LineType::Thick;
            }
            TokenType::InvisibleArrow => { // e.g., ~~~
                arrow_type = ArrowType::None;
                line_type = LineType::Invisible;
            }
            TokenType::BidirectionalArrow => { // e.g., <-->
                arrow_type = ArrowType::Bidirectional;
                line_type = LineType::Solid;
            }
            _ => { /* Already handled by expect_any_token */ }
        }

        // Handle optional label after arrow (e.g., A --> "label" B)
        if self.current_token_is(TokenType::StringLiteral) {
            label = Some(self.current_token.literal.clone());
            self.advance_token(); // Consume label
        } else if self.current_token_is(TokenType::Identifier) {
            label = Some(self.current_token.literal.clone());
            self.advance_token(); // Consume label
        }

        let target_token = self.expect_token(TokenType::Identifier)?;
        let target_id = target_token.literal.clone();

        Ok(Edge {
            source: source_id,
            target: target_id,
            label,
            arrow_type,
            line_type,
            location: edge_start_location,
            arrow_location,
            label_location: None, // TODO: Accurately track label location
        })
    }

    /// Parses a subgraph definition.
    fn parse_subgraph(&mut self) -> Result<Subgraph, Diagnostic> {
        let subgraph_keyword_token = self.expect_token(TokenType::KeywordSubgraph)?;
        let subgraph_location = subgraph_keyword_token.location;

        let id_token = self.expect_token(TokenType::Identifier)?;
        let id = id_token.literal.clone();

        let mut label: Option<String> = None;
        let mut label_location: Option<Location> = None;

        // Optional subgraph label (e.g., subgraph MySubgraph ["My Label"])
        if self.current_token_is(TokenType::NodeOpenBracket) {
            self.advance_token(); // Consume '['
            if self.current_token_is(TokenType::StringLiteral) {
                label = Some(self.current_token.literal.clone());
                label_location = Some(self.current_token.location);
                self.advance_token(); // Consume string literal
            } else if self.current_token_is(TokenType::Identifier) {
                // Allow unquoted identifiers as labels
                label = Some(self.current_token.literal.clone());
                label_location = Some(self.current_token.location);
                self.advance_token(); // Consume identifier
            }
            self.expect_token(TokenType::NodeCloseBracket)?; // Consume ']'
        }

        let mut statements = Vec::new();
        // Subgraph body can be enclosed in { } or just indented lines
        // For now, let's assume it's followed by statements until 'end' or EOF
        // A more robust parser would check for indentation or explicit `{}`
        while !self.current_token_is(TokenType::KeywordEnd) && !self.current_token_is(TokenType::Eof) {
            match self.parse_statement() {
                Ok(Some(stmt)) => statements.push(stmt),
                Ok(None) => { /* Continue */ },
                Err(diag) => {
                    self.diagnostics.emit(diag);
                    self.synchronize_parser();
                }
            }
        }

        self.expect_token(TokenType::KeywordEnd)?; // Consume 'end' keyword

        Ok(Subgraph {
            id,
            label,
            statements,
            location: subgraph_location,
            label_location,
        })
    }

    /// Parses a directive (e.g., `%%{init: { ... } }%%`).
    fn parse_directive(&mut self) -> Result<Directive, Diagnostic> {
        let directive_start_token = self.expect_token(TokenType::DirectiveStart)?;
        let content = directive_start_token.literal.clone(); // The lexer should have captured full content

        // The lexer is designed to capture the full directive content as part of the DirectiveStart token.
        // So, we just need to consume the token and extract its literal.
        // No explicit '%%}' token is expected here if the lexer captures the whole thing.

        Ok(Directive {
            content,
            location: directive_start_token.location,
        })
    }

    /// Implements basic error recovery by advancing tokens until a likely synchronization point.
    fn synchronize_parser(&mut self) {
        loop {
            if self.current_token_is(TokenType::Eof) {
                return;
            }

            match self.current_token.token_type {
                TokenType::KeywordGraph
                | TokenType::KeywordFlowchart
                | TokenType::KeywordSequence
                | TokenType::KeywordClass
                | TokenType::KeywordSubgraph
                | TokenType::Identifier => {
                    // These tokens usually start a new statement or diagram
                    return;
                }
                _ => {
                    // Skip over current token
                    self.advance_token();
                }
            }
        }
    }
}

Explanation:

Parser struct: Holds a Peekable iterator over Tokens, the current_token, peek_token, DiagnosticEmitter, and a reference to the source_code for context.
new(): Constructor that initializes the parser by taking a Lexer and priming the current_token and peek_token.
advance_token(): Moves the current_token to peek_token and fetches the next token into peek_token.
current_token_is(), peek_token_is(): Helper methods for checking token types.
expect_token(): Core parsing primitive. It attempts to consume the current_token if its type matches expected. If not, it emits an Error diagnostic and returns an Err. This is crucial for strict syntax validation.
expect_any_token(): Similar to expect_token but checks against multiple expected token types.
parse(): The main entry point. It orchestrates the parsing process:
- First, it parses the DiagramType (e.g., graph TD).
- Then, it iteratively calls parse_statement() until Eof is reached.
- It collects all diagnostics and returns either a Diagram or a Vec<Diagnostic>.
parse_diagram_type(): Handles the initial graph, flowchart, sequence, class keywords and their optional directions for flowcharts.
parse_statement(): A dispatcher function that tries to identify what kind of statement the current tokens represent (Node, Edge, Subgraph, Directive, Comment). It includes basic error recovery by skipping whitespace/comments.
parse_node(): Parses node definitions, handling IDs, optional labels, and various node shapes.
parse_edge(): Parses edge definitions, including source/target IDs, optional labels (before or after the arrow), and different arrow/line types. This is a complex part due to the many arrow variations in Mermaid.
parse_subgraph(): Handles subgraph blocks, including parsing its ID, optional label, and nested statements, until an end keyword.
parse_directive(): Parses %%{...}%% directives, assuming the lexer captures the full content.
synchronize_parser(): A simple error recovery mechanism. When a parsing error occurs, this function tries to advance the current_token until it finds a token that likely marks the beginning of a new valid statement (like an Identifier or a keyword). This prevents a single error from cascading into many irrelevant errors.

2.4. Testing This Component

To ensure our parser is working correctly, we need to write unit tests. These tests will feed a stream of tokens (generated by our lexer) into the parser and assert that the resulting AST matches our expectations.

Action: Create src/parser/parser_tests.rs and add basic tests.

File: src/parser/parser_tests.rs

// src/parser/parser_tests.rs
#[cfg(test)]
mod tests {
    use super::*;
    use crate::lexer::Lexer;
    use crate::diagnostics::DiagnosticEmitter;
    use crate::Location;

    // Helper function to create a parser from source code
    fn setup_parser<'a>(source: &'a str, emitter: DiagnosticEmitter<'a>) -> Parser<'a> {
        let lexer = Lexer::new(source);
        Parser::new(lexer, source, emitter)
    }

    #[test]
    fn test_parse_simple_flowchart() {
        let source = "graph TD\n    A[Start] --> B(Process)";
        let emitter = DiagnosticEmitter::new(source);
        let mut parser = setup_parser(source, emitter.clone());

        let result = parser.parse();
        assert!(result.is_ok(), "Parsing failed with errors: {:?}", emitter.get_diagnostics());

        let diagram = result.unwrap();
        assert_eq!(diagram.diagram_type, DiagramType::Flowchart(Some(FlowchartDirection::TopToBottom)));
        assert_eq!(diagram.statements.len(), 2); // One node, one edge

        // Verify Node A
        if let Statement::Node(node_a) = &diagram.statements[0] {
            assert_eq!(node_a.id, "A");
            assert_eq!(node_a.label, Some("Start".to_string()));
            assert_eq!(node_a.shape, NodeShape::Rectangle);
        } else {
            panic!("Expected Node statement, found {:?}", diagram.statements[0]);
        }

        // Verify Edge A --> B
        if let Statement::Edge(edge_ab) = &diagram.statements[1] {
            assert_eq!(edge_ab.source, "A");
            assert_eq!(edge_ab.target, "B");
            assert_eq!(edge_ab.label, None);
            assert_eq!(edge_ab.arrow_type, ArrowType::ArrowHead);
            assert_eq!(edge_ab.line_type, LineType::Solid);
        } else {
            panic!("Expected Edge statement, found {:?}", diagram.statements[1]);
        }

        assert!(emitter.get_diagnostics().is_empty(), "Unexpected diagnostics: {:?}", emitter.get_diagnostics());
    }

    #[test]
    fn test_parse_flowchart_with_direction_lr() {
        let source = "flowchart LR\n    Node1 --> Node2";
        let emitter = DiagnosticEmitter::new(source);
        let mut parser = setup_parser(source, emitter.clone());

        let result = parser.parse();
        assert!(result.is_ok(), "Parsing failed with errors: {:?}", emitter.get_diagnostics());
        let diagram = result.unwrap();
        assert_eq!(diagram.diagram_type, DiagramType::Flowchart(Some(FlowchartDirection::LeftToRight)));
        assert!(emitter.get_diagnostics().is_empty());
    }

    #[test]
    fn test_parse_node_shapes_and_labels() {
        let source = "graph TD\n    A[Rect]\n    B(Rounded)\n    C>Asym]\n    D{Rhombus}";
        let emitter = DiagnosticEmitter::new(source);
        let mut parser = setup_parser(source, emitter.clone());
        let result = parser.parse().unwrap();

        assert_eq!(result.statements.len(), 4);
        if let Statement::Node(a) = &result.statements[0] { assert_eq!(a.shape, NodeShape::Rectangle); assert_eq!(a.label, Some("Rect".to_string())); }
        if let Statement::Node(b) = &result.statements[1] { assert_eq!(b.shape, NodeShape::RoundedRectangle); assert_eq!(b.label, Some("Rounded".to_string())); }
        if let Statement::Node(c) = &result.statements[2] { assert_eq!(c.shape, NodeShape::Asymmetric); assert_eq!(c.label, Some("Asym".to_string())); }
        if let Statement::Node(d) = &result.statements[3] { assert_eq!(d.shape, NodeShape::Rhombus); assert_eq!(d.label, Some("Rhombus".to_string())); }
        assert!(emitter.get_diagnostics().is_empty());
    }

    #[test]
    fn test_parse_edge_labels_and_types() {
        let source = "graph TD\n    A -- Label A --> B\n    B --o C\n    C ===> D\n    D -.-> E\n    E ~~~ F";
        let emitter = DiagnosticEmitter::new(source);
        let mut parser = setup_parser(source, emitter.clone());
        let result = parser.parse().unwrap();

        assert_eq!(result.statements.len(), 5);
        if let Statement::Edge(e) = &result.statements[0] { assert_eq!(e.label, Some("Label A".to_string())); assert_eq!(e.arrow_type, ArrowType::ArrowHead); }
        if let Statement::Edge(e) = &result.statements[1] { assert_eq!(e.label, None); assert_eq!(e.arrow_type, ArrowType::Open); }
        if let Statement::Edge(e) = &result.statements[2] { assert_eq!(e.label, None); assert_eq!(e.arrow_type, ArrowType::ArrowHead); assert_eq!(e.line_type, LineType::Thick); }
        if let Statement::Edge(e) = &result.statements[3] { assert_eq!(e.label, None); assert_eq!(e.arrow_type, ArrowType::ArrowHead); assert_eq!(e.line_type, LineType::Dotted); }
        if let Statement::Edge(e) = &result.statements[4] { assert_eq!(e.label, None); assert_eq!(e.arrow_type, ArrowType::None); assert_eq!(e.line_type, LineType::Invisible); }
        assert!(emitter.get_diagnostics().is_empty());
    }

    #[test]
    fn test_parse_subgraph() {
        let source = r#"
graph TD
    subgraph MySubgraph ["A Subgraph"]
        A --> B
    end
    B --> C
"#;
        let emitter = DiagnosticEmitter::new(source);
        let mut parser = setup_parser(source, emitter.clone());
        let result = parser.parse().unwrap();

        assert_eq!(result.statements.len(), 2);
        if let Statement::Subgraph(subgraph) = &result.statements[0] {
            assert_eq!(subgraph.id, "MySubgraph");
            assert_eq!(subgraph.label, Some("A Subgraph".to_string()));
            assert_eq!(subgraph.statements.len(), 1);
            if let Statement::Edge(edge) = &subgraph.statements[0] {
                assert_eq!(edge.source, "A");
                assert_eq!(edge.target, "B");
            } else { panic!("Expected edge in subgraph"); }
        } else { panic!("Expected subgraph"); }

        if let Statement::Edge(edge) = &result.statements[1] {
            assert_eq!(edge.source, "B");
            assert_eq!(edge.target, "C");
        } else { panic!("Expected edge outside subgraph"); }
        assert!(emitter.get_diagnostics().is_empty());
    }

    #[test]
    fn test_parse_directive() {
        let source = "%%{init: { 'theme': 'forest' } }%%\ngraph TD\nA --> B";
        let emitter = DiagnosticEmitter::new(source);
        let mut parser = setup_parser(source, emitter.clone());
        let result = parser.parse().unwrap();

        assert_eq!(result.statements.len(), 2);
        if let Statement::Directive(directive) = &result.statements[0] {
            assert_eq!(directive.content, "{init: { 'theme': 'forest' } }"); // Lexer should capture content without %%{}%%
        } else { panic!("Expected directive"); }
        assert!(emitter.get_diagnostics().is_empty());
    }

    #[test]
    fn test_parse_syntax_error_missing_diagram_type() {
        let source = "A --> B"; // Missing 'graph TD'
        let emitter = DiagnosticEmitter::new(source);
        let mut parser = setup_parser(source, emitter.clone());

        let result = parser.parse();
        assert!(result.is_err());
        let diagnostics = emitter.get_diagnostics();
        assert_eq!(diagnostics.len(), 1);
        assert_eq!(diagnostics[0].code, DiagnosticCode::UnexpectedToken);
        assert!(diagnostics[0].message.contains("Expected one of KeywordGraph or KeywordFlowchart or KeywordSequence or KeywordClass"));
    }

    #[test]
    fn test_parse_syntax_error_unmatched_bracket() {
        let source = "graph TD\n    A[Start --> B"; // Missing closing bracket
        let emitter = DiagnosticEmitter::new(source);
        let mut parser = setup_parser(source, emitter.clone());

        let result = parser.parse();
        assert!(result.is_err()); // Should return an error
        let diagnostics = emitter.get_diagnostics();
        assert_eq!(diagnostics.len(), 1);
        assert_eq!(diagnostics[0].code, DiagnosticCode::UnexpectedToken);
        assert!(diagnostics[0].message.contains("Expected token NodeCloseBracket"));
    }

    #[test]
    fn test_parse_empty_input() {
        let source = "";
        let emitter = DiagnosticEmitter::new(source);
        let mut parser = setup_parser(source, emitter.clone());

        let result = parser.parse();
        assert!(result.is_err()); // Empty input is invalid as it lacks a diagram type
        let diagnostics = emitter.get_diagnostics();
        assert_eq!(diagnostics.len(), 1);
        assert_eq!(diagnostics[0].code, DiagnosticCode::UnexpectedToken);
    }
}

Action: Add mod parser_tests; to src/parser/mod.rs (inside #[cfg(test)] block).

File: src/parser/mod.rs

// src/parser/mod.rs
pub mod ast;
pub mod parser;

pub use self::parser::Parser;
pub use self::ast::{
    Diagram, DiagramType, Statement, Node, Edge, Subgraph, ArrowType, LineType, NodeShape,
    FlowchartDirection,
};

#[cfg(test)]
mod parser_tests; // Add this line

Explanation:

The parser_tests.rs file contains various unit tests that cover successful parsing of different Mermaid constructs (simple flowcharts, nodes, edges, subgraphs, directives) and also tests for expected syntax errors.
The setup_parser helper streamlines test setup.
We use DiagnosticEmitter to capture diagnostics during parsing and assert their presence or absence.
assert!(result.is_ok()) and assert!(result.is_err()) are used to check if parsing succeeded or failed as expected.
Assertions on the Diagram and Statement types verify the correctness of the generated AST.

To run these tests, navigate to your project root and execute cargo test. All tests should pass, indicating that our parser correctly constructs the AST for valid Mermaid input and reports errors for invalid input.

3. Production Considerations

For a production-ready parser, several factors beyond basic functionality are critical:

Robust Error Recovery: Our current synchronize_parser is a basic starting point. A production-grade parser would employ more sophisticated error recovery techniques (e.g., panic-mode recovery, phrase-level recovery) to minimize cascade errors and allow parsing to continue as much as possible, providing a comprehensive list of issues rather than stopping at the first one. This is key for a linter/formatter.
Performance Optimization:
- Zero-Copy Parsing: For very large Mermaid files, minimizing string allocations (.clone()) during parsing is crucial. This can involve using string slices (&str) that reference the original source code where possible, or employing arenas for AST node allocation. Our current implementation does some cloning for literal values; a future optimization would be to reference the source_code directly.
- Efficient Lookahead: Peekable is generally efficient, but for very deep lookaheads, custom token buffering might be considered. However, recursive descent usually only needs one or two tokens of lookahead, so Peekable is often sufficient.
Memory Usage: The AST itself can consume significant memory for very large diagrams. Consider techniques like interning strings (using Rc<str> or a dedicated string table) for repeated node IDs or labels to reduce memory footprint.
Security (DoS Prevention): Malformed input should never cause the parser to panic or enter an infinite loop, which could be exploited for Denial of Service. Our expect_token and synchronize_parser help prevent this by always advancing tokens and reporting errors. Fuzz testing (covered in a later chapter) is essential here.
Logging and Metrics: Integrate logging to trace parser behavior, especially during error recovery or for performance bottlenecks. Metrics could track parsing time, AST size, and error rates.

4. Code Review Checkpoint

At this point, we have successfully implemented the core parsing logic and the foundational Abstract Syntax Tree for our Mermaid analyzer.

Summary of what was built:

src/parser/ast.rs: Defined the strongly typed AST structures (Diagram, DiagramType, Statement, Node, Edge, Subgraph, Directive, NodeShape, ArrowType, LineType, FlowchartDirection) to represent Mermaid diagrams. Each AST node includes Location information.
src/parser/parser.rs: Implemented the Parser struct using a recursive descent approach. It consumes tokens from the Lexer and constructs the AST. It includes methods for token lookahead, consumption, error reporting via DiagnosticEmitter, and basic error recovery.
src/parser/parser_tests.rs: Added unit tests to validate the parser’s functionality for various valid and invalid Mermaid inputs, ensuring correct AST generation and error reporting.

Files created/modified:

src/lib.rs (modified to include parser module and re-exports)
src/parser/mod.rs (new module, defines public interface)
src/parser/ast.rs (new file, defines AST data structures)
src/parser/parser.rs (new file, implements the parser logic)
src/parser/parser_tests.rs (new file, contains unit tests for the parser)

How it integrates: The Parser now seamlessly integrates with the Lexer (from Chapter 2) by taking its Token stream as input. It also leverages the DiagnosticEmitter (from Chapter 2) to report parsing errors, enriching our tool’s feedback mechanism. The output of the parser, the Diagram AST, is now ready to be consumed by the next stage of our pipeline: the validator.

5. Common Issues & Solutions

Developers working on parsers often encounter similar challenges. Here are a few common ones and how to address them:

Issue: Infinite Loops in Recursive Descent.
- Problem: A parsing function might call itself or another function that eventually calls it, without consuming any tokens. This often happens with optional or repetitive grammar rules.
- Example: parse_statement calls parse_node, but parse_node fails to consume a token and returns, leading parse_statement to call it again with the same token.
- Solution:
  - Always consume tokens: Ensure that every parsing function, especially those that can recurse, consumes at least one token if it successfully matches a part of the grammar.
  - Clear termination conditions: For loops (e.g., while !self.current_token_is(TokenType::Eof)), ensure the loop condition will eventually become false.
  - Error recovery: When expect_token fails, the parser should advance past the problematic token (self.advance_token()) or use a synchronization strategy to prevent re-evaluating the same problematic token. Our synchronize_parser helps here.
Issue: Off-by-One Errors in Token Consumption/Lookahead.
- Problem: The parser might consume too many or too few tokens, leading to incorrect parsing or missed errors. For example, peek_token is checked but current_token isn’t advanced correctly.
- Example: For A -- "label" --> B, if the parser consumes A and --> but skips "label", the AST will be incorrect.
- Solution:
  - Double-check advance_token() calls: Ensure advance_token() is called precisely when a token is matched and consumed.
  - Unit tests for each rule: Write specific tests for each grammar rule (node, edge, subgraph) to verify token consumption.
  - Trace parsing: Temporarily add println! statements in advance_token() and expect_token() to see the token stream being processed.
Issue: Ambiguity in Grammar Rules.
- Problem: Sometimes, the same sequence of tokens can be interpreted in multiple ways by different parsing rules. This is less common with hand-written recursive descent if the grammar is LL(1) (can be parsed with 1 token lookahead), but can arise in complex grammars.
- Example: Is A a node ID or the start of an edge? A --> B vs A[Label].
- Solution:
  - Prioritization: In recursive descent, the order of if/else if statements in functions like parse_statement implies prioritization. For instance, checking for an edge first (if self.peek_token_is(TokenType::Arrow)) before defaulting to a node (else if self.current_token_is(TokenType::Identifier)) resolves this.
  - More lookahead: If 1-token lookahead isn’t enough, peek_token (2-token lookahead) helps. If even more is needed, a custom buffer might be required, but it adds complexity.
  - Grammar analysis: Referencing the official Mermaid grammar meticulously helps identify and resolve ambiguities.

6. Testing & Verification

To verify the work of this chapter, you should run the unit tests we’ve created and consider adding more “golden tests.”

Run Unit Tests:
- Navigate to your project root in the terminal.
- Execute cargo test.
- All tests in src/parser/parser_tests.rs should pass. This confirms that the parser correctly builds the AST for a range of valid Mermaid inputs and reports expected errors for invalid ones.
Golden Tests (Input to Expected Output):
- Beyond unit tests, a powerful technique is “golden testing” or “snapshot testing.” For the parser, this means:
  - Have a collection of .mmd (Mermaid) files.
  - For each .mmd file, have a corresponding .ast.txt file that contains the Debug representation of the expected AST.
  - The test would:
    1. Read the .mmd file.
    2. Lex it.
    3. Parse it, getting the Diagram AST.
    4. Compare the format!("{:?}", diagram) output with the content of the .ast.txt file.
- This ensures that any changes to the parser don’t inadvertently alter the AST structure for previously working inputs. This will be integrated into our comprehensive testing strategy in a later chapter.
Manual Inspection:
- For a few complex Mermaid diagrams, you can temporarily add println!("{:?}", diagram) after parsing in main.rs (or a test) to visually inspect the generated AST and ensure it accurately reflects the input structure.

At this stage, you should be confident that your tool can now transform raw Mermaid code into a structured, machine-readable Abstract Syntax Tree, laying the groundwork for all subsequent analysis and transformation tasks.

7. Summary & Next Steps

Congratulations! In this chapter, you’ve taken a significant step in building our production-grade Mermaid analyzer:

We designed and implemented the Abstract Syntax Tree (AST), a robust, strongly typed representation of Mermaid diagrams, capturing their structure and semantic information.
We developed the Parser module, which consumes the token stream from our lexer and constructs this AST using a recursive descent approach.
We incorporated error reporting using our DiagnosticEmitter and implemented basic error recovery to make the parser resilient to malformed input.
We wrote unit tests to ensure the correctness and reliability of our parser, covering various valid and invalid Mermaid syntax examples.

The generated AST is now the authoritative representation of the Mermaid diagram. It’s clean, structured, and ready for deeper analysis.

In the next chapter (Chapter 4: Implementing the Semantic Validator), we will leverage this AST to perform strict validation. We’ll identify not only syntax errors (which the parser already catches) but also semantic errors such as duplicate node IDs, undefined references in edges, illegal nesting, and other logical inconsistencies that cannot be caught during the lexical or parsing phases. This will bring us closer to a truly robust and reliable Mermaid linter.

Chapter 3: Crafting the Parser: Building the Abstract Syntax Tree (AST)

Table of Contents

1. Planning & Design

1.1. Parser Architecture Overview

1.2. Abstract Syntax Tree (AST) Structure

1.3. File Structure

2. Step-by-Step Implementation

2.1. Setup and Configuration

2.2. Core Implementation - Defining the AST

2.3. Core Implementation - The Parser

2.4. Testing This Component

3. Production Considerations

4. Code Review Checkpoint

5. Common Issues & Solutions

6. Testing & Verification

7. Summary & Next Steps