Welcome to Chapter 3 of our journey to build a robust Mermaid analyzer! In the previous chapter, we successfully laid the foundation with our lexer, transforming raw Mermaid code into a stream of meaningful tokens. This chapter marks a significant leap forward as we tackle the next crucial phase of our compiler pipeline: parsing. Here, we will take the tokens generated by the lexer and construct a strongly typed Abstract Syntax Tree (AST).
The AST is the heart of our analyzer. It’s a hierarchical, language-agnostic representation of the input Mermaid code’s structure, free from the surface-level details of syntax. Think of it as the “understanding” layer of our tool. Without a well-defined and accurately built AST, our subsequent validation, fixing, and formatting steps would be impossible or highly error-prone. By the end of this chapter, you will have a Parser module capable of consuming a token stream and producing a structured AST for various Mermaid diagram types, along with robust error reporting for syntax issues.
1. Planning & Design
Building a parser is a complex task that requires careful planning. Our parser will follow a recursive descent approach, which is a top-down parsing technique where a set of recursive procedures are used to process tokens. Each procedure typically corresponds to a non-terminal in the grammar and tries to match a sequence of tokens according to its production rules.
1.1. Parser Architecture Overview
The Parser will interact with our Lexer to consume tokens and our Diagnostic system to report errors. It will construct the AST, which will reside in its own module.
1.2. Abstract Syntax Tree (AST) Structure
The AST needs to accurately represent the various components of a Mermaid diagram. Given our requirement to support flowcharts, sequence diagrams, and class diagrams, our AST must be flexible. We’ll start with a core structure that can be extended.
Consider a simple flowchart:
The AST for this would involve a Diagram containing a DiagramType (Flowchart) and a list of Statements. Each Statement could be a Node or an Edge. Nodes have IDs, labels, and shapes. Edges connect two nodes, have labels, and arrow types.
Here’s a conceptual diagram of our AST structure for a basic flowchart:
We need to capture not just the content but also the exact source location of each element for precise diagnostics.
1.3. File Structure
We’ll organize our parser components within a dedicated src/parser module:
src/
├── lib.rs
├── lexer/
│ └── ... (from Chapter 2)
├── diagnostics/
│ └── ... (from Chapter 2)
├── parser/
│ ├── mod.rs
│ ├── ast.rs # Defines the Abstract Syntax Tree data structures
│ └── parser.rs # Contains the Parser implementation
└── main.rs
2. Step-by-Step Implementation
2.1. Setup and Configuration
First, let’s create the necessary files and update our lib.rs to include the new module.
Action: Create src/parser/mod.rs, src/parser/ast.rs, and src/parser/parser.rs.
File: src/parser/mod.rs
// src/parser/mod.rs
pub mod ast;
pub mod parser;
pub use self::parser::Parser;
pub use self::ast::{
Diagram, DiagramType, Statement, Node, Edge, Subgraph, ArrowType, LineType, NodeShape,
EdgeDirection,
};
Explanation: This mod.rs acts as the public interface for the parser module, making its sub-modules and key types easily accessible.
Action: Update src/lib.rs to include the new parser module.
File: src/lib.rs
// src/lib.rs
pub mod lexer;
pub mod diagnostics;
pub mod parser; // Add this line
// Re-export key types for easier access
pub use lexer::{Lexer, Token, TokenType, Location};
pub use diagnostics::{Diagnostic, DiagnosticLevel, DiagnosticCode, DiagnosticEmitter};
pub use parser::{
Parser, Diagram, DiagramType, Statement, Node, Edge, Subgraph, ArrowType, LineType, NodeShape,
EdgeDirection,
};
Explanation: We’re making the parser module public and re-exporting its main types. This allows other parts of our application (like main.rs or future modules) to easily import Parser or Diagram without needing to specify parser::parser::Parser or parser::ast::Diagram.
2.2. Core Implementation - Defining the AST
Now, let’s define the core data structures for our Abstract Syntax Tree in src/parser/ast.rs. We’ll start with the most common elements, focusing on flowcharts, and then extend. Remember to include Location from our diagnostics module to track source positions.
Action: Define the AST structures in src/parser/ast.rs.
File: src/parser/ast.rs
// src/parser/ast.rs
use crate::{Location, TokenType}; // Import Location and TokenType from root crate
/// Represents the top-level structure of a Mermaid diagram.
#[derive(Debug, PartialEq, Eq, Clone)]
pub struct Diagram {
pub diagram_type: DiagramType,
pub statements: Vec<Statement>,
pub location: Location, // Location of the 'graph' or 'flowchart' keyword
}
/// Enumerates the supported types of Mermaid diagrams.
#[derive(Debug, PartialEq, Eq, Clone)]
pub enum DiagramType {
Flowchart(Option<FlowchartDirection>), // e.g., graph TD, flowchart LR
Sequence,
Class,
// Add other diagram types as needed (e.g., Gantt, State, Git, etc.)
}
/// Represents the direction of a flowchart.
#[derive(Debug, PartialEq, Eq, Clone)]
pub enum FlowchartDirection {
TopToBottom, // TD
BottomToTop, // BT
LeftToRight, // LR
RightToLeft, // RL
}
/// Represents a single statement within a Mermaid diagram.
#[derive(Debug, PartialEq, Eq, Clone)]
pub enum Statement {
Node(Node),
Edge(Edge),
Subgraph(Subgraph),
Directive(Directive), // e.g., %%{init: ...}
Comment(String, Location), // e.g., %% This is a comment
EmptyLine(Location), // For tracking formatting, useful for formatter
// Potentially add other statement types like Class members, Sequence messages, etc.
}
/// Represents a node in the diagram.
#[derive(Debug, PartialEq, Eq, Clone)]
pub struct Node {
pub id: String,
pub label: Option<String>, // Optional label for nodes like A[Label]
pub shape: NodeShape,
pub location: Location, // Location of the node ID
pub label_location: Option<Location>, // Location of the label, if present
}
/// Enumerates the possible shapes a node can have.
#[derive(Debug, PartialEq, Eq, Clone)]
pub enum NodeShape {
Rectangle, // A[Text]
RoundedRectangle, // A(Text)
Stadium, // A([Text])
Subroutine, // A[[Text]]
Cylinder, // A[(Text)]
Circle, // A((Text))
Asymmetric, // A>Text]
Rhombus, // A{Text}
Hexagon, // A{{Text}}
Parallelogram, // A[/Text/]
ParallelogramAlt, // A[\Text\]
Trapezoid, // A[/Text\]
TrapezoidAlt, // A[\Text/]
DoubleCircle, // A(((Text)))
// Add other shapes as per Mermaid specification
Default, // Implicit rectangle if no shape specified (e.g., A)
}
/// Represents an edge (connection) between two nodes.
#[derive(Debug, PartialEq, Eq, Clone)]
pub struct Edge {
pub source: String,
pub target: String,
pub label: Option<String>,
pub arrow_type: ArrowType,
pub line_type: LineType,
pub location: Location, // Location of the first token of the edge (source ID)
pub arrow_location: Location, // Location of the arrow/line tokens
pub label_location: Option<Location>, // Location of the label, if present
}
/// Enumerates the types of arrows that can be used in an edge.
#[derive(Debug, PartialEq, Eq, Clone)]
pub enum ArrowType {
None, // ---
ArrowHead, // -->
Cross, // --x
Open, // --o
Bidirectional, // <-->
// Add other arrow types (e.g., ===>, --+>, etc.)
}
/// Enumerates the types of lines that can be used in an edge.
#[derive(Debug, PartialEq, Eq, Clone)]
pub enum LineType {
Solid, // ---
Dotted, // -.->
Thick, // ===
Invisible, // ~~~
// Add other line types
}
/// Represents a subgraph in the diagram.
#[derive(Debug, PartialEq, Eq, Clone)]
pub struct Subgraph {
pub id: String,
pub label: Option<String>, // Subgraph label (e.g., subgraph "My Subgraph")
pub statements: Vec<Statement>,
pub location: Location, // Location of the 'subgraph' keyword
pub label_location: Option<Location>, // Location of the label, if present
}
/// Represents a directive, like `%%{init: ...}`.
#[derive(Debug, PartialEq, Eq, Clone)]
pub struct Directive {
pub content: String, // The full content of the directive
pub location: Location,
}
// Helper to determine node shape from token type
impl NodeShape {
pub fn from_token_type(token_type: &TokenType) -> Option<Self> {
match token_type {
TokenType::NodeOpenBracket => Some(NodeShape::Rectangle),
TokenType::NodeOpenParen => Some(NodeShape::RoundedRectangle),
TokenType::NodeOpenSquareBracketParen => Some(NodeShape::Stadium),
TokenType::NodeOpenDoubleSquareBracket => Some(NodeShape::Subroutine),
TokenType::NodeOpenParenSquareBracket => Some(NodeShape::Cylinder),
TokenType::NodeOpenDoubleParen => Some(NodeShape::Circle),
TokenType::NodeOpenGt => Some(NodeShape::Asymmetric),
TokenType::NodeOpenBrace => Some(NodeShape::Rhombus),
TokenType::NodeOpenDoubleBrace => Some(NodeShape::Hexagon),
TokenType::NodeOpenSlash => Some(NodeShape::Parallelogram),
TokenType::NodeOpenBackslash => Some(NodeShape::ParallelogramAlt),
TokenType::NodeOpenSlashBackslash => Some(NodeShape::Trapezoid),
TokenType::NodeOpenBackslashSlash => Some(NodeShape::TrapezoidAlt),
TokenType::NodeOpenTripleParen => Some(NodeShape::DoubleCircle),
_ => None,
}
}
pub fn get_closing_token_type(&self) -> TokenType {
match self {
NodeShape::Rectangle => TokenType::NodeCloseBracket,
NodeShape::RoundedRectangle => TokenType::NodeCloseParen,
NodeShape::Stadium => TokenType::NodeCloseSquareBracketParen,
NodeShape::Subroutine => TokenType::NodeCloseDoubleSquareBracket,
NodeShape::Cylinder => TokenType::NodeCloseParenSquareBracket,
NodeShape::Circle => TokenType::NodeCloseDoubleParen,
NodeShape::Asymmetric => TokenType::NodeCloseBracket, // Asymmetric closes with ]
NodeShape::Rhombus => TokenType::NodeCloseBrace,
NodeShape::Hexagon => TokenType::NodeCloseDoubleBrace,
NodeShape::Parallelogram => TokenType::NodeCloseSlash,
NodeShape::ParallelogramAlt => TokenType::NodeCloseBackslash,
NodeShape::Trapezoid => TokenType::NodeCloseBackslash, // Trapezoid closes with \]
NodeShape::TrapezoidAlt => TokenType::NodeCloseSlash, // TrapezoidAlt closes with /]
NodeShape::DoubleCircle => TokenType::NodeCloseTripleParen,
NodeShape::Default => TokenType::Identifier, // No explicit closing for default
}
}
}
// Helper to determine arrow and line types from token type
impl EdgeDirection {
// This enum might be useful for representing the entire arrow token
// For now, we'll parse ArrowType and LineType separately.
// This can be expanded if we need more complex arrow structures.
}
Explanation:
Diagram: The root of our AST, holding the diagram type and a vector of statements.DiagramType: An enum to distinguish between different Mermaid diagram kinds. We start withFlowchart,Sequence, andClass.FlowchartDirection: Specific to flowcharts (TD, LR, etc.).Statement: An enum representing the different top-level constructs in a diagram (nodes, edges, subgraphs, directives, comments). This is crucial for handling the mixed nature of Mermaid syntax.Node: Represents a single node with its ID, optional label, andNodeShape.NodeShape: An enum covering all standard Mermaid node shapes. Includes helper methods to mapTokenTypetoNodeShapeand get the corresponding closing token.Edge: Connects two nodes, with an optional label,ArrowType, andLineType.ArrowTypeandLineType: Enums to capture the various arrowheads and line styles.Subgraph: Represents nested blocks of statements.Directive: For%%{init: ...}style directives.Location: Crucially, every major AST element stores itsLocationto enable precise error reporting and future formatting.
2.3. Core Implementation - The Parser
Now, we’ll implement the Parser struct and its methods in src/parser/parser.rs. This will be an incremental process, starting with the basic structure and parsing the overall diagram type, then moving to statements, nodes, and edges.
Action: Define the Parser struct and its basic methods.
File: src/parser/parser.rs
// src/parser/parser.rs
use crate::{
diagnostics::{Diagnostic, DiagnosticCode, DiagnosticEmitter, DiagnosticLevel},
lexer::{Lexer, Token, TokenType},
parser::ast::*, // Import all AST types
Location,
};
use std::iter::Peekable;
use std::vec::IntoIter;
/// The Parser transforms a stream of tokens into an Abstract Syntax Tree (AST).
pub struct Parser<'a> {
lexer: Peekable<IntoIter<Token>>,
current_token: Token,
peek_token: Token,
diagnostics: DiagnosticEmitter<'a>,
source_code: &'a str,
}
impl<'a> Parser<'a> {
/// Creates a new Parser instance.
/// It consumes the lexer's token stream and initializes the current and peek tokens.
pub fn new(lexer: Lexer, source_code: &'a str, diagnostics: DiagnosticEmitter<'a>) -> Self {
let mut token_iter = lexer.into_iter().peekable();
// Prime the current and peek tokens
let current_token = token_iter.next().unwrap_or_else(|| Token {
token_type: TokenType::Eof,
literal: String::new(),
location: Location::new(0, 0, 0, 0),
});
let peek_token = token_iter.next().unwrap_or_else(|| Token {
token_type: TokenType::Eof,
literal: String::new(),
location: Location::new(0, 0, 0, 0),
});
Parser {
lexer: token_iter,
current_token,
peek_token,
diagnostics,
source_code,
}
}
/// Advances the parser's token stream by one.
fn advance_token(&mut self) {
self.current_token = self.peek_token.clone();
self.peek_token = self.lexer.next().unwrap_or_else(|| Token {
token_type: TokenType::Eof,
literal: String::new(),
location: Location::new(self.source_code.len(), self.source_code.len(), 0, 0),
});
}
/// Checks if the current token matches the expected type.
fn current_token_is(&self, token_type: TokenType) -> bool {
self.current_token.token_type == token_type
}
/// Checks if the peek token matches the expected type.
fn peek_token_is(&self, token_type: TokenType) -> bool {
self.peek_token.token_type == token_type
}
/// Consumes the current token if it matches the expected type, otherwise reports an error.
/// Returns the consumed token or an error diagnostic.
fn expect_token(&mut self, expected: TokenType) -> Result<Token, Diagnostic> {
if self.current_token_is(expected) {
let token = self.current_token.clone();
self.advance_token();
Ok(token)
} else {
let diagnostic = Diagnostic::new(
DiagnosticCode::UnexpectedToken,
format!(
"Expected token {:?}, but found {:?}",
expected, self.current_token.token_type
),
self.current_token.location,
DiagnosticLevel::Error,
)
.with_help(format!(
"Ensure that '{}' is used correctly here.",
expected.to_string()
));
self.diagnostics.emit(diagnostic.clone());
Err(diagnostic)
}
}
/// Consumes the current token if it matches any of the expected types, otherwise reports an error.
fn expect_any_token(&mut self, expected_types: &[TokenType]) -> Result<Token, Diagnostic> {
if expected_types.contains(&self.current_token.token_type) {
let token = self.current_token.clone();
self.advance_token();
Ok(token)
} else {
let expected_str = expected_types
.iter()
.map(|t| format!("{:?}", t))
.collect::<Vec<_>>()
.join(" or ");
let diagnostic = Diagnostic::new(
DiagnosticCode::UnexpectedToken,
format!(
"Expected one of {}, but found {:?}",
expected_str, self.current_token.token_type
),
self.current_token.location,
DiagnosticLevel::Error,
)
.with_help(format!(
"Ensure that one of the expected tokens is used here."
));
self.diagnostics.emit(diagnostic.clone());
Err(diagnostic)
}
}
/// Parses the entire Mermaid diagram. This is the entry point for parsing.
pub fn parse(&mut self) -> Result<Diagram, Vec<Diagnostic>> {
let mut collected_diagnostics = Vec::new();
// First, parse the diagram type (e.g., 'graph', 'flowchart', 'sequence')
let diagram_type_token = match self.expect_any_token(&[
TokenType::KeywordGraph,
TokenType::KeywordFlowchart,
TokenType::KeywordSequence,
TokenType::KeywordClass,
// Add other diagram type keywords here
]) {
Ok(t) => t,
Err(diag) => {
collected_diagnostics.push(diag);
// Attempt to recover by assuming a default diagram type or skipping until a known keyword
// For now, we'll just return the error. Robust error recovery comes later.
return Err(collected_diagnostics);
}
};
let diagram_type_location = diagram_type_token.location;
let diagram_type = match self.parse_diagram_type(&diagram_type_token) {
Ok(dt) => dt,
Err(diag) => {
collected_diagnostics.push(diag);
// Recovery: assume Flowchart(None)
DiagramType::Flowchart(None)
}
};
let mut statements = Vec::new();
while !self.current_token_is(TokenType::Eof) {
match self.parse_statement() {
Ok(Some(stmt)) => statements.push(stmt),
Ok(None) => { /* No statement parsed, possibly a skipped invalid token */ }
Err(diag) => {
collected_diagnostics.push(diag);
// Error recovery: skip tokens until a likely statement start or EOF
self.synchronize_parser();
}
}
}
if collected_diagnostics.is_empty() {
Ok(Diagram {
diagram_type,
statements,
location: diagram_type_location,
})
} else {
Err(collected_diagnostics)
}
}
/// Parses the specific diagram type and its associated properties (e.g., direction for flowchart).
fn parse_diagram_type(&mut self, token: &Token) -> Result<DiagramType, Diagnostic> {
match token.token_type {
TokenType::KeywordGraph | TokenType::KeywordFlowchart => {
// Flowcharts can have an optional direction (TD, LR, etc.)
if self.peek_token_is(TokenType::KeywordTD) {
self.advance_token(); // Consume TD
self.advance_token(); // Consume the keyword itself
Ok(DiagramType::Flowchart(Some(FlowchartDirection::TopToBottom)))
} else if self.peek_token_is(TokenType::KeywordLR) {
self.advance_token(); // Consume LR
self.advance_token(); // Consume the keyword itself
Ok(DiagramType::Flowchart(Some(FlowchartDirection::LeftToRight)))
} else if self.peek_token_is(TokenType::KeywordBT) {
self.advance_token(); // Consume BT
self.advance_token(); // Consume the keyword itself
Ok(DiagramType::Flowchart(Some(FlowchartDirection::BottomToTop)))
} else if self.peek_token_is(TokenType::KeywordRL) {
self.advance_token(); // Consume RL
self.advance_token(); // Consume the keyword itself
Ok(DiagramType::Flowchart(Some(FlowchartDirection::RightToLeft)))
} else {
// No explicit direction, default for flowchart/graph is TD
Ok(DiagramType::Flowchart(Some(FlowchartDirection::TopToBottom)))
}
}
TokenType::KeywordSequence => Ok(DiagramType::Sequence),
TokenType::KeywordClass => Ok(DiagramType::Class),
_ => {
// This should ideally not be reached if `expect_any_token` handles it correctly.
// But as a fallback, report an error.
Err(Diagnostic::new(
DiagnosticCode::InvalidDiagramType,
format!("Unknown diagram type: {:?}", token.literal),
token.location,
DiagnosticLevel::Error,
))
}
}
}
/// Parses a single statement (Node, Edge, Subgraph, Directive, Comment, or EmptyLine).
/// Returns `Ok(Some(Statement))` if a statement was successfully parsed.
/// Returns `Ok(None)` if the current token doesn't start a valid statement but isn't an error
/// (e.g., whitespace tokens that should be ignored at this level).
/// Returns `Err(Diagnostic)` if a syntax error prevents parsing a statement.
fn parse_statement(&mut self) -> Result<Option<Statement>, Diagnostic> {
// Skip whitespace and comments at the statement level for now.
// The lexer already handles most whitespace, but empty lines or comments
// between statements might be passed through.
while self.current_token_is(TokenType::Whitespace) || self.current_token_is(TokenType::Newline) {
self.advance_token();
}
if self.current_token_is(TokenType::Eof) {
return Ok(None);
}
if self.current_token_is(TokenType::DirectiveStart) {
return self.parse_directive().map(Some);
}
if self.current_token_is(TokenType::Comment) {
let comment_token = self.current_token.clone();
self.advance_token();
return Ok(Some(Statement::Comment(comment_token.literal, comment_token.location)));
}
// A statement typically starts with an identifier (for a node or source of an edge)
// or a keyword like 'subgraph'.
if self.current_token_is(TokenType::Identifier) {
let id_token = self.current_token.clone();
if self.peek_token_is(TokenType::Arrow)
|| self.peek_token_is(TokenType::DottedArrow)
|| self.peek_token_is(TokenType::ThickArrow)
|| self.peek_token_is(TokenType::InvisibleArrow)
|| self.peek_token_is(TokenType::BidirectionalArrow)
|| self.peek_token_is(TokenType::EdgeOpenParen) // A---B style
|| self.peek_token_is(TokenType::EdgeOpenBracket) // A--label--B style
{
// This looks like an edge
return self.parse_edge().map(Some);
} else {
// This looks like a node
return self.parse_node().map(Some);
}
} else if self.current_token_is(TokenType::KeywordSubgraph) {
return self.parse_subgraph().map(Some);
}
// If we reach here, it's an unexpected token at the start of a statement.
let diagnostic = Diagnostic::new(
DiagnosticCode::UnexpectedToken,
format!(
"Unexpected token '{:?}' at the start of a statement.",
self.current_token.token_type
),
self.current_token.location,
DiagnosticLevel::Error,
)
.with_help("Expected a node ID, 'subgraph' keyword, a directive, or a comment.");
self.diagnostics.emit(diagnostic.clone());
// Error recovery: advance token to try and find a valid statement start
self.advance_token();
Err(diagnostic)
}
/// Parses a node definition (e.g., `A[Start]`, `B(Process)`, `C`).
fn parse_node(&mut self) -> Result<Node, Diagnostic> {
let id_token = self.expect_token(TokenType::Identifier)?;
let id = id_token.literal.clone();
let mut label: Option<String> = None;
let mut node_shape = NodeShape::Default;
let mut label_location: Option<Location> = None;
// Check for node shape and label
if let Some(shape) = NodeShape::from_token_type(&self.current_token.token_type) {
let open_bracket_token = self.current_token.clone();
self.advance_token(); // Consume opening bracket/paren
// Node label can be a quoted string or unquoted identifier-like text
if self.current_token_is(TokenType::StringLiteral) {
label = Some(self.current_token.literal.clone());
label_location = Some(self.current_token.location);
self.advance_token(); // Consume string literal
} else if self.current_token_is(TokenType::Identifier) {
// Allow identifiers as labels if not quoted (Mermaid often does this)
label = Some(self.current_token.literal.clone());
label_location = Some(self.current_token.location);
self.advance_token(); // Consume identifier as label
} else {
// Empty label is allowed (e.g., A[])
label = Some("".to_string());
label_location = Some(self.current_token.location); // Location of the empty string
}
// Expect closing bracket/paren
let expected_closing_type = shape.get_closing_token_type();
self.expect_token(expected_closing_type)?;
node_shape = shape;
}
Ok(Node {
id,
label,
shape: node_shape,
location: id_token.location,
label_location,
})
}
/// Parses an edge definition (e.g., `A --> B`, `B -- "label" --> C`).
fn parse_edge(&mut self) -> Result<Edge, Diagnostic> {
let source_token = self.expect_token(TokenType::Identifier)?;
let source_id = source_token.literal.clone();
let edge_start_location = source_token.location; // Start location of the edge
let mut label: Option<String> = None;
let mut arrow_type = ArrowType::None; // Default to no arrow
let mut line_type = LineType::Solid; // Default to solid line
let mut arrow_location = self.current_token.location; // Default, will update
// Handle optional label between source and arrow (e.g., A -- label -- B)
if self.current_token_is(TokenType::EdgeOpenParen) || self.current_token_is(TokenType::EdgeOpenBracket) {
let open_token = self.current_token.clone();
self.advance_token(); // Consume '(' or '['
if self.current_token_is(TokenType::StringLiteral) {
label = Some(self.current_token.literal.clone());
self.advance_token(); // Consume label
} else if self.current_token_is(TokenType::Identifier) {
label = Some(self.current_token.literal.clone());
self.advance_token(); // Consume label
} else {
// Empty label
}
// Expect closing token
let expected_close = match open_token.token_type {
TokenType::EdgeOpenParen => TokenType::EdgeCloseParen,
TokenType::EdgeOpenBracket => TokenType::EdgeCloseBracket,
_ => unreachable!(), // Should be caught by the if condition
};
self.expect_token(expected_close)?;
}
// Parse arrow/line type
let arrow_line_token = self.expect_any_token(&[
TokenType::Arrow,
TokenType::DottedArrow,
TokenType::ThickArrow,
TokenType::InvisibleArrow,
TokenType::BidirectionalArrow,
// Add more specific arrow types as they are tokenized
])?;
arrow_location = arrow_line_token.location;
match arrow_line_token.token_type {
TokenType::Arrow => { // e.g., -->, ---
if arrow_line_token.literal.ends_with('>') {
arrow_type = ArrowType::ArrowHead;
} else if arrow_line_token.literal.ends_with('x') { // --x
arrow_type = ArrowType::Cross;
} else if arrow_line_token.literal.ends_with('o') { // --o
arrow_type = ArrowType::Open;
}
line_type = LineType::Solid; // Default for '---' and '-->'
}
TokenType::DottedArrow => { // e.g., -.->
arrow_type = ArrowType::ArrowHead;
line_type = LineType::Dotted;
}
TokenType::ThickArrow => { // e.g., ===>, ===
arrow_type = ArrowType::ArrowHead;
line_type = LineType::Thick;
}
TokenType::InvisibleArrow => { // e.g., ~~~
arrow_type = ArrowType::None;
line_type = LineType::Invisible;
}
TokenType::BidirectionalArrow => { // e.g., <-->
arrow_type = ArrowType::Bidirectional;
line_type = LineType::Solid;
}
_ => { /* Already handled by expect_any_token */ }
}
// Handle optional label after arrow (e.g., A --> "label" B)
if self.current_token_is(TokenType::StringLiteral) {
label = Some(self.current_token.literal.clone());
self.advance_token(); // Consume label
} else if self.current_token_is(TokenType::Identifier) {
label = Some(self.current_token.literal.clone());
self.advance_token(); // Consume label
}
let target_token = self.expect_token(TokenType::Identifier)?;
let target_id = target_token.literal.clone();
Ok(Edge {
source: source_id,
target: target_id,
label,
arrow_type,
line_type,
location: edge_start_location,
arrow_location,
label_location: None, // TODO: Accurately track label location
})
}
/// Parses a subgraph definition.
fn parse_subgraph(&mut self) -> Result<Subgraph, Diagnostic> {
let subgraph_keyword_token = self.expect_token(TokenType::KeywordSubgraph)?;
let subgraph_location = subgraph_keyword_token.location;
let id_token = self.expect_token(TokenType::Identifier)?;
let id = id_token.literal.clone();
let mut label: Option<String> = None;
let mut label_location: Option<Location> = None;
// Optional subgraph label (e.g., subgraph MySubgraph ["My Label"])
if self.current_token_is(TokenType::NodeOpenBracket) {
self.advance_token(); // Consume '['
if self.current_token_is(TokenType::StringLiteral) {
label = Some(self.current_token.literal.clone());
label_location = Some(self.current_token.location);
self.advance_token(); // Consume string literal
} else if self.current_token_is(TokenType::Identifier) {
// Allow unquoted identifiers as labels
label = Some(self.current_token.literal.clone());
label_location = Some(self.current_token.location);
self.advance_token(); // Consume identifier
}
self.expect_token(TokenType::NodeCloseBracket)?; // Consume ']'
}
let mut statements = Vec::new();
// Subgraph body can be enclosed in { } or just indented lines
// For now, let's assume it's followed by statements until 'end' or EOF
// A more robust parser would check for indentation or explicit `{}`
while !self.current_token_is(TokenType::KeywordEnd) && !self.current_token_is(TokenType::Eof) {
match self.parse_statement() {
Ok(Some(stmt)) => statements.push(stmt),
Ok(None) => { /* Continue */ },
Err(diag) => {
self.diagnostics.emit(diag);
self.synchronize_parser();
}
}
}
self.expect_token(TokenType::KeywordEnd)?; // Consume 'end' keyword
Ok(Subgraph {
id,
label,
statements,
location: subgraph_location,
label_location,
})
}
/// Parses a directive (e.g., `%%{init: { ... } }%%`).
fn parse_directive(&mut self) -> Result<Directive, Diagnostic> {
let directive_start_token = self.expect_token(TokenType::DirectiveStart)?;
let content = directive_start_token.literal.clone(); // The lexer should have captured full content
// The lexer is designed to capture the full directive content as part of the DirectiveStart token.
// So, we just need to consume the token and extract its literal.
// No explicit '%%}' token is expected here if the lexer captures the whole thing.
Ok(Directive {
content,
location: directive_start_token.location,
})
}
/// Implements basic error recovery by advancing tokens until a likely synchronization point.
fn synchronize_parser(&mut self) {
loop {
if self.current_token_is(TokenType::Eof) {
return;
}
match self.current_token.token_type {
TokenType::KeywordGraph
| TokenType::KeywordFlowchart
| TokenType::KeywordSequence
| TokenType::KeywordClass
| TokenType::KeywordSubgraph
| TokenType::Identifier => {
// These tokens usually start a new statement or diagram
return;
}
_ => {
// Skip over current token
self.advance_token();
}
}
}
}
}
Explanation:
Parserstruct: Holds aPeekableiterator overTokens, thecurrent_token,peek_token,DiagnosticEmitter, and a reference to thesource_codefor context.new(): Constructor that initializes the parser by taking aLexerand priming thecurrent_tokenandpeek_token.advance_token(): Moves thecurrent_tokentopeek_tokenand fetches the next token intopeek_token.current_token_is(),peek_token_is(): Helper methods for checking token types.expect_token(): Core parsing primitive. It attempts to consume thecurrent_tokenif its type matchesexpected. If not, it emits anErrordiagnostic and returns anErr. This is crucial for strict syntax validation.expect_any_token(): Similar toexpect_tokenbut checks against multiple expected token types.parse(): The main entry point. It orchestrates the parsing process:- First, it parses the
DiagramType(e.g.,graph TD). - Then, it iteratively calls
parse_statement()untilEofis reached. - It collects all diagnostics and returns either a
Diagramor aVec<Diagnostic>.
- First, it parses the
parse_diagram_type(): Handles the initialgraph,flowchart,sequence,classkeywords and their optional directions for flowcharts.parse_statement(): A dispatcher function that tries to identify what kind of statement the current tokens represent (Node, Edge, Subgraph, Directive, Comment). It includes basic error recovery by skipping whitespace/comments.parse_node(): Parses node definitions, handling IDs, optional labels, and various node shapes.parse_edge(): Parses edge definitions, including source/target IDs, optional labels (before or after the arrow), and different arrow/line types. This is a complex part due to the many arrow variations in Mermaid.parse_subgraph(): Handlessubgraphblocks, including parsing its ID, optional label, and nested statements, until anendkeyword.parse_directive(): Parses%%{...}%%directives, assuming the lexer captures the full content.synchronize_parser(): A simple error recovery mechanism. When a parsing error occurs, this function tries to advance thecurrent_tokenuntil it finds a token that likely marks the beginning of a new valid statement (like anIdentifieror a keyword). This prevents a single error from cascading into many irrelevant errors.
2.4. Testing This Component
To ensure our parser is working correctly, we need to write unit tests. These tests will feed a stream of tokens (generated by our lexer) into the parser and assert that the resulting AST matches our expectations.
Action: Create src/parser/parser_tests.rs and add basic tests.
File: src/parser/parser_tests.rs
// src/parser/parser_tests.rs
#[cfg(test)]
mod tests {
use super::*;
use crate::lexer::Lexer;
use crate::diagnostics::DiagnosticEmitter;
use crate::Location;
// Helper function to create a parser from source code
fn setup_parser<'a>(source: &'a str, emitter: DiagnosticEmitter<'a>) -> Parser<'a> {
let lexer = Lexer::new(source);
Parser::new(lexer, source, emitter)
}
#[test]
fn test_parse_simple_flowchart() {
let source = "graph TD\n A[Start] --> B(Process)";
let emitter = DiagnosticEmitter::new(source);
let mut parser = setup_parser(source, emitter.clone());
let result = parser.parse();
assert!(result.is_ok(), "Parsing failed with errors: {:?}", emitter.get_diagnostics());
let diagram = result.unwrap();
assert_eq!(diagram.diagram_type, DiagramType::Flowchart(Some(FlowchartDirection::TopToBottom)));
assert_eq!(diagram.statements.len(), 2); // One node, one edge
// Verify Node A
if let Statement::Node(node_a) = &diagram.statements[0] {
assert_eq!(node_a.id, "A");
assert_eq!(node_a.label, Some("Start".to_string()));
assert_eq!(node_a.shape, NodeShape::Rectangle);
} else {
panic!("Expected Node statement, found {:?}", diagram.statements[0]);
}
// Verify Edge A --> B
if let Statement::Edge(edge_ab) = &diagram.statements[1] {
assert_eq!(edge_ab.source, "A");
assert_eq!(edge_ab.target, "B");
assert_eq!(edge_ab.label, None);
assert_eq!(edge_ab.arrow_type, ArrowType::ArrowHead);
assert_eq!(edge_ab.line_type, LineType::Solid);
} else {
panic!("Expected Edge statement, found {:?}", diagram.statements[1]);
}
assert!(emitter.get_diagnostics().is_empty(), "Unexpected diagnostics: {:?}", emitter.get_diagnostics());
}
#[test]
fn test_parse_flowchart_with_direction_lr() {
let source = "flowchart LR\n Node1 --> Node2";
let emitter = DiagnosticEmitter::new(source);
let mut parser = setup_parser(source, emitter.clone());
let result = parser.parse();
assert!(result.is_ok(), "Parsing failed with errors: {:?}", emitter.get_diagnostics());
let diagram = result.unwrap();
assert_eq!(diagram.diagram_type, DiagramType::Flowchart(Some(FlowchartDirection::LeftToRight)));
assert!(emitter.get_diagnostics().is_empty());
}
#[test]
fn test_parse_node_shapes_and_labels() {
let source = "graph TD\n A[Rect]\n B(Rounded)\n C>Asym]\n D{Rhombus}";
let emitter = DiagnosticEmitter::new(source);
let mut parser = setup_parser(source, emitter.clone());
let result = parser.parse().unwrap();
assert_eq!(result.statements.len(), 4);
if let Statement::Node(a) = &result.statements[0] { assert_eq!(a.shape, NodeShape::Rectangle); assert_eq!(a.label, Some("Rect".to_string())); }
if let Statement::Node(b) = &result.statements[1] { assert_eq!(b.shape, NodeShape::RoundedRectangle); assert_eq!(b.label, Some("Rounded".to_string())); }
if let Statement::Node(c) = &result.statements[2] { assert_eq!(c.shape, NodeShape::Asymmetric); assert_eq!(c.label, Some("Asym".to_string())); }
if let Statement::Node(d) = &result.statements[3] { assert_eq!(d.shape, NodeShape::Rhombus); assert_eq!(d.label, Some("Rhombus".to_string())); }
assert!(emitter.get_diagnostics().is_empty());
}
#[test]
fn test_parse_edge_labels_and_types() {
let source = "graph TD\n A -- Label A --> B\n B --o C\n C ===> D\n D -.-> E\n E ~~~ F";
let emitter = DiagnosticEmitter::new(source);
let mut parser = setup_parser(source, emitter.clone());
let result = parser.parse().unwrap();
assert_eq!(result.statements.len(), 5);
if let Statement::Edge(e) = &result.statements[0] { assert_eq!(e.label, Some("Label A".to_string())); assert_eq!(e.arrow_type, ArrowType::ArrowHead); }
if let Statement::Edge(e) = &result.statements[1] { assert_eq!(e.label, None); assert_eq!(e.arrow_type, ArrowType::Open); }
if let Statement::Edge(e) = &result.statements[2] { assert_eq!(e.label, None); assert_eq!(e.arrow_type, ArrowType::ArrowHead); assert_eq!(e.line_type, LineType::Thick); }
if let Statement::Edge(e) = &result.statements[3] { assert_eq!(e.label, None); assert_eq!(e.arrow_type, ArrowType::ArrowHead); assert_eq!(e.line_type, LineType::Dotted); }
if let Statement::Edge(e) = &result.statements[4] { assert_eq!(e.label, None); assert_eq!(e.arrow_type, ArrowType::None); assert_eq!(e.line_type, LineType::Invisible); }
assert!(emitter.get_diagnostics().is_empty());
}
#[test]
fn test_parse_subgraph() {
let source = r#"
graph TD
subgraph MySubgraph ["A Subgraph"]
A --> B
end
B --> C
"#;
let emitter = DiagnosticEmitter::new(source);
let mut parser = setup_parser(source, emitter.clone());
let result = parser.parse().unwrap();
assert_eq!(result.statements.len(), 2);
if let Statement::Subgraph(subgraph) = &result.statements[0] {
assert_eq!(subgraph.id, "MySubgraph");
assert_eq!(subgraph.label, Some("A Subgraph".to_string()));
assert_eq!(subgraph.statements.len(), 1);
if let Statement::Edge(edge) = &subgraph.statements[0] {
assert_eq!(edge.source, "A");
assert_eq!(edge.target, "B");
} else { panic!("Expected edge in subgraph"); }
} else { panic!("Expected subgraph"); }
if let Statement::Edge(edge) = &result.statements[1] {
assert_eq!(edge.source, "B");
assert_eq!(edge.target, "C");
} else { panic!("Expected edge outside subgraph"); }
assert!(emitter.get_diagnostics().is_empty());
}
#[test]
fn test_parse_directive() {
let source = "%%{init: { 'theme': 'forest' } }%%\ngraph TD\nA --> B";
let emitter = DiagnosticEmitter::new(source);
let mut parser = setup_parser(source, emitter.clone());
let result = parser.parse().unwrap();
assert_eq!(result.statements.len(), 2);
if let Statement::Directive(directive) = &result.statements[0] {
assert_eq!(directive.content, "{init: { 'theme': 'forest' } }"); // Lexer should capture content without %%{}%%
} else { panic!("Expected directive"); }
assert!(emitter.get_diagnostics().is_empty());
}
#[test]
fn test_parse_syntax_error_missing_diagram_type() {
let source = "A --> B"; // Missing 'graph TD'
let emitter = DiagnosticEmitter::new(source);
let mut parser = setup_parser(source, emitter.clone());
let result = parser.parse();
assert!(result.is_err());
let diagnostics = emitter.get_diagnostics();
assert_eq!(diagnostics.len(), 1);
assert_eq!(diagnostics[0].code, DiagnosticCode::UnexpectedToken);
assert!(diagnostics[0].message.contains("Expected one of KeywordGraph or KeywordFlowchart or KeywordSequence or KeywordClass"));
}
#[test]
fn test_parse_syntax_error_unmatched_bracket() {
let source = "graph TD\n A[Start --> B"; // Missing closing bracket
let emitter = DiagnosticEmitter::new(source);
let mut parser = setup_parser(source, emitter.clone());
let result = parser.parse();
assert!(result.is_err()); // Should return an error
let diagnostics = emitter.get_diagnostics();
assert_eq!(diagnostics.len(), 1);
assert_eq!(diagnostics[0].code, DiagnosticCode::UnexpectedToken);
assert!(diagnostics[0].message.contains("Expected token NodeCloseBracket"));
}
#[test]
fn test_parse_empty_input() {
let source = "";
let emitter = DiagnosticEmitter::new(source);
let mut parser = setup_parser(source, emitter.clone());
let result = parser.parse();
assert!(result.is_err()); // Empty input is invalid as it lacks a diagram type
let diagnostics = emitter.get_diagnostics();
assert_eq!(diagnostics.len(), 1);
assert_eq!(diagnostics[0].code, DiagnosticCode::UnexpectedToken);
}
}
Action: Add mod parser_tests; to src/parser/mod.rs (inside #[cfg(test)] block).
File: src/parser/mod.rs
// src/parser/mod.rs
pub mod ast;
pub mod parser;
pub use self::parser::Parser;
pub use self::ast::{
Diagram, DiagramType, Statement, Node, Edge, Subgraph, ArrowType, LineType, NodeShape,
FlowchartDirection,
};
#[cfg(test)]
mod parser_tests; // Add this line
Explanation:
- The
parser_tests.rsfile contains various unit tests that cover successful parsing of different Mermaid constructs (simple flowcharts, nodes, edges, subgraphs, directives) and also tests for expected syntax errors. - The
setup_parserhelper streamlines test setup. - We use
DiagnosticEmitterto capture diagnostics during parsing and assert their presence or absence. assert!(result.is_ok())andassert!(result.is_err())are used to check if parsing succeeded or failed as expected.- Assertions on the
DiagramandStatementtypes verify the correctness of the generated AST.
To run these tests, navigate to your project root and execute cargo test. All tests should pass, indicating that our parser correctly constructs the AST for valid Mermaid input and reports errors for invalid input.
3. Production Considerations
For a production-ready parser, several factors beyond basic functionality are critical:
- Robust Error Recovery: Our current
synchronize_parseris a basic starting point. A production-grade parser would employ more sophisticated error recovery techniques (e.g., panic-mode recovery, phrase-level recovery) to minimize cascade errors and allow parsing to continue as much as possible, providing a comprehensive list of issues rather than stopping at the first one. This is key for a linter/formatter. - Performance Optimization:
- Zero-Copy Parsing: For very large Mermaid files, minimizing string allocations (
.clone()) during parsing is crucial. This can involve using string slices (&str) that reference the original source code where possible, or employing arenas for AST node allocation. Our current implementation does some cloning forliteralvalues; a future optimization would be to reference thesource_codedirectly. - Efficient Lookahead:
Peekableis generally efficient, but for very deep lookaheads, custom token buffering might be considered. However, recursive descent usually only needs one or two tokens of lookahead, soPeekableis often sufficient.
- Zero-Copy Parsing: For very large Mermaid files, minimizing string allocations (
- Memory Usage: The AST itself can consume significant memory for very large diagrams. Consider techniques like interning strings (using
Rc<str>or a dedicated string table) for repeated node IDs or labels to reduce memory footprint. - Security (DoS Prevention): Malformed input should never cause the parser to panic or enter an infinite loop, which could be exploited for Denial of Service. Our
expect_tokenandsynchronize_parserhelp prevent this by always advancing tokens and reporting errors. Fuzz testing (covered in a later chapter) is essential here. - Logging and Metrics: Integrate logging to trace parser behavior, especially during error recovery or for performance bottlenecks. Metrics could track parsing time, AST size, and error rates.
4. Code Review Checkpoint
At this point, we have successfully implemented the core parsing logic and the foundational Abstract Syntax Tree for our Mermaid analyzer.
Summary of what was built:
src/parser/ast.rs: Defined the strongly typed AST structures (Diagram,DiagramType,Statement,Node,Edge,Subgraph,Directive,NodeShape,ArrowType,LineType,FlowchartDirection) to represent Mermaid diagrams. Each AST node includesLocationinformation.src/parser/parser.rs: Implemented theParserstruct using a recursive descent approach. It consumes tokens from theLexerand constructs the AST. It includes methods for token lookahead, consumption, error reporting viaDiagnosticEmitter, and basic error recovery.src/parser/parser_tests.rs: Added unit tests to validate the parser’s functionality for various valid and invalid Mermaid inputs, ensuring correct AST generation and error reporting.
Files created/modified:
src/lib.rs(modified to includeparsermodule and re-exports)src/parser/mod.rs(new module, defines public interface)src/parser/ast.rs(new file, defines AST data structures)src/parser/parser.rs(new file, implements the parser logic)src/parser/parser_tests.rs(new file, contains unit tests for the parser)
How it integrates:
The Parser now seamlessly integrates with the Lexer (from Chapter 2) by taking its Token stream as input. It also leverages the DiagnosticEmitter (from Chapter 2) to report parsing errors, enriching our tool’s feedback mechanism. The output of the parser, the Diagram AST, is now ready to be consumed by the next stage of our pipeline: the validator.
5. Common Issues & Solutions
Developers working on parsers often encounter similar challenges. Here are a few common ones and how to address them:
Issue: Infinite Loops in Recursive Descent.
- Problem: A parsing function might call itself or another function that eventually calls it, without consuming any tokens. This often happens with optional or repetitive grammar rules.
- Example:
parse_statementcallsparse_node, butparse_nodefails to consume a token and returns, leadingparse_statementto call it again with the same token. - Solution:
- Always consume tokens: Ensure that every parsing function, especially those that can recurse, consumes at least one token if it successfully matches a part of the grammar.
- Clear termination conditions: For loops (e.g.,
while !self.current_token_is(TokenType::Eof)), ensure the loop condition will eventually become false. - Error recovery: When
expect_tokenfails, the parser should advance past the problematic token (self.advance_token()) or use a synchronization strategy to prevent re-evaluating the same problematic token. Oursynchronize_parserhelps here.
Issue: Off-by-One Errors in Token Consumption/Lookahead.
- Problem: The parser might consume too many or too few tokens, leading to incorrect parsing or missed errors. For example,
peek_tokenis checked butcurrent_tokenisn’t advanced correctly. - Example: For
A -- "label" --> B, if the parser consumesAand-->but skips"label", the AST will be incorrect. - Solution:
- Double-check
advance_token()calls: Ensureadvance_token()is called precisely when a token is matched and consumed. - Unit tests for each rule: Write specific tests for each grammar rule (node, edge, subgraph) to verify token consumption.
- Trace parsing: Temporarily add
println!statements inadvance_token()andexpect_token()to see the token stream being processed.
- Double-check
- Problem: The parser might consume too many or too few tokens, leading to incorrect parsing or missed errors. For example,
Issue: Ambiguity in Grammar Rules.
- Problem: Sometimes, the same sequence of tokens can be interpreted in multiple ways by different parsing rules. This is less common with hand-written recursive descent if the grammar is LL(1) (can be parsed with 1 token lookahead), but can arise in complex grammars.
- Example: Is
Aa node ID or the start of an edge?A --> BvsA[Label]. - Solution:
- Prioritization: In recursive descent, the order of
if/else ifstatements in functions likeparse_statementimplies prioritization. For instance, checking for an edge first (if self.peek_token_is(TokenType::Arrow)) before defaulting to a node (else if self.current_token_is(TokenType::Identifier)) resolves this. - More lookahead: If 1-token lookahead isn’t enough,
peek_token(2-token lookahead) helps. If even more is needed, a custom buffer might be required, but it adds complexity. - Grammar analysis: Referencing the official Mermaid grammar meticulously helps identify and resolve ambiguities.
- Prioritization: In recursive descent, the order of
6. Testing & Verification
To verify the work of this chapter, you should run the unit tests we’ve created and consider adding more “golden tests.”
Run Unit Tests:
- Navigate to your project root in the terminal.
- Execute
cargo test. - All tests in
src/parser/parser_tests.rsshould pass. This confirms that the parser correctly builds the AST for a range of valid Mermaid inputs and reports expected errors for invalid ones.
Golden Tests (Input to Expected Output):
- Beyond unit tests, a powerful technique is “golden testing” or “snapshot testing.” For the parser, this means:
- Have a collection of
.mmd(Mermaid) files. - For each
.mmdfile, have a corresponding.ast.txtfile that contains theDebugrepresentation of the expected AST. - The test would:
- Read the
.mmdfile. - Lex it.
- Parse it, getting the
DiagramAST. - Compare the
format!("{:?}", diagram)output with the content of the.ast.txtfile.
- Read the
- Have a collection of
- This ensures that any changes to the parser don’t inadvertently alter the AST structure for previously working inputs. This will be integrated into our comprehensive testing strategy in a later chapter.
- Beyond unit tests, a powerful technique is “golden testing” or “snapshot testing.” For the parser, this means:
Manual Inspection:
- For a few complex Mermaid diagrams, you can temporarily add
println!("{:?}", diagram)after parsing inmain.rs(or a test) to visually inspect the generated AST and ensure it accurately reflects the input structure.
- For a few complex Mermaid diagrams, you can temporarily add
At this stage, you should be confident that your tool can now transform raw Mermaid code into a structured, machine-readable Abstract Syntax Tree, laying the groundwork for all subsequent analysis and transformation tasks.
7. Summary & Next Steps
Congratulations! In this chapter, you’ve taken a significant step in building our production-grade Mermaid analyzer:
- We designed and implemented the Abstract Syntax Tree (AST), a robust, strongly typed representation of Mermaid diagrams, capturing their structure and semantic information.
- We developed the Parser module, which consumes the token stream from our lexer and constructs this AST using a recursive descent approach.
- We incorporated error reporting using our
DiagnosticEmitterand implemented basic error recovery to make the parser resilient to malformed input. - We wrote unit tests to ensure the correctness and reliability of our parser, covering various valid and invalid Mermaid syntax examples.
The generated AST is now the authoritative representation of the Mermaid diagram. It’s clean, structured, and ready for deeper analysis.
In the next chapter (Chapter 4: Implementing the Semantic Validator), we will leverage this AST to perform strict validation. We’ll identify not only syntax errors (which the parser already catches) but also semantic errors such as duplicate node IDs, undefined references in edges, illegal nesting, and other logical inconsistencies that cannot be caught during the lexical or parsing phases. This will bring us closer to a truly robust and reliable Mermaid linter.