When you write a regex like (a|b)*abb, you are not writing a program. You are describing a language: the set of all strings that match. Turning that description into something a computer can actually run requires a pipeline that goes through four different representations. This post walks through each one.
Type a pattern to follow along. Everything below updates live.
# Tokenization
Before the pattern can be understood structurally, it has to be broken into tokens: individual symbols, quantifiers, parentheses, and operators. Your pattern (a|b)*abb
The regular expression is broken down into tokens that represent the basic building blocks. Each token has a specific meaning and will be used to build the abstract syntax tree.
Left parenthesis groups expressions together and affects operator precedence.
A literal character matches exactly itself in the input string.
The union operator (|) matches either the expression on its left or right.
A literal character matches exactly itself in the input string.
Right parenthesis closes a group and restores the previous precedence.
The Kleene star (*) matches zero or more occurrences of the preceding element.
A literal character matches exactly itself in the input string.
A literal character matches exactly itself in the input string.
A literal character matches exactly itself in the input string.
# Parsing
The token list is fed into a recursive descent parser that builds an Abstract Syntax Tree (AST). The tree encodes precedence and grouping: concatenation binds tighter than alternation (|), and quantifiers bind tightest of all.
The tokens are parsed into an Abstract Syntax Tree (AST) that represents the hierarchical structure of the regular expression. Click on any node to see detailed information about it.
# NFA with Lambda Transitions
Thompson’s construction algorithm converts the AST into an NFA-Lambda: a non-deterministic finite automaton that allows lambda (empty/epsilon) transitions, which are moves between states without consuming any input character.
Your pattern produces an NFA-Lambda with
The start state is highlighted in blue and accepting states have orange double circles. You can click any state to highlight it and its transitions.
# NFA
Lambda transitions are convenient to construct but inconvenient to execute. The next step eliminates them by computing the lambda-closure of every state: the set of states reachable from it via lambda transitions alone.
Compute Λ-closure of every state — all states reachable via Λ-transitions alone:
| State | Λ-closure |
|---|---|
| 0 (start) | {0, 1, 2, 4, 7} |
| 1 | {1, 2, 4} |
| 2 | {2} |
| 3 | {1, 2, 3, 4, 6, 7} |
| 4 | {4} |
| 5 | {1, 2, 4, 5, 6, 7} |
| 6 | {1, 2, 4, 6, 7} |
| 7 | {7} |
| 8 | {8} |
| 9 | {9} |
| 10 (accept) | {10} |
Build new transitions: δ₁(q, a) = Λ(δ(Λ(q), a)) — from state q, expand via Λ-closure, follow a-transitions, then expand again. Then remove all Λ-transitions.
After removing lambdas, the NFA has
# DFA
An NFA can be in multiple states at once, which makes it hard to implement efficiently. The subset construction algorithm converts it to a DFA where each state represents a set of NFA states. The number of possible DFA states is exponential in theory (2^n for n NFA states), but in practice most are unreachable.
Your NFA’s
State Mapping
When converting from an NFA to a DFA, each DFA state represents a set of NFA states.
You can click on a row to highlight the corresponding states in the automaton.
| DFA State | NFA States |
|---|---|
| 0 | 0 |
| 1 | 1, 2, 4, 5, 6, 7 |
| 2 | 1, 2, 3, 4, 6, 7, 8 |
| 3 | 1, 2, 4, 5, 6, 7, 9 |
| 4 | 1, 2, 4, 5, 6, 7, 10 |
# State-Minimized DFA
The DFA may contain redundant states: pairs that behave identically for every possible input string. The table-filling algorithm finds and merges them. Two states are equivalent if and only if they always accept or reject together for every suffix.
The distinguishability matrix below shows which state pairs are marked (distinguishable). Unmarked pairs are equivalent and get merged.
Distinguishability Matrix
| 0 | 1 | 2 | 3 | 4 | |
|---|---|---|---|---|---|
| 0 | |||||
| 1 | |||||
| 2 | x | x | |||
| 3 | x | x | x | ||
| 4 | x | x | x | x |
Each cell compares a pair of DFA states. Marked cells indicate pairs that can be distinguished during minimization.
After minimization, the DFA reduces to
State Minimization Groups
To minimize the states, we group equivalent states together.
You can click on a row to highlight the corresponding states in the automaton.
| Final State | Merged States |
|---|---|
| 0 | 0, 1 |
| 1 | 2 |
| 2 | 3 |
| 3 | 4 |
The minimized DFA is the smallest possible machine for this language. It is what most regex engines ultimately run against input text.