How a Regex Becomes a Finite Automaton

March 15, 20263 min read
computer-scienceinteractive

Table Of Contents

  1. Tokenization
  2. Parsing
  3. NFA with Lambda Transitions
  4. NFA
  5. DFA
  6. State-Minimized DFA

When you write a regex like (a|b)*abb, you are not writing a program. You are describing a language: the set of all strings that match. Turning that description into something a computer can actually run requires a pipeline that goes through four different representations. This post walks through each one.

Type a pattern to follow along. Everything below updates live.

# Tokenization

Before the pattern can be understood structurally, it has to be broken into tokens: individual symbols, quantifiers, parentheses, and operators. Your pattern (a|b)*abb produces 9 tokens.

The regular expression is broken down into tokens that represent the basic building blocks. Each token has a specific meaning and will be used to build the abstract syntax tree.

1
Left ParenthesisPosition 0

Left parenthesis groups expressions together and affects operator precedence.

2
LiteralPosition 1
Value: "a"

A literal character matches exactly itself in the input string.

3
Union (|)Position 2

The union operator (|) matches either the expression on its left or right.

4
LiteralPosition 3
Value: "b"

A literal character matches exactly itself in the input string.

5
Right ParenthesisPosition 4

Right parenthesis closes a group and restores the previous precedence.

6
Kleene Star (*)Position 5

The Kleene star (*) matches zero or more occurrences of the preceding element.

7
LiteralPosition 6
Value: "a"

A literal character matches exactly itself in the input string.

8
LiteralPosition 7
Value: "b"

A literal character matches exactly itself in the input string.

9
LiteralPosition 8
Value: "b"

A literal character matches exactly itself in the input string.

# Parsing

The token list is fed into a recursive descent parser that builds an Abstract Syntax Tree (AST). The tree encodes precedence and grouping: concatenation binds tighter than alternation (|), and quantifiers bind tightest of all.

The tokens are parsed into an Abstract Syntax Tree (AST) that represents the hierarchical structure of the regular expression. Click on any node to see detailed information about it.

# NFA with Lambda Transitions

Thompson’s construction algorithm converts the AST into an NFA-Lambda: a non-deterministic finite automaton that allows lambda (empty/epsilon) transitions, which are moves between states without consuming any input character.

Your pattern produces an NFA-Lambda with 11 states.

Loading...Rendering automaton...

The start state is highlighted in blue and accepting states have orange double circles. You can click any state to highlight it and its transitions.

# NFA

Lambda transitions are convenient to construct but inconvenient to execute. The next step eliminates them by computing the lambda-closure of every state: the set of states reachable from it via lambda transitions alone.

Compute Λ-closure of every state — all states reachable via Λ-transitions alone:

StateΛ-closure
0 (start){0, 1, 2, 4, 7}
1{1, 2, 4}
2{2}
3{1, 2, 3, 4, 6, 7}
4{4}
5{1, 2, 4, 5, 6, 7}
6{1, 2, 4, 6, 7}
7{7}
8{8}
9{9}
10 (accept){10}

Build new transitions: δ₁(q, a) = Λ(δ(Λ(q), a)) — from state q, expand via Λ-closure, follow a-transitions, then expand again. Then remove all Λ-transitions.

Λ(q₀) = {0, 1, 2, 4, 7} contains no accepting state, so the accepting states remain unchanged.

After removing lambdas, the NFA has 11 states.

Loading...Rendering automaton...

# DFA

An NFA can be in multiple states at once, which makes it hard to implement efficiently. The subset construction algorithm converts it to a DFA where each state represents a set of NFA states. The number of possible DFA states is exponential in theory (2^n for n NFA states), but in practice most are unreachable.

Your NFA’s 11 states produce a DFA with 5 states. The mapping below shows which NFA states each DFA state corresponds to — hover a state to highlight it across both diagrams.

Loading...Rendering automaton...

State Mapping

When converting from an NFA to a DFA, each DFA state represents a set of NFA states.

You can click on a row to highlight the corresponding states in the automaton.

DFA StateNFA States
00
11, 2, 4, 5, 6, 7
21, 2, 3, 4, 6, 7, 8
31, 2, 4, 5, 6, 7, 9
41, 2, 4, 5, 6, 7, 10

# State-Minimized DFA

The DFA may contain redundant states: pairs that behave identically for every possible input string. The table-filling algorithm finds and merges them. Two states are equivalent if and only if they always accept or reject together for every suffix.

The distinguishability matrix below shows which state pairs are marked (distinguishable). Unmarked pairs are equivalent and get merged.

Distinguishability Matrix

01234
0
1
2xx
3xxx
4xxxx

Each cell compares a pair of DFA states. Marked cells indicate pairs that can be distinguished during minimization.

After minimization, the DFA reduces to 4 states.

Loading...Rendering automaton...

State Minimization Groups

To minimize the states, we group equivalent states together.

You can click on a row to highlight the corresponding states in the automaton.

Final StateMerged States
00, 1
12
23
34

The minimized DFA is the smallest possible machine for this language. It is what most regex engines ultimately run against input text.