RL Quantum Circuit Routing

Deep Reinforcement Learning for Quantum Circuit Transpilation

A D3QN+PER agent that learns to route quantum circuits on hardware topologies, minimizing SWAP gate overhead compared to IBM's SABRE compiler.

Ratio 0.969 (Beats SABRE by 3.1%) · Benchmark: 42% Wins / 19% Losses over 150 circuits · 100% Completion · 3 Hardware Topologies

Results · Benchmark · Architecture · How It Works · Getting Started · Training · Documentation

Overview

Quantum computers can only execute two-qubit gates between physically adjacent qubits. Real quantum circuits contain gates between arbitrary qubit pairs. Quantum circuit routing (transpilation) solves this by inserting SWAP gates to move qubits into adjacent positions before each gate can execute. Fewer SWAPs = shorter circuits = less decoherence = higher fidelity results.

The industry-standard solution is SABRE (Li et al., 2019), a heuristic search algorithm used by IBM's Qiskit compiler. SABRE is fast and produces good results, but as a greedy heuristic, it can miss globally optimal routing strategies.

Finding the minimum number of SWAPs for a given circuit is NP-hard (Siraichi et al., 2018) — no known polynomial-time algorithm can compute the optimal solution. For small instances (5-10 qubits), exact methods like ILP/SAT solvers or A* search (Botea et al., 2018) can find optima, but for real hardware sizes these are intractable. Both SABRE and our agent produce approximate solutions.

This project trains a deep RL agent to learn routing strategies end-to-end, directly from the circuit structure and hardware topology. The agent:

Observes the hardware graph, current qubit mapping, pending gate demands, and routing progress via a 5-channel spatial state
Selects SWAP operations on hardware edges using a Dueling CNN policy
Learns from millions of routing episodes via Double DQN with Prioritized Experience Replay
Generalizes across multiple hardware topologies with a single network

    Quantum Circuit (logical)          Hardware Topology (physical)
    +-------------------+              +-------------------------+
    | q0 --X--          |              |    0 --- 1 --- 2        |
    |      |            |              |    |           |        |
    | q1 --X-- X--      |     SWAP     |    3 --- 4 --- 5        |
    |          |        |  --------->  |    |           |        |
    | q2 ---- X-- X--   |  insertion   |    6 --- 7 --- 8        |
    |             |     |              |   (grid_3x3: 9q, 12e)   |
    | q3 -------- X--   |              +-------------------------+
    +-------------------+
    Gates between non-adjacent          Agent inserts SWAPs to make
    qubits can't execute directly       all gates executable

Results

Agent vs SABRE: Routing Comparison

The agent (blue) and SABRE (orange) route the same quantum circuit on heavy_hex_19. Each frame shows one SWAP operation, with gates executing automatically when qubits become adjacent.

Run 029 / fine-tuned model (best checkpoint, ep17k): Agent uses 176 SWAPs vs SABRE's 200 — ratio 0.880 on this circuit

Best Result: Run 029 — Fine-Tuned (heavy_hex_19)

Metric	Value
Best Swap Ratio	0.969 (beats SABRE by 3.1%)
Completion Rate	100% (all circuits fully routed)
Agent SWAPs (avg)	180.2
SABRE SWAPs (avg)	186.0
Method	Fine-tune Run 019 best checkpoint @ LR=1e-5

Fine-tuning loads the best checkpoint from a full training run and refines it with a low learning rate — only 20k episodes / 4.5h wall time.

Agent vs SABRE eval comparison — fine-tuned model

Fine-tuned model evaluation (ep17k, ratio 0.969): per-circuit SWAP count comparison between agent (blue) and SABRE (orange)

Best Multi-Topology: Run 026 (3 topologies, 100k episodes)

A single network learns to route circuits across 3 different hardware topologies simultaneously:

Topology	Qubits	Edges	Swap Ratio	vs SABRE
linear_5	5	4	0.890	Beats SABRE by 11%
grid_3x3	9	12	1.008	Matches SABRE
heavy_hex_19	19	20	1.050	5% more SWAPs
Overall	—	—	0.996	Beats SABRE

Progression Across Experiments

Version	Best Run	Heavy Hex Ratio	Key Breakthrough
V1	—	0% completion	Q-value collapse
V2	Run 7	1.080	5-channel state, soft targets, eps=0.02
V3	Run 15	1.014	60k episodes, standard net
V4	Run 19	0.991	80k episodes — first to beat SABRE
V5	Run 24	0.987	100k episodes — more training helps
V6	Run 29	0.969	Fine-tuning from best checkpoint — beats SABRE by 3.1%

Training Dashboards

Run 029 fine-tune (20k episodes): Starting from Run 019's best weights at ratio ~0.99, the low-LR refinement pushes down to 0.969.

Run 019 Training Dashboard (80k, original SABRE-beater)

Run 019 (80k episodes): The characteristic "phase transition" at ep8k-12k is where the agent suddenly learns to complete circuits. Best ratio 0.991 at ep63k.

Run 024 Training Dashboard (100k, brute force)

Run 024 (100k episodes): Extended training pushes best ratio to 0.987 at ep77k. Confirms more training continues to help.

Run 023 Training Dashboard (Curriculum + Cosine LR)

Run 023: Three curriculum phases visible — depth 5 (ep0-9k), depth 10 (ep9k-21k), depth 20 (ep21k+). Most sample-efficient path to beating SABRE from scratch.

Additional Routing GIFs

Run 024 (best checkpoint, ep77k): Agent uses 164 SWAPs vs SABRE's 193 — ratio 0.850

Run 019 (best checkpoint, ep63k): Agent uses 189 SWAPs vs SABRE's 213 — ratio 0.887

Benchmark

Full Benchmark: Agent vs SABRE on 150 Circuits

We evaluate the best agent (Run 029, fine-tuned) on two independent test suites:

Generated Suite (110 circuits): QFT, VQE (linear/circular/full), GHZ, Bernstein-Vazirani, Quantum Volume, Structured (CNOT rings), and Random circuits at 4-19 qubits
QASMBench Suite (40 circuits): Real-world algorithms from the QASMBench repository — Shor's, Grover's, QAOA, QEC, HHL, arithmetic, and more (filtered to ≤19 qubits with ≥1 two-qubit gate)

Full benchmark dashboard — Generated + QASMBench

Full benchmark dashboard: per-category swap ratios, agent vs SABRE scatter, ratio distributions, and scaling trends across both test suites.

Overall Results

Suite	Circuits	Mean Ratio	Agent Wins	Ties	SABRE Wins
Generated	110 (89 with SWAPs)	0.989	37 (42%)	34 (38%)	18 (20%)
QASMBench	40 (24 with SWAPs)	0.687	11 (46%)	10 (42%)	3 (12%)
Combined	150 (113 with SWAPs)	0.925	48 (42%)	44 (39%)	21 (19%)

The agent wins more than twice as often as it loses across both suites.

Per-Category Highlights

Generated circuits — the agent is strongest on dense, structured algorithms:

Category	Ratio	Standout
QFT	0.890	qft_4: 3 vs 5 SWAPs (0.600)
VQE (full)	0.909	vqe_full_12q_1r: 55 vs 64 (0.859)
Random (depth-20)	0.988	random_d20_s16: 163 vs 189 (0.862)

QASMBench — the agent dominates on real-world arithmetic/factoring circuits:

Circuit	Agent SWAPs	SABRE SWAPs	Ratio
square_root_n18	6	543	0.011
multiplier_n15	15	111	0.135
shor_n5	3	8	0.375
seca_n11	10	25	0.400
qft_n18	133	136	0.978

Agent vs SABRE SWAP counts across all 113 completed circuits. Points below the diagonal = agent wins. The agent consistently matches or beats SABRE, with dramatic wins on structured QASMBench circuits.

Suite Comparison (bar chart)

Suite comparison — Generated vs QASMBench

Side-by-side comparison of Generated and QASMBench suites: mean ratio, win rate, and loss rate.

The benchmark script (scripts/benchmark_test.py) generates 15 figures per run (6 per suite + 3 combined), detailed per-circuit JSON/Markdown results, and category-level statistics. Run with:

sbatch benchmark.slurm <checkpoint>                    # Both suites (default)
sbatch benchmark.slurm <checkpoint> --no-qasmbench     # Generated only

Full results: outputs/tests/benchmark_002/benchmark_results.md

Architecture

+----------------------------------------------------------------------+
|                                                                      |
|   State: 5-channel N x N tensor                                      |
|   +-------+-------+----------+-----------+------------+              |
|   | Ch 0  | Ch 1  |  Ch 2    |   Ch 3    |   Ch 4     |              |
|   |Adjacen|Mapping|Gate      |Front-layer|Stagnation  |              |
|   |-cy    |Perm.  |Demand    |Distance   |Signal      |              |
|   |(27x27)|(27x27)|(27x27)   |(27x27)    |(27x27)     |              |
|   +---+---+---+---+----+-----+-----+-----+------+-----+              |
|       +-------+--------+-----------+------------+                    |
|                        |                                             |
|                        v                                             |
|   +--------------------------------------+                           |
|   |        CNN Feature Extractor         |                           |
|   |  Conv2d(5->32, 3x3) -> ReLU         |                           |
|   |  Conv2d(32->64, 3x3) -> ReLU        |                           |
|   |  Conv2d(64->32, 3x3) -> ReLU        |                           |
|   |  Flatten -> 32 x 27 x 27 = 23,328   |                           |
|   +------------------+-------------------+                           |
|                      |                                               |
|           +----------+----------+                                    |
|           v                     v                                    |
|   +------------+    +----------------+                               |
|   | Value      |    | Advantage      |                               |
|   | Stream     |    | Stream         |                               |
|   | 23328->256 |    | 23328->256     |                               |
|   | ->ReLU->1  |    | ->ReLU->20     |  (20 edges = 20 actions)      |
|   +-----+------+    +------+---------+                               |
|         |                  |                                         |
|         +------+-----------+                                         |
|                v                                                     |
|   Q(s,a) = V(s) + A(s,a) - mean(A)                                  |
|                                                                      |
|   ~6M parameters (standard) | ~24M parameters (bignet)               |
|                                                                      |
+----------------------------------------------------------------------+

Key Design Decisions

Decision	Rationale	Reference
Dueling DQN	Separates state value from action advantage — learns "how good is this state" independently of "which SWAP is best here"	Wang et al., 2016
Double DQN	Eliminates overestimation bias in Q-learning — online net selects actions, target net evaluates them	van Hasselt et al., 2016
Prioritized Experience Replay	Samples high-TD-error transitions more often — focuses learning on surprising/difficult states	Schaul et al., 2016
CNN on N x N state	Hardware topology and qubit mapping have spatial structure — convolutions capture local connectivity patterns	—
5-channel state	Encodes topology (Ch0), current mapping (Ch1), future gate demand (Ch2), immediate routing targets (Ch3), and stuck-ness signal (Ch4)	Pozzi et al., 2022 (Ch3 distance shaping)
Soft target updates	Polyak averaging (tau=0.005) gives smoother target evolution vs hard copies — prevents Q-value oscillation	Lillicrap et al., 2016
Curriculum learning	Train on easy circuits (depth 5) first, progress to hard ones (depth 20) — gives ~10k episode head start	Bengio et al., 2009

How It Works

The Routing Problem

Given a quantum circuit with two-qubit gates and a hardware coupling graph, find a sequence of SWAP operations that makes all gates executable on adjacent qubits, minimizing the total number of SWAPs inserted.

This is NP-hard in general (Siraichi et al., 2018). SABRE (Li et al., 2019) solves it heuristically using bidirectional greedy search with a lookahead cost function. Our agent learns a policy directly from experience.

Environment (MDP)

Each episode:

Reset: Pick a topology, generate a random depth-20 circuit, set a random initial qubit mapping
Step: Agent selects a SWAP (edge in the hardware graph). The SWAP is applied, then all now-routable gates execute automatically
Terminate: When all gates are executed (success, +5 bonus) or after max_steps (timeout, -10 penalty)

Reward per step:

r = -1                                    # step cost (encourages efficiency)
  + gates_executed x gate_reward           # +1 per gate routed (progress signal)
  + distance_coeff x delta_distance        # Pozzi-style distance shaping
  + repetition_penalty (if same SWAP)      # -0.5 prevents swap-undo loops
  + completion_bonus (if done)             # +5 for completing all gates
  + timeout_penalty (if truncated)         # -10 for running out of steps

Training Algorithm

D3QN + PER (Double Dueling DQN with Prioritized Experience Replay):

Collect transitions via epsilon-greedy exploration (epsilon: 1.0 -> 0.02)
Store in prioritized replay buffer (SumTree, capacity 300-500k)
Sample mini-batches weighted by TD error magnitude
Compute Double DQN targets: y = r + gamma^n * Q_target(s', argmax Q_online(s'))
Minimize Huber loss weighted by importance-sampling corrections
Soft-update target network every 500 steps (tau = 0.005)

Multi-Topology Generalization

A single agent can learn routing across multiple hardware topologies simultaneously:

State observations are padded to a fixed N x N size (N=27 covers all topologies)
Actions beyond the current topology's edge count are masked
Weighted topology sampling ensures the hardest topology (heavy_hex) gets sufficient training time

Getting Started

Prerequisites

Python 3.10+
PyTorch 2.0+ with CUDA
Qiskit 1.0.2

Installation

git clone <repo-url>
cd rl-quantum-circuit-routing
python -m venv .venv
source .venv/bin/activate
pip install torch --index-url https://download.pytorch.org/whl/cu118
pip install qiskit==1.0.2 gymnasium==0.29.1 networkx==3.2.1 \
            numpy==1.26.4 matplotlib==3.8.3 imageio tqdm

Quick Start

Train with a preset:

# Sanity check (~5 min)
python main.py train --preset linear5

# Primary heavy_hex training (~16h on GPU)
python main.py train --preset heavy_hex --episodes 60000 --device cuda

# Multi-topology generalization
python main.py train --preset multi --episodes 80000 --device cuda

Train with a custom config:

python main.py train --config configs/run14_heavy_hex_60k.json --device cuda

Evaluate a checkpoint:

python main.py evaluate \
    --checkpoint outputs/run_019/checkpoints/checkpoint_final.pt \
    --episodes 100 \
    --save-trajectories \
    --output-dir eval_results

SLURM Cluster

# Using config files
sbatch experiment.slurm configs/run14_heavy_hex_60k.json

# With CLI overrides
sbatch experiment.slurm heavy_hex --episodes 80000 --seed 123

The SLURM script automatically: trains the agent, evaluates the final checkpoint, and generates visualizations. Outputs land in outputs/<SLURM_JOB_ID>/ (or outputs/run_NNN/ for local runs).

Fine-tuning from a checkpoint:

sbatch experiment.slurm configs/run29_finetune_run019.json \
    --finetune outputs/run_019/checkpoints/checkpoint_ep64000.pt

Training

Hyperparameters (Best Configuration)

Parameter	Value	Notes
Algorithm	D3QN + PER	Double Dueling DQN
Conv channels	[32, 64, 32]	Standard net (~6M params)
Dueling hidden	256
Learning rate	1e-4	Cosine annealing to 1e-5 recommended
Discount (gamma)	0.99
Batch size	128
Buffer capacity	300,000-500,000
PER alpha	0.6	Priority exponent
PER beta	0.4 -> 1.0	IS weight annealing
Epsilon	1.0 -> 0.02	Linear decay over 4-6M steps
Target update	tau=0.005 every 500 steps	Soft Polyak
Gradient clip	10.0	Max gradient norm
Episodes	60,000-100,000	More is better (curve not flat)
Curriculum	depth 5 -> 10 -> 20	Optional, saves ~25% training time

Compute Requirements

Resource	Requirement
GPU	Any CUDA GPU (RTX 3090 recommended)
Training time	~16h for 60k episodes on heavy_hex
GPU memory	~2-4 GB
Evaluation	~5 min per 100 episodes

Output Structure

outputs/run_NNN/
├── config.json                 # Full configuration snapshot
├── quicksum.md                 # One-line experiment summary
├── logs/
│   ├── episodes.jsonl          # Per-episode: reward, swaps, completion
│   ├── train_steps.jsonl       # Per-step: loss, Q-values, TD error, LR
│   └── evaluations.jsonl       # Per-eval: swap ratio, completion rate
├── checkpoints/
│   ├── checkpoint_ep10000.pt   # Periodic checkpoints
│   └── checkpoint_final.pt     # Final model
├── figures/
│   ├── training_curves.png     # 6-panel training dashboard
│   ├── eval_comparison_*.png   # Agent vs SABRE bar charts
│   └── routing_*.gif           # Animated routing trajectories
├── eval/
│   ├── eval_ep*.json           # Evaluation results per checkpoint
│   └── random_eval.json        # Final evaluation
└── results_summary.json        # Aggregate results for quick comparison

Project Structure

rl-quantum-circuit-routing/
├── src/
│   ├── environment.py          # Gymnasium env: 5-channel state, SWAP actions, reward
│   ├── networks.py             # DuelingCNN: Conv2d stack + V/A streams
│   ├── dqn_agent.py            # D3QN agent: PER, n-step, epsilon, LR scheduling
│   ├── replay_buffer.py        # SumTree + PrioritizedReplayBuffer
│   ├── config.py               # TrainConfig dataclass + presets
│   ├── train.py                # Main training loop with curriculum support
│   ├── evaluate.py             # Evaluation + SABRE comparison
│   ├── visualize.py            # Training dashboard + routing GIFs
│   ├── circuit_utils.py        # Quantum circuit DAG, front layer, coupling maps
│   └── explore.py              # Interactive exploration utilities
├── configs/                    # JSON configs for each experiment run
├── assets/                     # Figures and GIFs for README
├── outputs/                    # All experiment outputs (auto-numbered, gitignored)
├── main.py                     # CLI entry point (train/evaluate/visualize)
├── experiment.slurm            # SLURM job script (train + eval + viz)
├── benchmark.slurm             # SLURM benchmark script (agent vs SABRE on 150 circuits)
├── scripts/
│   └── benchmark_test.py       # Full benchmark: generated + QASMBench suites
├── ARCHITECTURE.md             # Detailed technical documentation
├── outputs.md                  # Full experiment log and analysis
└── README.md

Documentation

Document	Description
ARCHITECTURE.md	Complete technical reference (~1200 lines): MDP formulation with math, network architecture, D3QN+PER algorithm, training pipeline, multi-topology support, reward design, experimental findings, and paper references
outputs.md	Full experiment log: every run's config, results, detailed analysis, and cross-run comparison. Covers V1-V6 findings (31 runs), leaderboard, and ablation studies

References

SABRE: Li, G., Ding, Y., & Xie, Y. (2019). Tackling the Qubit Mapping Problem for NISQ-Era Quantum Devices. ASPLOS.
Qubit Routing Problem: Cowtan, A., et al. (2019). On the Qubit Routing Problem. TQC.
NP-hardness: Siraichi, M. Y., et al. (2018). Qubit allocation. CGO.
Optimal Routing: Botea, A., et al. (2018). On the complexity of quantum circuit compilation. SOCS.
Token Swapping: Miltzow, T., et al. (2016). Approximation and Hardness of Token Swapping. ESA.
Dueling DQN: Wang, Z., et al. (2016). Dueling Network Architectures for Deep Reinforcement Learning. ICML.
Double DQN: van Hasselt, H., Guez, A., & Silver, D. (2016). Deep Reinforcement Learning with Double Q-learning. AAAI.
Prioritized Experience Replay: Schaul, T., et al. (2016). Prioritized Experience Replay. ICLR.
RL for Routing: Pozzi, M. G., et al. (2022). Using Reinforcement Learning to Perform Qubit Routing in Quantum Compilers. ACM Computing Surveys.
AlphaRouter: Fosel, T., et al. (2021). Quantum circuit optimization with deep reinforcement learning. arXiv:2103.07585.
Curriculum Learning: Bengio, Y., et al. (2009). Curriculum Learning. ICML.

License

This project is released under the MIT License.

Built with PyTorch · Qiskit · Gymnasium

Name		Name	Last commit message	Last commit date
Latest commit History 90 Commits
assets		assets
benchmarks		benchmarks
configs		configs
docs		docs
notebooks		notebooks
outputs		outputs
scripts		scripts
src		src
.gitignore		.gitignore
ARCHITECTURE.md		ARCHITECTURE.md
COLLABORATION.md		COLLABORATION.md
README.md		README.md
benchmark.slurm		benchmark.slurm
experiment.slurm		experiment.slurm
main.py		main.py
outputs.md		outputs.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

RL Quantum Circuit Routing

Deep Reinforcement Learning for Quantum Circuit Transpilation

Overview

Results

Agent vs SABRE: Routing Comparison

Best Result: Run 029 — Fine-Tuned (heavy_hex_19)

Best Multi-Topology: Run 026 (3 topologies, 100k episodes)

Progression Across Experiments

Training Dashboards

Benchmark

Full Benchmark: Agent vs SABRE on 150 Circuits

Overall Results

Per-Category Highlights

Architecture

Key Design Decisions

How It Works

The Routing Problem

Environment (MDP)

Training Algorithm

Multi-Topology Generalization

Getting Started

Prerequisites

Installation

Quick Start

SLURM Cluster

Training

Hyperparameters (Best Configuration)

Compute Requirements

Output Structure

Project Structure

Documentation

References

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages