A D3QN+PER agent that learns to route quantum circuits on hardware topologies, minimizing SWAP gate overhead compared to IBM's SABRE compiler.
Ratio 0.969 (Beats SABRE by 3.1%) · Benchmark: 42% Wins / 19% Losses over 150 circuits · 100% Completion · 3 Hardware Topologies
Results · Benchmark · Architecture · How It Works · Getting Started · Training · Documentation
Quantum computers can only execute two-qubit gates between physically adjacent qubits. Real quantum circuits contain gates between arbitrary qubit pairs. Quantum circuit routing (transpilation) solves this by inserting SWAP gates to move qubits into adjacent positions before each gate can execute. Fewer SWAPs = shorter circuits = less decoherence = higher fidelity results.
The industry-standard solution is SABRE (Li et al., 2019), a heuristic search algorithm used by IBM's Qiskit compiler. SABRE is fast and produces good results, but as a greedy heuristic, it can miss globally optimal routing strategies.
Finding the minimum number of SWAPs for a given circuit is NP-hard (Siraichi et al., 2018) — no known polynomial-time algorithm can compute the optimal solution. For small instances (5-10 qubits), exact methods like ILP/SAT solvers or A* search (Botea et al., 2018) can find optima, but for real hardware sizes these are intractable. Both SABRE and our agent produce approximate solutions.
This project trains a deep RL agent to learn routing strategies end-to-end, directly from the circuit structure and hardware topology. The agent:
- Observes the hardware graph, current qubit mapping, pending gate demands, and routing progress via a 5-channel spatial state
- Selects SWAP operations on hardware edges using a Dueling CNN policy
- Learns from millions of routing episodes via Double DQN with Prioritized Experience Replay
- Generalizes across multiple hardware topologies with a single network
Quantum Circuit (logical) Hardware Topology (physical)
+-------------------+ +-------------------------+
| q0 --X-- | | 0 --- 1 --- 2 |
| | | | | | |
| q1 --X-- X-- | SWAP | 3 --- 4 --- 5 |
| | | ---------> | | | |
| q2 ---- X-- X-- | insertion | 6 --- 7 --- 8 |
| | | | (grid_3x3: 9q, 12e) |
| q3 -------- X-- | +-------------------------+
+-------------------+
Gates between non-adjacent Agent inserts SWAPs to make
qubits can't execute directly all gates executable
The agent (blue) and SABRE (orange) route the same quantum circuit on heavy_hex_19. Each frame shows one SWAP operation, with gates executing automatically when qubits become adjacent.
Run 029 / fine-tuned model (best checkpoint, ep17k): Agent uses 176 SWAPs vs SABRE's 200 — ratio 0.880 on this circuit
| Metric | Value |
|---|---|
| Best Swap Ratio | 0.969 (beats SABRE by 3.1%) |
| Completion Rate | 100% (all circuits fully routed) |
| Agent SWAPs (avg) | 180.2 |
| SABRE SWAPs (avg) | 186.0 |
| Method | Fine-tune Run 019 best checkpoint @ LR=1e-5 |
Fine-tuning loads the best checkpoint from a full training run and refines it with a low learning rate — only 20k episodes / 4.5h wall time.
Fine-tuned model evaluation (ep17k, ratio 0.969): per-circuit SWAP count comparison between agent (blue) and SABRE (orange)
A single network learns to route circuits across 3 different hardware topologies simultaneously:
| Topology | Qubits | Edges | Swap Ratio | vs SABRE |
|---|---|---|---|---|
| linear_5 | 5 | 4 | 0.890 | Beats SABRE by 11% |
| grid_3x3 | 9 | 12 | 1.008 | Matches SABRE |
| heavy_hex_19 | 19 | 20 | 1.050 | 5% more SWAPs |
| Overall | — | — | 0.996 | Beats SABRE |
| Version | Best Run | Heavy Hex Ratio | Key Breakthrough |
|---|---|---|---|
| V1 | — | 0% completion | Q-value collapse |
| V2 | Run 7 | 1.080 | 5-channel state, soft targets, eps=0.02 |
| V3 | Run 15 | 1.014 | 60k episodes, standard net |
| V4 | Run 19 | 0.991 | 80k episodes — first to beat SABRE |
| V5 | Run 24 | 0.987 | 100k episodes — more training helps |
| V6 | Run 29 | 0.969 | Fine-tuning from best checkpoint — beats SABRE by 3.1% |
Run 029 fine-tune (20k episodes): Starting from Run 019's best weights at ratio ~0.99, the low-LR refinement pushes down to 0.969.
Run 019 Training Dashboard (80k, original SABRE-beater)
Run 024 Training Dashboard (100k, brute force)
Run 023 Training Dashboard (Curriculum + Cosine LR)
Additional Routing GIFs
We evaluate the best agent (Run 029, fine-tuned) on two independent test suites:
- Generated Suite (110 circuits): QFT, VQE (linear/circular/full), GHZ, Bernstein-Vazirani, Quantum Volume, Structured (CNOT rings), and Random circuits at 4-19 qubits
- QASMBench Suite (40 circuits): Real-world algorithms from the QASMBench repository — Shor's, Grover's, QAOA, QEC, HHL, arithmetic, and more (filtered to ≤19 qubits with ≥1 two-qubit gate)
Full benchmark dashboard: per-category swap ratios, agent vs SABRE scatter, ratio distributions, and scaling trends across both test suites.
| Suite | Circuits | Mean Ratio | Agent Wins | Ties | SABRE Wins |
|---|---|---|---|---|---|
| Generated | 110 (89 with SWAPs) | 0.989 | 37 (42%) | 34 (38%) | 18 (20%) |
| QASMBench | 40 (24 with SWAPs) | 0.687 | 11 (46%) | 10 (42%) | 3 (12%) |
| Combined | 150 (113 with SWAPs) | 0.925 | 48 (42%) | 44 (39%) | 21 (19%) |
The agent wins more than twice as often as it loses across both suites.
Generated circuits — the agent is strongest on dense, structured algorithms:
| Category | Ratio | Standout |
|---|---|---|
| QFT | 0.890 | qft_4: 3 vs 5 SWAPs (0.600) |
| VQE (full) | 0.909 | vqe_full_12q_1r: 55 vs 64 (0.859) |
| Random (depth-20) | 0.988 | random_d20_s16: 163 vs 189 (0.862) |
QASMBench — the agent dominates on real-world arithmetic/factoring circuits:
| Circuit | Agent SWAPs | SABRE SWAPs | Ratio |
|---|---|---|---|
| square_root_n18 | 6 | 543 | 0.011 |
| multiplier_n15 | 15 | 111 | 0.135 |
| shor_n5 | 3 | 8 | 0.375 |
| seca_n11 | 10 | 25 | 0.400 |
| qft_n18 | 133 | 136 | 0.978 |
Agent vs SABRE SWAP counts across all 113 completed circuits. Points below the diagonal = agent wins. The agent consistently matches or beats SABRE, with dramatic wins on structured QASMBench circuits.
Suite Comparison (bar chart)
The benchmark script (scripts/benchmark_test.py) generates 15 figures per run (6 per suite + 3 combined), detailed per-circuit JSON/Markdown results, and category-level statistics. Run with:
sbatch benchmark.slurm <checkpoint> # Both suites (default)
sbatch benchmark.slurm <checkpoint> --no-qasmbench # Generated onlyFull results: outputs/tests/benchmark_002/benchmark_results.md
+----------------------------------------------------------------------+
| |
| State: 5-channel N x N tensor |
| +-------+-------+----------+-----------+------------+ |
| | Ch 0 | Ch 1 | Ch 2 | Ch 3 | Ch 4 | |
| |Adjacen|Mapping|Gate |Front-layer|Stagnation | |
| |-cy |Perm. |Demand |Distance |Signal | |
| |(27x27)|(27x27)|(27x27) |(27x27) |(27x27) | |
| +---+---+---+---+----+-----+-----+-----+------+-----+ |
| +-------+--------+-----------+------------+ |
| | |
| v |
| +--------------------------------------+ |
| | CNN Feature Extractor | |
| | Conv2d(5->32, 3x3) -> ReLU | |
| | Conv2d(32->64, 3x3) -> ReLU | |
| | Conv2d(64->32, 3x3) -> ReLU | |
| | Flatten -> 32 x 27 x 27 = 23,328 | |
| +------------------+-------------------+ |
| | |
| +----------+----------+ |
| v v |
| +------------+ +----------------+ |
| | Value | | Advantage | |
| | Stream | | Stream | |
| | 23328->256 | | 23328->256 | |
| | ->ReLU->1 | | ->ReLU->20 | (20 edges = 20 actions) |
| +-----+------+ +------+---------+ |
| | | |
| +------+-----------+ |
| v |
| Q(s,a) = V(s) + A(s,a) - mean(A) |
| |
| ~6M parameters (standard) | ~24M parameters (bignet) |
| |
+----------------------------------------------------------------------+
| Decision | Rationale | Reference |
|---|---|---|
| Dueling DQN | Separates state value from action advantage — learns "how good is this state" independently of "which SWAP is best here" | Wang et al., 2016 |
| Double DQN | Eliminates overestimation bias in Q-learning — online net selects actions, target net evaluates them | van Hasselt et al., 2016 |
| Prioritized Experience Replay | Samples high-TD-error transitions more often — focuses learning on surprising/difficult states | Schaul et al., 2016 |
| CNN on N x N state | Hardware topology and qubit mapping have spatial structure — convolutions capture local connectivity patterns | — |
| 5-channel state | Encodes topology (Ch0), current mapping (Ch1), future gate demand (Ch2), immediate routing targets (Ch3), and stuck-ness signal (Ch4) | Pozzi et al., 2022 (Ch3 distance shaping) |
| Soft target updates | Polyak averaging (tau=0.005) gives smoother target evolution vs hard copies — prevents Q-value oscillation | Lillicrap et al., 2016 |
| Curriculum learning | Train on easy circuits (depth 5) first, progress to hard ones (depth 20) — gives ~10k episode head start | Bengio et al., 2009 |
Given a quantum circuit with two-qubit gates and a hardware coupling graph, find a sequence of SWAP operations that makes all gates executable on adjacent qubits, minimizing the total number of SWAPs inserted.
This is NP-hard in general (Siraichi et al., 2018). SABRE (Li et al., 2019) solves it heuristically using bidirectional greedy search with a lookahead cost function. Our agent learns a policy directly from experience.
Each episode:
- Reset: Pick a topology, generate a random depth-20 circuit, set a random initial qubit mapping
- Step: Agent selects a SWAP (edge in the hardware graph). The SWAP is applied, then all now-routable gates execute automatically
- Terminate: When all gates are executed (success, +5 bonus) or after max_steps (timeout, -10 penalty)
Reward per step:
r = -1 # step cost (encourages efficiency)
+ gates_executed x gate_reward # +1 per gate routed (progress signal)
+ distance_coeff x delta_distance # Pozzi-style distance shaping
+ repetition_penalty (if same SWAP) # -0.5 prevents swap-undo loops
+ completion_bonus (if done) # +5 for completing all gates
+ timeout_penalty (if truncated) # -10 for running out of steps
D3QN + PER (Double Dueling DQN with Prioritized Experience Replay):
- Collect transitions via epsilon-greedy exploration (epsilon: 1.0 -> 0.02)
- Store in prioritized replay buffer (SumTree, capacity 300-500k)
- Sample mini-batches weighted by TD error magnitude
- Compute Double DQN targets:
y = r + gamma^n * Q_target(s', argmax Q_online(s')) - Minimize Huber loss weighted by importance-sampling corrections
- Soft-update target network every 500 steps (tau = 0.005)
A single agent can learn routing across multiple hardware topologies simultaneously:
- State observations are padded to a fixed N x N size (N=27 covers all topologies)
- Actions beyond the current topology's edge count are masked
- Weighted topology sampling ensures the hardest topology (heavy_hex) gets sufficient training time
- Python 3.10+
- PyTorch 2.0+ with CUDA
- Qiskit 1.0.2
git clone <repo-url>
cd rl-quantum-circuit-routing
python -m venv .venv
source .venv/bin/activate
pip install torch --index-url https://download.pytorch.org/whl/cu118
pip install qiskit==1.0.2 gymnasium==0.29.1 networkx==3.2.1 \
numpy==1.26.4 matplotlib==3.8.3 imageio tqdmTrain with a preset:
# Sanity check (~5 min)
python main.py train --preset linear5
# Primary heavy_hex training (~16h on GPU)
python main.py train --preset heavy_hex --episodes 60000 --device cuda
# Multi-topology generalization
python main.py train --preset multi --episodes 80000 --device cudaTrain with a custom config:
python main.py train --config configs/run14_heavy_hex_60k.json --device cudaEvaluate a checkpoint:
python main.py evaluate \
--checkpoint outputs/run_019/checkpoints/checkpoint_final.pt \
--episodes 100 \
--save-trajectories \
--output-dir eval_results# Using config files
sbatch experiment.slurm configs/run14_heavy_hex_60k.json
# With CLI overrides
sbatch experiment.slurm heavy_hex --episodes 80000 --seed 123The SLURM script automatically: trains the agent, evaluates the final checkpoint, and generates visualizations. Outputs land in outputs/<SLURM_JOB_ID>/ (or outputs/run_NNN/ for local runs).
Fine-tuning from a checkpoint:
sbatch experiment.slurm configs/run29_finetune_run019.json \
--finetune outputs/run_019/checkpoints/checkpoint_ep64000.pt| Parameter | Value | Notes |
|---|---|---|
| Algorithm | D3QN + PER | Double Dueling DQN |
| Conv channels | [32, 64, 32] | Standard net (~6M params) |
| Dueling hidden | 256 | |
| Learning rate | 1e-4 | Cosine annealing to 1e-5 recommended |
| Discount (gamma) | 0.99 | |
| Batch size | 128 | |
| Buffer capacity | 300,000-500,000 | |
| PER alpha | 0.6 | Priority exponent |
| PER beta | 0.4 -> 1.0 | IS weight annealing |
| Epsilon | 1.0 -> 0.02 | Linear decay over 4-6M steps |
| Target update | tau=0.005 every 500 steps | Soft Polyak |
| Gradient clip | 10.0 | Max gradient norm |
| Episodes | 60,000-100,000 | More is better (curve not flat) |
| Curriculum | depth 5 -> 10 -> 20 | Optional, saves ~25% training time |
| Resource | Requirement |
|---|---|
| GPU | Any CUDA GPU (RTX 3090 recommended) |
| Training time | ~16h for 60k episodes on heavy_hex |
| GPU memory | ~2-4 GB |
| Evaluation | ~5 min per 100 episodes |
outputs/run_NNN/
├── config.json # Full configuration snapshot
├── quicksum.md # One-line experiment summary
├── logs/
│ ├── episodes.jsonl # Per-episode: reward, swaps, completion
│ ├── train_steps.jsonl # Per-step: loss, Q-values, TD error, LR
│ └── evaluations.jsonl # Per-eval: swap ratio, completion rate
├── checkpoints/
│ ├── checkpoint_ep10000.pt # Periodic checkpoints
│ └── checkpoint_final.pt # Final model
├── figures/
│ ├── training_curves.png # 6-panel training dashboard
│ ├── eval_comparison_*.png # Agent vs SABRE bar charts
│ └── routing_*.gif # Animated routing trajectories
├── eval/
│ ├── eval_ep*.json # Evaluation results per checkpoint
│ └── random_eval.json # Final evaluation
└── results_summary.json # Aggregate results for quick comparison
rl-quantum-circuit-routing/
├── src/
│ ├── environment.py # Gymnasium env: 5-channel state, SWAP actions, reward
│ ├── networks.py # DuelingCNN: Conv2d stack + V/A streams
│ ├── dqn_agent.py # D3QN agent: PER, n-step, epsilon, LR scheduling
│ ├── replay_buffer.py # SumTree + PrioritizedReplayBuffer
│ ├── config.py # TrainConfig dataclass + presets
│ ├── train.py # Main training loop with curriculum support
│ ├── evaluate.py # Evaluation + SABRE comparison
│ ├── visualize.py # Training dashboard + routing GIFs
│ ├── circuit_utils.py # Quantum circuit DAG, front layer, coupling maps
│ └── explore.py # Interactive exploration utilities
├── configs/ # JSON configs for each experiment run
├── assets/ # Figures and GIFs for README
├── outputs/ # All experiment outputs (auto-numbered, gitignored)
├── main.py # CLI entry point (train/evaluate/visualize)
├── experiment.slurm # SLURM job script (train + eval + viz)
├── benchmark.slurm # SLURM benchmark script (agent vs SABRE on 150 circuits)
├── scripts/
│ └── benchmark_test.py # Full benchmark: generated + QASMBench suites
├── ARCHITECTURE.md # Detailed technical documentation
├── outputs.md # Full experiment log and analysis
└── README.md
| Document | Description |
|---|---|
| ARCHITECTURE.md | Complete technical reference (~1200 lines): MDP formulation with math, network architecture, D3QN+PER algorithm, training pipeline, multi-topology support, reward design, experimental findings, and paper references |
| outputs.md | Full experiment log: every run's config, results, detailed analysis, and cross-run comparison. Covers V1-V6 findings (31 runs), leaderboard, and ablation studies |
- SABRE: Li, G., Ding, Y., & Xie, Y. (2019). Tackling the Qubit Mapping Problem for NISQ-Era Quantum Devices. ASPLOS.
- Qubit Routing Problem: Cowtan, A., et al. (2019). On the Qubit Routing Problem. TQC.
- NP-hardness: Siraichi, M. Y., et al. (2018). Qubit allocation. CGO.
- Optimal Routing: Botea, A., et al. (2018). On the complexity of quantum circuit compilation. SOCS.
- Token Swapping: Miltzow, T., et al. (2016). Approximation and Hardness of Token Swapping. ESA.
- Dueling DQN: Wang, Z., et al. (2016). Dueling Network Architectures for Deep Reinforcement Learning. ICML.
- Double DQN: van Hasselt, H., Guez, A., & Silver, D. (2016). Deep Reinforcement Learning with Double Q-learning. AAAI.
- Prioritized Experience Replay: Schaul, T., et al. (2016). Prioritized Experience Replay. ICLR.
- RL for Routing: Pozzi, M. G., et al. (2022). Using Reinforcement Learning to Perform Qubit Routing in Quantum Compilers. ACM Computing Surveys.
- AlphaRouter: Fosel, T., et al. (2021). Quantum circuit optimization with deep reinforcement learning. arXiv:2103.07585.
- Curriculum Learning: Bengio, Y., et al. (2009). Curriculum Learning. ICML.
This project is released under the MIT License.





