perf(lmp): GPU-accelerated edge tensor construction for DPA4/SeZM by SchrodingersCattt · Pull Request #5596 · deepmodeling/deepmd-kit

SchrodingersCattt · 2026-06-27T13:00:59Z

Summary

Replace CPU-side createEdgeTensors() with createEdgeTensorsGPU() that builds edge topology on GPU using torch tensor ops (index_select, gather, nonzero).

Motivation

DPA4/SeZM .pt2 models suffer 3–6× throughput degradation in LAMMPS vs Python/ASE at large atom counts (see #5595). Profiling shows the dominant bottleneck is createEdgeTensors() in commonPT.h, which iterates O(N×nnei) neighbor pairs in a serial C++ loop on CPU, then copies 4 tensors to GPU via separate .clone().to(device) calls.

At 65K atoms with ~120 neighbors (~7.8M edges), this takes ~60–80ms — 3.2× the model inference itself (measured as LAMMPS Pair time minus pure AOTI runner time).

The Python inference path already avoids this by using edge_schema_from_extended() which runs equivalent logic as pure torch tensor ops on GPU.

Changes

`commonPT.h`

Added createEdgeTensorsGPU(): takes padded nlist tensor + coord + mapping (all on GPU) and produces edge_index + edge_index_ext using the same gather/scatter/nonzero pattern as the Python edge_schema_from_extended().
Supports both fold_to_local (single-rank) and extended (multi-rank) index spaces.
Supports optional with_geometry=true for the non-LAMMPS path (produces edge_vec + edge_mask + dummy edges).
Original CPU createEdgeTensors() is preserved (not removed) for backward compatibility.

`DeepPotPTExpt.cc`

LAMMPS path (ago==0): Replaced createEdgeTensors(jagged_nlist, cpu_coord, ...) with:
1. createNlistTensor() → padded tensor (cheap CPU memcpy)
2. .to(device) → single H2D transfer
3. createEdgeTensorsGPU() → GPU parallel edge construction
Non-LAMMPS path: Same replacement with with_geometry=true.
compactEdgeTensors() (ago>0 reuse path) is unchanged — already runs on GPU.

Expected Impact

Phase	Before	After
Edge construction (ago==0)	~60-80ms CPU serial	~1-5ms GPU parallel
H2D transfers	4 separate .clone().to(device)	1 padded nlist transfer
compactEdgeTensors (ago>0)	unchanged (GPU)	unchanged (GPU)

Estimated LAMMPS DPA4 throughput improvement at 65K atoms: ~50-70% (from ~130K to ~200K+ atom·step/s).

Closes #5595. Related: #5574.

Summary by CodeRabbit

New Features
- Enabled GPU-based edge/topology tensor construction for edge-driven neighbor-list workflows.
- When geometry is enabled, generated edge tensors now include per-edge vectors, with padded dummy edges and a mask to keep tensor shapes consistent.
Performance
- Reduced CPU-side tensor preparation by padding neighbor lists and building edge topology directly on the selected device for supported edge input paths.

coderabbitai · 2026-06-27T13:02:46Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

A GPU edge-tensor constructor was added, and both DeepPotPTExpt::compute edge-input paths now build padded neighbor tensors on-device before creating edge tensors.

Changes

GPU edge topology construction

Layer / File(s)	Summary
GPU edge constructor `source/api_cc/include/commonPT.h`	`createEdgeTensorsDevice` builds `edge_index` and `edge_index_ext` from padded neighbor-list tensors, filters invalid/coincident edges, folds ghost neighbors when requested, and appends dummy edges plus `edge_mask` when geometry is enabled.
GPU edge wiring `source/api_cc/src/DeepPotPTExpt.cc`	Both `lower_input_is_edge_` compute paths build padded neighbor tensors on-device with `createNlistTensor` and call `createEdgeTensorsDevice`; the extended-coordinate path enables geometry output and the main path passes `fold_to_local` from `use_with_comm`.

Sequence Diagram(s)

sequenceDiagram
  participant DeepPotPTExpt
  participant createNlistTensor
  participant createEdgeTensorsDevice

  DeepPotPTExpt->>createNlistTensor: build padded nlist_tensor on device
  DeepPotPTExpt->>createEdgeTensorsDevice: pass padded_nlist, coord_Tensor, mapping_tensor
  createEdgeTensorsDevice-->>DeepPotPTExpt: edge_tensors

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related issues

#5595: Replaces CPU-side edge construction in DeepPotPTExpt::compute with a GPU path, matching the issue’s performance target.
#5574: Also describes moving DeePMD’s C++/PTExpt edge construction toward a device-resident pipeline.

Suggested labels

LAMMPS

Suggested reviewers

iProzd
njzjz-bot

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly describes the main change: GPU-based edge tensor construction for LAMMPS/DPA4/SeZM.
Linked Issues check	✅ Passed	The PR moves edge construction onto the GPU and preserves both fold_to_local and geometry-aware paths, matching `#5595`'s requirements.
Out of Scope Changes check	✅ Passed	The changes stay focused on GPU edge tensor construction and the related call sites, with no obvious unrelated additions.

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands.}

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@source/api_cc/include/commonPT.h`:
- Around line 387-397: The edge filtering in commonPT’s src_actual handling is
still using src_local to enforce the local-owner range, which incorrectly
removes extended-index neighbors when fold_to_local is false. Update the
edge_keep mask in this block to gate the range check on fold_to_local, so the
src_local bounds are only applied for the folded/local path and the extended
path preserves neighbor_safe-based ghost nodes in multi-rank with-comm mode.
- Around line 363-364: The dst_actual tensor is being promoted to floating point
by the division, which breaks later indexing operations. Update the dst_actual
construction in commonPT.h to use integer arithmetic so it stays an integer
tensor, and keep the surrounding logic that feeds index_select unchanged.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 5d7b13f9-0524-46e6-a39d-49da601759fe

📥 Commits

Reviewing files that changed from the base of the PR and between dbdc9a3 and 47b7d06.

📒 Files selected for processing (2)

source/api_cc/include/commonPT.h
source/api_cc/src/DeepPotPTExpt.cc

…MPS inference Replace CPU-side createEdgeTensors() with createEdgeTensorsGPU() that uses torch tensor ops (index_select, gather, nonzero) to build edge_index and edge_index_ext entirely on the GPU. The CPU version iterates O(N*nnei) neighbor pairs in a serial C++ loop, builds 4 separate std::vectors, then copies each to GPU via .clone().to(device). At 65K atoms with ~120 neighbors, this takes ~60-80ms and accounts for 3-6x throughput degradation vs the Python inference path which uses GPU-native edge construction (vesin.torch / edge_schema_from_extended). The GPU version: 1. Converts the jagged LAMMPS nlist to a padded tensor (createNlistTensor, cheap CPU memcpy) 2. Transfers the padded [1, nloc, nnei] tensor to GPU (single H2D) 3. Builds edge topology using gather/scatter/nonzero on GPU (parallel) For the LAMMPS cached path (ago==0), only topology is built on GPU; compactEdgeTensors() still handles per-step geometry recomputation (already on GPU, unchanged). Closes deepmodeling#5595. Related: deepmodeling#5574.

The function uses standard torch tensor ops and works on any device (CPU or CUDA), so 'GPU' in the name was misleading.

for more information, see https://pre-commit.ci

codecov · 2026-06-27T14:38:28Z

Codecov Report

❌ Patch coverage is 0% with 65 lines in your changes missing coverage. Please review.
✅ Project coverage is 81.95%. Comparing base (73de44b) to head (8fd79d0).

Files with missing lines	Patch %	Lines
source/api_cc/include/commonPT.h	0.00%	49 Missing ⚠️
source/api_cc/src/DeepPotPTExpt.cc	0.00%	16 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #5596      +/-   ##
==========================================
- Coverage   81.98%   81.95%   -0.04%     
==========================================
  Files         959      959              
  Lines      105430   105484      +54     
  Branches     4071     4072       +1     
==========================================
+ Hits        86442    86451       +9     
- Misses      17518    17572      +54     
+ Partials     1470     1461       -9

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

njzjz-bot

Thanks for pushing the edge construction onto tensor ops — this is a good direction, and the CI signal is encouraging. I found one correctness regression in the LAMMPS edge path: the new device helper assumes neighbor-list row i is centered on atom i, but the previous CPU path explicitly passed nlist_data.ilist as row_centers after shuffle_exclude_empty(). That matters when LAMMPS/real-atom filtering leaves rows in a compacted/order-preserved layout rather than identity center order. Please preserve the row-center mapping in the device path before merging.

Authored by OpenClaw (model: custom-chat-jinzhezeng-group/gpt-5.5)

njzjz-bot · 2026-06-29T21:41:50Z

+      auto padded_nlist = createNlistTensor(nlist_data.jlist, nnei)
+                              .to(torch::kInt64)
+                              .to(device);
+      const auto edge_tensors = createEdgeTensorsDevice(


This call loses the row_centers information that the old CPU path passed via &nlist_data.ilist. After copy_from_nlist() + shuffle_exclude_empty(), the jlist rows are not guaranteed to have row_index == center_atom; the existing unit test CreateEdgeTensorsUsesRowCenters documents that case. createEdgeTensorsDevice() currently derives dst_actual from the row number, so it can build edges to the wrong destination atom and compute the wrong edge vectors whenever nlist_data.ilist is not identity.

Could you pass a padded/1-D center tensor (from nlist_data.ilist) into the device helper and use it for dst_actual, matching the old row_centers behavior? It would also be worth adding a device-helper test mirroring CreateEdgeTensorsUsesRowCenters.

github-advanced-security

CodeQL found more than 20 potential problems in the proposed changes. Check the Files changed tab for more details.

…sors

Address review from njzjz-bot: after shuffle_exclude_empty(), LAMMPS nlist rows may not have row_index == center_atom. The old CPU path passed nlist_data.ilist as row_centers; the GPU path was deriving dst_actual from the row number, building edges to wrong destinations. Add optional row_centers tensor parameter to createEdgeTensorsDevice. The LAMMPS call site now passes nlist_data.ilist as a device tensor.

for more information, see https://pre-commit.ci

OutisLi

I will make another PR related to this problem

dosubot Bot added the enhancement label Jun 27, 2026

github-actions Bot added the C++ label Jun 27, 2026

coderabbitai Bot reviewed Jun 27, 2026

View reviewed changes

Comment thread source/api_cc/include/commonPT.h Outdated

Comment thread source/api_cc/include/commonPT.h Outdated

SchrodingersCattt force-pushed the feat/gpu-edge-tensors branch from 47b7d06 to 3fba28c Compare June 27, 2026 13:12

SchrodingersCattt force-pushed the feat/gpu-edge-tensors branch from 6325194 to 73e90d7 Compare June 27, 2026 13:16

refactor: rename createEdgeTensorsGPU → createEdgeTensorsDevice

df59e21

The function uses standard torch tensor ops and works on any device (CPU or CUDA), so 'GPU' in the name was misleading.

SchrodingersCattt force-pushed the feat/gpu-edge-tensors branch from 3e3baf6 to df59e21 Compare June 27, 2026 13:30

[pre-commit.ci] auto fixes from pre-commit.com hooks

b347c1a

for more information, see https://pre-commit.ci

SchrodingersCattt added 2 commits June 27, 2026 23:30

Merge branch 'master' into feat/gpu-edge-tensors

62e2a75

Merge branch 'master' into feat/gpu-edge-tensors

2e9322f

njzjz-bot suggested changes Jun 29, 2026

View reviewed changes

github-actions Bot added Python Core CUDA ROCM OP labels Jun 30, 2026

github-advanced-security AI found potential problems Jun 30, 2026

View reviewed changes

SchrodingersCattt and others added 2 commits June 30, 2026 01:09

Merge remote-tracking branch 'upstream/master' into feat/gpu-edge-ten…

8197a43

…sors

SchrodingersCattt force-pushed the feat/gpu-edge-tensors branch from d74bfae to a2ef9ec Compare June 30, 2026 01:12

[pre-commit.ci] auto fixes from pre-commit.com hooks

8fd79d0

for more information, see https://pre-commit.ci

SchrodingersCattt requested a review from njzjz-bot June 30, 2026 01:15

OutisLi reviewed Jun 30, 2026

View reviewed changes

OutisLi marked this pull request as draft June 30, 2026 01:34

SchrodingersCattt closed this Jun 30, 2026

Uh oh!

Conversation

SchrodingersCattt commented Jun 27, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Motivation

Changes

commonPT.h

DeepPotPTExpt.cc

Expected Impact

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Jun 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related issues

Suggested labels

Suggested reviewers

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

codecov Bot commented Jun 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

njzjz-bot left a comment

Choose a reason for hiding this comment

Uh oh!

njzjz-bot Jun 29, 2026

Choose a reason for hiding this comment

Uh oh!

github-advanced-security AI left a comment

Choose a reason for hiding this comment

Uh oh!

OutisLi left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

SchrodingersCattt commented Jun 27, 2026 •

edited by coderabbitai Bot

Loading

`commonPT.h`

`DeepPotPTExpt.cc`

coderabbitai Bot commented Jun 27, 2026 •

edited

Loading

codecov Bot commented Jun 27, 2026 •

edited

Loading