Skip to content

perf(lmp): GPU-accelerated edge tensor construction for DPA4/SeZM#5596

Closed
SchrodingersCattt wants to merge 8 commits into
deepmodeling:masterfrom
SchrodingersCattt:feat/gpu-edge-tensors
Closed

perf(lmp): GPU-accelerated edge tensor construction for DPA4/SeZM#5596
SchrodingersCattt wants to merge 8 commits into
deepmodeling:masterfrom
SchrodingersCattt:feat/gpu-edge-tensors

Conversation

@SchrodingersCattt

@SchrodingersCattt SchrodingersCattt commented Jun 27, 2026

Copy link
Copy Markdown
Contributor

Summary

Replace CPU-side createEdgeTensors() with createEdgeTensorsGPU() that builds edge topology on GPU using torch tensor ops (index_select, gather, nonzero).

Motivation

DPA4/SeZM .pt2 models suffer 3–6× throughput degradation in LAMMPS vs Python/ASE at large atom counts (see #5595). Profiling shows the dominant bottleneck is createEdgeTensors() in commonPT.h, which iterates O(N×nnei) neighbor pairs in a serial C++ loop on CPU, then copies 4 tensors to GPU via separate .clone().to(device) calls.

At 65K atoms with ~120 neighbors (~7.8M edges), this takes ~60–80ms — 3.2× the model inference itself (measured as LAMMPS Pair time minus pure AOTI runner time).

The Python inference path already avoids this by using edge_schema_from_extended() which runs equivalent logic as pure torch tensor ops on GPU.

Changes

commonPT.h

  • Added createEdgeTensorsGPU(): takes padded nlist tensor + coord + mapping (all on GPU) and produces edge_index + edge_index_ext using the same gather/scatter/nonzero pattern as the Python edge_schema_from_extended().
  • Supports both fold_to_local (single-rank) and extended (multi-rank) index spaces.
  • Supports optional with_geometry=true for the non-LAMMPS path (produces edge_vec + edge_mask + dummy edges).
  • Original CPU createEdgeTensors() is preserved (not removed) for backward compatibility.

DeepPotPTExpt.cc

  • LAMMPS path (ago==0): Replaced createEdgeTensors(jagged_nlist, cpu_coord, ...) with:
    1. createNlistTensor() → padded tensor (cheap CPU memcpy)
    2. .to(device) → single H2D transfer
    3. createEdgeTensorsGPU() → GPU parallel edge construction
  • Non-LAMMPS path: Same replacement with with_geometry=true.
  • compactEdgeTensors() (ago>0 reuse path) is unchanged — already runs on GPU.

Expected Impact

Phase Before After
Edge construction (ago==0) ~60-80ms CPU serial ~1-5ms GPU parallel
H2D transfers 4 separate .clone().to(device) 1 padded nlist transfer
compactEdgeTensors (ago>0) unchanged (GPU) unchanged (GPU)

Estimated LAMMPS DPA4 throughput improvement at 65K atoms: ~50-70% (from ~130K to ~200K+ atom·step/s).

Closes #5595. Related: #5574.

Summary by CodeRabbit

  • New Features

    • Enabled GPU-based edge/topology tensor construction for edge-driven neighbor-list workflows.
    • When geometry is enabled, generated edge tensors now include per-edge vectors, with padded dummy edges and a mask to keep tensor shapes consistent.
  • Performance

    • Reduced CPU-side tensor preparation by padding neighbor lists and building edge topology directly on the selected device for supported edge input paths.

@dosubot dosubot Bot added the enhancement label Jun 27, 2026
@coderabbitai

coderabbitai Bot commented Jun 27, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

A GPU edge-tensor constructor was added, and both DeepPotPTExpt::compute edge-input paths now build padded neighbor tensors on-device before creating edge tensors.

Changes

GPU edge topology construction

Layer / File(s) Summary
GPU edge constructor
source/api_cc/include/commonPT.h
createEdgeTensorsDevice builds edge_index and edge_index_ext from padded neighbor-list tensors, filters invalid/coincident edges, folds ghost neighbors when requested, and appends dummy edges plus edge_mask when geometry is enabled.
GPU edge wiring
source/api_cc/src/DeepPotPTExpt.cc
Both lower_input_is_edge_ compute paths build padded neighbor tensors on-device with createNlistTensor and call createEdgeTensorsDevice; the extended-coordinate path enables geometry output and the main path passes fold_to_local from use_with_comm.

Sequence Diagram(s)

sequenceDiagram
  participant DeepPotPTExpt
  participant createNlistTensor
  participant createEdgeTensorsDevice

  DeepPotPTExpt->>createNlistTensor: build padded nlist_tensor on device
  DeepPotPTExpt->>createEdgeTensorsDevice: pass padded_nlist, coord_Tensor, mapping_tensor
  createEdgeTensorsDevice-->>DeepPotPTExpt: edge_tensors
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related issues

  • #5595: Replaces CPU-side edge construction in DeepPotPTExpt::compute with a GPU path, matching the issue’s performance target.
  • #5574: Also describes moving DeePMD’s C++/PTExpt edge construction toward a device-resident pipeline.

Suggested labels

LAMMPS

Suggested reviewers

  • iProzd
  • njzjz-bot
🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly describes the main change: GPU-based edge tensor construction for LAMMPS/DPA4/SeZM.
Linked Issues check ✅ Passed The PR moves edge construction onto the GPU and preserves both fold_to_local and geometry-aware paths, matching #5595's requirements.
Out of Scope Changes check ✅ Passed The changes stay focused on GPU edge tensor construction and the related call sites, with no obvious unrelated additions.
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@github-actions github-actions Bot added the C++ label Jun 27, 2026

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@source/api_cc/include/commonPT.h`:
- Around line 387-397: The edge filtering in commonPT’s src_actual handling is
still using src_local to enforce the local-owner range, which incorrectly
removes extended-index neighbors when fold_to_local is false. Update the
edge_keep mask in this block to gate the range check on fold_to_local, so the
src_local bounds are only applied for the folded/local path and the extended
path preserves neighbor_safe-based ghost nodes in multi-rank with-comm mode.
- Around line 363-364: The dst_actual tensor is being promoted to floating point
by the division, which breaks later indexing operations. Update the dst_actual
construction in commonPT.h to use integer arithmetic so it stays an integer
tensor, and keep the surrounding logic that feeds index_select unchanged.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 5d7b13f9-0524-46e6-a39d-49da601759fe

📥 Commits

Reviewing files that changed from the base of the PR and between dbdc9a3 and 47b7d06.

📒 Files selected for processing (2)
  • source/api_cc/include/commonPT.h
  • source/api_cc/src/DeepPotPTExpt.cc

Comment thread source/api_cc/include/commonPT.h Outdated
Comment thread source/api_cc/include/commonPT.h Outdated
…MPS inference

Replace CPU-side createEdgeTensors() with createEdgeTensorsGPU() that uses
torch tensor ops (index_select, gather, nonzero) to build edge_index and
edge_index_ext entirely on the GPU.

The CPU version iterates O(N*nnei) neighbor pairs in a serial C++ loop,
builds 4 separate std::vectors, then copies each to GPU via .clone().to(device).
At 65K atoms with ~120 neighbors, this takes ~60-80ms and accounts for 3-6x
throughput degradation vs the Python inference path which uses GPU-native
edge construction (vesin.torch / edge_schema_from_extended).

The GPU version:
1. Converts the jagged LAMMPS nlist to a padded tensor (createNlistTensor,
   cheap CPU memcpy)
2. Transfers the padded [1, nloc, nnei] tensor to GPU (single H2D)
3. Builds edge topology using gather/scatter/nonzero on GPU (parallel)

For the LAMMPS cached path (ago==0), only topology is built on GPU;
compactEdgeTensors() still handles per-step geometry recomputation (already
on GPU, unchanged).

Closes deepmodeling#5595. Related: deepmodeling#5574.
The function uses standard torch tensor ops and works on any device
(CPU or CUDA), so 'GPU' in the name was misleading.
@codecov

codecov Bot commented Jun 27, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 0% with 65 lines in your changes missing coverage. Please review.
✅ Project coverage is 81.95%. Comparing base (73de44b) to head (8fd79d0).

Files with missing lines Patch % Lines
source/api_cc/include/commonPT.h 0.00% 49 Missing ⚠️
source/api_cc/src/DeepPotPTExpt.cc 0.00% 16 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #5596      +/-   ##
==========================================
- Coverage   81.98%   81.95%   -0.04%     
==========================================
  Files         959      959              
  Lines      105430   105484      +54     
  Branches     4071     4072       +1     
==========================================
+ Hits        86442    86451       +9     
- Misses      17518    17572      +54     
+ Partials     1470     1461       -9     

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@njzjz-bot njzjz-bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pushing the edge construction onto tensor ops — this is a good direction, and the CI signal is encouraging. I found one correctness regression in the LAMMPS edge path: the new device helper assumes neighbor-list row i is centered on atom i, but the previous CPU path explicitly passed nlist_data.ilist as row_centers after shuffle_exclude_empty(). That matters when LAMMPS/real-atom filtering leaves rows in a compacted/order-preserved layout rather than identity center order. Please preserve the row-center mapping in the device path before merging.

Authored by OpenClaw (model: custom-chat-jinzhezeng-group/gpt-5.5)

auto padded_nlist = createNlistTensor(nlist_data.jlist, nnei)
.to(torch::kInt64)
.to(device);
const auto edge_tensors = createEdgeTensorsDevice(

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This call loses the row_centers information that the old CPU path passed via &nlist_data.ilist. After copy_from_nlist() + shuffle_exclude_empty(), the jlist rows are not guaranteed to have row_index == center_atom; the existing unit test CreateEdgeTensorsUsesRowCenters documents that case. createEdgeTensorsDevice() currently derives dst_actual from the row number, so it can build edges to the wrong destination atom and compute the wrong edge vectors whenever nlist_data.ilist is not identity.

Could you pass a padded/1-D center tensor (from nlist_data.ilist) into the device helper and use it for dst_actual, matching the old row_centers behavior? It would also be worth adding a device-helper test mirroring CreateEdgeTensorsUsesRowCenters.

@github-advanced-security github-advanced-security AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CodeQL found more than 20 potential problems in the proposed changes. Check the Files changed tab for more details.

SchrodingersCattt and others added 2 commits June 30, 2026 01:09
Address review from njzjz-bot: after shuffle_exclude_empty(), LAMMPS
nlist rows may not have row_index == center_atom.  The old CPU path
passed nlist_data.ilist as row_centers; the GPU path was deriving
dst_actual from the row number, building edges to wrong destinations.

Add optional row_centers tensor parameter to createEdgeTensorsDevice.
The LAMMPS call site now passes nlist_data.ilist as a device tensor.

@OutisLi OutisLi left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will make another PR related to this problem

@OutisLi OutisLi marked this pull request as draft June 30, 2026 01:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

perf: DPA4 .pt2 LAMMPS (C++ AOTI) inference 3-6x slower than Python (ASE) due to CPU-side edge construction and data marshalling

4 participants