perf(lmp): GPU-accelerated edge tensor construction for DPA4/SeZM#5596
perf(lmp): GPU-accelerated edge tensor construction for DPA4/SeZM#5596SchrodingersCattt wants to merge 8 commits into
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughA GPU edge-tensor constructor was added, and both ChangesGPU edge topology construction
Sequence Diagram(s)sequenceDiagram
participant DeepPotPTExpt
participant createNlistTensor
participant createEdgeTensorsDevice
DeepPotPTExpt->>createNlistTensor: build padded nlist_tensor on device
DeepPotPTExpt->>createEdgeTensorsDevice: pass padded_nlist, coord_Tensor, mapping_tensor
createEdgeTensorsDevice-->>DeepPotPTExpt: edge_tensors
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related issues
Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@source/api_cc/include/commonPT.h`:
- Around line 387-397: The edge filtering in commonPT’s src_actual handling is
still using src_local to enforce the local-owner range, which incorrectly
removes extended-index neighbors when fold_to_local is false. Update the
edge_keep mask in this block to gate the range check on fold_to_local, so the
src_local bounds are only applied for the folded/local path and the extended
path preserves neighbor_safe-based ghost nodes in multi-rank with-comm mode.
- Around line 363-364: The dst_actual tensor is being promoted to floating point
by the division, which breaks later indexing operations. Update the dst_actual
construction in commonPT.h to use integer arithmetic so it stays an integer
tensor, and keep the surrounding logic that feeds index_select unchanged.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: 5d7b13f9-0524-46e6-a39d-49da601759fe
📒 Files selected for processing (2)
source/api_cc/include/commonPT.hsource/api_cc/src/DeepPotPTExpt.cc
47b7d06 to
3fba28c
Compare
…MPS inference Replace CPU-side createEdgeTensors() with createEdgeTensorsGPU() that uses torch tensor ops (index_select, gather, nonzero) to build edge_index and edge_index_ext entirely on the GPU. The CPU version iterates O(N*nnei) neighbor pairs in a serial C++ loop, builds 4 separate std::vectors, then copies each to GPU via .clone().to(device). At 65K atoms with ~120 neighbors, this takes ~60-80ms and accounts for 3-6x throughput degradation vs the Python inference path which uses GPU-native edge construction (vesin.torch / edge_schema_from_extended). The GPU version: 1. Converts the jagged LAMMPS nlist to a padded tensor (createNlistTensor, cheap CPU memcpy) 2. Transfers the padded [1, nloc, nnei] tensor to GPU (single H2D) 3. Builds edge topology using gather/scatter/nonzero on GPU (parallel) For the LAMMPS cached path (ago==0), only topology is built on GPU; compactEdgeTensors() still handles per-step geometry recomputation (already on GPU, unchanged). Closes deepmodeling#5595. Related: deepmodeling#5574.
6325194 to
73e90d7
Compare
The function uses standard torch tensor ops and works on any device (CPU or CUDA), so 'GPU' in the name was misleading.
3e3baf6 to
df59e21
Compare
for more information, see https://pre-commit.ci
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #5596 +/- ##
==========================================
- Coverage 81.98% 81.95% -0.04%
==========================================
Files 959 959
Lines 105430 105484 +54
Branches 4071 4072 +1
==========================================
+ Hits 86442 86451 +9
- Misses 17518 17572 +54
+ Partials 1470 1461 -9 ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
njzjz-bot
left a comment
There was a problem hiding this comment.
Thanks for pushing the edge construction onto tensor ops — this is a good direction, and the CI signal is encouraging. I found one correctness regression in the LAMMPS edge path: the new device helper assumes neighbor-list row i is centered on atom i, but the previous CPU path explicitly passed nlist_data.ilist as row_centers after shuffle_exclude_empty(). That matters when LAMMPS/real-atom filtering leaves rows in a compacted/order-preserved layout rather than identity center order. Please preserve the row-center mapping in the device path before merging.
Authored by OpenClaw (model: custom-chat-jinzhezeng-group/gpt-5.5)
| auto padded_nlist = createNlistTensor(nlist_data.jlist, nnei) | ||
| .to(torch::kInt64) | ||
| .to(device); | ||
| const auto edge_tensors = createEdgeTensorsDevice( |
There was a problem hiding this comment.
This call loses the row_centers information that the old CPU path passed via &nlist_data.ilist. After copy_from_nlist() + shuffle_exclude_empty(), the jlist rows are not guaranteed to have row_index == center_atom; the existing unit test CreateEdgeTensorsUsesRowCenters documents that case. createEdgeTensorsDevice() currently derives dst_actual from the row number, so it can build edges to the wrong destination atom and compute the wrong edge vectors whenever nlist_data.ilist is not identity.
Could you pass a padded/1-D center tensor (from nlist_data.ilist) into the device helper and use it for dst_actual, matching the old row_centers behavior? It would also be worth adding a device-helper test mirroring CreateEdgeTensorsUsesRowCenters.
There was a problem hiding this comment.
CodeQL found more than 20 potential problems in the proposed changes. Check the Files changed tab for more details.
Address review from njzjz-bot: after shuffle_exclude_empty(), LAMMPS nlist rows may not have row_index == center_atom. The old CPU path passed nlist_data.ilist as row_centers; the GPU path was deriving dst_actual from the row number, building edges to wrong destinations. Add optional row_centers tensor parameter to createEdgeTensorsDevice. The LAMMPS call site now passes nlist_data.ilist as a device tensor.
d74bfae to
a2ef9ec
Compare
for more information, see https://pre-commit.ci
OutisLi
left a comment
There was a problem hiding this comment.
I will make another PR related to this problem
Summary
Replace CPU-side
createEdgeTensors()withcreateEdgeTensorsGPU()that builds edge topology on GPU using torch tensor ops (index_select,gather,nonzero).Motivation
DPA4/SeZM
.pt2models suffer 3–6× throughput degradation in LAMMPS vs Python/ASE at large atom counts (see #5595). Profiling shows the dominant bottleneck iscreateEdgeTensors()incommonPT.h, which iterates O(N×nnei) neighbor pairs in a serial C++ loop on CPU, then copies 4 tensors to GPU via separate.clone().to(device)calls.At 65K atoms with ~120 neighbors (~7.8M edges), this takes ~60–80ms — 3.2× the model inference itself (measured as LAMMPS Pair time minus pure AOTI runner time).
The Python inference path already avoids this by using
edge_schema_from_extended()which runs equivalent logic as pure torch tensor ops on GPU.Changes
commonPT.hcreateEdgeTensorsGPU(): takes padded nlist tensor + coord + mapping (all on GPU) and producesedge_index+edge_index_extusing the same gather/scatter/nonzero pattern as the Pythonedge_schema_from_extended().fold_to_local(single-rank) and extended (multi-rank) index spaces.with_geometry=truefor the non-LAMMPS path (produces edge_vec + edge_mask + dummy edges).createEdgeTensors()is preserved (not removed) for backward compatibility.DeepPotPTExpt.cccreateEdgeTensors(jagged_nlist, cpu_coord, ...)with:createNlistTensor()→ padded tensor (cheap CPU memcpy).to(device)→ single H2D transfercreateEdgeTensorsGPU()→ GPU parallel edge constructionwith_geometry=true.compactEdgeTensors()(ago>0 reuse path) is unchanged — already runs on GPU.Expected Impact
Estimated LAMMPS DPA4 throughput improvement at 65K atoms: ~50-70% (from ~130K to ~200K+ atom·step/s).
Closes #5595. Related: #5574.
Summary by CodeRabbit
New Features
Performance