Feat: Activation Checkpointing by Chamberlain0w0 · Pull Request #154 · InfiniTensor/InfiniTrain

Chamberlain0w0 · 2026-05-21T02:49:58Z

No description provided.

chen2021673 · 2026-06-09T03:04:34Z

@@ -1,12 +1,15 @@
 #include "infini_train/include/nn/modules/transformer/transformer.h"


之后扩展 recompute 其他功能（如selective）的话就不适合全放在transformer.cc里了，之后可以拆分activation_recompute.cc。本PR可以不做修改

确实，这块确实没太想清楚，主要是 megatron 的实现里面也把很多重计算逻辑融在了 transformer 模型层，之后可以再讨论下

Chamberlain0w0 · 2026-06-25T07:32:52Z

+// Used by non-reentrant checkpoint recomputation so downstream SetupContext
+// calls see the same needs_input_grad_ pattern as the original forward,
+// without wiring the recompute graph into the engine.
+class PropagateRequiresGradGuard {


本质上是需要一个 2*2=4 种情况的精细控制（with_grad/no_grad x 带引用计数建图/仅传递 requires_grad）。

也可以不单独实现一个 guard，而是给原先的 GradGuard 重载一个带参数 (比如 bool record_context = false) 的构造方式，让 GradGuard 本身能够覆盖这四种情况。

Chamberlain0w0 · 2026-06-25T07:43:28Z

+        try {
+            forward_fn(detached_inputs);
+        } catch (const StopRecomputeError &) {
+            // Early-stop: expected when all needed tensors are recomputed.


只是用异常的语法和控制流机制来实现 early stop 的语义，本质上并不是真的抛一个会让程序 abort 的异常。torch 也采取的是一样的实现方式。

这块暂时没有想到其他更优雅的处理方式。

Chamberlain0w0 force-pushed the feat/activation_checkpointing branch from d87926e to ba8db8f Compare June 3, 2026 08:45

Chamberlain0w0 changed the title ~~[WIP] Feat: Activation Checkpointing~~ Feat: Activation Checkpointing Jun 5, 2026

Chamberlain0w0 requested a review from chen2021673 June 5, 2026 02:14

chen2021673 reviewed Jun 9, 2026

View reviewed changes

Comment thread infini_train/src/utils/checkpoint.cc Outdated

chen2021673 reviewed Jun 9, 2026

View reviewed changes

Comment thread infini_train/include/autocast.h

chen2021673 reviewed Jun 9, 2026

View reviewed changes

Comment thread infini_train/src/nn/modules/module.cc Outdated

Chamberlain0w0 force-pushed the feat/activation_checkpointing branch from 9cdf73c to a38a4d3 Compare June 11, 2026 01:21

Chamberlain0w0 added 3 commits June 25, 2026 14:10

feat: init ac

640927a

refactor: cleanup & add comments

ab0c27d

fix: resolve comments

e594d9c

Chamberlain0w0 force-pushed the feat/activation_checkpointing branch from a38a4d3 to e594d9c Compare June 25, 2026 06:47

Chamberlain0w0 commented Jun 25, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Feat: Activation Checkpointing#154

Feat: Activation Checkpointing#154
Chamberlain0w0 wants to merge 3 commits into
masterfrom
feat/activation_checkpointing

Chamberlain0w0 commented May 21, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

chen2021673 Jun 9, 2026

Uh oh!

Chamberlain0w0 Jun 10, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Chamberlain0w0 Jun 25, 2026

Uh oh!

Chamberlain0w0 Jun 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		@@ -1,12 +1,15 @@
		#include "infini_train/include/nn/modules/transformer/transformer.h"

Uh oh!

Conversation

Chamberlain0w0 commented May 21, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

chen2021673 Jun 9, 2026

Choose a reason for hiding this comment

Uh oh!

Chamberlain0w0 Jun 10, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Chamberlain0w0 Jun 25, 2026

Choose a reason for hiding this comment

Uh oh!

Chamberlain0w0 Jun 25, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants