TensorProbe (code name: kj600) is a LLM pretrain debugger with model's torch module , optimizer status, collective communication tensor collection and aggregation. It also supports rule-based alerts.
fault injection for pytorch by torch_distpatch
Contributions last year: 355
Max continuous contributions: 10
Recent contributions: 1
Commits, issues, and pull requests will appear on your contribution graph. Only when the email address used for the commits in local configuration is associated with your GitOSC account, the commits' contribution will be counted.