Open Source Repository

🔗 GitHub: https://github.com/RLsys-Foundation/TritonForge

Contributors

Jin Pan, Xiang Long, Chengxing Xie, Kexun Zhang, Haoran Wang, Junrong Lin, Yuzhen Zhou, Jiajun Li, Yang Wang, Xiaodong Yu, Gowtham Ramesh, Yusheng Su, Zicheng Liu, Emad Barsoum

1. TL;DR

TritonForge is a Server-based RL training and evaluation closed-loop system designed for multi-turn Agent tasks, built on the slime (SGLang-native) + Megatron foundation. It focuses on Triton kernel generation with stable and scalable practices across both NVIDIA and AMD ecosystems. The design goal is to transform "the instability of multi-turn RL in real-world environments" into implementable, scalable, and maintainable system capabilities.

Regarding methodology and task design, we draw inspiration from Kevin (multi-turn RL for generating CUDA kernels) and KernelBench (kernel correctness and performance evaluation benchmark)—representing the multi-turn RL training paradigm and engineering evaluation standards, respectively.

Architecture Philosophy: The Server-based design decouples training/routing/evaluation; SGLang Router natively supports multiple inference services and high concurrency; Buffer operates in "group" units performing multi-sample sampling (e.g., n=8) → filtering → normalization → padding, unifying the raw_reward standard.
Methodology Overview:
1. SFT cold start (KernelBook-style data; extreme-length sample filtering to avoid OOM);
2. RL (primarily GRPO, with GSPO/TIS integrated for future horizontal comparison);
3. Eval Server based on KernelBench backend with engineering enhancements (subprocess isolation, timeout/fault classification, CUDA/Triton dual backends).
Early Results (on Qwen3-8B-fine-tuned):
- Single-turn @ AMD: 0.116 → 0.175, +5.94 percentage points (≈+51.4%)
- Multi-turn @ NV: 0.24 → 0.36, +12.00 percentage points (+50.0%)
- Single-turn @ NV: 0.102 → 0.223, +12.10 percentage points (≈+118.6%)
- Multi-turn @ AMD: Issue identified, currently being fixed
Open Source and Scalability: We have open-sourced the end-to-end Server-based framework and slime_plugins (single/multi-turn kernel generators, Buffer five-component hooks), using the slime + SGLang paradigm to facilitate future plans for integrating more algorithms (GRPO/GSPO/TIS/…), MoE models, and complete Agentic tool-calling workflows.
Recommended Reading (Inspiration Sources):
- Kevin: Multi-Turn RL for Generating CUDA Kernels (training framework built on closed-source OpenRLHF + vLLM + DeepSpeed ZeRO-3, adapting multi-turn RL to real environments and long trajectories)
- KernelBench: Can LLMs Write Efficient GPU Kernels? (250 PyTorch-CUDA scenarios, evaluation framework and metrics design balancing correctness and performance)
- KernelBook / KernelLLM: PyTorch↔Triton paired sample dataset; includes KernelLLM (Llama-3.1-8B-Instruct); inspired us to adopt the SFT cold start → RL approach — Database

Screenshot 2025-09-29 at 8.29.25 PM.png

2. Technical Choices

2.1 Why Slime?（From verl → slime）

Where we started

We initially planned to build the full multi-turn RL pipeline on veRL: