FRONTIER 01
Parallelism-Aware Training of Diffusion Language Models
TP
DP
High Novelty
Design tensor- and data-parallelism strategies that exploit the parallel token-generation structure of DLMs during both training and inference.
No published work co-designs TP/DP strategies for diffusion-based language model training. Mercury achieves 10x faster inference on H100s, but training parallelism is unexplored.
NeurIPS / ICML
▼
FRONTIER 02
MoE Meets Diffusion LLMs: Expert Routing for Parallel Token Denoising
EP
TP
High Novelty
Design MoE architectures for DLMs where expert selection is conditioned on both token identity and denoising timestep, enabling timestep-aware expert specialization.
No published work combines MoE with diffusion-based text generation. Could yield the first trillion-parameter DLM trainable on current hardware.
ICLR / NeurIPS
▼
FRONTIER 03
Adaptive Context Parallelism with Dynamic Sparse Attention
CP
SP
High Novelty
Dynamically adjust CP degree per layer based on runtime attention sparsity patterns, allocating more GPUs to dense-attention layers and fewer to sparse ones.
Adaptive CP could save 30–50% communication for 1M+ token sequences. RingX (SC25) achieves 3.4x speedup but still uses uniform partitioning.
NeurIPS / ICML
▼
FRONTIER 04
Speculative Decoding for Discrete Diffusion Models
DP
TP
High Novelty
Adapt speculative decoding to discrete diffusion LLMs: a small "draft denoiser" predicts token masks, a large "verifier denoiser" confirms or corrects across multiple GPUs.
Combines two of the hottest topics in ML systems. Could reduce denoising steps from 10–100 down to 3–5 steps.
ICML / ICLR
▼
FRONTIER 05
Carbon-Aware Dynamic Parallelism Reconfiguration
All 5D
High Novelty
Build a system that continuously monitors grid carbon intensity and dynamically reconfigures the parallelism strategy to minimize carbon footprint while maintaining throughput SLAs.
Training a single frontier model emits 300+ tonnes of CO₂. No work reconfigures parallelism dimensions based on energy signals. Highly interdisciplinary (systems + sustainability).
NeurIPS / ICLR (Green AI)
▼
FRONTIER 06
Heterogeneous Expert Parallelism: Right-Sizing Experts to GPUs
EP
DP
Introduce heterogeneous expert architectures where expert capacity is matched to the GPU it runs on — larger experts on faster GPUs, smaller on slower ones.
Could unlock MoE training for organizations that cannot afford homogeneous H100 clusters. No existing work adapts the model itself to hardware heterogeneity.
NeurIPS / MLSys
▼
FRONTIER 07
Distributed Parallel Reasoning with Speculative Verification
DP
PP
Distribute reasoning across multiple GPUs: a planner decomposes problems into sub-tasks, executor GPUs solve in parallel, a verifier speculatively checks solutions.
No work distributes reasoning across a GPU cluster. Reasoning models produce 10K–100K tokens per query — distributed inference is essential.
ICLR / NeurIPS
▼
FRONTIER 08
Zero-Bubble Pipeline Parallelism for Diffusion Model Training
PP
DP
High Novelty
Design pipeline schedules optimized for diffusion model training, exploiting their fundamentally different computation graph (denoising + score matching loss).
Better PP could reduce training cost of Stable Diffusion 3, FLUX, and Sora-scale models by 20–30%. Zero-bubble PP assumes autoregressive models.
ICML / NeurIPS
▼
FRONTIER 09
Communication-Free Expert Parallelism via Learned Expert Compression
EP
TP
Train a lightweight autoencoder to compress token representations during expert dispatch, reducing all-to-all communication by 4–8x without quality loss.
All-to-all communication consumes 40%+ of MoE training time at scale. Could enable MoE training on clusters with slower interconnects (Ethernet instead of InfiniBand).
ICML / NeurIPS
▼
FRONTIER 10
Ring Attention + Speculative Decoding for Million-Token Inference
CP
DP
Combine Ring Attention's distributed KV-cache with speculative decoding: draft tokens generated locally while the ring asynchronously prefetches KV-cache segments for verification.
Speculative decoding assumes KV-cache fits on one GPU — impossible for 1M+ contexts. Long-context inference is a major deployment challenge.
NeurIPS / ICLR
▼
FRONTIER 11
Async RLHF with Disaggregated Expert-Parallel Reward Models
EP
PP
Use MoE-based reward models with expert parallelism, disaggregated from the policy model's GPU pool, enabling asynchronous reward scoring overlapped with training.
RLHF spends 80% of time on generation. MoE reward models could score different quality aspects (helpfulness, safety, style) via different experts on different GPUs.
ICML / NeurIPS
▼
FRONTIER 12
Scaling Test-Time Training Layers via Context Parallelism
CP
SP
Scale TTT layers to million-token contexts using context parallelism with large-chunk gradient aggregation across GPUs.
TTT layers have linear complexity but existing methods achieve less than 5% GPU utilization. Combining TTT with Transformer parallelism strategies is unexplored.
ICLR / NeurIPS
▼
FRONTIER 13
Topology-Aware Expert Placement for Disaggregated MoE Serving
EP
PP
Learn a topology-aware expert placement policy from production routing traces, using graph partitioning to minimize cross-node communication while maintaining load balance.
DeepSeek-V3's 256 experts create ~4.4B possible routing combinations. Lmsys reported that expert placement significantly impacts throughput on 96 H100s.
OSDI / NeurIPS
▼
FRONTIER 14
Energy-Aware Expert Routing in Mixture-of-Experts Models
EP
DP
High Novelty
Add an energy cost term to the MoE routing decision: tokens are routed considering both quality (affinity) and the real-time energy cost of the GPU hosting each expert.
No existing work incorporates energy signals into MoE routing. Could reduce inference energy by 20–40% with minimal quality degradation. Aligns with the Green AI movement.
NeurIPS / ICLR (Green AI)
▼
FRONTIER 15
Unified Parallelism Search: AutoML for 5D Parallelism Configuration
All 5D
High Novelty
Formulate 5D parallelism configuration as a combinatorial optimization problem solved with a learned cost model + search algorithm across the full TP/DP/PP/CP/EP space.
For a 256-GPU cluster, there are thousands of valid configurations. Could become the "NAS for parallelism" — a foundational tool for the field. Alpa only searched 2D (TP×DP).
NeurIPS / OSDI / MLSys
▼