Applications Open

GPU Engineers
4-Month Research Bootcamp

GPU Chip - NVIDIA

Master 5D Parallelism. Publish at NeurIPS, ICML, ICLR. Go from GPU fundamentals to a conference-ready research paper with 1:1 asynchronous mentorship on the Vizuara Platform.

12
Lectures
30+
Hours of Content
15
Research Frontiers
4
Months Total
1:1
Mentorship

Master the 5 Dimensions of Modern Parallelism

Every frontier AI model — GPT-5, DeepSeek-V3, Llama 4 — is trained using 5D Parallelism. This bootcamp teaches you to build, optimize, and research at the intersection of parallelism and modern AI.

🧩

TP

Tensor Parallelism

Split weight matrices across GPUs for intra-layer parallelism

📊

DP

Data Parallelism

Replicate model, split data batches across workers

🔄

PP

Pipeline Parallelism

Split model layers across stages for inter-layer parallelism

📝

CP/SP

Context / Sequence Parallelism

Split along sequence length for long-context processing

🤖

EP

Expert Parallelism

Distribute MoE experts across GPUs for sparse models

Lecture Series: 30+ Hours of Teaching

12 lectures over 1.5 months covering GPU programming fundamentals through cutting-edge parallelism strategies, by Dr. Raj Dandekar.

01

GPU Architecture & CUDA Fundamentals

Warps, memory hierarchy, SM occupancy, profiling with Nsight

02

Triton & Custom GPU Kernels

Writing high-performance GPU kernels with OpenAI Triton

03

NCCL & Distributed Communication

All-reduce, all-gather, ring topologies, NCCL internals

04

Data Parallelism & ZeRO

DDP, FSDP, ZeRO stages 1-3, gradient compression

05

Tensor Parallelism

Megatron-LM style column/row splitting, async TP

06

Pipeline Parallelism

GPipe, 1F1B, zero-bubble schedules, interleaving

07

Context & Sequence Parallelism

Ring Attention, Striped Attention, Ulysses for long contexts

08

Expert Parallelism & MoE

All-to-all routing, load balancing, DeepSeek-V3 architecture

09

5D Parallelism Integration

Combining TP+DP+PP+CP+EP, Megatron-LM 5D config

10

Diffusion Language Models

MDLM, Mercury, SEDD — parallel token generation at scale

11

Speculative Decoding & Inference

Draft/verify paradigm, distributed inference, KV-cache parallelism

12

Research Methods & Paper Writing

Experiment design, benchmarking, writing for NeurIPS/ICML

Your Research Journey

Over 2.5 months, go from selecting a frontier research topic to submitting a conference-ready paper — with dedicated 1:1 mentorship at every step.

01

Research Topic Selection

Choose from 15 frontier research directions at the intersection of 5D parallelism and modern AI breakthroughs.

  • Each topic is a standalone 3–6 month research project
  • Topics marked High Novelty have zero existing published work
  • Matched to your background and interests
02

Full Paper Writing Process

Guided through every stage of producing a publication-quality research paper.

  • Literature survey and related work positioning
  • Abstract, methodology, experiment design
  • Results analysis and paper formatting
  • Camera-ready preparation and submission
03

1:1 Asynchronous Mentorship

Dedicated asynchronous mentorship through the Vizuara Platform — no Zoom calls, no scheduling conflicts. Get detailed feedback on your own time.

  • Asynchronous feedback on the dedicated Vizuara Platform
  • Detailed reviewer-style comments on paper drafts
  • Guidance on experiment design and GPU profiling
  • Work at your own pace, get responses within 24–48 hours

5D Parallelism Meets Modern AI

Each topic combines a modern AI breakthrough with a parallelism dimension to yield a novel research direction suitable for top venues.

FRONTIER 01

Parallelism-Aware Training of Diffusion Language Models

TP DP High Novelty
Design tensor- and data-parallelism strategies that exploit the parallel token-generation structure of DLMs during both training and inference.

No published work co-designs TP/DP strategies for diffusion-based language model training. Mercury achieves 10x faster inference on H100s, but training parallelism is unexplored.

NeurIPS / ICML
FRONTIER 02

MoE Meets Diffusion LLMs: Expert Routing for Parallel Token Denoising

EP TP High Novelty
Design MoE architectures for DLMs where expert selection is conditioned on both token identity and denoising timestep, enabling timestep-aware expert specialization.

No published work combines MoE with diffusion-based text generation. Could yield the first trillion-parameter DLM trainable on current hardware.

ICLR / NeurIPS
FRONTIER 03

Adaptive Context Parallelism with Dynamic Sparse Attention

CP SP High Novelty
Dynamically adjust CP degree per layer based on runtime attention sparsity patterns, allocating more GPUs to dense-attention layers and fewer to sparse ones.

Adaptive CP could save 30–50% communication for 1M+ token sequences. RingX (SC25) achieves 3.4x speedup but still uses uniform partitioning.

NeurIPS / ICML
FRONTIER 04

Speculative Decoding for Discrete Diffusion Models

DP TP High Novelty
Adapt speculative decoding to discrete diffusion LLMs: a small "draft denoiser" predicts token masks, a large "verifier denoiser" confirms or corrects across multiple GPUs.

Combines two of the hottest topics in ML systems. Could reduce denoising steps from 10–100 down to 3–5 steps.

ICML / ICLR
FRONTIER 05

Carbon-Aware Dynamic Parallelism Reconfiguration

All 5D High Novelty
Build a system that continuously monitors grid carbon intensity and dynamically reconfigures the parallelism strategy to minimize carbon footprint while maintaining throughput SLAs.

Training a single frontier model emits 300+ tonnes of CO₂. No work reconfigures parallelism dimensions based on energy signals. Highly interdisciplinary (systems + sustainability).

NeurIPS / ICLR (Green AI)
FRONTIER 06

Heterogeneous Expert Parallelism: Right-Sizing Experts to GPUs

EP DP
Introduce heterogeneous expert architectures where expert capacity is matched to the GPU it runs on — larger experts on faster GPUs, smaller on slower ones.

Could unlock MoE training for organizations that cannot afford homogeneous H100 clusters. No existing work adapts the model itself to hardware heterogeneity.

NeurIPS / MLSys
FRONTIER 07

Distributed Parallel Reasoning with Speculative Verification

DP PP
Distribute reasoning across multiple GPUs: a planner decomposes problems into sub-tasks, executor GPUs solve in parallel, a verifier speculatively checks solutions.

No work distributes reasoning across a GPU cluster. Reasoning models produce 10K–100K tokens per query — distributed inference is essential.

ICLR / NeurIPS
FRONTIER 08

Zero-Bubble Pipeline Parallelism for Diffusion Model Training

PP DP High Novelty
Design pipeline schedules optimized for diffusion model training, exploiting their fundamentally different computation graph (denoising + score matching loss).

Better PP could reduce training cost of Stable Diffusion 3, FLUX, and Sora-scale models by 20–30%. Zero-bubble PP assumes autoregressive models.

ICML / NeurIPS
FRONTIER 09

Communication-Free Expert Parallelism via Learned Expert Compression

EP TP
Train a lightweight autoencoder to compress token representations during expert dispatch, reducing all-to-all communication by 4–8x without quality loss.

All-to-all communication consumes 40%+ of MoE training time at scale. Could enable MoE training on clusters with slower interconnects (Ethernet instead of InfiniBand).

ICML / NeurIPS
FRONTIER 10

Ring Attention + Speculative Decoding for Million-Token Inference

CP DP
Combine Ring Attention's distributed KV-cache with speculative decoding: draft tokens generated locally while the ring asynchronously prefetches KV-cache segments for verification.

Speculative decoding assumes KV-cache fits on one GPU — impossible for 1M+ contexts. Long-context inference is a major deployment challenge.

NeurIPS / ICLR
FRONTIER 11

Async RLHF with Disaggregated Expert-Parallel Reward Models

EP PP
Use MoE-based reward models with expert parallelism, disaggregated from the policy model's GPU pool, enabling asynchronous reward scoring overlapped with training.

RLHF spends 80% of time on generation. MoE reward models could score different quality aspects (helpfulness, safety, style) via different experts on different GPUs.

ICML / NeurIPS
FRONTIER 12

Scaling Test-Time Training Layers via Context Parallelism

CP SP
Scale TTT layers to million-token contexts using context parallelism with large-chunk gradient aggregation across GPUs.

TTT layers have linear complexity but existing methods achieve less than 5% GPU utilization. Combining TTT with Transformer parallelism strategies is unexplored.

ICLR / NeurIPS
FRONTIER 13

Topology-Aware Expert Placement for Disaggregated MoE Serving

EP PP
Learn a topology-aware expert placement policy from production routing traces, using graph partitioning to minimize cross-node communication while maintaining load balance.

DeepSeek-V3's 256 experts create ~4.4B possible routing combinations. Lmsys reported that expert placement significantly impacts throughput on 96 H100s.

OSDI / NeurIPS
FRONTIER 14

Energy-Aware Expert Routing in Mixture-of-Experts Models

EP DP High Novelty
Add an energy cost term to the MoE routing decision: tokens are routed considering both quality (affinity) and the real-time energy cost of the GPU hosting each expert.

No existing work incorporates energy signals into MoE routing. Could reduce inference energy by 20–40% with minimal quality degradation. Aligns with the Green AI movement.

NeurIPS / ICLR (Green AI)
FRONTIER 15

Unified Parallelism Search: AutoML for 5D Parallelism Configuration

All 5D High Novelty
Formulate 5D parallelism configuration as a combinatorial optimization problem solved with a learned cost model + search algorithm across the full TP/DP/PP/CP/EP space.

For a 256-GPU cluster, there are thousands of valid configurations. Could become the "NAS for parallelism" — a foundational tool for the field. Alpa only searched 2D (TP×DP).

NeurIPS / OSDI / MLSys
📖
15 Research Frontiers in 5D Parallelism & Modern AI
Vizuara AI Labs • March 2026

The Research Book

A comprehensive guide covering all 15 frontier research topics. Each topic includes the core idea, why it's impactful, a suggested approach, key references, and target conferences.

15
Research Topics
50+
Paper References
18
Pages
Download Research Book (PDF)

Your Publication Timeline

Target the world's top ML and systems venues. Our bootcamp is structured to align with these deadlines.

Mar 26, 2026

SOSP 2026 (Abstract)

Premier systems conference • ~17 days away

Apr 1, 2026

SOSP 2026 (Full Paper)

Full paper deadline • ~23 days away

Apr 15, 2026

ASPLOS 2027 (Spring Cycle)

Architecture & systems • ~37 days away

~May 7, 2026

NeurIPS 2026

Top ML venue • ~59 days away

Apr–May 2026

ICML 2026 Workshops

Workshop papers • Great entry point

Aug–Sept 2026

NeurIPS 2026 Workshops

Workshop papers • ~5–6 months away

Sept 9, 2026

ASPLOS 2027 (Fall Cycle)

Architecture & systems • ~184 days away

~Oct 2026

MLSys 2027 / ICLR 2027

ML systems & representation learning • ~7 months away

Learn from the Best

Dr. Raj Dandekar

Dr. Raj Dandekar

MIT, PhD • Founder, Vizuara AI Labs

Dr. Raj Dandekar brings deep expertise in GPU systems, distributed computing, and AI research. With a background spanning MIT and IIT Madras, he has worked at the intersection of high-performance computing and machine learning.

At Vizuara AI Labs, he leads research bootcamps that have helped students publish at top-tier venues and build careers in GPU engineering and AI research.

MIT PhD IIT Madras Vizuara AI Labs GPU Systems Distributed Computing

Invest in Your Research Career

95,000

Everything you need to go from GPU fundamentals to a published research paper.

  • 12 lectures (30+ hours) on 5D Parallelism
  • Choose from 15 frontier research topics
  • Full paper writing process — abstract to submission
  • 1:1 asynchronous mentorship on the Vizuara Platform
  • Research Book: 15 Frontiers in 5D Parallelism (PDF)
  • Target: NeurIPS, ICML, ICLR, SOSP, ASPLOS, MLSys
  • Dedicated Vizuara Platform access
  • GPU profiling & experiment design guidance
Enroll Now