GPU Engineers 4-Month Research Bootcamp

Why This Bootcamp

Master the 5 Dimensions of Modern Parallelism

Every frontier AI model — GPT-5, DeepSeek-V3, Llama 4 — is trained using 5D Parallelism. This bootcamp teaches you to build, optimize, and research at the intersection of parallelism and modern AI.

🧩

TP

Tensor Parallelism

Split weight matrices across GPUs for intra-layer parallelism

📊

DP

Data Parallelism

Replicate model, split data batches across workers

🔄

PP

Pipeline Parallelism

Split model layers across stages for inter-layer parallelism

📝

CP/SP

Context / Sequence Parallelism

Split along sequence length for long-context processing

🤖

EP

Expert Parallelism

Distribute MoE experts across GPUs for sparse models

Phase 1 — 1.5 Months

Lecture Series: 30+ Hours of Teaching

12 lectures over 1.5 months covering GPU programming fundamentals through cutting-edge parallelism strategies, by Dr. Raj Dandekar.

GPU Architecture & CUDA Fundamentals

Warps, memory hierarchy, SM occupancy, profiling with Nsight

Triton & Custom GPU Kernels

Writing high-performance GPU kernels with OpenAI Triton

NCCL & Distributed Communication

All-reduce, all-gather, ring topologies, NCCL internals

Data Parallelism & ZeRO

DDP, FSDP, ZeRO stages 1-3, gradient compression

Tensor Parallelism

Megatron-LM style column/row splitting, async TP

Pipeline Parallelism

GPipe, 1F1B, zero-bubble schedules, interleaving

Context & Sequence Parallelism

Ring Attention, Striped Attention, Ulysses for long contexts

Expert Parallelism & MoE

All-to-all routing, load balancing, DeepSeek-V3 architecture

5D Parallelism Integration

Combining TP+DP+PP+CP+EP, Megatron-LM 5D config

Diffusion Language Models

MDLM, Mercury, SEDD — parallel token generation at scale

Speculative Decoding & Inference

Draft/verify paradigm, distributed inference, KV-cache parallelism

Research Methods & Paper Writing

Experiment design, benchmarking, writing for NeurIPS/ICML

Phase 2 — 2.5 Months

Your Research Journey

Over 2.5 months, go from selecting a frontier research topic to submitting a conference-ready paper — with dedicated 1:1 mentorship at every step.

Research Topic Selection

Choose from 15 frontier research directions at the intersection of 5D parallelism and modern AI breakthroughs.

Each topic is a standalone 3–6 month research project
Topics marked High Novelty have zero existing published work
Matched to your background and interests

Full Paper Writing Process

Guided through every stage of producing a publication-quality research paper.

Literature survey and related work positioning
Abstract, methodology, experiment design
Results analysis and paper formatting
Camera-ready preparation and submission

1:1 Asynchronous Mentorship

Dedicated asynchronous mentorship through the Vizuara Platform — no Zoom calls, no scheduling conflicts. Get detailed feedback on your own time.

Asynchronous feedback on the dedicated Vizuara Platform
Detailed reviewer-style comments on paper drafts
Guidance on experiment design and GPU profiling
Work at your own pace, get responses within 24–48 hours

15 Research Frontiers

5D Parallelism Meets Modern AI

Each topic combines a modern AI breakthrough with a parallelism dimension to yield a novel research direction suitable for top venues.

FRONTIER 01

Parallelism-Aware Training of Diffusion Language Models

TP DP High Novelty

Design tensor- and data-parallelism strategies that exploit the parallel token-generation structure of DLMs during both training and inference.

No published work co-designs TP/DP strategies for diffusion-based language model training. Mercury achieves 10x faster inference on H100s, but training parallelism is unexplored.

NeurIPS / ICML ▼

FRONTIER 02

MoE Meets Diffusion LLMs: Expert Routing for Parallel Token Denoising

EP TP High Novelty

Design MoE architectures for DLMs where expert selection is conditioned on both token identity and denoising timestep, enabling timestep-aware expert specialization.

No published work combines MoE with diffusion-based text generation. Could yield the first trillion-parameter DLM trainable on current hardware.

ICLR / NeurIPS ▼

FRONTIER 03

Adaptive Context Parallelism with Dynamic Sparse Attention

CP SP High Novelty

Dynamically adjust CP degree per layer based on runtime attention sparsity patterns, allocating more GPUs to dense-attention layers and fewer to sparse ones.

Adaptive CP could save 30–50% communication for 1M+ token sequences. RingX (SC25) achieves 3.4x speedup but still uses uniform partitioning.

NeurIPS / ICML ▼

FRONTIER 04

Speculative Decoding for Discrete Diffusion Models

DP TP High Novelty

Adapt speculative decoding to discrete diffusion LLMs: a small "draft denoiser" predicts token masks, a large "verifier denoiser" confirms or corrects across multiple GPUs.

Combines two of the hottest topics in ML systems. Could reduce denoising steps from 10–100 down to 3–5 steps.

ICML / ICLR ▼

FRONTIER 05

Carbon-Aware Dynamic Parallelism Reconfiguration

All 5D High Novelty

Build a system that continuously monitors grid carbon intensity and dynamically reconfigures the parallelism strategy to minimize carbon footprint while maintaining throughput SLAs.

Training a single frontier model emits 300+ tonnes of CO₂. No work reconfigures parallelism dimensions based on energy signals. Highly interdisciplinary (systems + sustainability).

NeurIPS / ICLR (Green AI) ▼

FRONTIER 06

Heterogeneous Expert Parallelism: Right-Sizing Experts to GPUs

EP DP

Introduce heterogeneous expert architectures where expert capacity is matched to the GPU it runs on — larger experts on faster GPUs, smaller on slower ones.

Could unlock MoE training for organizations that cannot afford homogeneous H100 clusters. No existing work adapts the model itself to hardware heterogeneity.

NeurIPS / MLSys ▼

FRONTIER 07

Distributed Parallel Reasoning with Speculative Verification

DP PP

Distribute reasoning across multiple GPUs: a planner decomposes problems into sub-tasks, executor GPUs solve in parallel, a verifier speculatively checks solutions.

No work distributes reasoning across a GPU cluster. Reasoning models produce 10K–100K tokens per query — distributed inference is essential.

ICLR / NeurIPS ▼

FRONTIER 08

Zero-Bubble Pipeline Parallelism for Diffusion Model Training

PP DP High Novelty

Design pipeline schedules optimized for diffusion model training, exploiting their fundamentally different computation graph (denoising + score matching loss).

Better PP could reduce training cost of Stable Diffusion 3, FLUX, and Sora-scale models by 20–30%. Zero-bubble PP assumes autoregressive models.

ICML / NeurIPS ▼

FRONTIER 09

Communication-Free Expert Parallelism via Learned Expert Compression

EP TP

Train a lightweight autoencoder to compress token representations during expert dispatch, reducing all-to-all communication by 4–8x without quality loss.

All-to-all communication consumes 40%+ of MoE training time at scale. Could enable MoE training on clusters with slower interconnects (Ethernet instead of InfiniBand).

ICML / NeurIPS ▼

FRONTIER 10

Ring Attention + Speculative Decoding for Million-Token Inference

CP DP

Combine Ring Attention's distributed KV-cache with speculative decoding: draft tokens generated locally while the ring asynchronously prefetches KV-cache segments for verification.

Speculative decoding assumes KV-cache fits on one GPU — impossible for 1M+ contexts. Long-context inference is a major deployment challenge.

NeurIPS / ICLR ▼

FRONTIER 11

Async RLHF with Disaggregated Expert-Parallel Reward Models

EP PP

Use MoE-based reward models with expert parallelism, disaggregated from the policy model's GPU pool, enabling asynchronous reward scoring overlapped with training.

RLHF spends 80% of time on generation. MoE reward models could score different quality aspects (helpfulness, safety, style) via different experts on different GPUs.

ICML / NeurIPS ▼

FRONTIER 12

Scaling Test-Time Training Layers via Context Parallelism

CP SP

Scale TTT layers to million-token contexts using context parallelism with large-chunk gradient aggregation across GPUs.

TTT layers have linear complexity but existing methods achieve less than 5% GPU utilization. Combining TTT with Transformer parallelism strategies is unexplored.

ICLR / NeurIPS ▼

FRONTIER 13

Topology-Aware Expert Placement for Disaggregated MoE Serving

EP PP

Learn a topology-aware expert placement policy from production routing traces, using graph partitioning to minimize cross-node communication while maintaining load balance.

DeepSeek-V3's 256 experts create ~4.4B possible routing combinations. Lmsys reported that expert placement significantly impacts throughput on 96 H100s.

OSDI / NeurIPS ▼

FRONTIER 14

Energy-Aware Expert Routing in Mixture-of-Experts Models

EP DP High Novelty

Add an energy cost term to the MoE routing decision: tokens are routed considering both quality (affinity) and the real-time energy cost of the GPU hosting each expert.

No existing work incorporates energy signals into MoE routing. Could reduce inference energy by 20–40% with minimal quality degradation. Aligns with the Green AI movement.

NeurIPS / ICLR (Green AI) ▼

FRONTIER 15

Unified Parallelism Search: AutoML for 5D Parallelism Configuration

All 5D High Novelty

Formulate 5D parallelism configuration as a combinatorial optimization problem solved with a learned cost model + search algorithm across the full TP/DP/PP/CP/EP space.

For a 256-GPU cluster, there are thousands of valid configurations. Could become the "NAS for parallelism" — a foundational tool for the field. Alpa only searched 2D (TP×DP).

NeurIPS / OSDI / MLSys ▼

📖

15 Research Frontiers in 5D Parallelism & Modern AI

Vizuara AI Labs • March 2026

The Research Book

A comprehensive guide covering all 15 frontier research topics. Each topic includes the core idea, why it's impactful, a suggested approach, key references, and target conferences.

Research Topics

50+

Paper References

Pages

Download Research Book (PDF)

Target Conferences

Your Publication Timeline

Target the world's top ML and systems venues. Our bootcamp is structured to align with these deadlines.

Mar 26, 2026

SOSP 2026 (Abstract)

Premier systems conference • ~17 days away

Apr 1, 2026

SOSP 2026 (Full Paper)

Full paper deadline • ~23 days away

Apr 15, 2026

ASPLOS 2027 (Spring Cycle)

Architecture & systems • ~37 days away

~May 7, 2026

NeurIPS 2026

Top ML venue • ~59 days away

Apr–May 2026

ICML 2026 Workshops

Workshop papers • Great entry point

Aug–Sept 2026

NeurIPS 2026 Workshops

Workshop papers • ~5–6 months away

Sept 9, 2026

ASPLOS 2027 (Fall Cycle)

Architecture & systems • ~184 days away

~Oct 2026

MLSys 2027 / ICLR 2027

ML systems & representation learning • ~7 months away

Your Instructor

Learn from the Best

Dr. Raj Dandekar

MIT, PhD • Founder, Vizuara AI Labs

Dr. Raj Dandekar brings deep expertise in GPU systems, distributed computing, and AI research. With a background spanning MIT and IIT Madras, he has worked at the intersection of high-performance computing and machine learning.

At Vizuara AI Labs, he leads research bootcamps that have helped students publish at top-tier venues and build careers in GPU engineering and AI research.

MIT PhD IIT Madras Vizuara AI Labs GPU Systems Distributed Computing

Pricing

Invest in Your Research Career

Complete Bootcamp

₹95,000

Everything you need to go from GPU fundamentals to a published research paper.

✓ 12 lectures (30+ hours) on 5D Parallelism
✓ Choose from 15 frontier research topics
✓ Full paper writing process — abstract to submission
✓ 1:1 asynchronous mentorship on the Vizuara Platform
✓ Research Book: 15 Frontiers in 5D Parallelism (PDF)
✓ Target: NeurIPS, ICML, ICLR, SOSP, ASPLOS, MLSys
✓ Dedicated Vizuara Platform access
✓ GPU profiling & experiment design guidance

Enroll Now

GPU Engineers4-Month Research Bootcamp

Master the 5 Dimensions of Modern Parallelism

TP

DP

PP

CP/SP

EP

Lecture Series: 30+ Hours of Teaching

GPU Architecture & CUDA Fundamentals

Triton & Custom GPU Kernels

NCCL & Distributed Communication

Data Parallelism & ZeRO

Tensor Parallelism

Pipeline Parallelism

Context & Sequence Parallelism

Expert Parallelism & MoE

5D Parallelism Integration

Diffusion Language Models

Speculative Decoding & Inference

Research Methods & Paper Writing

Your Research Journey

Research Topic Selection

Full Paper Writing Process

1:1 Asynchronous Mentorship

5D Parallelism Meets Modern AI

Parallelism-Aware Training of Diffusion Language Models

MoE Meets Diffusion LLMs: Expert Routing for Parallel Token Denoising

Adaptive Context Parallelism with Dynamic Sparse Attention

Speculative Decoding for Discrete Diffusion Models

Carbon-Aware Dynamic Parallelism Reconfiguration

Heterogeneous Expert Parallelism: Right-Sizing Experts to GPUs

Distributed Parallel Reasoning with Speculative Verification

Zero-Bubble Pipeline Parallelism for Diffusion Model Training

Communication-Free Expert Parallelism via Learned Expert Compression

Ring Attention + Speculative Decoding for Million-Token Inference

Async RLHF with Disaggregated Expert-Parallel Reward Models

Scaling Test-Time Training Layers via Context Parallelism

Topology-Aware Expert Placement for Disaggregated MoE Serving

Energy-Aware Expert Routing in Mixture-of-Experts Models

Unified Parallelism Search: AutoML for 5D Parallelism Configuration

The Research Book

Your Publication Timeline

SOSP 2026 (Abstract)

SOSP 2026 (Full Paper)

ASPLOS 2027 (Spring Cycle)

NeurIPS 2026

ICML 2026 Workshops

NeurIPS 2026 Workshops

ASPLOS 2027 (Fall Cycle)

MLSys 2027 / ICLR 2027

Learn from the Best

Dr. Raj Dandekar

Invest in Your Research Career

GPU Engineers
4-Month Research Bootcamp