DeepSeek Unveils FlashMLA, Kicking Off Open-Source Week

Chinese AI lab, DeepSeek rolled out FlashMLA as its promised recent statement of “Open Source Week“, FlashMLA is a groundbreaking Multi-head Latent Attention (MLA) decoding kernel optimized for variable-length sequences and is now in production.

The H800 GPU achieves speeds of 3000 GB/s in memory-bound configurations and 580 TFLOPS in compute-bound configurations. Designed specifically for Hopper GPUs, such as NVIDIA’s H100 series, it boasts impressive performance stats: on H800 SXM5 GPUs running CUDA 12.6, FlashMLA achieves 83% utilization of theoretical memory bandwidth and 91% of peak FLOPs in compute-bound setups. That results in 2.3x faster inference speeds for 175B-parameter LLMs compared to previous benchmarks, making it a game-changer for AI developers.

The kernel, now in production, supports BF16 (Brain Float 16) and features a paged KV cache with a block size of 64, minimizing latency and maximizing throughput. DeepSeek engineered FlashMLA for immediate integration, drawing inspiration from projects like FlashAttention 2, 3, and CUTLASS. Available on GitHub under permissive licensing, it’s already racking up attention within hours, it garnered over 3,700 stars and 143 forks, with developers on X hailing its “game-changing optimization potential“.

The company shares a statement on X post “Honored to share FlashMLA – our efficient MLA decoding kernel for Hopper GPUs, optimised for variable-length sequences and now in production.”

This launch is the first of five promised open-source repositories DeepSeek announced last week, with new releases planned daily through the week.

Leave a Comment