China has found a workaround to NVIDIA’s downgraded AI accelerators, and it’s quite impressive. DeepSeek introduced their latest innovation, managing to deliver eight times the TFLOPS with the Hopper H800s AI accelerators.
DeepSeek’s innovative FlashMLA is set to revolutionize how China’s AI industry harnesses the potential of NVIDIA’s modified Hopper GPUs.
China is taking significant strides in tech development and isn’t just resting on its laurels. A prime example is DeepSeek’s innovative approach, as they’ve found creative solutions to make the most out of existing hardware. Their recent breakthroughs are turning heads, as they claim to have significantly boosted the performance of NVIDIA’s “cut-down” Hopper H800 GPUs. By fine-tuning memory usage and adeptly allocating resources, DeepSeek has unlocked substantial gains.
In an exciting tweet, DeepSeek announced:
“🚀 Day 1 of #OpenSourceWeek: FlashMLA Honored to share FlashMLA – our efficient MLA decoding kernel for Hopper GPUs, optimized for variable-length sequences and now in production. ✅ BF16 support✅ Paged KV cache (block size 64)âš¡ 3000 GB/s memory-bound & 580 TFLOPS…”
DeepSeek is in the middle of “OpenSource” week, showcasing their accessible tech to the public via Github. FlashMLA, debuting as part of this event, is a “decoding kernel” crafted especially for NVIDIA’s Hopper GPUs. Let’s break down what makes it so groundbreaking.
According to them, FlashMLA’s optimization achieves 580 TFLOPS for BF16 matrix computations on the H800, a figure that dwarfs the standard industry benchmark. Moreover, it’s designed to handle up to 3000 GB/s of memory bandwidth, nearly doubling the H800’s theoretical maximum. What’s truly remarkable is that these advancements are accomplished purely through sophisticated coding, rather than hardware tweaks.
Another notable mention on social media highlighted:
“This is crazy.-> Blazing fast: 580 TFLOPS on H800, ~8x industry avg (73.5 TFLOPS).-> Memory wizardry: Hits 3000 GB/s, surpassing H800’s 1681 GB/s peak.”
DeepSeek employs “low-rank key-value compression,” which simplifies large datasets, speeding up processing and slashing memory use by up to 60%. Their dynamic block-based paging allocates memory according to task demands, improving the handling of variable-length sequences and, by extension, overall performance.
This development illustrates the multifaceted nature of AI computing — it’s not just about hardware. Although FlashMLA currently focuses on Hopper GPUs, its potential application for the H100 in future releases is something to look forward to.