FlashMLA AI
FlashMLA is an efficient MLA decoding kernel AI model optimized for Hopper GPU, launched by Deepseek.
FlashMLA is an efficient Multi-head Latent Attention (MLA) decoding kernel optimized for GPUs in the Hopper architecture, aiming to improve the performance of variable-length sequences services. It is deeply optimized mainly for variable-length sequences scenarios, especially in large-model inference services, significantly improving the inference performance of large models.