How to use FlashMLA

  • 1. Open the GitHub to download the code.

  • 2. Install python setup.py install

  • 3. Benchmark python tests/test_flash_mla.py

  • Usage method

    from flash_mla import get_mla_metadata, flash_mla_with_kvcache tile_scheduler_metadata, num_splits = get_mla_metadata(cache_seqlens, s_q * h_q // h_kv, h_kv) for i in range(num_layers): ... o_i, lse_i = flash_mla_with_kvcache( q_i, kvcache_i, block_table, cache_seqlens, dv, tile_scheduler_metadata, num_splits, causal=True, ) ...

    Features of FlashMLA

    1

    Hopper Architecture Design

    FlashMLA is optimised for Hopper architecture GPUs, such as the H100 and H200, to provide higher computational efficiency and performance.

    2

    HPC

    Using a page-based KV cache, FlashMLA is able to achieve up to 580 TFLOPS of arithmetic power and 3000 GB/S of memory bandwidth on H800 GPUs, far exceeding traditional methods.

    3

    End-to-end latency optimisation

    By kernelising the MLA decoding process, the number of data transfers between CPU-GPU is reduced, and the end-to-end latency is measured to be reduced by 40% in the inference of the 100 billion models.

    4

    NLL tasks

    FlashMLA is suitable for natural language processing tasks that require efficient decoding, such as machine translation, text generation, sentiment analysis, and question-answering systems. It is optimized for variable-length sequences and can significantly improve inference efficiency.

    5

    Large Language Model (LLM) Inference

    FlashMLA is designed for inference scenarios of large language models. By optimizing KV cache and parallel decoding mechanisms, it reduces hardware resource requirements and improves inference speed.

    6

    Real-time interactive applications

    In applications that require fast response, such as conversational AI, real-time translation, and content recommendation, FlashMLA can provide low-latency inference capabilities and improve user experience.

    Frequently Asked Questions about deepseek FlashMLA

               What is FlashMLA and how does it differ from traditional AI models?

    On February 24, 2025, Deepseek released FlashMLA, an efficient MLA decoding kernel designed specifically for Hopper GPUs. This release marks Deepseek’s commitment to pushing the boundaries of AI performance, especially amidst the surging demand for faster and more scalable AI models in industries such as healthcare, finance, and automation systems.