BMClogo

Why is LLM inferred as a batch kernel when the data stream compiler can pipe tiles through on-chip FIFOS and stream converters? Streaming tensor is a compiler that reduces Pytorch llm graphs (GPT-2, LLAMA, QWEN, QWEN, QWEN, GEMMA) to enter streaming data traffic accelerator. The system introduces Iterative tensor (“iTensor”) type encoded streams tile/sequence, which can prove the correct inter-kernel streaming and automatic insert/size of DMA engines, FIFOS and layout converters. On LLM decoding workloads, the research team reports 0.64×Low latency vs. GPU Then 1.99×High energy efficiency.

https://arxiv.org/pdf/2509.13694

What StreamTensor does?

StreamTensor compiles Pytorch graphs into stream-oriented data stream designs so that the intermediate tiles are Through on-chip streaming and fusion, off-chip DRAM round-trip is largely avoided. Insert DMA only if needed;They are forwarded to the downstream kernel via chip FIFOS. The central abstraction of the compiler –Iterative Tensors (ITENSORS)– Record iteration order, paving and layout, which makes the compatibility of intra-streams clear and converters can only be developed if needed. The framework also searches through tiling, fusion and resource allocation hierarchy and uses Linear Program Size fifos to avoid stalls or deadlocks while minimizing on-chip memory.

https://arxiv.org/pdf/2509.13694

What is actually?

  • Layered DSE. The compiler explores three design spaces – (i) tiling/expanding/vectoring/perversion at the Linalg level, (ii) fusion under memory/resource constraints, and (iii) resource allocation/stream width – with continuous throughput under bandwidth limitations.
  • End-to-end pytorch → device stream. Enter the model through torch-Mlir, convert it to Mlir linalg, and then convert to Data flow IR Its node becomes a hardware kernel with explicit streaming and host/runtime glue-NO manual RTL component.
  • Iterative tensor ((ITENSOR) typing system. First-class tensor types represent iterative order, tiled and affine plots. This makes stream orders clear, can be securely integrated with kernels and can synthesize the smallest buffer/format converter if the producer/consumer disagrees.
  • Formal FIFO size. For inter-kernel buffering Linear programming Formula to avoid stalls/stalemates while minimizing on-chip memory usage (BRAM/URAM).

result

Incubation period: Up to 0.76× with previous FPGA LLM accelerator and 0.64× with GPU baseline On GPT-2; Energy Efficiency: Up to 1.99× A100 on emerging LLMS (dependence model). Platform context: ALVEO U55C (HBM2 16 GB, 460 GB/sPCIE Gen3×16 or Dual Gen4×8, 2×QSFP28).

https://arxiv.org/pdf/2509.13694

The useful contribution here is pytorch→Torch-mlir→DataFlow compiler, which emits the kernel of the streaming core and the host/runtime of AMD ALVEO U55C; Iterative tensor Type a linear programming-based FIFO size to achieve secure inter-core flow instead of DRAM round trip. On the LLM decoding The research team’s benchmarks for GPT-2, Llama, Qwen and Gemma show that the geometric mean latency is as low as 0.64× with GPU baseline and energy efficiency 1.99×the scope is limited to decoding workloads only. The hardware context is obvious: ALVEO U55C supply 16 GB HBM2 exist 460 GB/s With dual QSFP28 and PCIE GEN3×16 or DUAL GEN4×8, consistent with streaming data flow design.


Check Paper. Check out ours anytime Tutorials, codes and notebooks for github pages. Also, please stay tuned for us twitter And don’t forget to join us 100K+ ml reddit And subscribe Our newsletter.


Michal Sutter is a data science professional with a master’s degree in data science from the University of Padua. With a solid foundation in statistical analysis, machine learning, and data engineering, Michal excels in transforming complex data sets into actionable insights.

🙌Follow Marktechpost: Add us as the preferred source on Google.

Source link