当前位置: 首页 > news >正文

950基础矩阵乘法TLA示例

950 Basic Matmul TLA Example Readme

【免费下载链接】catlass本项目是CANN的算子模板库,提供NPU上高性能矩阵乘及其相关融合类算子模板样例。项目地址: https://gitcode.com/cann/catlass

Note: The community package does not currently support 950 capabilities. Stay tuned for a future supported version.

Code Organization

├── 43_ascend950_basic_matmul │ ├── CMakeLists.txt # CMake build file │ ├── README.md │ └── basic_matmul_tla.cpp # Main file

Usage Example

  • After obtaining the code, build the corresponding operator executable. See Template Library Quick Start. This case is a 950 operator, and-DCATLASS_ARCH=3510must be added during build.
  • Run the operator.
# Build the specified case bash scripts/build.sh 43_ascend950_basic_matmul -DCATLASS_ARCH=3510 cd output/bin # Executable file name | matrix m axis | n axis | k axis | Device ID # Device ID is optional and defaults to 0 ./43_ascend950_basic_matmul 256 512 1024 0

The execution result is as follows, indicating that the precision comparison succeeds.

Compare success.

Usage Notes

The DispatchPolicy MmadPingpong used by BasicMatmul by default supports the following template parameters:

Template ParameterDefault ValueParameter Description
ArchTagNoneSpecifies the architecture model
enableUnitFlagfalseSpecifies whether to enable UnitFlag. It must be set to false when L0C multi-buffering is enabled
useHF32falseSpecifies whether to enable HF32. Only the float type is supported
l0CStages1Specifies the number of L0C buffers. Set it to 2 to enable L0C double buffering
enableL1ResidentfalseSpecifies whether to enable L1 residency
l1AStages2Number of buffers for loading matrix A on L1
l1BStages2Number of buffers for loading matrix B on L1
l0AStages2Number of buffers for loading matrix A on L0
l0BStages2Number of buffers for loading matrix B on L0

Assume the matrix Shape isM N K, the tile size on L1 ism1 n1 k1, the number of tiles in the M direction ismTiles = CeilDiv(M, m1), the number of tiles in the N direction isnTiles = CeilDiv(N, n1), and the total number of tasks istaskBlocks = mTiles * nTiles. enableL1Resident can be enabled in the following two cases:

  1. mTiles = 1,nTiles > CoreNum, andK < 2 * k1. In this case,l0CStages=2can also be set (enableUnitFlag must be disabled). If there is not enough space andl0CStages=2cannot be set, setn1to half of the original value.

  2. nTiles = 1,mTiles > CoreNum, andK < 2 * k1. In this case,l0CStages=2can also be set (enableUnitFlag must be disabled). If there is not enough space andl0CStages=2cannot be set, setm1to half of the original value.

BasicMatmul also supports DispatchPolicy MmadPreloadAsyncWithCallback, which supports the following template parameters:

Template ParameterDefault ValueParameter Description
ArchTagNoneSpecifies the architecture model
preloadStagesNoneSpecifies the number of preloads
l1AStages2Number of buffers for loading matrix A on L1
l1BStages2Number of buffers for loading matrix B on L1
l0AStages2Number of buffers for loading matrix A on L0
l0BStages2Number of buffers for loading matrix B on L0
l0CStages1Specifies the number of L0C buffers. Set it to 2 to enable L0C double buffering
enableUnitFlagfalseSpecifies whether to enable UnitFlag. It must be set to false when L0C multi-buffering is enabled
enableShuffleKfalseSpecifies whether to enable K-direction staggered reading
useHF32falseSpecifies whether to enable HF32. Only the float type is supported
enableL1ResidentfalseSpecifies whether to enable L1 residency

Compared withMmadPingpong,MmadPreloadAsyncWithCallbackhas two more template parameters. One ispreloadStages. This parameter is usually set to 1 and specifies the number of preloads. When this parameter is set to 1, the first loop only loads data and does not perform matmul computation. The second loop first loads the data for the second loop, and then completes the Matmul computation of the previous loop, and so on. After the final loop ends, one additional Matmul computation is performed. The benefit is that the data required for the current Matmul computation has already been moved in the previous loop. Therefore, instruction issue is advanced, which reduces the performance loss caused by instruction issue latency.

The second parameter isenableShuffleK. This parameter is mainly used to avoid bandwidth loss caused by same-address access conflicts. The main principle is to stagger the data read addresses of each core. This parameter does not need to be enabled on 950.

Compared withMmadPingpong,MmadPreloadAsyncWithCallbackhas more optimization points, but its logic is also more complex and has higher Scalar overhead. Use it based on the scenario, especially for small Shape scenarios.

【免费下载链接】catlass本项目是CANN的算子模板库,提供NPU上高性能矩阵乘及其相关融合类算子模板样例。项目地址: https://gitcode.com/cann/catlass

创作声明:本文部分内容由AI辅助生成(AIGC),仅供参考

http://www.cnnetsun.cn/news/2996488.html

相关文章:

  • PhoneVR项目路线图:未来功能和发展方向展望
  • 终极iOS越狱指南:使用palera1n轻松解锁iPhone系统权限
  • 如何用AI+BI平台在3分钟内让数据开口说话?
  • 从零到一:如何用AFDKO打造专业的OpenType字体?
  • 告别单调终端:3步打造你的专属Terminator主题生态系统
  • 如何让喜欢的角色住进桌面?5分钟快速上手DyberPet桌宠系统
  • 如何快速理解YOLOv7评估指标:新手必读的完整指南
  • Chonky国际化方案:如何实现多语言文件浏览器界面
  • 别再找 Prompt 模板了:提示词的本质,是你和 AI 的任务契约
  • 分布式调度系统 — scheduler-worker执行器详解
  • Linux线程3.0-线程同步与互斥,C/C++互斥锁。
  • 大模型微调灾难性遗忘2026:LoRA+SFT+DPO联合缓解的工程方案
  • 增量量距离保护:破解IBR电网继电保护难题的核心技术
  • Spring AI Agent Skills 工程化实践:解耦、契约与可插拔
  • 4sapi工作流引擎:2026生产级Agent的确定性架构实践
  • Vibe Coding:从指令编程到意图驱动的开发范式革命
  • DESIGN.md:从静态文档到可执行契约的工程实践
  • Spring AI Alibaba:Java企业级大模型集成的基础设施协议
  • Vue3+Vite性能优化实战:构建、响应式与加载链路闭环
  • Python3安装后command not found的根因与解决方案
  • Python3环境搭建的底层原理与四条技术路径
  • Burp Suite实战指南:从入门到精通的Web安全测试工具系统学习
  • AI生成代码如何安全落地:工程化落地流水线实践
  • 自动驾驶感知系统实战:多传感器融合与BEV+Occupancy落地
  • vLLM私有部署100倍性能提升的工程实践
  • 截断扩散模型在端到端自动驾驶规划中的工程落地
  • 彻底解决Appium iOS自动化测试WebDriverAgent启动失败Code 65错误
  • Frida在Windows逆向工程中的实战应用:动态插桩与自动化破解
  • 打破功能边界,广凌智慧教学融合平台解决方案实现全场景一体化覆盖
  • 如何获取加密货币的历史K线数据用于回测策略