###还有另一个令人印象深刻的M1统计信息(你可以在浏览器中尝试)Safari通过WSL Kernels通过WSL Kernels实验支持WebGPU.I写了一个简单的调谐器,它试图优化矩阵乘法。如果您有Safari,则可以尝试[这里]( https://jott.live/html/webgpu_demo.html),p).(您有效地启用WebGPU,在Develop和GT中启用WebGPU;实验功能。)我的** M1 MacBook Air实现900GFLOPS **后换秒秒。MY英特尔MacBook Pro只达到100GFlops,具有相同的详尽搜索。如图所示,MobileNet V3大(x1.0)是〜219mflock.running在这种性能下,它可以执行4,500个推论**每秒**。基础BERT模型(12层)是11.2 GFLOPS。在这个完善的情况下,理论上可以理解90次。基本思想是瓷砖内存访问,矢量化,使用`mad`指令和曲调来进行线程和调度参数。结果是一个看起来像这样的内核:``` [NumThreads(2,8,1)]计算void main(常量float4 [] a:寄存器(u0),常量float4 [] b:寄存器(u1),设备float4 [] c:寄存器(u2),float3 threadid: sv_dispatchthreadid){uint m = uint(threadid.x); uint n = uint(threadid.y); FLOAT4结果_0_0 = FLOOT4(0.0,0.0,0.0,0.0); FLOAT4结果_1_0 = FLOAT4(0.0,0.0,0.0,0.0); FLOAT4结果_2_0 = FLOAT4(0.0,0.0,0.0,0.0); FLOAT4结果_3_0 = FLOAT4(0.0,0.0,0.0,0.0); for(uint k = 0; k< 256; k ++){float4 a_0_0 = a [(m * 4 + 0)* 256 +(k * 1 + 0)]; float4 a_1_0 = a [(m * 4 + 1)* 256 +(k * 1 + 0)]; float4 a_2_0 = a [(m * 4 + 2)* 256 +(k * 1 + 0)]; float4 a_3_0 = a [(m * 4 + 3)* 256 +(k * 1 + 0)]; float4 b_0_0 = b [(k * 4 + 0)* 256 +(n * 1 + 0)]; float4 b_0_1 = b [(k * 4 + 1)* 256 +(n * 1 + 0)]; float4 b_0_2 = b [(k * 4 + 2)* 256 +(n * 1 + 0)]; float4 b_0_3 = b [(k * 4 + 3)* 256 +(n * 1 + 0)];结果_0_0 + = mul(a_0_0.x,b_0_0);结果__0 + = mul(a_1_0.x,b_0_0);结果_2_0 + = mul(a_2_0.x,b_0_0);结果_3_0 + = mul(a_3_0.x,b_0_0);结果_0_0 + = mul(a_0_0.y,b_0_1);结果__0 + = mul(a_1_0.y,b_0_1);结果_2_0 + = mul(a_2_0.y,b_0_1);结果_3_0 + = mul(a_3_0.y,b_0_1);结果_0_0 + = mul(a_0_0.z,b_0_2);结果__0 + = mul(a_1_0.z,b_0_2);结果_2_0 + = mul(a_2_0.z,b_0_2);结果_3_0 + = mul(a_3_0.z,b_0_2);结果_0_0 + = mul(a_0_0.w,b_0_3);结果__0 + = mul(a_1_0.w,b_0_3);结果_2_0 + = mul(a_2_0.w,b_0_3);结果_3_0 + = mul(a_3_0.w,b_0_3); } C [(m * 4 + 0)* 256 +(n * 1 + 0)] =结果_0_0; c [(m * 4 + 1)* 256 +(n * 1 + 0)] =结果_1_0; c [(m * 4 + 2)* 256 +(n * 1 + 0)] =结果_2_0; C [(m * 4 + 3)* 256 +(n * 1 + 0)] =结果_3_0;}派遣参数:更频繁地调整128,32,1.1 /}(例如要对`提供k`维度更多或做更多级别的平铺),但是我对结果很满意。浏览器中的近1TFlops(峰值的50%)非常赋予'令人兴奋的'令人兴奋的可用技术。