首页>Program>source

在Agner Fog的手册在C ++中优化软件,在第9.10节" 大型数据结构",他描述了一个在矩阵宽度等于临界跨度时转置矩阵的问题.在他的测试中,当宽度等于临界跨度时,L1中矩阵的成本增加40%. If the matrix is is even larger and only fits in L2 the cost is 600%! 表9.1中对此进行了很好的总结.这是必不可少的,在 为什么要转置 512x512的矩阵比转置513x513的矩阵慢得多?

他后来写道:

The reason why this effect is so much stronger for level-2 cache contentions than for level-1 cache contentions is that the level-2 cache cannot prefetch more than one line at a time.

So my questions are related to prefetching data.

根据他的评论,我推断L1一次可以预取多个缓存行. How many can it prefetch?

根据我的理解,尝试编写代码来预取数据(例如,使用_mm_prefetch)几乎没有帮助.我读过的唯一示例是预取示例?,它仅提高了O(10%)(某些情况下 机器). Agner稍后对此进行了解释:

The reason is that modern processors prefetch data automatically thanks to out-of-order execution and advanced prediction mechanisms. Modern microprocessors are able to automatically prefetch data for regular access patterns containing multiple streams with different strides. Therefore, you don't have to prefetch data explicitly if data access can be arranged in regular patterns with fixed strides.

So how does the CPU decide which data to prefetch and are there ways to help the CPU make better choices for the prefetching (e.g. "regular patterns with fixed strides")?

Edit: 根据Leeor的评论,让我补充我的问题并使之更加有趣. Why does the critical stride have so much more of an effect on L2 compared to L1?

Edit: 我尝试使用为什么转置512x512的矩阵要比转置513x513的矩阵慢得多? 我在Xeon E5 1620(Ivy Bridge)上以MSVC2013 64位发布模式运行了该版本,它具有L1 32KB 8路,L2 256 KB 8路和L3 10MB 20路. L1的最大矩阵大小约为90x90,L3的最大矩阵大小为256x256,L3的最大矩阵大小为1619.

Matrix Size  Average Time
64x64        0.004251 0.004472 0.004412 (three times)
65x65        0.004422 0.004442 0.004632 (three times)
128x128      0.0409
129x129      0.0169
256x256      0.219   //max L2 matrix size
257x257      0.0692
512x512      2.701
513x513      0.649
1024x1024    12.8
1025x1025    10.1

我没有看到L1的任何性能损失,但是L2显然有关键的跨步问题,也许还有L3.我不确定L1为什么没有显示问题.可能还有其他背景(开销)源在L1时代占主导地位。

最新回答
  • 13天前
    1 #

    此语句:

    the level-2 cache cannot prefetch more than one line at a time.

    不正确

    实际上,L2预取器通常比L1预取器更强大,更具攻击性.这取决于您使用的实际机器,但是例如Intels的L2预取器.可以为每个请求触发2个预取,而L1通常是有限的(L1中可以共存几种类型的预取,但是与L2可以使用的预取权相比,它们可能竞争更有限的带宽),因此 从L1发出的预取信息可能会更少。

    第2.3.5.4节(数据预取)中的优化指南计算了以下预取器类型:

    Two hardware prefetchers load data to the L1 DCache:
    - Data cache unit (DCU) prefetcher: This prefetcher, also known as the streaming prefetcher, is triggered by an ascending access to very recently loaded data. The processor assumes that this access is part of a streaming algorithm and automatically fetches the next line.
    - Instruction pointer (IP)-based stride prefetcher: This prefetcher keeps track of individual load instructions. If a load instruction is detected to have a regular stride, then a prefetch is sent to the next address which is the sum of the current address and the stride. This prefetcher can prefetch forward or backward and can detect strides of up to 2K bytes.
     Data Prefetch to the L2 and Last Level Cache - 
     - Spatial Prefetcher: This prefetcher strives to complete every cache line fetched to  the L2 cache with the pair line that completes it to a 128-byte aligned chunk.
     - Streamer: This prefetcher monitors read requests from the L1 cache for ascending and descending sequences of addresses. Monitored read requests include L1 DCache requests initiated by load and store operations and by the hardware prefetchers, and L1 ICache requests for code fetch. When a forward or backward stream of requests is detected, the anticipated cache lines are prefetched. Prefetched cache lines must be in the same 4K page.
    

    还有更进一步:

    ... The streamer may issue two prefetch requests on every L2 lookup. The streamer can run up to 20 lines ahead of the load request.
    

    在以上方法中,只有基于IP的步幅可以处理一个以上的缓存行(流式缓存可以处理使用连续缓存行的任何内容,这意味着最多64字节的步幅(如果不这样做,则实际上可以达到128字节) 请注意一些额外的行)。要使用该功能,请确保在给定地址处进行加载/存储将执行跨步访问-通常在遍历数组的循环中已经存在这种情况。编译器循环展开可能会将其拆分为多个更大的跨步流 大步前进-除非您超出未完成的跟踪IP的数量,否则效果会更好(开头性会更大)-再次,这取决于确切的实现。

    但是,如果您的访问模式确实由连续的行组成,则L2流媒体比L1更有效,因为它运行得更快.

  • ios:更改UITextView文本方向
  • c++:了解底层的鼠标和键盘挂钩(win32)