如何在 GCC 中以 32 字节边界对齐堆栈?堆栈、边界、字节、中以

由网友(龙之谷帅气的名字)分享简介:我正在为 Windows 64 位目标使用基于 GCC 4.6.1 的 MinGW64 构建.我正在玩新的英特尔 AVX 指令.我的命令行参数是 -march=corei7-avx -mtune=corei7-avx -mavx.I'm using MinGW64 build based on GCC 4.6.1 f...

我正在为 Windows 64 位目标使用基于 GCC 4.6.1 的 MinGW64 构建.我正在玩新的英特尔 AVX 指令.我的命令行参数是 -march=corei7-avx -mtune=corei7-avx -mavx.

I'm using MinGW64 build based on GCC 4.6.1 for Windows 64bit target. I'm playing around with the new Intel's AVX instructions. My command line arguments are -march=corei7-avx -mtune=corei7-avx -mavx.

但是在堆栈上分配局部变量时,我开始遇到分段错误错误.GCC使用对齐的移动VMOVAPSVMOVAPD来移动__m256__m256d,这些指令需要32字节结盟.但是,Windows 64 位的堆栈只有 16 字节对齐.

But I started running into segmentation fault errors when allocating local variables on the stack. GCC uses the aligned moves VMOVAPS and VMOVAPD to move __m256 and __m256d around, and these instructions require 32-byte alignment. However, the stack for Windows 64bit has only 16 byte alignment.

如何将 GCC 的堆栈对齐更改为 32 字节?

我曾尝试使用 -mstackrealign 但无济于事,因为它仅对齐 16 个字节.我也无法使 __attribute__((force_align_arg_pointer)) 工作,无论如何它都对齐到 16 个字节.我无法找到任何其他可以解决此问题的编译器选项.非常感谢任何帮助.

I have tried using -mstackrealign but to no avail, since that aligns only to 16 bytes. I couldn't make __attribute__((force_align_arg_pointer)) work either, it aligns to 16 bytes anyway. I haven't been able to find any other compiler options that would address this. Any help is greatly appreciated.

我尝试使用 -mpreferred-stack-boundary=5,但 GCC 表示此目标不支持 5.我没主意了.

I tried using -mpreferred-stack-boundary=5, but GCC says that 5 is not supported for this target. I'm out of ideas.

推荐答案

我一直在探索这个问题,提交了一份 GCC 错误报告,发现这是一个与 MinGW64 相关的问题.请参阅 GCC 错误#49001.显然,GCC 不支持 Windows 上的 32 字节堆栈对齐.这有效地防止了使用 256 位 AVX 指令.

I have been exploring the issue, filed a GCC bug report, and found out that this is a MinGW64 related problem. See GCC Bug#49001. Apparently, GCC doesn't support 32-byte stack alignment on Windows. This effectively prevents the use of 256-bit AVX instructions.

我研究了几种解决此问题的方法.最简单和最直接的解决方案是用未对齐的替代方案 VMOVUPS 等替换对齐的内存访问 VMOVAPS/PD/DQA.所以我昨晚学习了 Python(顺便说一句,这是非常好的工具)并使用以下脚本完成了这项工作输入 GCC 生成的汇编文件:

I investigated a couple ways how to deal with this issue. The simplest and bluntest solution is to replace of aligned memory accesses VMOVAPS/PD/DQA by unaligned alternatives VMOVUPS etc. So I learned Python last night (very nice tool, by the way) and pulled off the following script that does the job with an input assembler file produced by GCC:

import re
import fileinput
import sys

# fix aligned stack access
# replace aligned vmov* by unaligned vmov* with 32-byte aligned operands 
# see Intel's AVX programming guide, page 39
vmova = re.compile(r"s*?vmov(w+).*?(((%r.*?%ymm)|(%ymm.*?(%r))")
aligndict = {"aps" : "ups", "apd" : "upd", "dqa" : "dqu"};
for line in fileinput.FileInput(sys.argv[1:],inplace=1):
    m = vmova.match(line)
    if m and m.group(1) in aligndict:
        s = m.group(1)
        print line.replace("vmov"+s, "vmov"+aligndict[s]),
    else:
        print line,

这种方法非常安全且万无一失.尽管我在极少数情况下观察到了性能损失.当堆栈未对齐时,内存访问会跨越高速缓存行边界.幸运的是,代码的执行速度在大多数情况下与对齐访问一样快.我的建议:关键循环中的内联函数!

This approach is pretty safe and foolproof. Though I observed a performance penalty on rare occasions. When the stack is unaligned, the memory access crosses the cache line boundary. Fortunately, the code performs as fast as aligned accesses most of the time. My recommendation: inline functions in critical loops!

我还尝试使用另一个 Python 脚本修复每个函数序言中的堆栈分配,尝试始终将其对齐在 32 字节边界.这似乎适用于某些代码,但不适用于其他代码.我必须依靠 GCC 的善意,它会分配对齐的局部变量(相对于堆栈指针),它通常会这样做.情况并非总是如此,尤其是当由于需要在函数调用之前保存所有 ymm 寄存器而导致严重的寄存器溢出时.(所有 ymm 寄存器都是被调用者保存的).如果有兴趣,我可以发布脚本.

I also attempted to fix the stack allocation in every function prolog using another Python script, trying to align it always at the 32-byte boundary. This seems to work for some code, but not for other. I have to rely on the good will of GCC that it will allocate aligned local variables (with respect to the stack pointer), which it usually does. This is not always the case, especially when there is a serious register spilling due to the necessity to save all ymm register before a function call. (All ymm registers are callee-save). I can post the script if there's an interest.

最好的解决方案是修复 GCC MinGW64 版本.不幸的是,我不知道它的内部工作原理,上周才开始使用它.

The best solution would be to fix GCC MinGW64 build. Unfortunately, I have no knowledge of its internal workings, just started using it last week.

阅读全文

相关推荐

最新文章