Treelink比赛分享

Treelink模型预测算法
比赛分享

鸣嵩

决策树-1
• 经典的决策树
o 根据天气情况决定是否合适打网球

决策树-2
• 根据变量数值来决策

决策树-3
• 根据向量的值来决策

决策树-4
• 比赛中模型包含190棵决策树
o Treelink/decision tree forest
o 输出是所有决策树上计算的结果叠加

• 每棵树都是4层的完全二叉树

……

Tree 1 Tree 2 Tree 190

单棵决策树上的一次预测
• 输入是向量 float px[0]~px [51]，输出是float pY[0]

代码的问题-1
• 0.019217是double
• 0.019217f 才是float
• double v.s. float
o Double 8bytes, 64bits register
o Float 4bytes, 32bits register

• double与float进行比较
需要额外的类型转换

代码的问题-2
反汇编结果，gcc 4.1

注释
x_test[28] => 0x70(%rdi) =>%xmm2 (float) => %xmm3 (double)
0.00706646 => 0x19345(%rip)

X86指令不能直接带浮点数，只能编译到代码段中，运算时加载到寄存器内
代价：访存，数据的位置依赖于编译器

改进
• 利用x86指令集中的32位
立即数
o mov，cmp，add……
o 浮点数转为整型

• 减少一次访存

反汇编

注释：
x_test[2] => 0x8(%rdi) => %r10d
1028025 => 0xfafb0 (cmp指令中的32位立即数)

SIMD

批量的将Float转换成Integer

GCC Intrinsic
• #include <emmintrin.h>

了解处理器
Nehalem E5620
• 长流水线 >= 15级
• X86指令解释为微
指令后乱序执行
o 等待执行的微指令放在
Reserveration Station
o 多个ALU运算单元并发、
乱序执行
o Reorder Buffer中实现串
行化
o Instruction Retirement

Pipeline
• 示例：4级和8级的流水线

Front End
读入x86指令，
每个时钟周期
16字节

x86指令解析为
微指令（μop）

微指令（μop）缓存

乱序执行-1
寄存器重命名
分配临时寄存器

微指令进入保留站

发射指令

EU EU EU

各种运算 Load/Store

乱序执行-2
按指令顺序写出结果
指令生效，真正写入
内存和物理寄存器
存入临时寄存器

触发具有数据依赖的指令执行

EU中计算结果 Load/Store

指令量化分析
• 取指令，每个16字节/cycle
• X86指令解析为微指令
o 简单指令3条/cycle
o 复杂指令1条/cycle

• 保留站到EU的Port，总共6个
o P0，P1，P5到ALU单元
o P2，P3，P4到Load/Store单元

• Instruction Retirement，4条μop/cycle
• Dependency Chain长度

指令优化
• 长流水线 >= 15级
o Branch prediction miss性能损耗大
• 减少Branch prediction miss率
o 减少/消除conditional branch
• Bit运算代替比较
• Comvg指令代替比较
• 充分发挥Intel处理器乱序执行的能力
o 避免指令间存在long dependency chain
o 避免指令间隐性的依赖关系，例如对eflags的依赖

消除Conditional Branch
• 如何消除这个if语句 • Bit运算版本1
if (a < b) { int mask = (a-b) >> 31;
r = c; r = (mask & c) | (~mask & d);
} else {
r = d; • Bit运算版本2
} int mask = (a-b) >> 31;
r = d + mask & (c-d);

• cmovg版本
r = (a < b) ？c : d;

不要滥用CMOV指令
CMOV (and, more generically, any
"predicated instruction") tends to
generally a bad idea on an
aggressively out-of-order CPU.

—— Linux Torvalds

优化结果
只保留前两层比较，因
为branch命中率较高

第四层用bit运算代替比较，
充分发挥处理器的乱序执行

第三层用cmovg，优点：指令少

优化结果

反汇编平均13条指令

执行时间100w条输入190棵树 0.44s， E5620 @ 2.40GHz
平均每次计算 (0.44 * 2.4 * 1000,000,000)/(1000,000 * 190) = 5.55 个时钟周期

量化分析
指令 μop P0/1/5 P0 P1 P5 P2 P3 P4
mov 0x70(%rdi),%edx 6 3 x x x 3
lea -0xf540(%r9),%eax 1 1 1
sar $0x1f,%eax 1 1 x x
sub $0xe76606,%edx 1 1 x x x
and $0x1d5c6,%eax 1 1 x x x
sar $0x1f,%edx 1 1 x x
add $0x3cc1,%eax 1 1 x x x
and $0x6079,%edx 1 1 x x x
sub $0x50fd,%edx 1 1 x x x
cmpl $0xfe3f93,0xb8(%rdi) 1 1 x x x 1
cmovg %eax,%edx 2 2 x x x
lea (%rdx,%rcx,1),%r8d 1 1 1
jg 0x403e18 1 1 1
总和 19 16 4
需要时钟周期 4.75 5.3 4

量化分析
指令 μop P0/1/5 P0 P1 P5 P2 P3 P4
mov 0x70(%rdi),%edx 6 3 x x x 3
lea -0xf540(%r9),%eax 1 1 1
sar $0x1f,%eax 1 1 x x

理论值 5.3个时钟周期x
sub $0xe76606,%edx 1 1 x x x
and $0x1d5c6,%eax 1 1 x x
sar $0x1f,%edx 实际值 5.5个时钟周期x
1 1 x
add $0x3cc1,%eax 1 高度契合
1 x x x
and $0x6079,%edx 1 1 x x x
sub $0x50fd,%edx 1 1 x x x
cmpl $0xfe3f93,0xb8(%rdi) 1 1 x x x 1
cmovg %eax,%edx 2 2 x x x
lea (%rdx,%rcx,1),%r8d 1 1 1
jg 0x403e18 1 1 1
总和 19 16 4
需要时钟周期 4.75 5.3 4

Treelink比赛分享

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (18)

Similar to Treelink比赛分享

Similar to Treelink比赛分享 (20)

Treelink比赛分享