Доклад рассказывает об устройстве и опыте применения инструментов динамического тестирования C/C++ программ — AddressSanitizer, ThreadSanitizer и MemorySanitizer. Инструменты находят такие ошибки, как использование памяти после освобождения, обращения за границы массивов и объектов, гонки в многопоточных программах и использования неинициализированной памяти.
Доклад рассказывает об устройстве и опыте применения инструментов динамического тестирования C/C++ программ — AddressSanitizer, ThreadSanitizer и MemorySanitizer. Инструменты находят такие ошибки, как использование памяти после освобождения, обращения за границы массивов и объектов, гонки в многопоточных программах и использования неинициализированной памяти.
Linux Performance Analysis: New Tools and Old SecretsBrendan Gregg
Talk for USENIX/LISA2014 by Brendan Gregg, Netflix. At Netflix performance is crucial, and we use many high to low level tools to analyze our stack in different ways. In this talk, I will introduce new system observability tools we are using at Netflix, which I've ported from my DTraceToolkit, and are intended for our Linux 3.2 cloud instances. These show that Linux can do more than you may think, by using creative hacks and workarounds with existing kernel features (ftrace, perf_events). While these are solving issues on current versions of Linux, I'll also briefly summarize the future in this space: eBPF, ktap, SystemTap, sysdig, etc.
Kernel address sanitizer (KASan) is a dynamic memory error detector for finding out-of-bounds and use-after-free bugs in Linux kernel. It uses shadow memory to record whether each byte of memory is safe to access and uses compile-time instrumentation to check shadow memory
on each memory access. In this presentation Alexander Popov will describe the successful experience of porting KASan to a bare-metal hypervisor: the main steps, pitfalls and the ways to make KASan checks much more strict and multi-purpose.
This presentation was delivered at LinuxCon Japan 2016 by Alexander Popov
Compare Performance-power of Arm Cortex vs RISC-V for AI applications_oct_2021Deepak Shankar
Abstract: In the Webinar, we will show you how to construct, simulate, analyze, validate, optimize an architecture model using pre-built components. We will compare micro and application benchmarks on system SoC models containing clusters of ARM Cortex A53, SiFive u74, ARM Cortex A77, and other vendor cores. The system will be built around custom switches, Ingress/Egress buffers, credit flow control, AI accelerators, NoC and AMBA AXI buses with multi-level caches, DDR4 DRAM and DMA. The evaluation and optimization criteria will be task latency, dCache hit-ratio, power consumed/task and memory bandwidth. The parameters to be modified are bus topology, cache size, processor clock speed, custom arbiters, task thread allocation and changing the processor pipeline.
Selection of cores is a combination of financial and technical bias. Technical comparison of processor cores requires the understanding of the workload, task partitioning and cache-memory structure. A core must be evaluated in the context of the target application. To evaluate these selections, architecture simulation software must be fortified with a library of Intellectual property for power and timing accurate processor cores, simulator at 100 million events per second, peripherals, and all possible traffic distributions
Key Takeaways:
1. Validating architecture models using mathematical calculus and hardware traces
2. Construct custom policies, arbitrations and configure processor cores
3. Select the right combination of statistics to detect bottlenecks and optimize the architecture
4. Identify the right use of stochastic, transaction, cycle-accurate and traces to construct the model
Speaker Bio:
Alex Su is a FPGA solution architect at E-Elements Technology, Hsinchu, Taiwan. He has been an FPGA Solution Architect and Xilinx FPGA Trainer for a number of years, supporting companies, research centers and universities in China and Taiwan. Prior to that, Mr Su has worked at ARM Ltd for 5 years in technical support of Arm CPU and System IP. Alex has also been engaged with a variety of FPGA-based Hardware Emulation System and over ten years in ASIC/SoC design and verification engineer.
Deepak Shankar is the Founder of Mirabilis Design and has been involved in the architecture exploration of over 250 SoC and processors. Mr. Shankar started Mirabilis Design because of a vacuum in the systems engineering and modeling space with the focus shifting to network design and early software development. Deepak has published over 50 articles and presented at over 30 conferences in EDA, semiconductors and embedded computing. Mr. Shankar has an MBA from UC Berkeley, MS in from Clemson University and BS from Coimbatore Institute of Technology, both in Electronics and Communication.
Linux Performance Analysis: New Tools and Old SecretsBrendan Gregg
Talk for USENIX/LISA2014 by Brendan Gregg, Netflix. At Netflix performance is crucial, and we use many high to low level tools to analyze our stack in different ways. In this talk, I will introduce new system observability tools we are using at Netflix, which I've ported from my DTraceToolkit, and are intended for our Linux 3.2 cloud instances. These show that Linux can do more than you may think, by using creative hacks and workarounds with existing kernel features (ftrace, perf_events). While these are solving issues on current versions of Linux, I'll also briefly summarize the future in this space: eBPF, ktap, SystemTap, sysdig, etc.
Kernel address sanitizer (KASan) is a dynamic memory error detector for finding out-of-bounds and use-after-free bugs in Linux kernel. It uses shadow memory to record whether each byte of memory is safe to access and uses compile-time instrumentation to check shadow memory
on each memory access. In this presentation Alexander Popov will describe the successful experience of porting KASan to a bare-metal hypervisor: the main steps, pitfalls and the ways to make KASan checks much more strict and multi-purpose.
This presentation was delivered at LinuxCon Japan 2016 by Alexander Popov
Compare Performance-power of Arm Cortex vs RISC-V for AI applications_oct_2021Deepak Shankar
Abstract: In the Webinar, we will show you how to construct, simulate, analyze, validate, optimize an architecture model using pre-built components. We will compare micro and application benchmarks on system SoC models containing clusters of ARM Cortex A53, SiFive u74, ARM Cortex A77, and other vendor cores. The system will be built around custom switches, Ingress/Egress buffers, credit flow control, AI accelerators, NoC and AMBA AXI buses with multi-level caches, DDR4 DRAM and DMA. The evaluation and optimization criteria will be task latency, dCache hit-ratio, power consumed/task and memory bandwidth. The parameters to be modified are bus topology, cache size, processor clock speed, custom arbiters, task thread allocation and changing the processor pipeline.
Selection of cores is a combination of financial and technical bias. Technical comparison of processor cores requires the understanding of the workload, task partitioning and cache-memory structure. A core must be evaluated in the context of the target application. To evaluate these selections, architecture simulation software must be fortified with a library of Intellectual property for power and timing accurate processor cores, simulator at 100 million events per second, peripherals, and all possible traffic distributions
Key Takeaways:
1. Validating architecture models using mathematical calculus and hardware traces
2. Construct custom policies, arbitrations and configure processor cores
3. Select the right combination of statistics to detect bottlenecks and optimize the architecture
4. Identify the right use of stochastic, transaction, cycle-accurate and traces to construct the model
Speaker Bio:
Alex Su is a FPGA solution architect at E-Elements Technology, Hsinchu, Taiwan. He has been an FPGA Solution Architect and Xilinx FPGA Trainer for a number of years, supporting companies, research centers and universities in China and Taiwan. Prior to that, Mr Su has worked at ARM Ltd for 5 years in technical support of Arm CPU and System IP. Alex has also been engaged with a variety of FPGA-based Hardware Emulation System and over ten years in ASIC/SoC design and verification engineer.
Deepak Shankar is the Founder of Mirabilis Design and has been involved in the architecture exploration of over 250 SoC and processors. Mr. Shankar started Mirabilis Design because of a vacuum in the systems engineering and modeling space with the focus shifting to network design and early software development. Deepak has published over 50 articles and presented at over 30 conferences in EDA, semiconductors and embedded computing. Mr. Shankar has an MBA from UC Berkeley, MS in from Clemson University and BS from Coimbatore Institute of Technology, both in Electronics and Communication.
Vectorized Processing in a Nutshell. (in Korean)
Presented by Hyoungjun Kim, Gruter CTO and Apache Tajo committer, at DeView 2014, Sep. 30 Seoul Korea.
2010 CodeEngn Conference 04
2010년 5월 22~24일, 세계 최대의 해커들의 축제인 DEFCON 18의 CTF 예선이 열렸습니다. Kaist 보안동아리 GoN 팀으로 참여하면서 느낀 이번 DEFCON CTF 예선에 대한 전반적인 리뷰와 함께, 여러 문제 분야들 중 Binary l33tness, Pwtent pwnables 분야의 문제들의 풀이를 해보고자 한다. (Defcon 18 CTF 예선전 전체 529팀에서 6위로 본선진출)
http://codeengn.com/conference/04
CyberConnect2에서는 2013년부터 DirectX11세대용 멀티플랫폼엔진 개발을 시작하였으며, 제작 시 발생하였던 문제점을 DirectX9와의 차이점을 바탕으로 공유하고자 합니다.
이 세션은 DirectX11의 개발이 처음이거나 관심 있으신 분을 대상으로 합니다. Tessellation 이나 OIT와 같은 최신기술은 다루지 않으므로 주의하시기 바랍니다.
2. SIMD.
• Single Instruction Multiple Data
• CPU에서 지원하는 일종의 명령어 셋.
• 한번의 연산으로 다수의 데이터를 처리할
수 있다.
• SISD - Single Instruction Single Data
• SIMD - Single Instruction Multiple Data
• MISD - Multiple Instruction Single Data
• MIMD - Multiple Instruction Multiple Data
3. SIMD.
Input Data Output Data
SISD 명령어
Input Data Output Data
SIM D 명령어
5. SIMD 구현 방법.
• 어셈블리 SIMD 명령어
• Intrinsic 함수
• Vector Class
SIMD 명령어 Intrinsic 함수 Vector Class
xmm0 __m128i Is16vec8
xmm1 __m128 Is32vec4 ~
~ __m128d F32vec4
xmm7 F64vec2
7. SIMD 구현 조건.
• 프로젝트의 중요 부분 또는 병목 지점인가.
• SIMD 구조에 적합한가.
• 성능향상에 도움이 되어지는가.
• 정수형, 실수형 인지 파악.
• 128bit에 담을 수 있는 데이터 개수 고려.
• 제작기간, 디버깅 테스트 기간 고려.
• 구현 도구 결정.
• SISD와 SIMD 성능 비교 테스트.
8. SSE.
• Streaming SIMD Extensions
• XMM 128비트 레지스터 8개가 존재.
• 인텔이 1999년 펜티엄3 프로세서에 도입.
• FLOAT, POINT. 비교로직 등 다양한 연산
가능.
Packing 사이즈 byte short integer
병렬 연산 개수 16 8 4
C 코드와 성능 차이 4~6배 2배 10~30%
11. SIMD 연산 타입
32 bit A3 32 bit A2 32 bit A1 32 bit A0
Scalar 덧셈 계산 +
32 bit B3 32 bit B2 32 bit B1 32 bit B0
=
32 bit A3 32 bit A2 32 bit A1 32 bit A0+ B0
32 bit A3 32 bit A2 32 bit A1 32 bit A0
Packed 덧셈 계산 + + + +
32 bit B3 32 bit B2 32 bit B1 32 bit B0
= = = =
32 bit A3+ B3 32 bit A2+ B2 32 bit A1+ B1 32 bit A0+ B0
12. 중요 어셈블리
• mov
• add / sub
• mul / div
• inc / dec
• shl / shr
• cmp / jp
13. 어셈블리 예제.
__asm
{
pushad
mov eax, A
mov ebx, B
add eax, ebx
mov C, eax
popad
}
14. 어셈블리 예제.
__asm
{
pushad
mov ebx, 15
mov eax, A
mul ebx
mov C, eax
popad
}
15. 어셈블리 예제.
__asm
{
pushad
mov eax, 17
cdq //32 bit를 64 bit로 확장
//convert double word to quad word
mov ebx, A
div ebx
mov B, eax
mov C, edx
popad
}
17. SIMD 명령어
• 정수형과 실수형 두 가지 병렬 연산 방식이 있다.
• Pack 형식에 따라 연산 방식이 달라짂다.
• MMX 에서는 64bit 병렬 연산만 가능했지만 SSE
로 넘어오면서 128bit 병렬 연산이 가능해졌다.
• 구현코드가 CPU에서 똑같이 동작한다.
• 디버깅이 어렵고, 가독성이 앆좋다.
18. 명명법
P <SIMD_op> <suffix>
접미사 원어 사이즈 의 미
S signed - 양수값을 의미하는 접미사로 사이즈를 의미하는 단어 앞에 오게 된다.
U unsigned - +- 부호를 갖는 데이터형임을 의미한다.
B Byte 8 bit 해당 데이터는 8 bit 정수 형 16개를 연산할 수 있다.
W Word 16 bit 해당 데이터는 16 bit 정수 형 8개를 연산할 수 있다.
D DoubleWord 32 bit 해당 데이터는 32 bit 정수 형 4개를 연산할 수 있다.
Q QuadWord 64 bit 해당 데이터는 64 bit 정수 형 2개를 연산할 수 있다.
19. 메모리 align, unalign
• 정렬된 메모리를 사용하는 것이 빠르다.
• 정렬된 메모리
• __declspec( align(16) ) int array[100];
• 정렬되지 않은 메모리
• Int array[100]
20. MOVDQU
Move Unaligned Double Quad word
• 128bit 레지스트 또는 128bit 메모리에 정
렬되지 않은 값을 읽어올 때 사용한다
16 bit 16 bit 16 bit 16 bit 16 bit 16 bit 16 bit 16 bit
16 Byte Unaligned Memory 8 7 6 5 4 3 2 1
M OVDQU 16 bit 16 bit 16 bit 16 bit 16 bit 16 bit 16 bit 16 bit
xmm0 레지스터 8 7 6 5 4 3 2 1
21. MOVDQA
Move aligned Double Quad word
• 128bit 레지스터 또는 128bit 메모리에 정렬되어
있는 값을 읽어올 때 사용 한다.
• 메모리가 정렬되어 있기 때문에 읽어오는 속도
가 빠르다.
16 bit 16 bit 16 bit 16 bit 16 bit 16 bit 16 bit 16 bit
16 Byte Aligned Memory 8 7 6 5 4 3 2 1
M OVDQA 16 bit 16 bit 16 bit 16 bit 16 bit 16 bit 16 bit 16 bit
xmm0 레지스터 8 7 6 5 4 3 2 1
22. 논리연산
32 bit 32 bit 32 bit 32 bit
SourceA 1 1 0 0
SourceB 1 0 1 0
PAND 1 0 0 0
POR 1 1 1 0
PXOR 0 1 1 0
PANDN 0 0 1 0
23. PADDD
(Packed Add)
• 더하기 연산을 한다
32 bit 32 bit 32 bit 32 bit
xmm0 4 3 2 1
paddd + + + +
xmm1 8 7 6 5
= = = =
xmm0 12 10 8 6
32. Intrinsic 함수
• 코드의 분석이 쉽다.
• SIMD 명령어를 inline 함수로 구현하여, 함수의
성능은 SIMD 명령어 셋과 차이가 별로 없다.
• Scalar 형 intrinsic 함수는 어셈블리 SIMD 명령
어 보다 약간의 성능저하가 있을 수 있다.
• SIMD 명령어 보다 코드 작성에 편리하다.
33. 명명법
_mm_<intrin_op>_<suffix>
문자 데이터 타입
s 32bit 실수형 (single-precision floating point)
d 64bit 실수형 (double-precision floating point)
i128 128bit signed 정수 (signed 128-bit integer)
i64 64bit signed 정수 (signed 64-bit integer)
u64 64bit unsigned 정수 (unsigned 64-bit integer)
i32 32bit signed 정수 (signed 32-bit integer)
u32 32bit unsigned 정수 (unsigned 32-bit integer)
i16 16bit signed 정수 (signed 16-bit integer)
u16 16bit unsigned 정수 (unsigned 16-bit integer)
i8 8bit signed 정수 (signed 8-bit integer)
u8 8bit unsigned 정수 (unsigned 8-bit integer)
34. Intrinsic 함수
• 헤더 파일 include
• - #include <xmmintrin.h> SSE
- #include <emmintrin.h> SSE2
- #include <pmmintrin.h> SSE3
- #include <smmintrin.h> <nmmintrin.h> SSE4
35. __m128 자료형
• SIMD 연산을 위한 자료형으로 XMM 레지스트와
1대 1 대응되는 구조체이다.
___m128i 32 bit integer 32 bit integer 32 bit integer 32 bit integer
36. Intrinsic LOAD. STORE
__m128i r = _mm_load_si128(__m128i const *p)
16 bit 16 bit 16 bit 16 bit 16 bit 16 bit 16 bit 16 bit
16Byte Align Memory p p[7] p[6] p[5] p[4] p[3] p[2] p[1] p[0]
__m128i r r[7] r[6] r[5] r[4] r[3] r[2] r[1] r[0]
void _mm_st ore_si128(__m128i *p, __m128i b)
16 bit 16 bit 16 bit 16 bit 16 bit 16 bit 16 bit 16 bit
__m128i b b[7] b[6] b[5] b[4] b[3] b[2] b[1] b[0]
16Byte Align Memory p p[7] p[6] p[5] p[4] p[3] p[2] p[1] p[0]
37. Intrinsic 함수
short Source[8] = {1,2,3,4,5,6,7,8};
short Dest[8] = {0};
__m128i xmm0 = _mm_loadu_si128((__m128i*)Source);
__m128i xmm1 = xmm0;
_mm_storeu_si128((__m128i*)Dest, xmm1);
38. Intrinsic ADD. SUB
__m128i R = _mm_add_epi16(__m128i a, __m128i b)
__m128i a a7 a6 a5 a4 a3 a2 a1 a0
+ + + + + + + +
__m128i b b7 b6 b5 b4 b3 b2 b1 b0
= = = = = = = =
__m128i r a7+b7 a6+b6 a5+b5 a4+b4 a3+b3 a2+b2 a1+b1 a0+b0
__m128i R = _mm_sub_epi16(__m128i a, __m128i b)
__m128i a a7 a6 a5 a4 a3 a2 a1 a0
- - - - - - - -
__m128i b b7 b6 b5 b4 b3 b2 b1 b0
= = = = = = = =
__m128i r a7-b7 a6-b6 a5-b5 a4-b4 a3-b3 a2-b2 a1-b1 a0-b0
39. Intrinsic MUL
__m128i R = _mm_mullo_epi16(__m128i a, __m128i b)
__m128i a a7 a6 a5 a4 a3 a2 a1 a0
* * * * * * * *
__m128i b b7 b6 b5 b4 b3 b2 b1 b0
= = = = = = = =
__m128i r a7*b7 a6*b6 a5*b5 a4*b4 a3*b3 a2*b2 a1*b1 a0*b0
40. Intrinsic MAX. MIN
__m128i R = _mm_max_epi16(__m128i a, __m128i b)
__m128i a a7 a6 a5 a4 a3 a2 a1 a0
max max max max max max max max
__m128i b b7 b6 b5 b4 b3 b2 b1 b0
= = = = = = = =
__m128i r max(a7,b7) max(a6,b6) max(a5,b5) max(a4,b4) max(a3,b3) max(a2,b2) max(a1,b1) max(a0,b0)
__m128i R = _mm_min_epi16(__m128i a, __m128i b)
__m128i a a7 a6 a5 a4 a3 a2 a1 a0
min min min min min min min min
__m128i b b7 b6 b5 b4 b3 b2 b1 b0
= = = = = = = =
__m128i r min(a7,b7) min(a6,b6) min(a5,b5) min(a4,b4) min(a3,b3) min(a2,b2) min(a1,b1) min(a0,b0)
41. 최대값 구하기
const short* pShort = pShortArray;
int nRemain = nSize % 8;
short nMaxValue = 0;
short MaxValueArray[8] ={0};
__m128i XMMCurrentValue;
__m128i XMMMaxValue;
for(unsigned int Index =0 ; Index < nSize; Index+=8)
{
XMMCurrentValue = _mm_loadu_si128((__m128i*)(pShortArray+Index));
//16byte 씩읽어온다.
XMMMaxValue = _mm_max_epi16(XMMMaxValue, XMMCurrentValue);
//16byte 씩더한다.
}
43. Vector 클래스
• Intrinsic 데이터형 또는 함수를 클래스화시
킨 라이브러리.
• Intrinsic 를 이용할 때 보다 직관적이고 사
용이 편리하다.
• 연산자 오버로딩으로 만들어져 있다.
44. Vector 클래스
• 명명법
– <type><signedness><bits>vec<elements>
Iu32vec4
32bit unsigned int 형 정수를 4개 담고 있는 vector 클래스.
Fs16vec8
8bit signed short 형 실수를 8개 담고 있는 vector 클래스.
45. Vector 클래스
class 부호 pack 데이터 타입 pack 사이즈 pack 개수 해더 파일
I128vec1 unspecified __m128i 128 1 dvec.h
I64vec2 unspecified __int64 64 2 dvec.h
Is64vec2 signed __int64 64 2 dvec.h
Iu64vec2 unsigned __int64 64 2 dvec.h
I32vec4 unspecified int 32 4 dvec.h
Is32vec4 signed int 32 4 dvec.h
Iu32vec4 unsigned int 32 4 dvec.h
I16vec8 unspecified short 16 8 dvec.h
Is16vec8 signed short 16 8 dvec.h
Iu16vec8 unsigned short 16 8 dvec.h
I8vec16 unspecified char 8 16 dvec.h
Is8vec16 signed char 8 16 dvec.h
Iu8vec16 unsigned char 8 16 dvec.h
46. Vector 클래스
__asm{
movaps xmm0, a
movaps xmm1, b
addps xmm0, xmm1
movaps c, xmm0
}
#include < xmmintrin.h >
__m128 a, b, c;
c = _mm_add_ps( a, b);
#include <fvec.h>
F32vec4 A, B, C
C = A + B;
47. Vector 클래스 읽고. 쓰기
__declspec(align(16)) short A[8] = {1,2,3,4,5,6,7,8};
__declspec(align(16)) short R[8] = {0};
Is16vec8 Vector(1,2,3,4,5,6,7,8); //역순으로 8부터 입력
_mm_store_si128((__m128i *)R, Vector); //intrinsic 함수로 쓰기
printf("Store : %d, %d, %d, %d, %d, %d, %d, %dn"
,R[0],R[1],R[2],R[3],R[4],R[5],R[6],R[7]);
Is16vec8 *a = (Is16vec8 *)A; //포인터 캐스팅으로 바로 읽
기
Is16vec8 *r = (Is16vec8 *)R;
*r = *a; //대입 연산
printf("Store : %d, %d, %d, %d, %d, %d, %d, %dn"
,R[0],R[1],R[2],R[3],R[4],R[5],R[6],R[7]);
return 0;
48. Vector 클래스 사칙연산
Is16vec8 A;
Is16vec8 B;
Is16vec8 R;
R = A + B; //덧셈 연산
R += A;
R = A - B; //뺄셈 연산
R -= A;
R = A * B; //곱셈 연산
R *= A;
R = mul_high( A, B ); //곱셈 상위 bit
R = mul_add( A, B); //곱 합 연산
49. 최대값 구하기
const short* pShort = pShortArray;
int nRemain = nSize % 8;
short nMaxValue = 0;
short MaxValueArray[8] ={0};
Is16vec8* XMMCurrentValue;
Is16vec8 XMMMaxValue;
for(unsigned int Index =0 ; Index < nSize; Index+=8)
{
XMMCurrentValue = (Is16vec8*)(pShortArray+Index);
XMMMaxValue = simd_max(XMMMaxValue,* XMMCurrentValue);
}
51. 마무리
- SSE를 사용하면 C/C++ 보다 빠른 성능 향상을 볼 수 있
다.
- 멀티코어 시대에 SSE의 사용은 미래를 대비하는 일이다.
- 아직까지 가독성의 문제가 남아있다.
- 반복되는 계산량이 많은 부분에서 많은 이득을 볼 수
있다.
- 캐쉬라인 정렬이 되어있어야 효율적이다.