eBPF has 64-bit general purpose registers, therefore 32-bit architectures normally need to use register pair to model them and need to generate extra instructions to manipulate the high 32-bit in the pair. Some of these overheads incurred could be eliminated if JIT compiler knows only the low 32-bit of a register is interested. This could be known through data flow (DF) analysis techniques. Either the classic iterative DF analysis or "path-sensitive" version based on verifier's code path walker.
In this talk, implementations for both versions of DF analyzer will be presented. We will see how a def-use chain based classic eBPF DF analyser looks first, and will see the possibility to integrate it with previous proposed eBPF control flow graph framework to make a stand-alone eBPF global DF analyser which could potentially serve as a library. Then, another "path-sensitive" DF analyser based on the existing verifier code path walker will be presented. We will discuss how function calls, path prune, path switch affect the implementation. Finally, we will summarize pros and cons for each, and will see how could each of them be adapted to 64-bit and 32-bit architecture back-ends.
Also, eBPF has 32-bit sub-register and ALU32 instructions associated, enable them (-mattr=+alu32) in LLVM code-gen could let the generated eBPF sequences carry more 32-bit information which could potentially easy flow analyser. This will be briefly discussed in the talk as well.
Есть много причин заниматься конверсией управляемых языков в нативные: это прежде всего производительность, но также защита от реверс-инжиниринга, поддержка аппаратных технологий или каких-то специфичных платформ. В этом докладе мы посмотрим на пример построения конвертера из C# в C++ и те нюансы, которые встречаются при решении этой задачи
eBPF has 64-bit general purpose registers, therefore 32-bit architectures normally need to use register pair to model them and need to generate extra instructions to manipulate the high 32-bit in the pair. Some of these overheads incurred could be eliminated if JIT compiler knows only the low 32-bit of a register is interested. This could be known through data flow (DF) analysis techniques. Either the classic iterative DF analysis or "path-sensitive" version based on verifier's code path walker.
In this talk, implementations for both versions of DF analyzer will be presented. We will see how a def-use chain based classic eBPF DF analyser looks first, and will see the possibility to integrate it with previous proposed eBPF control flow graph framework to make a stand-alone eBPF global DF analyser which could potentially serve as a library. Then, another "path-sensitive" DF analyser based on the existing verifier code path walker will be presented. We will discuss how function calls, path prune, path switch affect the implementation. Finally, we will summarize pros and cons for each, and will see how could each of them be adapted to 64-bit and 32-bit architecture back-ends.
Also, eBPF has 32-bit sub-register and ALU32 instructions associated, enable them (-mattr=+alu32) in LLVM code-gen could let the generated eBPF sequences carry more 32-bit information which could potentially easy flow analyser. This will be briefly discussed in the talk as well.
Есть много причин заниматься конверсией управляемых языков в нативные: это прежде всего производительность, но также защита от реверс-инжиниринга, поддержка аппаратных технологий или каких-то специфичных платформ. В этом докладе мы посмотрим на пример построения конвертера из C# в C++ и те нюансы, которые встречаются при решении этой задачи
Доклад рассказывает об устройстве и опыте применения инструментов динамического тестирования C/C++ программ — AddressSanitizer, ThreadSanitizer и MemorySanitizer. Инструменты находят такие ошибки, как использование памяти после освобождения, обращения за границы массивов и объектов, гонки в многопоточных программах и использования неинициализированной памяти.
Presentation from DICE Coder's Day (2010 November) by Johan Torp:
This talk is about making object-oriented code more cache-friendly and how we can incrementally move towards parallelizable data-oriented designs. Filled with production code examples from Frostbite’s pathfinding implementation.
this is simd programming power point file that CS 240A, Winter 2016
single instruction multiple data strea simd or sim-dee
simd computer exploits multiple data streams against a single loop
Доклад рассказывает об устройстве и опыте применения инструментов динамического тестирования C/C++ программ — AddressSanitizer, ThreadSanitizer и MemorySanitizer. Инструменты находят такие ошибки, как использование памяти после освобождения, обращения за границы массивов и объектов, гонки в многопоточных программах и использования неинициализированной памяти.
Presentation from DICE Coder's Day (2010 November) by Johan Torp:
This talk is about making object-oriented code more cache-friendly and how we can incrementally move towards parallelizable data-oriented designs. Filled with production code examples from Frostbite’s pathfinding implementation.
this is simd programming power point file that CS 240A, Winter 2016
single instruction multiple data strea simd or sim-dee
simd computer exploits multiple data streams against a single loop
Microprocessor architecture,
Organisation & operation of microcomputer systems.
Hardware and software interaction.
Programme and data storage.
Parallel interfacing and programmable ICs.
Serial interfacing, standards and protocols.
Analogue interfacing. Interrupts and DMA.
Microcontrollers and small embedded systems.
The CPU, memory and the operating system.
This slide deck focuses on eBPF JIT compilation infrastructure and how it plays an important role in the entire eBPF life cycle inside the Linux kernel. First, it does quite a number of control flow checks to reject vulnerable programs and then JIT compiles the eBPF program to either host or offloading target instructions which boost performance. However, there is little documentation about this topic which this slide deck will dive into.
Software Engineering, Software Consulting, Tech Lead, Spring Boot, Spring Cloud, Spring Core, Spring JDBC, Spring Transaction, Spring MVC, OpenShift Cloud Platform, Kafka, REST, SOAP, LLD & HLD.
AI Pilot Review: The World’s First Virtual Assistant Marketing SuiteGoogle
AI Pilot Review: The World’s First Virtual Assistant Marketing Suite
👉👉 Click Here To Get More Info 👇👇
https://sumonreview.com/ai-pilot-review/
AI Pilot Review: Key Features
✅Deploy AI expert bots in Any Niche With Just A Click
✅With one keyword, generate complete funnels, websites, landing pages, and more.
✅More than 85 AI features are included in the AI pilot.
✅No setup or configuration; use your voice (like Siri) to do whatever you want.
✅You Can Use AI Pilot To Create your version of AI Pilot And Charge People For It…
✅ZERO Manual Work With AI Pilot. Never write, Design, Or Code Again.
✅ZERO Limits On Features Or Usages
✅Use Our AI-powered Traffic To Get Hundreds Of Customers
✅No Complicated Setup: Get Up And Running In 2 Minutes
✅99.99% Up-Time Guaranteed
✅30 Days Money-Back Guarantee
✅ZERO Upfront Cost
See My Other Reviews Article:
(1) TubeTrivia AI Review: https://sumonreview.com/tubetrivia-ai-review
(2) SocioWave Review: https://sumonreview.com/sociowave-review
(3) AI Partner & Profit Review: https://sumonreview.com/ai-partner-profit-review
(4) AI Ebook Suite Review: https://sumonreview.com/ai-ebook-suite-review
Essentials of Automations: The Art of Triggers and Actions in FMESafe Software
In this second installment of our Essentials of Automations webinar series, we’ll explore the landscape of triggers and actions, guiding you through the nuances of authoring and adapting workspaces for seamless automations. Gain an understanding of the full spectrum of triggers and actions available in FME, empowering you to enhance your workspaces for efficient automation.
We’ll kick things off by showcasing the most commonly used event-based triggers, introducing you to various automation workflows like manual triggers, schedules, directory watchers, and more. Plus, see how these elements play out in real scenarios.
Whether you’re tweaking your current setup or building from the ground up, this session will arm you with the tools and insights needed to transform your FME usage into a powerhouse of productivity. Join us to discover effective strategies that simplify complex processes, enhancing your productivity and transforming your data management practices with FME. Let’s turn complexity into clarity and make your workspaces work wonders!
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...Globus
Large Language Models (LLMs) are currently the center of attention in the tech world, particularly for their potential to advance research. In this presentation, we'll explore a straightforward and effective method for quickly initiating inference runs on supercomputers using the vLLM tool with Globus Compute, specifically on the Polaris system at ALCF. We'll begin by briefly discussing the popularity and applications of LLMs in various fields. Following this, we will introduce the vLLM tool, and explain how it integrates with Globus Compute to efficiently manage LLM operations on Polaris. Attendees will learn the practical aspects of setting up and remotely triggering LLMs from local machines, focusing on ease of use and efficiency. This talk is ideal for researchers and practitioners looking to leverage the power of LLMs in their work, offering a clear guide to harnessing supercomputing resources for quick and effective LLM inference.
Top 7 Unique WhatsApp API Benefits | Saudi ArabiaYara Milbes
Discover the transformative power of the WhatsApp API in our latest SlideShare presentation, "Top 7 Unique WhatsApp API Benefits." In today's fast-paced digital era, effective communication is crucial for both personal and professional success. Whether you're a small business looking to enhance customer interactions or an individual seeking seamless communication with loved ones, the WhatsApp API offers robust capabilities that can significantly elevate your experience.
In this presentation, we delve into the top 7 distinctive benefits of the WhatsApp API, provided by the leading WhatsApp API service provider in Saudi Arabia. Learn how to streamline customer support, automate notifications, leverage rich media messaging, run scalable marketing campaigns, integrate secure payments, synchronize with CRM systems, and ensure enhanced security and privacy.
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...Globus
The U.S. Geological Survey (USGS) has made substantial investments in meeting evolving scientific, technical, and policy driven demands on storing, managing, and delivering data. As these demands continue to grow in complexity and scale, the USGS must continue to explore innovative solutions to improve its management, curation, sharing, delivering, and preservation approaches for large-scale research data. Supporting these needs, the USGS has partnered with the University of Chicago-Globus to research and develop advanced repository components and workflows leveraging its current investment in Globus. The primary outcome of this partnership includes the development of a prototype enterprise repository, driven by USGS Data Release requirements, through exploration and implementation of the entire suite of the Globus platform offerings, including Globus Flow, Globus Auth, Globus Transfer, and Globus Search. This presentation will provide insights into this research partnership, introduce the unique requirements and challenges being addressed and provide relevant project progress.
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptxrickgrimesss22
Discover the essential features to incorporate in your Winzo clone app to boost business growth, enhance user engagement, and drive revenue. Learn how to create a compelling gaming experience that stands out in the competitive market.
Navigating the Metaverse: A Journey into Virtual Evolution"Donna Lenk
Join us for an exploration of the Metaverse's evolution, where innovation meets imagination. Discover new dimensions of virtual events, engage with thought-provoking discussions, and witness the transformative power of digital realms."
First Steps with Globus Compute Multi-User EndpointsGlobus
In this presentation we will share our experiences around getting started with the Globus Compute multi-user endpoint. Working with the Pharmacology group at the University of Auckland, we have previously written an application using Globus Compute that can offload computationally expensive steps in the researcher's workflows, which they wish to manage from their familiar Windows environments, onto the NeSI (New Zealand eScience Infrastructure) cluster. Some of the challenges we have encountered were that each researcher had to set up and manage their own single-user globus compute endpoint and that the workloads had varying resource requirements (CPUs, memory and wall time) between different runs. We hope that the multi-user endpoint will help to address these challenges and share an update on our progress here.
Cyaniclab : Software Development Agency Portfolio.pdfCyanic lab
CyanicLab, an offshore custom software development company based in Sweden,India, Finland, is your go-to partner for startup development and innovative web design solutions. Our expert team specializes in crafting cutting-edge software tailored to meet the unique needs of startups and established enterprises alike. From conceptualization to execution, we offer comprehensive services including web and mobile app development, UI/UX design, and ongoing software maintenance. Ready to elevate your business? Contact CyanicLab today and let us propel your vision to success with our top-notch IT solutions.
Code reviews are vital for ensuring good code quality. They serve as one of our last lines of defense against bugs and subpar code reaching production.
Yet, they often turn into annoying tasks riddled with frustration, hostility, unclear feedback and lack of standards. How can we improve this crucial process?
In this session we will cover:
- The Art of Effective Code Reviews
- Streamlining the Review Process
- Elevating Reviews with Automated Tools
By the end of this presentation, you'll have the knowledge on how to organize and improve your code review proces
Quarkus Hidden and Forbidden ExtensionsMax Andersen
Quarkus has a vast extension ecosystem and is known for its subsonic and subatomic feature set. Some of these features are not as well known, and some extensions are less talked about, but that does not make them less interesting - quite the opposite.
Come join this talk to see some tips and tricks for using Quarkus and some of the lesser known features, extensions and development techniques.
Developing Distributed High-performance Computing Capabilities of an Open Sci...Globus
COVID-19 had an unprecedented impact on scientific collaboration. The pandemic and its broad response from the scientific community has forged new relationships among public health practitioners, mathematical modelers, and scientific computing specialists, while revealing critical gaps in exploiting advanced computing systems to support urgent decision making. Informed by our team’s work in applying high-performance computing in support of public health decision makers during the COVID-19 pandemic, we present how Globus technologies are enabling the development of an open science platform for robust epidemic analysis, with the goal of collaborative, secure, distributed, on-demand, and fast time-to-solution analyses to support public health.
GraphSummit Paris - The art of the possible with Graph TechnologyNeo4j
Sudhir Hasbe, Chief Product Officer, Neo4j
Join us as we explore breakthrough innovations enabled by interconnected data and AI. Discover firsthand how organizations use relationships in data to uncover contextual insights and solve our most pressing challenges – from optimizing supply chains, detecting fraud, and improving customer experiences to accelerating drug discoveries.
How Recreation Management Software Can Streamline Your Operations.pptxwottaspaceseo
Recreation management software streamlines operations by automating key tasks such as scheduling, registration, and payment processing, reducing manual workload and errors. It provides centralized management of facilities, classes, and events, ensuring efficient resource allocation and facility usage. The software offers user-friendly online portals for easy access to bookings and program information, enhancing customer experience. Real-time reporting and data analytics deliver insights into attendance and preferences, aiding in strategic decision-making. Additionally, effective communication tools keep participants and staff informed with timely updates. Overall, recreation management software enhances efficiency, improves service delivery, and boosts customer satisfaction.
3. What is SIMD?
• The Extreme Optimization for C/C++
• Pointer only
• Have to exactly define act of memory, register,
compiler, it can be challenge the limit.
C/C++ level
Assembly
level
SIMD
4. C=A+B ?
float arr0[4] = { 1,2,3,4 };
float arr1[4] = { 5,6,7,8 };
float arr2[4] = { 0 };
A
B
C
A B
C +
=
Result: arr2[4] => { 6,8,10,12 };
5. Why is SIMD fast?
for(int i=0;i<4;i++)
arr2[i]=arr0[i]+arr1[i];
for(int i=0;i<4;i++)
*(arr2 + i) = *(arr0 + i)+*(arr1 + i);
1 1*4
(1+1)*4 (1+1)*4 (1+1)*4
37 cycles
1*4
1*4
Assume
the all
instruction sets
have
1 instruction and
1 cycle
6. Why is SIMD fast?
float32x4_t a,b,c;
a=*(float32x4_t *)arr0;
b=*(float32x4_t *)arr1;
c=a+b;
*(float32x4_t *)arr2=c;
4 cycles
9x fast
15. Memory
• In the approaching physical limits era, CPU
operation is not bottleneck, it keeps changing
per year, but the speed of memory transform
is only constant.
• The SIMD succeeded in reducing over
quadruple CPU cycles when multi-data
parallelize.
• The Data was producing latency from Memory
L2 L1 register Load/Store.
17. L1 cache
• Create array in the function.
• The registers are used over established quantity for
reserving data, it will write back L1 cache by stack
pointer.
• Function arguments transfer data. (partial –O3 opt.
will through out by registers, No write back)
• Function call, to save current registers of data by
stack point, when it’s finished, read data out to
registers.
• Interrupt or exchanging thread, will write through
current registers of data, depend on OS capability.
18. I-Cache/D-Cache
• Instruction Cache:
– The code size that was compiled CPU instructions
(function symbol size), function firstly execution
will pre-fetch, and twice is in cache, if it is
computer vision application, the first execution
function have to ignore in calculating efficacy.
• Data Cache:
– It is L1 cache when we say, established
methodology pre-fetch data to L1 cache.
19. Page Table
• The page 4096 bytes
• The cache line64 bytes
• A page contains 64 cache lines
• L2 cache 5~10Mb
• L1 cache 512k~1Mb
• L1 entry way 2 way or 4 way
• A image320*240 or 640*480 bytes
• Does it have heavily cache miss while the
memory usage over cache size?????
20. Cache line
• 64 bytes= 16 float
• 128 bits = 4 float
To Address
64 bits
To Address
64 bits
21. Worldview
• In SIMD world, if you want to get limit, look-
up table should not be optimally method, if
the table is large, to use vector register will
reduce 4x cycles and more faster on the
contrary.
• At the extreme optimization, once you create
small Load/Store, this effect is very obvious!
22. Known Methodology
int arr0[100] = {1,2,3…};
void test1 (float *src,float *dst,int len)
{
int arr1[100] = {1,2,3…};
int b =4;
int *arr2 = (int *)malloc(100*sizeof(int));
int c = len + b;
…
}
Memory
(DDR3)
L1 cache
Instruction set
Const
Memory
(DDR3)
24. Compile Is Not As Smart As You
Think
void test0(float *src_dst,int len)
{
float4 *src_dst_ptr = (float4 *)src_dst;
float4 cc=*src_dst_ptr + *src_dst_ptr;
*src_dst_ptr +=cc;
…
}
Three Load
One Store
25. Correct Writing
void test0(float *src_dst,int len)
{
float4 *src_dst_ptr = (float4 *)src_dst;
float4 val = *src_dst_ptr;
float4 cc= val + val;
*src_dst_ptr =cc + val;
…
} One Load
One Store
26. Few To Use Array,
More To Use Pointer++
void test1(float *src,float *dst,int len)
{
float4 *src_ptr =(float4 *)src;
float4 *dst_ptr=(float4 *)dst;
float4 reg0,reg1…;
for(int i=0;i<len;i+=4)
{
reg0=*src_ptr++;
reg1=*src_ptr++;
reg0 = reg0+reg1;
…..
*dst_ptr++=reg0;
*dst_ptr++=reg1;
}
}
• Not recommend to use
void test2(float4 *src,float4 *dst,int len)
{
int len_4 = len/4;
float4 reg0,reg1…;
for(int i=0;i<len_4;i+=2)
{
reg0=src[i]+src[i+1];
…
dst[i]=reg0;
dst[i+1]=src[i+1];
}
}
27. Single Source And Destination,
To Avoid Cache Miss/Page Fault
void test1(float *src_dst, int len)
{
float4 *src_dst_ptr =(float4 *)src_dst;
float4 reg0,reg1…;
for(int i=0;i<len;i+=4)
{
reg0=*src_dst_ptr++;
reg1=*src_dst_ptr++;
reg0 = reg0+reg1;
…..
*src_dst_ptr++=reg0;
*src_dst_ptr++=reg1;
}
}
28. • Cache line was 64 bytes, 16 bits address
alignment.
• if Vector Register Load/Store that is Not at
multiples of 16 address
– Latency penalty
– Depends on CPU architecture, almost will occur
Align/Unalign
0x0000 0x0010
8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8
0x0020
From Here, Start to Load/Store 128bits
31. Register
• Arm64
– 32 Vector Registers
– 32 Scale Registers
• Arm32
– 16 Vector Registers
– 32 Scale Registers
• Intel SSE
– 16 Vector Registers
– 32 Scale Registers
• DSP
– ? Vector Registers
– ? Scale Registers
Have to remember the number of registers in use
(Very Importance!!)
32. Why
• Under the premise of full float, 32 Vector
Registers can provide
– 128 space sizes simulate array (4*32)
– The extreme operation that no need to write back
to memory
• With use of shuffle instruction
– If you use over 32 variables at same time, excess
data will be written back L1, make latency.
33. Why
float arr[4*32+4] = {…};
float4 *arr_ptr = (float4 *)arr;
float4 a0,a1,a2,a3,a4,a5,a6 … a32;
A0 = *arr_ptr++;
A1= *arr_ptr++;
…
A32 = *arr_ptr++;
Over 32 register variables at same
time, it makes extra Load/Store in
operation process, can’t be
optimized.
34. About Register
• The Register doesn’t have data type in itself,
only defines the instruction sets on assembly
level
• To use variables reserving data, can Not over
the maximum number of registers on CPU, but
need fully utilize.
• Vector Register
– Make good use of shuffle instruction
– Input/Output data rearrangement
35. Act of Load/Store
float4 *src_ptr = (float4 *)src;
float4 *dst_ptr=(float4 *)dst;
reg128 reg0,reg1,reg2,reg3… reg31;
for(int i=0;i<640*480;i+=4) {
reg0._float4 = *src_ptr++;
reg1._float4 = *src_ptr++;
reg2._float4 = *src_ptr++;
….
..
*dst_ptr++=reg0._float4;
*dst_ptr++=reg1._float4;
*dst_ptr++=reg2._float4;
….
}
2 General
Purpose
Registers
(Addressing) 32 Vector Registers
(full utilize)
1 General
Purpose
Register
Read All at
Once
Write All at
Once
Main
Algorithm
36. Act of Function Call
void test1(float *src,float *dst,int len) {
int a= len/4;
int b= len%4;
float4 aa = *(float4 *)src;
float4 bb = *(float4 *)dst;
float4 cc = aa + bb;
int val=test2(src,dst,len);
cc = aa + bb + cc;
int c =(a+b+len)*val;
…
}
2 General Purpose
Registers write
to L1 cache,
produce act of
Load/Store 1 Vector Registers write
to L1 cache,
produce act of
Load/Store
To read src,dst
address to General
Purpose Register from
L1 cache
The Registers will
clean-up, read
arguments form L1
cache to registers Return original data
from L1 cache to
Vector/General
Purpose Registers
Stack Pointer
management
37. Act of Function Argument
void test3(float4 aa,float4 *bb,float4 &cc) {
…
}
void test4(float a,float *b,float &c) {
…
}
int main() {
float4 aa = { 0,0,0,0 },bb={1,1,1,1},cc = {2,2,2,2};
float a = 0,b=1,c=2;
test3(aa,&bb,&cc);
test4(a,&b,&c);
}
Under the
premise of O3
Produce act of
Load/Store
Produce act of
Load/Store
Produce act of
Load/Store
Produce act of
Load/Store
Directly thoughout register!
Driectly throughout register!
38. Key Points
• Call by address, call by reference, those all
access L1 cache, unless them inline succeed,
must be slow.
• Reduce the function usage, all the way to the
end.
42. Act of Branch Instruction
• Compare with Normal Comparison Operation,
more than multiples of 4 fast.
• The branch prediction is NO exist, CPU
pipeline will Not be predicted fail and clean-up,
the pipeline is running to end(Explosion fast).
43. Act of Shuffle
• The instruction sets like the sea, to find the
best fit shuffle.
– the key point that is extreme optimization of
mathematics model.
– Didn’t write shuffle, Didn’t say you can write SIMD.
51. Methodology of O3 Optimization
• Clang gcc ???
• New Version >>>>> Old Version
52. Lantency && Throughtput
• Consecutive Load or Consecutive Store will
reduce latency.
• Specific pipeline rearrange can reduce latency
space.
– VLIW, SLOT
• The Register instruction sets that have the
dependency penalty.
– Load/Store rearrange by yourself
– Compiler can deal with the dependency penalty
53. About inline
• Always inline has a change to fail, you should
be check function symbol in assembly.
– Lines of code was the key in Clang
54. Vector Register Optimization
• Specific algorithms can NOT optimize with
SIMD on contemporary compiler, because the
90% algorithms need to use lot of shuffle
instruction, have to paper work.
• Compiler only parses the for loop unrolling
with SIMD.
for(int i=0;i<64;i++)
{
….
}
Compiler says:
I know how to do
vectorization
55. Read Element and Write Back
reg128 reg0;
float4 a= {0,1,2,3};
reg0._float4 = a;
float2 val1= reg0._float2[0];
reg0._float2[1]=val1;
float val0=reg0._float[2];
reg0._float[3] = val0;
1. Whether Instruction is supported,
if not, write to L1 Load/Store as same as array.
1. It depends on whether compiler is smart or not!!
0 1 2 3
寫
讀
讀
寫
寫
58. • Fix the all of algorithms parameters
– Make the constant value
• Remove the branch prediction, the code will
be very huge, but fast
• Don't doubt, the code is more than 4000 lines
casually.
59. Conception of SIMD Optimization
FunctionA
Algorithm A
FunctionB
Algorithm B
FunctionC
Algorithm C
FunctionEnd
(Final Algorithm)
Develop
One Month
Previous
codes
were no use,
only need to
develop
final algorithm
for one month
Waste Time
Develop
One Month
Develop
One Month
61. About Data
• The large data have to
– Satisfy multiples of 4.
– Know the maximum of quantity.
• Input data rearrangement can fly on the sky.
• Multiples of 4 are not met
– Padding zero, still use SIMD
– Use General Purpose Registers in the end.
62. Data Rearrangement
a b a b a b ...
a b a b a b ...
a b a b a b ...
a b a b a b ...
a b a b a b ...
a b a b a b ...
a a a a a a ...
a a a a a a ...
a a a a a a ...
b b b b b b ...
b b b b b b ...
b b b b b b ...
a b c a b c ...
a b c a b c ...
a b c a b c ...
a b c a b c ...
a b c a b c ...
a b c a b c ...
a a a a a a ...
a a a a a a ...
b b b b b b ...
b b b b b b ...
c c c c c c ...
c c c c c c ...
65. SIMD
for(int i=0;i<height;i++) // top
{ ...}
for(int i=0;i<height;i++)
{
// left
for(int j=0;j<width;j++) // middle
{ ... }
// right
}
for(int i=0;i<height;i++) // bottom
{ ...}
66. In Order To Cooperate With
SIMD, Crazy, Unlimited Unrolling
I cache is enough(over 32KB),
if Not enough, we'll talk about it then
67. Conclusion
• SIMD is strongly linked to mathematics
• Unknown field, or almost no course.
• Seldom data on the internet, seldom people
arrange success.
• Do you want to develop new algorithms?
You Can try it.