More Related Content Similar to SequenceL Auto-Parallelizing Toolset Intro slideshare (20) SequenceL Auto-Parallelizing Toolset Intro slideshare1. An Introduction to SequenceL
Auto-Parallelizing Programming Language and Toolset
www.texasmulticore.com
Brad Nemanich, PhD
Chief Technology Officer
2. Why is SequenceL Needed?
”The way the processor industry is going is
to add more and more cores, but nobody
knows how to program those things. I mean,
two, yeah; four, not really; eight, forget it.”
– Steve Jobs
© 2015 Texas Multicore Technologies, Inc.
All Rights Reserved2
This shift now affects every software company,
large enterprise, and government agency that
develops software
3. Current (Manual) Approach to Multicore Programming
1. Be sure you identify truly independent computations.
2. Implement concurrency at the highest level possible.
3. Plan early for scalability to take advantage of increasing numbers of
cores.
4. Make use of thread-safe libraries wherever possible.
5. Use the right threading model.
6. Never assume a particular order of execution.
7. Use thread-local storage whenever possible; associate locks to specific
data, if needed.
8. Don’t be afraid to change the algorithm for a better chance of
concurrency.
8 “Simple” Rules for Designing Threaded Applications
(0. Hire team of “Parallel Ninjas”, PhD experts in computer architecture.)
© 2015 Texas Multicore Technologies, Inc.
All Rights Reserved3
4. Current (Manual) Approach to Multicore Programming
1. Be sure you identify truly independent computations.
2. Implement concurrency at the highest level possible.
3. Plan early for scalability to take advantage of increasing numbers of
cores.
4. Make use of thread-safe libraries wherever possible.
5. Use the right threading model.
6. Never assume a particular order of execution.
7. Use thread-local storage whenever possible; associate locks to specific
data, if needed.
8. Don’t be afraid to change the algorithm for a better chance of
concurrency.
8 “Simple” Rules for Designing Threaded Applications
(0. Hire team of “Parallel Ninjas”, PhD experts in computer architecture.)
© 2015 Texas Multicore Technologies, Inc.
All Rights Reserved4
“The significant problems we face cannot be solved using
the same level of thinking we used when we created them.”
-Albert Einstein
5. “Parallel Ninja” Approach Does Not Scale
How do you:
─ find them?
─ afford them?
─ retain them?
─ support rapid innovation?
─ ensure accuracy and correctness?
─ keep them current on platform technologies?
─ do this for all your software?
Einstein was right;
There’s a much better way….
© 2015 Texas Multicore Technologies, Inc.
All Rights Reserved5
6. It’s Time to Change the Game (Again)
6
Wiring Machine CodeWiring
Machine Code Machine Code
Assembly
Language
Netlist
Netlist
1954 1957 1980
Machine Code
HLL + Compiler
(Fortran, COBOL,
PL/I, Lisp, C,…)
Machine Code
Object Oriented
(SmallTalk, C++,
Java, C#,)
19491949
© 2015 Texas Multicore Technologies, Inc.
All Rights Reserved
7. It’s Time to Change the Game (Again)
7
Wiring Machine CodeWiring
Machine Code Machine Code
Assembly
Language
Netlist
Netlist
1954 1957 1980
Machine Code
HLL + Compiler
(Fortran, COBOL,
PL/I, Lisp, C,…)
Machine Code
Object Oriented
(SmallTalk, C++,
Java, C#,)
19491949
2004: Multicore
© 2015 Texas Multicore Technologies, Inc.
All Rights Reserved
8. It’s Time to Change the Game (Again)
8
Wiring Machine CodeWiring
Machine Code Machine Code
Assembly
Language
Netlist
Netlist
1954 1957 1980
Machine Code
HLL + Compiler
(Fortran, COBOL,
PL/I, Lisp, C,…)
Machine Code
Object Oriented
(SmallTalk, C++,
Java, C#,)
19491949 2014
Machine Code
Object Oriented
C++
Functional,
Auto-
Parallelizing
Object Oriented
C++
Functional,
Auto-
Parallelizing
2004: Multicore
© 2015 Texas Multicore Technologies, Inc.
All Rights Reserved
9. SequenceL is a Game Changer
© 2015 Texas Multicore Technologies, Inc.
All Rights Reserved9
Faster Performance;
Uses all cores, GPUs
10X Faster Time to
Innovation/Market
Get it Right the
First Time
Quickly Leverage New
Computing Platforms
Built Upon Open Industry
Standards; Works with Existing
Tools & Methodologies
10. Customer Example: Industrial Control Networking
(WirelessHART, IEC 62591, IEEE 802.15.4)
New algorithm, developed for large, noisy industrial
process control environments
─ Presented white paper to IEEE
─ Won an award
Asked TMT to implement for comparison purposes
─ Finished in SequenceL in 3 weeks
10X faster performance and right the first time
─ Java finished by the inventors in 3 months
Had errors and much slower; used SequenceL code to debug Java
Another month getting code correct
A 5th month improving performance that still fell short
Bottom line
─ SL was finished in 15% of the time
─ SL was correct the first time
─ SL out-performed the Java code 1.5x-3.0x on a 2 core AMD APU
─ Robust and fast code, fast time to market
10
© 2015 Texas Multicore Technologies, Inc.
All Rights Reserved
11. Customer Example: Video Processing Using SequenceL
Goal: 30Hz to keep up with input video feed
Best performance (8 core x86 platform)
─ 58 Hz: SequenceL
─ 21 Hz: Matlab (Interpreter)
─ 1.2 Hz: Matlab (Coder/C-out)
Input video feed
(e.g.- Apache helicopter gyro camera)
Processed video
(Proprietary algorithms remove air
turbulence, radiated heat, etc.)
© 2015 Texas Multicore Technologies, Inc.
All Rights Reserved11
12. Customer Example: Video Processing Using SequenceL
Goal: 30Hz to keep up with input video feed
Best performance (8 core x86 platform)
─ 58 Hz: SequenceL
─ 21 Hz: Matlab (Interpreter)
─ 1.2 Hz: Matlab (Coder/C-out)
Input video feed
(e.g.- Apache helicopter gyro camera)
Processed video
(Proprietary algorithms remove air
turbulence, radiated heat, etc.)
© 2015 Texas Multicore Technologies, Inc.
All Rights Reserved12
13. Customer Example: Video Processing Using SequenceL
Goal: 30Hz to keep up with input video feed
Best performance (8 core x86 platform)
─ 58 Hz: SequenceL
─ 21 Hz: Matlab (Interpreter)
─ 1.2 Hz: Matlab (Coder/C-out)
Input video feed
(e.g.- Apache helicopter gyro camera)
Processed video
(Proprietary algorithms remove air
turbulence, radiated heat, etc.)
© 2015 Texas Multicore Technologies, Inc.
All Rights Reserved13
14. Customer Example: Video Processing Using SequenceL
Goal: 30Hz to keep up with input video feed
Best performance (8 core x86 platform)
─ 58 Hz: SequenceL
─ 21 Hz: Matlab (Interpreter)
─ 1.2 Hz: Matlab (Coder/C-out)
Input video feed
(e.g.- Apache helicopter gyro camera)
Processed video
(Proprietary algorithms remove air
turbulence, radiated heat, etc.)
© 2015 Texas Multicore Technologies, Inc.
All Rights Reserved14
15. What is SequenceL?
SequenceL is a…
High-Abstraction
Functional
Self-Parallelizing
…programming language and tool set
….designed to work in concert with other
popular programming languages and tools
15
© 2015 Texas Multicore Technologies, Inc.
All Rights Reserved
16. High-Abstraction, High Performance
Most common programming languages are imperative
─ Detailed sequence of commands for carrying out the computation;
i.e.- tell the computer both “what” to do and “how” to do it
─ Inherently sequential, written for classic Von Neumann computers
─ e.g.- C/C++, Java, C#, Python, Fortran
─ Some add explicit “directives” to manually enable low-level parallelism
SequenceL is declarative & functional – higher abstraction
─ Describe the desired output in terms of the input, as functions;
i.e.- tell the computer only “what” to do, so no thinking about parallel
─ Abstracts away complex multicore and many-core platforms
Best analogy is SQL database language
─ A programmer could write their own database procedures in low level C
─ But would be error-prone and not perform as well as with Oracle or DB2
16
© 2015 Texas Multicore Technologies, Inc.
All Rights Reserved
17. Drops Into Your Current Design Flow
Designed to work in concert with
other programming languages,
legacy code and libraries
Additive: works with existing
design flows, tools, and training
Builds upon open industry
standards
17
© 2015 Texas Multicore Technologies, Inc.
All Rights Reserved
18. Drops Into Your Current Design Flow
Adds a multicore “power tool” to
the programmers toolbox
Complete add-on solution
─ IDE plug-ins, debugger, interpreter, auto-
parallelizing compiler, runtime environment
Easy to modernize legacy applications
─ Parallel C++ output enables just a portion to
be refactored in SequenceL and linked in
─ Uses Vector (SIMD) processor instructions
─ Automatic OpenCL generation averts the
need to learn and incorporate low-level
CUDA or OpenCL code and associated
scaffolding to exploit systems with (GP)GPUs
─ Often faster to refactor portions of code in
SequenceL than find and fix bugs in old code
18
© 2015 Texas Multicore Technologies, Inc.
All Rights Reserved
19. The Problem With Directive-Based Programming
Example: 3-body problem
//P1
a1 = grav(P1, P2, m2) + grav(P1, P3, m3);
dv1 = a1*dt;
v1 = v1 + dv1;
dp1 = v1*dt;
//P2
a2 = grav(P2, P1, m1) + grav(P2, P3, m3);
dv2 = a2*dt;
v2 = v2 + dv2;
dp2 = v2*dt;
//P3
a3 = grav(P3, P2, m2) + grav(P3, P1, m1);
dv3 = a3*dt;
v3 = v3 + dv3;
dp3 = v3*dt;
19
© 2015 Texas Multicore Technologies, Inc.
All Rights Reserved
20. The Problem With Directive-Based Programming
Example: 3-body problem
//P1
a1 = grav(P1, P2, m2) + grav(P1, P3, m3);
dv1 = a1*dt;
v1 = v1 + dv1;
dp1 = v1*dt;
//P2
a2 = grav(P2, P1, m1) + grav(P2, P3, m3);
dv2 = a2*dt;
v2 = v2 + dv2;
dp2 = v2*dt;
//P3
a3 = grav(P3, P2, m2) + grav(P3, P1, m1);
dv3 = a3*dt;
v3 = v3 + dv3;
dp3 = v3*dt;
Each body can be
calculated at the same
time to give in theory a
3x speedup
20
© 2015 Texas Multicore Technologies, Inc.
All Rights Reserved
21. The Problem With Directive-Based Programming
Example: 3-body problem
#pragma omp parallel
#pragma omp single nowait
{
#pragma omp task
{
a1 = grav(P1, P2, m2) + grav(P1, P3, m3);
dv1 = a1*dt;
v1 = v1 + dv1;
dp1 = v1*dt;
}
#pragma omp task
{
a2 = grav(P2, P1, m1) + grav(P2, P3, m3);
dv2 = a2*dt;
v2 = v2 + dv2;
dp2 = v2*dt;
}
#pragma omp task
{
a3 = grav(P3, P2, m2) + grav(P3, P1, m1);
dv3 = a3*dt;
v3 = v3 + dv3;
dp3 = v3*dt;
}
#pragma omp taskwait
}
Using directive-based
approaches like OpenMP,
the burden is on the
programmer to identify
where the program can
be safely parallelized.
Programmer then has to
add the correct pragmas.
21
© 2015 Texas Multicore Technologies, Inc.
All Rights Reserved
22. The Problem With Directive-Based Programming
Example: 3-body problem
#pragma omp parallel
#pragma omp single nowait
{
#pragma omp task
{
a1 = grav(P1, P2, m2) + grav(P1, P3, m3);
dv1 = a1*dt;
v1 = v1 + dv1;
dp1 = v1*dt;
}
#pragma omp task
{
a2 = grav(P2, P1, m1) + grav(P2, P3, m3);
dv2 = a2*dt;
v2 = v2 + dv2;
dp2 = v2*dt;
}
#pragma omp task
{
a3 = grav(P3, P2, m2) + grav(P3, P1, m1);
dv3 = a3*dt;
v3 = v3 + dv3;
dp3 = v3*dt;
}
#pragma omp taskwait
}
But maybe you could
parallelize other things…
22
© 2015 Texas Multicore Technologies, Inc.
All Rights Reserved
23. The Problem With Directive-Based Programming
Example: 3-body problem
#pragma omp parallel
#pragma omp single nowait
{
#pragma omp task
g1 = grav(P1, P2, m2);
#pragma omp task
g2 = grav(P1, P3, m3);
#pragma omp task
g3 = grav(P2, P1, m1);
#pragma omp task
g4 = grav(P2, P3, m3);
#pragma omp task
g5 = grav(P3, P2, m2);
#pragma omp task
g6 = grav(P3, P1, m1);
#pragma omp taskwait
}
a1 = g1 + g2;
dv1 = a1*dt;
v1 = v1 + dv1;
dp1 = v1*dt;
a2 = g3 + g4;
dv2 = a2*dt;
v2 = v2 + dv2;
dp2 = v2*dt;
a3 = g5 + g6;
dv3 = a3*dt;
v3 = v3 + dv3;
dp3 = v3*dt;
But now you have to start
re-arranging the code,
moving further away from
the original description of
the algorithm
Possible Race Conditions!
If the grav function modifies its
inputs or calls non thread-safe
functions, there could be hard to
detect race conditions, leading to
incorrect results
23
© 2015 Texas Multicore Technologies, Inc.
All Rights Reserved
24. SequenceL: Self-Parallelizes, Race-Free, Readable
Example: 3-body problem
threeBody(P1, m1, P2, m2, P3, m3, dt) :=
let
a1 := grav(P1, P2, m2) + grav(P1, P2, m2);
dv1 := a1*dt;
v1 := v1 + dv1;
dp1 := v1*dt;
a2 := g3 = grav(P2, P1, m1) + grav(P2, P3, m3);
dv2 := a2*dt;
v2 := v2 + dv2;
dp2 := v2*dt;
a3 := grav(P3, P2, m2) + grav(P3, P1, m1);
dv3 := a3*dt;
v3 := v3 + dv3;
dp3 := v3*dt;
in
[dp1, dp2, dp3];
With SequenceL the programmer
does not add any parallel
constructs or pragmas.
The program will self-parallelize if
safe to do so (No race conditions).
Code clarity and intent remain,
greatly improving correctness and
quality.
Subsequent enhancements and
innovations are rapid.
This ease of reading/writing
is not by accident.
24
© 2015 Texas Multicore Technologies, Inc.
All Rights Reserved
25. Ease of Reading/Writing SequenceL
Matrix Multiply:
─ The product of an m×p matrix A with a p×n matrix B is
an m×n matrix denoted AB whose entries are given by:
𝐴𝐵 𝑖𝑗 = 𝑘=1
𝑝
𝐴𝑖𝑘 𝐵 𝑘𝑗
25
© 2015 Texas Multicore Technologies, Inc.
All Rights Reserved
26. Ease of Reading/Writing SequenceL
Matrix Multiply in Java:
𝐴𝐵 𝑖𝑗 = 𝑘=1
𝑝
𝐴𝑖𝑘 𝐵 𝑘𝑗
26
© 2015 Texas Multicore Technologies, Inc.
All Rights Reserved
27. Ease of Reading/Writing SequenceL
Matrix Multiply in SequenceL:
─ The product of an m×p matrix A with a p×n matrix B is
an m×n matrix denoted AB whose entries are given by:
𝐴𝐵 𝑖𝑗 = 𝑘=1
𝑝
𝐴𝑖𝑘 𝐵 𝑘𝑗
27
- or -
© 2015 Texas Multicore Technologies, Inc.
All Rights Reserved
29. Sample SequenceL Performance Speedups
29
0.00
2.00
4.00
6.00
8.00
10.00
12.00
0 2 4 6 8 10 12 14 16
Matrix Multiply
Game Of Life
2D FFT
LU factorization
QuickSort
String Search
Barnes-Hut
n-Body
Matrix Inverse
Sparse Matrix
Compression
Adesk (DC)
Adesk (LW)
Matrix Multiply
(blocking)
Semblance
Speech filter
Perfect
Number of Processor Cores
TimesFaster
© 2015 Texas Multicore Technologies, Inc.
All Rights Reserved
30. To learn more:
Watch an short 3-part video tutorial at:
http://www.texasmulticoretechnologies.com/resources/videos/
Email: sales@texasmulticore.com for a free 45 day trial
www.texasmulticore.com