The document discusses mapping streaming applications to multicore architectures. It proposes a 3-phase approach: 1) Coarsen the stream graph by fusing stateless pipelines to reduce communication and expose optimization opportunities. 2) Data parallelize stateless filters to occupy all cores while preserving task parallelism. 3) Software pipeline stateful filters to exploit pipeline parallelism. Evaluation shows the coarse-grained approach achieves good parallelism with low synchronization overhead.
This slide contain the description about the various technique related to parallel Processing(vector Processing and array processor), Arithmetic pipeline, Instruction Pipeline, SIMD processor, Attached array processor
Parallel computing is computing architecture paradigm ., in which processing required to solve a problem is done in more than one processor parallel way.
This slide contain the description about the various technique related to parallel Processing(vector Processing and array processor), Arithmetic pipeline, Instruction Pipeline, SIMD processor, Attached array processor
Parallel computing is computing architecture paradigm ., in which processing required to solve a problem is done in more than one processor parallel way.
pipelining is the concept of decomposing the sequential process into number of small stages in which each stage execute individual parts of instruction life cycle inside the processor.
recognizer for a language, Deterministic finite automata, Non-deterministic finite automata, conversion of NFA to DFA, Regular Expression to NFA, Thomsons Construction
In this presentation, I am explaining about Threads, types of threads, its advantages and disadvantages, difference between Process and Threads, multithreading and its type.
"Like the ppt if you liked the ppt"
LinkedIn - https://in.linkedin.com/in/prakharmaurya
Interstage buffer B1 feeds the Decode stage with a newly-fetched instruction.
Interstage buffer B2 feeds the Compute stage with the two operands
Interstage buffer B3 holds the result of the ALU operation
Interstage buffer B4 feeds the Write stage with a value to be written into the register file
pipelining is the concept of decomposing the sequential process into number of small stages in which each stage execute individual parts of instruction life cycle inside the processor.
recognizer for a language, Deterministic finite automata, Non-deterministic finite automata, conversion of NFA to DFA, Regular Expression to NFA, Thomsons Construction
In this presentation, I am explaining about Threads, types of threads, its advantages and disadvantages, difference between Process and Threads, multithreading and its type.
"Like the ppt if you liked the ppt"
LinkedIn - https://in.linkedin.com/in/prakharmaurya
Interstage buffer B1 feeds the Decode stage with a newly-fetched instruction.
Interstage buffer B2 feeds the Compute stage with the two operands
Interstage buffer B3 holds the result of the ALU operation
Interstage buffer B4 feeds the Write stage with a value to be written into the register file
A Pipeline for Distributed Topic and Sentiment Analysis of Tweets on Pivotal ...Srivatsan Ramanujam
Unstructured data is everywhere - in the form of posts, status updates, bloglets or news feeds in social media or in the form of customer interactions Call Center CRM. While many organizations study and monitor social media for tracking brand value and targeting specific customer segments, in our experience blending the unstructured data with the structured data in supplementing data science models has been far more effective than working with it independently.
In this talk we will show case an end-to-end topic and sentiment analysis pipeline we've built on the Pivotal Greenplum Database platform for Twitter feeds from GNIP, using open source tools like MADlib and PL/Python. We've used this pipeline to build regression models to predict commodity futures from tweets and in enhancing churn models for telecom through topic and sentiment analysis of call center transcripts. All of this was possible because of the flexibility and extensibility of the platform we worked with.
Please contact me to download this pres.A comprehensive presentation on the field of Parallel Computing.It's applications are only growing exponentially day by days.A useful seminar covering basics,its classification and implementation thoroughly.
Visit www.ameyawaghmare.wordpress.com for more info
Ariel Waizel discusses the Data Plane Development Kit (DPDK), an API for developing fast packet processing code in user space.
* Who needs this library? Why bypass the kernel?
* How does it work?
* How good is it? What are the benchmarks?
* Pros and cons
Ariel worked on kernel development at the IDF, Ben Gurion University, and several companies. He is interested in networking, security, machine learning, and basically everything except UI development. Currently a Solution Architect at ConteXtream (an HPE company), which specializes in SDN solutions for the telecom industry.
OpenPOWER Acceleration of HPCC SystemsHPCC Systems
JT Kellington, IBM and Allan Cantle, Nallatech present at the 2015 HPCC Systems Engineering Summit Community Day about porting HPCC Systems to the POWER8-based ppc64el architecture.
This webinar explains why PISA chips are inevitable, provides overview of machine architecture of such switches, presents a brief primer on the P4 language with sample programs for a variety of networks and demonstrates a powerful network diagnostics application implemented in P4.
Programmability in SDNs is confined to the network control plane. The forwarding plane is still largely dictated by fixed-function switching chips. Our goal is to change that, and to allow programmers to define how packets are to be processed all the way down to the wire.
This is made possible by a new generation of high-performance forwarding chips. At the high-end, PISA (Protocol-Independent Switch Architecture) chips promise multi-Tb/s of packet processing. At the mid- and low-end of the performance spectrum, CPUs, GPUs, FPGAs, and NPUs already offer great flexibility with performance of a few tens to hundreds of Gb/s.
In addition to programmable forwarding chips, we also need a high-level language to dictate the forwarding behavior in a target independent fashion. "P4" (www.p4.org) is such a language. In P4, the programer declares how packets are to be processed, and a compiler generates a configuration for a PISA chip, or a programmable target in general. For example, the programmer might program the switch to be a top-of-rack switch, a firewall, or a load-balancer; and might add features to run automatic diagnostics and novel congestion control algorithms.
Espresso Database Replication with Kafka, Tom Quiggleconfluent
The initial deployment of Espresso relies on MySQL’s built-in mechanism for Master-Slave replication. Storage hosts running MySQL masters service HTTP requests to store and retrieve documents, while hosts running slave replicas remain mostly idle. Since replication is at the MySQL instance level, masters and slaves must contain the exact same partitions – precluding flexible and dynamic partition placement and migration within the cluster.
Espresso is migrating to a new deployment topology where each Storage Node may host a combination of master and slave partitions; thus distributing the application requests equally across all available hardware resources. This topology requires per-partition replication between master and slave nodes. Kafka will be used as the transport for replication between partitions.
For use as the replication stream for the source-of-truth data store for LinkedIn’s most valuable data, Kafka must be as reliable as MySQL replication. The session will cover Kafka configuration options to ensure highly reliable, in-order message delivery. Additionally, the application logic maintains state both within the Kafka event stream and externally to detect message re-delivery, out of order delivery, and messages inserted out-of-band. These application protocols to guarantee high fidelity will be discussed.
The Swiss ISP SWITCH has developed a scalable IPFIX exporter built using Snabb.
In 2022 the application gained many new features, and was upstreamed into the
main Snabb repository. We will showcase a production-grade Snabb application,
and discuss implementation challenges and how Snabb helps you deal with them.
(c) FOSDEM 2023
4 & 5 February 2023
https://fosdem.org/2023/schedule/event/network_snabbflow_ipfix/
digital signal processing
Computer Architectures for signal processing
Harvard Architecture, Pipelining, Multiplier
Accumulator, Special Instructions for DSP, extended
Parallelism,General Purpose DSP Processors,
Implementation of DSP Algorithms for var
ious operations,Special purpose DSP
Hardware,Hardware Digital filters and FFT processors,
Case study and overview of TMS320
series processor, ADSP 21XX processor
Field Programmable Gate Arrays (FPGAs) are configurable integrated circuits able to provide a good trade-off in terms of performance, power consumption, and flexibility with respect to other architectures, like CPUs, GPUs and ASICs. The main drawback in using FPGAs, however, is their steep learning curve. An emerging solution to this problem is to write algorithms in a Domain Specific Language (DSL) and to let the DSL compiler generate efficient code targeting FPGAs. This work proposes FROST, a unified backend that enables different DSL compilers to target FPGA architectures. Differently from other code generation frameworks targeting FPGA, FROST exploits a scheduling co-language that enables users to have full control over which optimizations to apply in order to generate efficient code (e.g. loop pipelining, array partitioning, vectorization). At first, FROST analyzes and manipulates the input Abstract Syntax Tree (AST) in order to apply FPGA-oriented transformations and optimizations, then generates a C/C++ implementation suitable for High-Level Synthesis (HLS) tools. Finally, the output of HLS phase is synthesized and implemented on the target FPGA using Xilinx SDAccel toolchain.
FROST currently supports as front-end Halide, an Image Processing DSL, and Tiramisu, a DSL optimizer, and allows to achieve significant speedups with respect to state-of-the-art FPGA implementations of the same algorithms.
DPDK Summit 2015 - NTT - Yoshihiro NakajimaJim St. Leger
DPDK Summit 2015 in San Francisco.
NTT presentation by Yoshihiro Nakajima.
For additional details and the video recording please visit www.dpdksummit.com.
Using a Field Programmable Gate Array to Accelerate Application PerformanceOdinot Stanislas
Intel s'intéresse tout particulièrement aux FPGA et notamment au potentiel qu'ils apportent lorsque les ISV et développeurs ont des besoins très spécifiques en Génomique, traitement d'images, traitement de bases de données, et même dans le Cloud. Dans ce document vous aurez l'occasion d'en savoir plus sur notre stratégie, et sur un programme de recherche lancé par Intel et Altera impliquant des Xeon E5 équipés... de FPGA
Intel is looking at FPGA and what they bring to ISVs and developers and their very specific needs in genomics, image processing, databases, and even in the cloud. In this document you will have the opportunity to learn more about our strategy, and a research program initiated by Intel and Altera involving Xeon E5 with... FPGA inside.
Auteur(s)/Author(s):
P. K. Gupta, Director of Cloud Platform Technology, Intel Corporation
Building a Business Logic Translation Engine with Spark Streaming for Communi...Spark Summit
Attestation Legale is a social networking service for companies that alleviates the administrative burden European countries are imposing on client supplier relationships. It helps companies from construction, staffing and transport industries, digitalize, secure and share their legal documents. With clients ranging from one-person businesses to industry leaders such as Orange or Bouygues Construction, they ease business relationships for a social network of companies that would be equivalent to a 34 billion dollar industry. While providing a high quality of service through our SAAS platform, we faced many challenges including refactoring our monolith into microservices, a daunting architectural task a lot of organizations are facing today. Strategies for tackling that problem primarily revolve around extracting business logic from the monolith or building new applications with their own logic that interfaces with the legacy. Sometimes however, especially in companies sustaining an important growth, new business opportunities arise and the required logic from your microservices might greatly differs from the legacy. We will discuss how we used Spark Streaming and Kafka to build a real time business logic translation engine that allows loose technical and business coupling between our microservices and legacy code. You will also hear about how making Apache Spark a part of our consumer facing product also came with technical challenges, especially when it comes to reliability. Finally, we will share the lambda architecture that allowed us to use move data in batch (migrating data from the monolith for initialization) and real time (handling data generated after through use). Key takeaways include: – Breaking down this strategy and its derived technical and business profits – Feedback on how we achieved reliability – Examples of implementations using RabbitMQ (then Kafka) and GraphX – Testing business rules and data transformation.
This presentation discusses the following topics:
Basic features of R
Exploring R GUI
Data Frames & Lists
Handling Data in R Workspace
Reading Data Sets & Exporting Data from R
Manipulating & Processing Data in R
A study on “Diagnosis Test of Diabetics and Hypertension by AI”, Presentation slides for International Conference on "Life Sciences: Acceptance of the New Normal", St. Aloysius' College, Jabalpur, Madhya Pradesh, India, 27-28 August, 2021
A study on “impact of artificial intelligence in covid19 diagnosis”Dr. C.V. Suresh Babu
A study on “Impact of Artificial Intelligence in COVID-19 Diagnosis”, Presentation slides for International Conference on "Life Sciences: Acceptance of the New Normal", St. Aloysius' College, Jabalpur, Madhya Pradesh, India, 27-28 August, 2021
A study on “impact of artificial intelligence in covid19 diagnosis”Dr. C.V. Suresh Babu
Although the lungs are one of the most vital organs in the body, they are vulnerable to infection and injury. COVID-19 has put the entire world in an unprecedented difficult situation, bringing life to a halt and claiming thousands of lives all across the world. Medical imaging, such as X-rays and computed tomography (CT), is essential in the global fight against COVID-19, and newly emerging artificial intelligence (AI) technologies are boosting the power of imaging tools and assisting medical specialists. AI can improve job efficiency by precisely identifying infections in X-ray and CT images and allowing further measurement. We focus on the integration of AI with X-ray and CT, both of which are routinely used in frontline hospitals, to reflect the most recent progress in medical imaging and radiology combating COVID-19.
Operation “Blue Star” is the only event in the history of Independent India where the state went into war with its own people. Even after about 40 years it is not clear if it was culmination of states anger over people of the region, a political game of power or start of dictatorial chapter in the democratic setup.
The people of Punjab felt alienated from main stream due to denial of their just demands during a long democratic struggle since independence. As it happen all over the word, it led to militant struggle with great loss of lives of military, police and civilian personnel. Killing of Indira Gandhi and massacre of innocent Sikhs in Delhi and other India cities was also associated with this movement.
Unit 8 - Information and Communication Technology (Paper I).pdfThiyagu K
This slides describes the basic concepts of ICT, basics of Email, Emerging Technology and Digital Initiatives in Education. This presentations aligns with the UGC Paper I syllabus.
How to Make a Field invisible in Odoo 17Celine George
It is possible to hide or invisible some fields in odoo. Commonly using “invisible” attribute in the field definition to invisible the fields. This slide will show how to make a field invisible in odoo 17.
Normal Labour/ Stages of Labour/ Mechanism of LabourWasim Ak
Normal labor is also termed spontaneous labor, defined as the natural physiological process through which the fetus, placenta, and membranes are expelled from the uterus through the birth canal at term (37 to 42 weeks
Model Attribute Check Company Auto PropertyCeline George
In Odoo, the multi-company feature allows you to manage multiple companies within a single Odoo database instance. Each company can have its own configurations while still sharing common resources such as products, customers, and suppliers.
3. Multicores Are Here!
512
256
128
64
32
# of
cores 16
8
4
For uniprocessors,
Uniprocessors:
C was:
C •is the common
Portable
machine language
•High Performance
•Composable
•Malleable
•Maintainable
Picochip
PC102
Cisco
CSR-1
Intel
Tflops
Raw
1
8086
286
386
486
Broadcom 1480
Pentium
8008
1970
3
8080
1975
1980
1985
1990
Raza
XLR
Niagara
2
4004
Ambric
AM2045
Cavium
Octeon
Cell
Opteron 4P
Xeon MP
Xbox360
PA-8800 Opteron
Tanglewood
Power4
PExtreme Power6
Yonah
P2 P3 Itanium
P4
Athlon
Itanium 2
1995
2000
2005
20??
4. Multicores Are Here!
What is the common
machine language
for multicores?
512
256
128
Picochip
PC102
Ambric
AM2045
Cisco
CSR-1
Intel
Tflops
64
32
# of
cores 16
Raw
8
Niagara
Broadcom 1480
4
2
1
4004
8080
8086
286
386
486
Pentium
8008
1970
4
Raza
XLR
1975
1980
1985
1990
Cavium
Octeon
Cell
Opteron 4P
Xeon MP
Xbox360
PA-8800 Opteron
Tanglewood
Power4
PExtreme Power6
Yonah
P2 P3 Itanium
P4
Athlon
Itanium 2
1995
2000
2005
20??
5. Common Machine Languages
Uniprocessors:
Common Properties
Multicores:
Common Properties
Single flow of control
Multiple flows of control
Single memory image
Multiple local memories
Differences:
Differences:
Number and capabilities of cores
Register Allocation
Communication Model
ISA Instruction Selection
Synchronization Model
Functional Units Instruction Scheduling
Register File
von-Neumann languages represent the
common properties and abstract away
the differences
5
Need common machine language(s)
for multicores
6. Streaming as a Common Machine Language
AtoD
• Regular and repeating computation
FMDemod
• Independent filters
with explicit communication
– Segregated address spaces and
multiple program counters
Scatter
– Producer / Consumer dependencies
– Enables powerful, whole-program
transformations
LPF2
LPF3
HPF1
• Natural expression of Parallelism:
LPF1
HPF2
HPF3
Gather
Adder
Speaker
6
7. Types of Parallelism
Task Parallelism
– Parallelism explicit in algorithm
– Between filters without
producer/consumer relationship
Scatter
Gather
7
Task
Data Parallelism
– Peel iterations of filter, place within
scatter/gather pair (fission)
– parallelize filters with state
Pipeline Parallelism
– Between producers and consumers
– Stateful filters can be parallelized
8. Types of Parallelism
Task Parallelism
– Parallelism explicit in algorithm
Data Parallel
– Between filters without
Gather
producer/consumer relationship
Scatter
Pipeline
Scatter
Gather
Data
8
Task
Data Parallelism
– Between iterations of a stateless filter
– Place within scatter/gather pair (fission)
– Can’t parallelize filters with state
Pipeline Parallelism
– Between producers and consumers
– Stateful filters can be parallelized
10. Problem Statement
Given:
– Stream graph with compute and communication
estimate for each filter
– Computation and communication resources of
the target machine
Find:
– Schedule of execution for the filters that best
utilizes the available parallelism to fit the
machine resources
10
11. Our 3-Phase Solution
Coarsen
Granularity
Data
Parallelize
Software
Pipeline
1. Coarsen: Fuse stateless sections of the graph
2. Data Parallelize: parallelize stateless filters
3. Software Pipeline: parallelize stateful filters
Compile to a 16 core architecture
–
11
11.2x mean throughput speedup over single core
12. Outline
• StreamIt language overview
• Mapping to multicores
– Baseline techniques
– Our 3-phase solution
12
13. The StreamIt Project
• Applications
StreamIt Program
– DES and Serpent [PLDI 05]
– MPEG-2 [IPDPS 06]
– SAR, DSP benchmarks, JPEG, …
Front-end
• Programmability
– StreamIt Language (CC 02)
– Teleport Messaging (PPOPP 05)
– Programming Environment in Eclipse (P-PHEC 05)
Annotated Java
• Domain Specific Optimizations
– Linear Analysis and Optimization (PLDI 03)
– Optimizations for bit streaming (PLDI 05)
– Linear State Space Analysis (CASES 05)
Simulator
(Java Library)
Stream-Aware
Optimizations
• Architecture Specific Optimizations
– Compiling for Communication-Exposed
Architectures (ASPLOS 02)
– Phased Scheduling (LCTES 03)
– Cache Aware Optimization (LCTES 05)
– Load-Balanced Rendering
(Graphics Hardware 05)
13
Uniprocessor
backend
Cluster
backend
Raw
backend
IBM X10
backend
C/C++
MPI-like
C/C++
C per tile +
msg code
Streaming
X10 runtime
14. Model of Computation
• Synchronous Dataflow [Lee ‘92]
A/D
– Graph of autonomous filters
– Communicate via FIFO channels
Band Pass
• Static I/O rates
– Compiler decides on an order
of execution (schedule)
Detect
– Static estimation of
computation
LED
14
Duplicate
Detect
Detect
Detect
LED
LED
LED
15. Example StreamIt Filter
0
1
2
3
4
5
6
7
8
9 10 11
FIR
0
1
output
float→float filter FIR (int N, float[N] weights) {
work push 1 pop 1 peek N {
float result = 0;
Stateless
for (int i = 0; i < N; i++) {
result += weights[i] ∗ peek(i);
}
pop();
push(result);
}
}
15
input
16. Example StreamIt Filter
0
1
2
3
4
5
6
7
8
9 10 11
FIR
0
1
output
float→float filter FIR (int N, float[N] weights) {
N) {
;
Stateful
work push 1 pop 1 peek N {
float result = 0;
weights = adaptChannel(weights);
for (int i = 0; i < N; i++) {
result += weights[i] ∗ peek(i);
}
pop();
push(result);
}
}
16
input
17. StreamIt Language Overview
• StreamIt is a novel
language for streaming
– Exposes parallelism and
communication
– Architecture independent
– Modular and composable
– Simple structures
composed to creates
complex graphs
filter
pipeline
may be
any StreamIt
language
construct
splitjoin
splitter
parallel computation
joiner
– Malleable
– Change program behavior
with small modifications
feedback loop
joiner
17
splitter
18. Outline
• StreamIt language overview
• Mapping to multicores
– Baseline techniques
– Our 3-phase solution
18
19. Baseline 1: Task Parallelism
• Inherent task parallelism between
two processing pipelines
Splitter
BandPass
BandPass
Compress
Compress
Process
Process
Expand
Expand
BandStop
BandStop
Joiner
Adder
19
• Task Parallel Model:
– Only parallelize explicit task
parallelism
– Fork/join parallelism
• Execute this on a 2 core machine
~2x speedup over single core
• What about 4, 16, 1024, … cores?
20. Throughput Normalized to Single Core StreamIt
Evaluation: Task Parallelism
Raw Microprocessor
Parallelism: Not matched to target!
16 inorder, single-issue cores with D$ and I$
Synchronization: Not matched to with DMA
16 memory banks, each bank target!
19
18
17
16
Cycle accurate simulator
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
20
n
M
ea
da
r
m
et
ric
R
a
G
eo
er
oc
od
V
od
er
G
2D
ec
T
D
E
P
E
M
t
S
er
pe
n
F
M
R
ad
i
o
k
er
ba
n
F
ilt
T
F
F
D
E
S
T
D
C
oc
lV
nn
e
C
ha
B
it o
ni
cS
or
t
od
e
r
0
21. Baseline 2: Fine-Grained
Data Parallelism
Splitter
Splitter
Joiner
Splitter
BandPass
BandPass
BandPass
BandPass
BandPass
BandPass
BandPass
BandPass
Splitter
Splitter
Compress
Compress
Compress
Compress
Compress
Compress
Compress
Compress
Joiner
Joiner
Splitter
Process
Process
Process
Process
Joiner
Splitter
Splitter
Expand
Expand
Expand
Expand
BandStop
BandStop
BandStop
BandStop
Process
Process
Process
Process
Expand
Expand
Expand
Expand
Joiner
Splitter
Splitter
Joiner
Splitter
BandStop
BandStop
BandStop
BandStop
Joiner
Joiner
Splitter
– Fiss each stateless filter N
ways (N is number of cores)
– Remove scatter/gather if
possible
• We can introduce data
parallelism
Joiner
Joiner
– Example: 4 cores
• Each fission group occupies
entire machine
BandStop
BandStop
BandStop
Adder
Adder
Joiner
21
Joiner
• Each of the filters in the
example are stateless
• Fine-grained Data Parallel
Model:
23. Outline
• StreamIt language overview
• Mapping to multicores
– Baseline techniques
– Our 3-phase solution
23
24. Phase 1: Coarsen the Stream Graph
Splitter
BandPass
Peek
BandPass
Compress
Compress
Process
Process
Expand
Expand
BandStop
Peek
Joiner
Adder
24
Peek
BandStop
Peek
• Before data-parallelism is
exploited
• Fuse stateless pipelines as
much as possible without
introducing state
– Don’t fuse stateless with
stateful
– Don’t fuse a peeking filter with
anything upstream
25. Phase 1: Coarsen the Stream Graph
Splitter
BandPass
Compress
Process
Expand
BandPass
Compress
Process
Expand
BandStop
BandStop
• Before data-parallelism is
exploited
• Fuse stateless pipelines as
much as possible without
introducing state
– Don’t fuse stateless with
stateful
– Don’t fuse a peeking filter with
anything upstream
• Benefits:
Joiner
Adder
25
– Reduces global communication
and synchronization
– Exposes inter-node
optimization opportunities
26. Phase 2: Data Parallelize
Data Parallelize for 4 cores
Splitter
BandPass
Compress
Process
Expand
BandPass
Compress
Process
Expand
BandStop
BandStop
Joiner
Adder
Adder
Adder
Adder
Joiner
26
Splitter
Fiss 4 ways, to occupy entire chip
27. Phase 2: Data Parallelize
Data Parallelize for 4 cores
Splitter
Splitter
Splitter
BandPass
BandPass
Compress
Compress
Process
Process
Expand
Expand
BandPass
BandPass
Compress
Compress
Process
Process
Expand
Expand
Joiner
Joiner
BandStop
BandStop
Joiner
Adder
Adder
Adder
Adder
Joiner
27
Splitter
Task parallelism!
Each fused filter does equal work
Fiss each filter 2 times to occupy entire chip
28. Phase 2: Data Parallelize
Data Parallelize for 4 cores
Splitter
Splitter
Splitter
BandPass
BandPass
Compress
Compress
Process
Process
Expand
Expand
BandPass
BandPass
Compress
Compress
Process
Process
Expand
Expand
Joiner
Joiner
Splitter
– Preserve task parallelism
• Benefits:
– Reduces global communication
and synchronization
Splitter
BandStop
BandStop
BandStop
BandStop
Joiner
Joiner
Joiner
Adder
Adder
Adder
Adder
Joiner
28
• Task-conscious data
parallelization
Splitter
Task parallelism, each filter does equal work
Fiss each filter 2 times to occupy entire chip
29. Evaluation:
Coarse-Grained Data Parallelism
Task
Fine-Grained Data
Coarse-Grained Task + Data
19
Throughput Normalized to Single Core StreamIt
18
17
Good Parallelism!
Low Synchronization!
16
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
29
n
r
M
ea
ad
a
G
eo
m
et
ri
c
R
r
co
de
Vo
r
ec
od
e
EG
2D
TD
E
M
P
t
rp
en
Se
ad
io
FM
R
rb
an
k
Fi
l te
T
FF
ES
D
CT
D
er
lV
oc
od
ha
nn
e
C
Bi
to
ni
c
So
rt
0
33. We Can Do Better!
Splitter
6
6
Cores
Joiner
Splitter
5
Joiner
Splitter
Splitter
2
2
1
1
1
1
1
1
Time
16
Joiner
Joiner
Splitter
5
RectPolar
Joiner
33
Target 4 core machine
34. Phase 3: Coarse-Grained
Software Pipelining
Prologue
New
Steady
State
RectPolar
RectPolar
• New steady-state is free of
dependencies
• Schedule new steady-state
using a greedy partitioning
34
RectPolar
RectPolar
37. Generalizing to Other Multicores
• Architectural requirements:
– Compiler controlled local memories with DMA
– Efficient implementation of scatter/gather
• To port to other architectures, consider:
– Local memory capacities
– Communication to computation tradeoff
• Did not use processor-to-processor
communication on Raw
37
38. Related Work
• Streaming languages:
– Brook [Buck et al. ’04]
– StreamC/KernelC [Kapasi ’03, Das et al. ’06]
– Cg [Mark et al. ‘03]
– SPUR [Zhang et al. ‘05]
• Streaming for Multicores:
– Brook [Liao et al., ’06]
• Ptolemy [Lee ’95]
• Explicit parallelism:
– OpenMP, MPI, & HPF
38
39. Conclusions
• Streaming model naturally exposes task, data, and
pipeline parallelism
• This parallelism must be exploited at the correct
granularity and combined correctly
Task
Fine-Grained Coarse-Grained
Data
Task + Data
Coarse-Grained
Task + Data +
Software Pipeline
Parallelism
Not
matched
Good
Good
Best
Synchronization
Not
matched
High
Low
Lowest
• Good speedups across varied benchmark suite
• Algorithms should be applicable across multicores
39
Editor's Notes
Lets look at a simplified graph of showing a log scale of the number of cores for some commodity microprocessors versus their date of introduction.
For 35 years, uniprocessor designs dominated the commodity microprocessor market.
But their end has come due to the limited scalability of their global, monolithic structures.
The major cpu vendors have stopped development of uniprocessor designs.
In the last 5 years, commodity processor designers have turned to multicore designs to continue performance scalability.
From 2 cores up to hundreds of cores.
During the age of the uniprocessor, programmers had it pretty easy. They could write a piece of code in c or another von Neumann language
and have it be portable and scalable across the generations of uniprocessor designs.
Furthermore, von Neumann languages had all these other great benefits…
debuggable
We could say that c and von Neumann languages were the common machine languages for uniprocessors.
They encapsulated the common properties but abstracted away the differences across uniprocessor architectures.
Now that multicore designs are becoming dominant, we want to ask ourselves what is the common machine language for them.
We would like to write a program once and have it be portable and scale with future generations of multicore designs.
Also, we want parallel programming has to become as easy as sequential programming thus, the common machine language should be
composeable, malleable, and maintainable and the mapping burden should be squarely on the compiler.
also a common machine language must not be too complex so that it can attract typical programmers.
memories maybe shared, does not matter if it is exposed in the lang.
If we were to use a von Neumann language as a common machine language for a multicore, it would reaquire heroic efforts from the compiler
there are many research proposals for such a common machine language, one representation that will cover a large part of the application space is a streaming representation
our research proposes streaming languages as the common machine language across multicore architectures
Specifically, in this talk we develop algorithms to enable portability and high-performance across multicores beginning from a high-level stream language.
our work focuses on compiler algorithms to break the abstraction layer between the language and the number of cores/capability of cores
Enable portability and high-performance across multicores
streaming applications are becoming increasing prevalent in general purpose processing, already a large part of desktop and embedded workloads
outer
define stateless!
composable, malleable, debuggable (because it is deterministic)
Abundance of parallelism in streaming programs,
pipeline parallelism because these filters execute repeatedly, if they are mapped to different computing cores, we might be able to take advantage of pipeline parallelism, chains of producer and consumers
each of the resulting duplicate products executes less times than the original, they are placed in a round robin splitter…
a filter can be data-parallelized if it is stateless, meaning that the filter does not write to any variable that is read by a later iteration.
we have the nice properties that each type of parallelism can be naturally expressed in the stream graph
Abundance of parallelism in streaming programs,
Some streaming representations require that all filters are data parallel, we don’t have this requirement, the compiler discovers data parallel filters (see below)
pipeline parallelism because these filters execute repeatedly, if they are mapped to different computing cores, we might be able to take advantage of pipeline parallelism, chains of producer and consumers
each of the resulting duplicate products executes less times than the original, they are placed in a round robin splitter…
a filter can be data-parallelized if it is stateless, meaning that the filter does not write to any variable that is read by a later iteration.
we have the nice properties that each type of parallelism can be naturally expressed in the stream graph
no filter dynamic filter migration
each filter is mapped to a single core
concerned with thruput!
coarsen gran to reduce communication overhead
Decreases global communication and synchronization
Enables internode optimizations on fused filter
data parallelize to parallel stateless components
to exploit pipeline parallelism we perform software pipeline to parallelize remaining components
after a prologue schedule is executed, we can statically schedule the filters in any order in the steady state…
data parallelize to parallel stateless components
to exploit pipeline parallelism we perform software pipeline to parallelize remaining components
after a prologue schedule is executed, we can statically schedule the filters in any order in the steady state
We employ a greedy heuristic to schedule the software pipelined steady-sate
over the last 5 years we have been developing….
Many possible legal schedules
frequency band detection, used in garage door openers, metal detectors
Highlight “filters”
Replace with filterbank
Every language has an underlying model of computation
Streamit adopts SDF
Programs as graph, nodes: filters i.e. kernels, which represent autonomous/standalone computation kernels; edges represent data flow between kernels
Compiler orchestrate execution of kernels: this is the schedule
As the example earlier showed, example can affect locality: how often does a kernel get loaded/reloaded into cache
A lot of previous emphasis on minimizing data requirements between filters but as we will show, this in the presence of caches, this isn’t a good strategy as we will show
start off with the work function, atomic unit of execution
This is the work function, it is repeated executed by the filter
emphasize peek!
stateful versus stateless filters
Stateless filters can be data-parallelized!
Talk through filter example: what computation looks like in StreamIt
Highlight peek/pop/push rates, and work function
Parameterizable: takes argument N
Emphasize the code!
we can detect stateless filters using a simple program analysis
start off with the work function, atomic unit of execution
This is the work function, it is repeated executed by the filter
emphasize peek!
stateful versus stateless filters
Stateless filters can be data-parallelized!
Talk through filter example: what computation looks like in StreamIt
Highlight peek/pop/push rates, and work function
Parameterizable: takes argument N
Emphasize the code!
we can detect stateless filters using a simple program analysis
Mention that splitter can be duplicate or round-robin and joiner can be round robin
the streams of a pipeline/splitjoin do not have to be all the same
parameterized graphs
The streamit language is designed with productivity in mind
- natural for the program to represent computation
- expose what is traditionally hard for the compiler to discover: namely, parallelism and communication
The language is also designed to be modular/composable: important to software engineering, and also for correctness checking
Show stream constructs:
- filter: basic unit of computation
- pipeline: sequential/serial execution of streams that communication data
> a stream can be a filter or any of the language stream constructs: modularity/reusability
- sj: explicit parallelism, scatter data with splitter, gather data with joiner
- also a feedback loop allows for cycles in the graph
The stream constructs are parameterizable: length of pipeline, width of sj
This gives rise to malleabililty: small change in code leads to big changes in program graph
Example on the next slide
each pipeline filters a different frequency
scale top to 16 , and use the same scale for each, height etc
mention problems with task parallelism
not adequate as only source of parallelism
highlight bitonic sort, mention the problems with communication granularity
how large is filterbank? does the number match?
where direct communication is possible, we remove the scatter/gather, but we need scatter/gather between data-parallel filters with non-matching i/o rates
Think about something to say about filterbank!
either make the legend bigger or say what the legend is!
A filter that peek performs computation on a sliding window computation and items need to be preserved across invocations of the filter
Define fusion
Akin to inlining
each pipeline filters a different frequency
remember that a streamit is hierarchical so a pipeline element can be a nested splitjoin, pipeline, or a filter
More detail into benefits
Naturally take advantage of task parallelism and avoid added synchronization imposed by filter fission
Significant amount of state…
annotated with relative load estimations for one execution of the stream graph
We don’t want to data parallelize the amplify filters because they perform such a small amount work per steady state, if we data-parallelize, the scatter and gather will cost more than the parallelized computation
we don’t coarsen the stateful components because we would like the scheduler to have as much freedom as possible in scheduling small tasks.
Color coding the graph for easier reference
We can map this graph to a multicore, following the structure of the graph and taking advantage of data and task parallelism
And each exeution of the graph would require 21 time units
But we can do better
because we are executing the stream graph repeatedly, we can unroll successive iterations…
compare to 9.5
put in all the bars for this, grey out other bars, the outline and the fill
explain the vocoder and the radar app and why they do so well
redo colors!
explain the other minor speedups
explain mpeg2decoder
comment on state, is it going to become more important mpeg4, h264?
Mention that our previous work utilized fine-grained processor-to-processor communication for hardware pipelining
brook, only data parallelism, actors are required to be stateless (support for reduction), focused on ilp and data-parallelism, proc/cons relationships are only exploited for memory hierarchy optimizations
Das’s recent PACT paper describes scheduling tecnhniques for brook targeting imagine, traditional loop scheduling techniques are leveraged but only parallelize memory access with a single compute kernel to hide memory latencies between the stream register file and system memory and to prevent spills from the stream register file. We apply trad loop scheduling techniques to parallelize stateful compute components (filters) and we target a multicore architecture. We exploit data-parallelism at a coarser-granularity, across cores, and we fuse compute components to match the granularity of the architecture
Cg, pipeline parallelism and data parallelism but only for 2 pipeline stages of a graphics processor
streamit has more emphasis on exploiting task and pipeline parallelis m
These languages make it difficult to coarsen the granularity and software pipeline
robust framework for dp that focuses on reductions (which we have not focused on), half of our benchmarks could be parallelized using mapreduce
Stress comparison to explicitly parallel langauges, we get great speedup without programmer intervention, the program is written in a portable manner that
Allows the compiler to produce efficient code.
Ptolemy is an environment for simulation, prototyping, and software synthesis for heterogeneous systems. It includes a SDF component, but the system focuses on simulation and modeling and on code generation for actual architectures.
intel and amd are pushing openmp and mpi to program their multicores. these systems graft parallel constructs onto c and fortan. They are not composable and are had to debug. They place the parallelization burden on the programmer. The programmer is forced to make granularity, load balancing, locality and synchronization decisions through profiling the code. We move these decisions into the compiler and lower the bar for parallel programming achieving portability and high-performance.
Task parallelism is inadequate because the parallelism and synchronization is not matched to the target, forcing the programmer to intervene and create un-portable code
Fine-grained data parallelism has good parallelism, but would overwhelm the communication mechanism of a multicore
Coarsening the granularity before data-parallelism is exploited and achieve great parallelization of stateless components
Finally, adding software pipelining allows us to parallelize stateful components and offers the best parallelism and the lowest synchronization because of the further opportunities for coarsening
conscious of the multicores communication substrate, we don’t want to overwhelm it
Our algorithms can remain largely unchanged across multicore architectures