PEARC19: Wrangling Rogues: A Case Study on Managing Experimental Post-Moore A...Jason Riedy
The Rogues Gallery is a new experimental testbed that is focused on tackling "rogue'' architectures for the Post-Moore era of computing. While some of these devices have roots in the embedded and high-performance computing spaces, managing current and emerging technologies provides a challenge for system administration that are not always foreseen in traditional data center environments.
We present an overview of the motivations and design of the initial Rogues Gallery testbed and cover some of the unique challenges that we have seen and foresee with upcoming hardware prototypes for future post-Moore research. Specifically, we cover the networking, identity management, scheduling of resources, and tools and sensor access aspects of the Rogues Gallery and techniques we have developed to manage these new platforms. We argue that current tools like the Slurm resource manager can support new rogues without major infrastructure changes.
PEARC19: Wrangling Rogues: A Case Study on Managing Experimental Post-Moore A...Jason Riedy
The Rogues Gallery is a new experimental testbed that is focused on tackling "rogue'' architectures for the Post-Moore era of computing. While some of these devices have roots in the embedded and high-performance computing spaces, managing current and emerging technologies provides a challenge for system administration that are not always foreseen in traditional data center environments.
We present an overview of the motivations and design of the initial Rogues Gallery testbed and cover some of the unique challenges that we have seen and foresee with upcoming hardware prototypes for future post-Moore research. Specifically, we cover the networking, identity management, scheduling of resources, and tools and sensor access aspects of the Rogues Gallery and techniques we have developed to manage these new platforms. We argue that current tools like the Slurm resource manager can support new rogues without major infrastructure changes.
In one classic sense a rogue is someone who goes their own way, who breaks away from the crowd. The CRNCH Rogues Gallery aims to support computer architecture rogues by being a physical and virtual space providing access to novel computing architectures. Researchers find applications, and architects discover what happens when their prototypes hit reality. Our goals are to help kick-start software ecosystems, train students in novel system evaluation and use, and provide rapid feedback to architects. By exposing students and researchers to this set of unique hardware, we foster cross-cutting discussions about hardware designs that will drive future performance improvements in computing long after the Moore’s Law era of “cheap transistors” ends. We provide a brief description of the current Rogues Gallery along with successes and research highlights over the last year.
OpenACC and Hackathons Monthly Highlights: April 2023OpenACC
Stay up-to-date on the latest news, research and resources. This month's edition covers the Open Hackathon Mentor Program, highlight from the recent UK National Hackathon, upcoming Open Hackathon and Bootcamp events, and more!
Opening Keynote Lecture
15th Annual ON*VECTOR International Photonics Workshop
Calit2’s Qualcomm Institute
University of California, San Diego
February 29, 2016
From Jisc's campus network engineering for data-intensive science workshop on 19 October 2016.
https://www.jisc.ac.uk/events/campus-network-engineering-for-data-intensive-science-workshop-19-oct-2016
Implementing AI: Running AI at the Edge: Adapting AI to available resource in...KTN
The Implementing AI: Running AI at the Edge, hosted by KTN and eFutures, is the second event of the Implementing AI webinar series.
To make products more intelligent, more responsive and to reduce the data generated, it is advantageous to run AI on the product itself, as opposed to in the cloud.
The focus of this webinar was the opportunities and challenges of moving the AI processing to “the Edge”. The webinar had four presentations from experts covering overviews of the opportunity, implementation techniques and case studies.
Find out more: https://ktn-uk.co.uk/news/just-launched-implementing-ai-webinar-series
AI-Driven Science and Engineering with the Global AI and Modeling Supercomput...Geoffrey Fox
Most things are dominated by Artificial Intelligence (AI). Technology Companies like Amazon, Google, Facebook, and Microsoft are AI First organizations.
Engineering achievement today is highlighted by the AI buried in a vehicle or machine. Industry (Manufacturing) 4.0 focusses on the AI-Driven future of the Industrial Internet of Things.
Software is eating the world.
We can describe much computer systems work as designing, building and using the Global AI and Modelling supercomputer which itself is autonomously tuned by AI. We suggest that this is not just a bunch of buzzwords but has profound significance and examine consequences of this for education and research.
Naively high-performance computing should be relevant for the AI supercomputer but somehow the corporate juggernaut is not making so much use of it. We discuss how to change this.
With the rise of fog and edge-computing as the basic paradigms for future communication standards such as 6G, new processing requirements are established. On the other hand, new security algorithms appear with the scaling of quantum technology, increasing the complexity of the cryptography applications for IoT devices with a tight standardization timeline. Finally, the integration of satellites as nodes for communication networks includes fault-tolerance and error-correction codes as design parameters.
FPGA-based soft-processors are supported by industry and space agencies as promising candidates to overcome all these challenges, due to their flexibility and power consumption compared to GPUs or multithreading CPUs with co-processors. To optimize these architectures to a wide range of scenarios, common methods, and arithmetic functions need to be integrated into the ISA. This talk will show some examples of the RISC-V EL2 core for both classical and post-quantum cryptography and error correction codes, reducing the latency of standardized solutions at a cost of a small cross-section increase keeping the behavior under radiation effects similar to the original core.
The rush to the edge and new applications around AI are causing a shift in design strategies toward the highest performance per watt, rather than the highest performance or lowest power.
An octa core processor with shared memory and message-passingeSAT Journals
Abstract This being the era of fast, high performance computing, there is the need of having efficient optimizations in the processor architecture and at the same time in memory hierarchy too. Each and every day, the advancement of applications in communication and multimedia systems are compelling to increase number of cores in the main processor viz., dual-core, quad-core, octa-core and so on. But, for enhancing the overall performance of multi processor chip, there are stringent requirements to improve inter-core synchronization. Thus, a MPSoC with 8-cores supporting both message-passing and shared-memory inter-core communication mechanisms is implemented on Virtex 5 LX110T FPGA. Each core is based on MIPS III (Microprocessor without interlocked pipelined stages) ISA, handling only integer type instructions and having six-stage pipeline with data hazard detection unit and forwarding logic. The eight processing cores and one central shared memory core are inter connected using 3x3 2-D mesh topology based Network-on-chip (NoC) with virtual channel router. The router is four stage pipelined supporting DOR X-Y routing algorithm and with round robin arbitration technique. For verification and functionality test of above fully synthesized multi core processor, matrix multiplication operation is mapped onto the above said. Partitioning and scheduling of multiple multiplications and addition for each element of resultant matrix has been done accordingly among eight cores to get maximum throughput. All the codes for processor design are written in Verilog HDL. Keywords: MPSoC, message-passing, shared memory, MIPS, ISA, wormhole router, network-on-chip, SIMD, data level parallelism, 2-D Mesh, virtual channel
In one classic sense a rogue is someone who goes their own way, who breaks away from the crowd. The CRNCH Rogues Gallery aims to support computer architecture rogues by being a physical and virtual space providing access to novel computing architectures. Researchers find applications, and architects discover what happens when their prototypes hit reality. Our goals are to help kick-start software ecosystems, train students in novel system evaluation and use, and provide rapid feedback to architects. By exposing students and researchers to this set of unique hardware, we foster cross-cutting discussions about hardware designs that will drive future performance improvements in computing long after the Moore’s Law era of “cheap transistors” ends. We provide a brief description of the current Rogues Gallery along with successes and research highlights over the last year.
OpenACC and Hackathons Monthly Highlights: April 2023OpenACC
Stay up-to-date on the latest news, research and resources. This month's edition covers the Open Hackathon Mentor Program, highlight from the recent UK National Hackathon, upcoming Open Hackathon and Bootcamp events, and more!
Opening Keynote Lecture
15th Annual ON*VECTOR International Photonics Workshop
Calit2’s Qualcomm Institute
University of California, San Diego
February 29, 2016
From Jisc's campus network engineering for data-intensive science workshop on 19 October 2016.
https://www.jisc.ac.uk/events/campus-network-engineering-for-data-intensive-science-workshop-19-oct-2016
Implementing AI: Running AI at the Edge: Adapting AI to available resource in...KTN
The Implementing AI: Running AI at the Edge, hosted by KTN and eFutures, is the second event of the Implementing AI webinar series.
To make products more intelligent, more responsive and to reduce the data generated, it is advantageous to run AI on the product itself, as opposed to in the cloud.
The focus of this webinar was the opportunities and challenges of moving the AI processing to “the Edge”. The webinar had four presentations from experts covering overviews of the opportunity, implementation techniques and case studies.
Find out more: https://ktn-uk.co.uk/news/just-launched-implementing-ai-webinar-series
AI-Driven Science and Engineering with the Global AI and Modeling Supercomput...Geoffrey Fox
Most things are dominated by Artificial Intelligence (AI). Technology Companies like Amazon, Google, Facebook, and Microsoft are AI First organizations.
Engineering achievement today is highlighted by the AI buried in a vehicle or machine. Industry (Manufacturing) 4.0 focusses on the AI-Driven future of the Industrial Internet of Things.
Software is eating the world.
We can describe much computer systems work as designing, building and using the Global AI and Modelling supercomputer which itself is autonomously tuned by AI. We suggest that this is not just a bunch of buzzwords but has profound significance and examine consequences of this for education and research.
Naively high-performance computing should be relevant for the AI supercomputer but somehow the corporate juggernaut is not making so much use of it. We discuss how to change this.
With the rise of fog and edge-computing as the basic paradigms for future communication standards such as 6G, new processing requirements are established. On the other hand, new security algorithms appear with the scaling of quantum technology, increasing the complexity of the cryptography applications for IoT devices with a tight standardization timeline. Finally, the integration of satellites as nodes for communication networks includes fault-tolerance and error-correction codes as design parameters.
FPGA-based soft-processors are supported by industry and space agencies as promising candidates to overcome all these challenges, due to their flexibility and power consumption compared to GPUs or multithreading CPUs with co-processors. To optimize these architectures to a wide range of scenarios, common methods, and arithmetic functions need to be integrated into the ISA. This talk will show some examples of the RISC-V EL2 core for both classical and post-quantum cryptography and error correction codes, reducing the latency of standardized solutions at a cost of a small cross-section increase keeping the behavior under radiation effects similar to the original core.
The rush to the edge and new applications around AI are causing a shift in design strategies toward the highest performance per watt, rather than the highest performance or lowest power.
An octa core processor with shared memory and message-passingeSAT Journals
Abstract This being the era of fast, high performance computing, there is the need of having efficient optimizations in the processor architecture and at the same time in memory hierarchy too. Each and every day, the advancement of applications in communication and multimedia systems are compelling to increase number of cores in the main processor viz., dual-core, quad-core, octa-core and so on. But, for enhancing the overall performance of multi processor chip, there are stringent requirements to improve inter-core synchronization. Thus, a MPSoC with 8-cores supporting both message-passing and shared-memory inter-core communication mechanisms is implemented on Virtex 5 LX110T FPGA. Each core is based on MIPS III (Microprocessor without interlocked pipelined stages) ISA, handling only integer type instructions and having six-stage pipeline with data hazard detection unit and forwarding logic. The eight processing cores and one central shared memory core are inter connected using 3x3 2-D mesh topology based Network-on-chip (NoC) with virtual channel router. The router is four stage pipelined supporting DOR X-Y routing algorithm and with round robin arbitration technique. For verification and functionality test of above fully synthesized multi core processor, matrix multiplication operation is mapped onto the above said. Partitioning and scheduling of multiple multiplications and addition for each element of resultant matrix has been done accordingly among eight cores to get maximum throughput. All the codes for processor design are written in Verilog HDL. Keywords: MPSoC, message-passing, shared memory, MIPS, ISA, wormhole router, network-on-chip, SIMD, data level parallelism, 2-D Mesh, virtual channel
Reproducible Linear Algebra from Application to ArchitectureJason Riedy
All computing must be parallel to take advantage of modern systems like multicore processors, GPUs, and distributed systems. Results that are not bit-wise reproducible introduce doubt on many levels. Sometimes that is appropriate. Reproducibility limitations occur because underlying libraries do not specify their reproducibility requirements. New advances in interfaces, algorithms, and architectures allow selecting among those requirements in the future. This talk covers many of the upcoming options and their trade-offs.
ICIAM 2019: Reproducible Linear Algebra from Application to ArchitectureJason Riedy
All computing must be parallel to take advantage of modern systems like multicore processors, GPUs, and distributed systems. Results that are not bit-wise reproducible introduce doubt on many levels. Sometimes that is appropriate. Reproducibility limitations occur because underlying libraries do not specify their reproducibility requirements. New advances in interfaces, algorithms, and architectures allow selecting among those requirements in the future. This talk covers many of the upcoming options and their trade-offs.
ICIAM 2019: A New Algorithm Model for Massive-Scale Streaming Graph AnalysisJason Riedy
Applications in many areas analyze an ever-changing environment. On billion vertices graphs, providing snapshots imposes a large performance cost. We propose the first formal model for graph analysis running concurrently with streaming data updates. We consider an algorithm valid if its output is correct for the initial graph plus some implicit subset of concurrent changes. We show theoretical properties of the model, demonstrate the model on various algorithms, and extend it to updating results incrementally.
Augmented Arithmetic Operations Proposed for IEEE-754 2018Jason Riedy
Algorithms for extending arithmetic precision through compensated summation or arithmetics like double-double rely on operations commonly called twoSum and twoProduct. The current draft of the IEEE 754 standard specifies these operations under the names augmentedAddition and augmentedMultiplication. These operations were included after three decades of experience because of a motivating new use: bitwise reproducible arithmetic. Standardizing the operations provides a hardware acceleration target that can provide at least a 33% speed improvements in reproducible dot product, placing reproducible dot product almost within a factor of two of common dot product. This paper provides history and motivation for standardizing these operations. We also define the operations, explain the rationale for all the specific choices, and provide parameterized test cases for new boundary behaviors.
CRNCH Rogues Gallery: A Community Core for Novel Computing PlatformsJason Riedy
The Rogues Gallery is a new concept focused on developing our understanding of next-generation hardware with a focus on unorthodox and uncommon technologies. This project, initiated by Georgia Tech's Center for Research into Novel Computing Hierarchies (CRNCH), will acquire new and unique hardware (ie, the aforementioned "rogues") from vendors, research labs, and startups and make this hardware available to students, faculty, and industry collaborators within a managed data center environment. By exposing students and researchers to this set of unique hardware, we hope to foster cross-cutting discussions about hardware designs that will drive future performance improvements in computing long after the Moore's Law era of "cheap transistors" ends.
A New Algorithm Model for Massive-Scale Streaming Graph AnalysisJason Riedy
Applications in computer network security, social media analysis,and other areas rely on analyzing a changing environment. The data is rich in relationships and lends itself to graph analysis. Traditional static graph analysis cannot keep pace with network security applications analyzing nearly one million events per second and social networks like Facebook collecting 500 thousand comments per second. Streaming frameworks like STINGER support ingesting up three million of edge changes per second but there are few streaming analysis kernels that keep up with these rates. Here we present a new algorithm model for applying complex metrics to a changing graph. In this model, many more algorithms can be applied without having to stop the world.
High-Performance Analysis of Streaming Graphs Jason Riedy
Graph-structured data in social networks, finance, network security, and others not only are massive but also under continual change. These changes often are scattered across the graph. Stopping the world to run a single, static query is infeasible. Repeating complex global analyses on massive snapshots to capture only what has changed is inefficient. We discuss requirements for single-shot queries on changing graphs as well as recent high-performance algorithms that update rather than recompute results. These algorithms are incorporated into our software framework for streaming graph analysis, STINGER.
High-Performance Analysis of Streaming GraphsJason Riedy
Graph-structured data in social networks, finance, network security, and others not only are massive but also under continual change. These changes often are scattered across the graph. Stopping the world to run a single, static query is infeasible. Repeating complex global analyses on massive snapshots to capture only what has changed is inefficient. We discuss requirements for single-shot queries on changing graphs as well as recent high-performance algorithms that update rather than recompute results. These algorithms are incorporated into our software framework for streaming graph analysis, STING (Spatio-Temporal Interaction Networks and Graphs).
Algorithm for efficiently and accurately updating PageRank as the graph changes from a stream of updates. Also includes needs from the upcoming GraphBLAS to support high-performance streaming graph analysis.
Scalable and Efficient Algorithms for Analysis of Massive, Streaming GraphsJason Riedy
Graph-structured data in network security, social networks, finance, and other applications not only are massive but also under continual evolution. The changes often are scattered across the graph, permitting novel parallel and incremental analysis algorithms. We discuss analysis algorithms for streaming graph data to maintain both local and global metrics with low latency and high efficiency.
High-performance graph analysis is unlocking knowledge in computer security, bioinformatics, social networks, and many other data integration areas. Graphs provide a convenient abstraction for many data problems beyond linear algebra. Some problems map directly to linear algebra. Others, like community detection, look eerily similar to sparse linear algebra techniques. And then there are algorithms that strongly resist attempts at making them look like linear algebra. This talk will cover recent results with an emphasis on streaming graph problems where the graph changes and results need updated with minimal latency. We’ll also touch on issues of sensitivity and reliability where graph analysis needs to learn from numerical analysis and linear algebra.
Graph Analysis Trends and Opportunities -- CMG Performance and Capacity 2014Jason Riedy
High-performance graph analysis is unlocking knowledge in problems like anomaly detection in computer security, community structure in social networks, and many other data integration areas. While graphs provide a convenient abstraction, real-world problems' sparsity and lack of locality challenge current systems. This talk will cover current trends ranging from massive scales to low-power, low-latency systems and summarize opportunities and directions for graphs and computing systems.
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Discussion on Vector Databases, Unstructured Data and AI
https://www.meetup.com/unstructured-data-meetup-new-york/
This meetup is for people working in unstructured data. Speakers will come present about related topics such as vector databases, LLMs, and managing data at scale. The intended audience of this group includes roles like machine learning engineers, data scientists, data engineers, software engineers, and PMs.This meetup was formerly Milvus Meetup, and is sponsored by Zilliz maintainers of Milvus.
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...sameer shah
"Join us for STATATHON, a dynamic 2-day event dedicated to exploring statistical knowledge and its real-world applications. From theory to practice, participants engage in intensive learning sessions, workshops, and challenges, fostering a deeper understanding of statistical methodologies and their significance in various fields."
Analysis insight about a Flyball dog competition team's performanceroli9797
Insight of my analysis about a Flyball dog competition team's last year performance. Find more: https://github.com/rolandnagy-ds/flyball_race_analysis/tree/main
Adjusting OpenMP PageRank : SHORT REPORT / NOTESSubhajit Sahu
For massive graphs that fit in RAM, but not in GPU memory, it is possible to take
advantage of a shared memory system with multiple CPUs, each with multiple cores, to
accelerate pagerank computation. If the NUMA architecture of the system is properly taken
into account with good vertex partitioning, the speedup can be significant. To take steps in
this direction, experiments are conducted to implement pagerank in OpenMP using two
different approaches, uniform and hybrid. The uniform approach runs all primitives required
for pagerank in OpenMP mode (with multiple threads). On the other hand, the hybrid
approach runs certain primitives in sequential mode (i.e., sumAt, multiply).
Learn SQL from basic queries to Advance queriesmanishkhaire30
Dive into the world of data analysis with our comprehensive guide on mastering SQL! This presentation offers a practical approach to learning SQL, focusing on real-world applications and hands-on practice. Whether you're a beginner or looking to sharpen your skills, this guide provides the tools you need to extract, analyze, and interpret data effectively.
Key Highlights:
Foundations of SQL: Understand the basics of SQL, including data retrieval, filtering, and aggregation.
Advanced Queries: Learn to craft complex queries to uncover deep insights from your data.
Data Trends and Patterns: Discover how to identify and interpret trends and patterns in your datasets.
Practical Examples: Follow step-by-step examples to apply SQL techniques in real-world scenarios.
Actionable Insights: Gain the skills to derive actionable insights that drive informed decision-making.
Join us on this journey to enhance your data analysis capabilities and unlock the full potential of SQL. Perfect for data enthusiasts, analysts, and anyone eager to harness the power of data!
#DataAnalysis #SQL #LearningSQL #DataInsights #DataScience #Analytics
Unleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdfEnterprise Wired
In this guide, we'll explore the key considerations and features to look for when choosing a Trusted analytics platform that meets your organization's needs and delivers actionable intelligence you can trust.
Novel Architectures for Applications in Data Science and Beyond
1. Novel Architectures for Applications in
Data Science and Beyond
Jason Riedy, Jeffrey Young, Tom Conte
Center for Research into Novel Computing Hierarchies at Georgia Tech
1 March 2019
2. Apps: Massive+-scale data analysis
Cyber-security Identify anomalies, malicious actors
Health care Find outbreaks, population epidemiology, similar
patient association
Social networks Advertising, searching, grouping
Intelligence Decisions at scale, regulating markets, smart &
sustainable cities
Systems biology Understanding interactions, drug design
Power grid / Smart cities Disruptions, conservation, prediction
Irregular data access. Changing data.
Rogues Gallery — 1 Mar 2019 2/23
3. High-Performance Data Analysis (HPDA)
Novel applications:
• Data at scale and speed needs new ideas for
computing analysis.
• “Big data” platforms fare poorly v. a single thread
plus large SSD even for static data sets. (McSherry,
Isard, Murray. “Scalability! But at what COST?” HotOS
XV, 2015.)
• Many high-level codes are written and re-written to
answer one question: need flexibility.
• Some primitives may be tuned and re-used.
Rogues Gallery — 1 Mar 2019 3/23
4. HPDA and Architectures
• Current architectures are hitting limits on
manufacturing, heat dissipation, memory latency...
• New architecture proposals are difficult to evaluate
via simulation and modeling alone.
• What happens when novel prototypes hit reality?
• Architects/designers need rapid feedback on new
ideas.
• New ideas only become successful with a
community: software ecosystem, trained students.
Need bridges between apps and architects.
Rogues Gallery — 1 Mar 2019 4/23
5. Introducing the CRNCH Rogues Gallery
A physical & virtual space for hosting novel computing
architectures, systems, and accelerators.
Emu Chick FPGAs & HMC/HBM
FPAA
Amortize effort and cost of trying novel architectures.
Break the “but it’s too much work” barrier.
http://crnch.gatech.edu/rogues-gallery
Rogues Gallery — 1 Mar 2019 5/23
6. Emu Technology’s PGAS Architecture
1 nodelet
Gossamer
Core 1
Memory-Side Processor
Gossamer
Core 4
...
Migration Engine
RapidIODisk I/O
8 nodelets
per node
64 nodelets
per Chick
RapidIO
Stationary
Core
• Multithreaded multicore
• Memory-side “processor” for
operations in
narrow-channel DRAM
• Stationary core for OS
• Threads migrate in
hardware on reads!
• Optimize for weak locality
Rogues Gallery — 1 Mar 2019 6/23
7. 3D Stacked Memory and FPGAs
FPGAs enable flexible
“near-memory” research for both
hardware and programming
models:
• Micron/Pico EX700 & HMC
• Nallatech 385s (Arria 10)
• Intel Arria10 DevKit
• Nallatech 520N (Stratix 10)
• Xilinx MpSOC
Rogues Gallery — 1 Mar 2019 7/23
8. Neuromorphic Systems
• Field-Programmable Analog Array (FPAA) System-On
Chip, designed in the lab of Dr. Jennifer Hasler.
• Analog arrays can achieve unprecedented power and
size reductions.
FPAA “driven” by a RPi Neuromorphic Workshop,
27 Apr 2018
Rogues Gallery — 1 Mar 2019 8/23
9. Flexible Infrastructure
login /
notebook
rg-adm
Slurm Ctl
toolbox
(NFS)
Scheduling,
Tools, and
Admin
Key:
Schedulable Resource
Physical Resource
VM
USB device
User
Resources
fpaa-host
power-host
nvidia-tegra-N
nvidia-tegra-1
fpaa-dev
rg-db
Slurm DBD
emu-dev emu-chick
..Nfpga-dev-1
fpga-hmcfpga-intel
Powell, Riedy, Young, and Conte. “Wrangling Rogues: Managing
Experimental Post-Moore Architectures.”
https://arxiv.org/abs/1808.06334
• Available. Plans to
integrate with NSF
XSEDE.
• Scheduler being
deployed.
• Incorporates
Singularity and virtual
machines for
OS/library versioning.
• Fun tools: usbip for
FPAA... Analog inputs?
Rogues Gallery — 1 Mar 2019 9/23
10. Rogues Gallery Press
On The Next Platform, a partner of The Register:
https://www.nextplatform.com/2018/08/27/a-rogues-gallery-of-post-moores-law-options/
Rogues Gallery — 1 Mar 2019 10/23
11. Neuromorphic Workshop - April 2018
FPAA-focused workshop led by Dr.
Hasler introduced the wider
community to FPAA programming
for neuromorphic systems.
• Tutorial-style class
• Lightning talks from GT and
DoE researchers
• http://crnch.gatech.
edu/neuro-workshop18
Rogues Gallery — 1 Mar 2019 11/23
12. Rogues Gallery ASPLOS Tutorial
Programming Novel Architectures in the Post-Moore Era
with The Rogues Gallery
https://crnch-rg.gitlab.io/rg/asplos-2019/
• To be held at ASPLOS 2019 in Providence, RI.
• 14 April 2019, 8.30am – 12.00pm
• Architecture detailed descriptions
• Hands-on with the Emu Chick and (hopefully) the
FPAA
Rogues Gallery — 1 Mar 2019 12/23
13. Rogues Gallery VIP Team
Rogues Gallery VIP team has started in Spring 2019. This
course allows undergraduates to get research experience
working with software for novel architectures.
http://www.vip.gatech.edu/teams/
new-team-rogues-gallery
Rogues Gallery — 1 Mar 2019 13/23
17. Selected Results: BFS on a Dynamic Data Structure
15 16 17 18 19 20 21
scale
0
20
40
60
80
100
MTEPS
Emu single node - Cilk
Emu multi-node - Cilk
x86 Haswell - STINGER
x86 Haswell - Cilk
0
500
1000
1500
EdgeBandwidth(MB/s)
Note: Streaming data structure, not statically optimized. 3
3
Hein, Eswar, Abdurrahman Yasar, Prasanth Chatarasi, Li, Young, Conte, Ümit Çatalyürek, Vuduc, Riedy, Bora Uçar.
“Programming Strategies for Irregular Algorithms on the Emu Chick.” https://arxiv.org/abs/1901.02775
Rogues Gallery — 1 Mar 2019 17/23
18. Selected Results: Labeled Subgraph Alignment
1 2 4 8 16 32 64 128
Number of Threads
0
10
20
30
40
50
Speedup
Multi-BLK
Multi-HCB
Single-BLK
Single-HCB
gsaNA, the first parallel algorithm, strong scaling on DBLP
graph (2048 vertices)3
Rogues Gallery — 1 Mar 2019 18/23
19. Selected Results: FPGAs
Hadidi, Asgari, Young, Mudassar, Garg, Krishna, Kim. “Performance
Implications of NoCs on 3D-Stacked Memories: Insights from the
Hybrid Memory Cube (HMC),” ISPASS 2018
• Characterizations
with FPGA and Hybrid
Memory Cube show
latency/bandwidth
tradeoff.
• Other FPGA work is
focused on compilers,
HPC prototyping, and
sparse algorithms for
Intel and Xilinx FPGAS.
Rogues Gallery — 1 Mar 2019 19/23
20. Lessons Learned, Emu Chick i
• Finding appropriate metrics is difficult, e.g. for the
Emu:
• Comparing ASICs (e.g. x86) to FPGA-based prototypes
can be unfair either way.
• Fraction of peak bandwidth for the idealized
problem?
• SpMV: FLOP/s ∝ BW, level 2 sparse BLAS op.
• Graph500 BFS: TEPS ∝ BW
Rogues Gallery — 1 Mar 2019 20/23
21. Lessons Learned, Emu Chick ii
• Distilling observations on architecture ↔
programming model:
• Program data location for load (BW) balance.
• Remote memory operations v. migration exposes the
architecture.
• Migrations cost more than it appears. Computation?
• Stack spills/access can cause ping-ponging.
• How does HW support for top-down (Cilk-ish) affect
bottom-up (UPC/SHMEM) PGAS programming?
• Memory allocation similar to UPC, SHMEM
• UPC++ rpc_ff v. Emu thread migration?
Rogues Gallery — 1 Mar 2019 21/23
22. Acknowledgments
Fantastic students and colleagues:
• Srinivas Eswar (GT CSE)
• Dr. Eric Hein (GT ECE ⇒ Emu)
• Patrick Lavin (GT CSE)
• Dr. Jiajia Li (GT CSE ⇒ PNNL)
• Abdurrahman Yaşar (GT CSE)
• Dr. Ümit Çatalürek (GT CSE)
• Dr. Tom Conte (GT CS/ECE)
• Dr. Bora Uçar (ENS Lyon CNRS)
• Dr. Rich Vuduc (GT CSE)
• Dr. Jeffrey S. Young (GT CS)
Code (ideally will have links from crnch.gatech.edu):
• https://gitlab.com/crnch-rg (soon)
• https://github.com/ehein6/emu-microbench
Other testbeds:
• ORNL: ExCL
• PNNL: CENATE
• Argonne
• Sandia
• Berkeley: AQCT
• (others?)
Rogues Gallery — 1 Mar 2019 22/23
23. Rogues Gallery: Active and Growing
• Added FPGA resources, integrating FPAAs
• Tight development loop with Emu
• Active research projects and publications
• Community outreach and education underway
• One GT student has made the leap to industry
already...
CRNCH Rogues Gallery connects researchers and
students with novel architectures and architects with
upcoming applications.
Let us host / manage your neat stuff!
http://crnch.gatech.edu/rogues-gallery
Rogues Gallery — 1 Mar 2019 23/23