Roop Ganguly, Solution Architect
The End of Moore’s
Law
350 nm
180 nm
130 nm
90 nm
65 nm
1.0
2.0
3.0
1970 1980 1990 2000
Power Wall
GHz
Gordon Moore
Implications for Big Data
Security AnalyticsRisk Management
Behavioral Analytics
Natural Language Processing
AI/Deep Learning
Machine Learning
CPU-Bound Applications – A New Bottleneck
40Gb-
100Gb
Network
Now that faster networking
and disk technologies have
emerged, CPUs act like
“stop signs” for computation
Node 1
Node 2
Node 3
Accelerators
Microprocessor and Cloud Vendors Respond
ASIC
GPU
FPGA
Data Scientists &
Developers
Performance Team
Inhibitor: Programming Model Gap
for Hardware Accelerators
Two wildly
different skill sets
CPU GPU FPGA
Data Science Programming Model
BIG DATA PLATFORMS
Acceleration Programming Model
Programming Model Gap
Cross Platform
Cross Hardware
Intelligent, automatic computation routing
Zero code change
Introducing Bigstream
Hyper-acceleration Layer
Dataflow Adaptation Layer
Bigstream Dataflow
Bigstream Hypervisor
HYPER-ACCELERATION LAYER
BIG DATA PLATFORMS
CPU GPU FPGA
3X to 30X acceleration
Accelerated Spark Architecture with
Bigstream
9
Business Intelligence Use Case
Business Intelligence Query
•Based on Transaction Processing Performance Council –
Decision Support (TPC-DS) Benchmark
•Spark/SQL Query:
SELECT i_item_id , avg(ss_quantity) agg1, avg(ss_list_price) agg2, avg(ss_coupon_amt)
agg3 | FROM store_sales, customer_demographics, date_dim, item, promotion
WHERE ss_sold_date_sk = d_date_sk AND ss_item_sk = i_item_sk
AND…….
•Input: approximately 2GB of avro table data
•Simultaneously run software-accelerated and unaccelerated on
identical Amazon EMR clusters
Business Intelligence Use Case Demo
12
ETL Adtech Use Case
Adtech ETL/ML Data Pipeline
Spark
Streaming
Spark
Streaming
APPLICATION/
WEB
SERVERS KAFKA
clicks
clicks, likes
impressions
USERS
Spark
ML
RTB
Systems
Distributed messaging system
(tens of servers)
Distributed computation system
(hundreds of servers)
Millions of users
ETL Use Case Demo
Announcement –
Bigstream on
AWS EMR
Setting the bootstrap script
Bigstream ON EMR
Add the Bigstream bootstrap URL
and your cluster has hyper-acceleration
Thank You

Demonstrating the Benefits of Hyper-Acceleration

  • 1.
  • 2.
    The End ofMoore’s Law 350 nm 180 nm 130 nm 90 nm 65 nm 1.0 2.0 3.0 1970 1980 1990 2000 Power Wall GHz Gordon Moore
  • 3.
    Implications for BigData Security AnalyticsRisk Management Behavioral Analytics Natural Language Processing AI/Deep Learning Machine Learning
  • 4.
    CPU-Bound Applications –A New Bottleneck 40Gb- 100Gb Network Now that faster networking and disk technologies have emerged, CPUs act like “stop signs” for computation Node 1 Node 2 Node 3
  • 5.
    Accelerators Microprocessor and CloudVendors Respond ASIC GPU FPGA
  • 6.
    Data Scientists & Developers PerformanceTeam Inhibitor: Programming Model Gap for Hardware Accelerators Two wildly different skill sets CPU GPU FPGA Data Science Programming Model BIG DATA PLATFORMS Acceleration Programming Model Programming Model Gap
  • 7.
    Cross Platform Cross Hardware Intelligent,automatic computation routing Zero code change Introducing Bigstream Hyper-acceleration Layer Dataflow Adaptation Layer Bigstream Dataflow Bigstream Hypervisor HYPER-ACCELERATION LAYER BIG DATA PLATFORMS CPU GPU FPGA 3X to 30X acceleration
  • 8.
  • 9.
  • 10.
    Business Intelligence Query •Basedon Transaction Processing Performance Council – Decision Support (TPC-DS) Benchmark •Spark/SQL Query: SELECT i_item_id , avg(ss_quantity) agg1, avg(ss_list_price) agg2, avg(ss_coupon_amt) agg3 | FROM store_sales, customer_demographics, date_dim, item, promotion WHERE ss_sold_date_sk = d_date_sk AND ss_item_sk = i_item_sk AND……. •Input: approximately 2GB of avro table data •Simultaneously run software-accelerated and unaccelerated on identical Amazon EMR clusters
  • 11.
  • 12.
  • 13.
    Adtech ETL/ML DataPipeline Spark Streaming Spark Streaming APPLICATION/ WEB SERVERS KAFKA clicks clicks, likes impressions USERS Spark ML RTB Systems Distributed messaging system (tens of servers) Distributed computation system (hundreds of servers) Millions of users
  • 14.
  • 15.
  • 16.
    Setting the bootstrapscript Bigstream ON EMR Add the Bigstream bootstrap URL and your cluster has hyper-acceleration
  • 17.

Editor's Notes

  • #7 At the very high level, our hyper-accelerator layer consists of three sub-layers:
  • #8 At the very high level, our hyper-accelerator layer consists of three sub-layers:
  • #14 This is a typical streaming Big Data pipeline. As users browse the internet, our servers send different events through messaging bus, like a Kafka cluster. There is a Spark cluster, which reads the data from Kafka, cleans up the data, passes the statistical summary to a machine learning engine. You can see a similar pipeline architect in an IoT architecture, security, AdTech pipeline, A similar scenario can happen when you have a batch data, which is not necessarily streamed. In the next slide, I will show a demo for Spark cluster with Bigstream hyper-acceleration software stack already installed. The demo shows the execution without and with acceleration.