GPUs in Big Data - StampedeCon 2014

•

13 likes•4,272 views

At StampedeCon 2014, John Tran of NVIDIA presented "GPUs in Big Data." Modern graphics processing units (GPUs) are massively parallel general-purpose processors that are taking Big Data by storm. In terms of power efficiency, compute density, and scalability, it is clear now that commodity GPUs are the future of parallel computing. In this talk, we will cover diverse examples of how GPUs are revolutionizing Big Data in fields such as machine learning, databases, genomics, and other computational sciences.

Technology

BIG DATA IN GPUS
John Tran | StampedeCon2014, May 29 2014, St Louis, MO

“If you were plowing a field, which
would you rather use? Two strong
oxen or 1024 chickens?”
—Seymour Cray

Example CPU: Xeon E5-2687W
! 2.27 B transistors
! 8 cores, 16 threads @ 3.1 GHz
! 0.35 SP TFLOPS
! 0.17 DP TFLOPS
! 256 GB DDR3 @1600 MHz
! 51.2 GB/s
! 150 W
! 20 MB L3 cache
! Single thread Perf
! branch prediction
! out of order execution

Example GPU: Tesla K40
! 7.1 B transistors
! 2880 cores, 30720 threads @
745 MHz
! 4.29 SP TFLOPS
! 1.43 DP TFLOPS
! 12 GB GDDR5 @ 3GHz
! 288 GB/s memory BW
! 235 W
! PCIE Gen3 x16
! 12 GB/s

Math and memory peak throughput
4.29
TFLOPS Xeon E5-2687-W Tesla K40
0.35 0.17
1.43
5
4
3
2
1
0
SP TFLOPS DP TFLOPS
51.2
288
400
300
200
100
0
Memory BW
GB/s
Xeon E5-2687W Tesla K40

The Chickens are Winning
! Parallel computing is no longer “the future”
! If you are not parallel, you are already behind
! GPUs win in
! Performance == $$
! Power == $$
! Cost == $$

The Basic Idea – Accelerated Computing
Application Code
Compute-Intensive Functions
Rest of Sequential
CPU Code
GPU CPU
CUDA

$Quick CUDA C example Standard C Code Parallel C Code void saxpy(int n, float a, float *x, float *y) { for (int i = 0; i < n; ++i) y[i] = a*x[i] + y[i]; } int N = 1<<20; // Perform SAXPY on 1M elements saxpy(N, 2.0, x, y); __global__ void saxpy(int n, float a, float *x, float *y) { int i = blockIdx.x*blockDim.x + threadIdx.x; if (i < n) y[i] = a*x[i] + y[i]; } int N = 1<<20; cudaMemcpy(x, d_x, N, cudaMemcpyHostToDevice); cudaMemcpy(y, d_y, N, cudaMemcpyHostToDevice); // Perform SAXPY on 1M elements saxpy<<<4096,256>>>(N, 2.0, x, y); cudaMemcpy(d_y, y, N, cudaMemcpyDeviceToHost); http://developer.nvidia.com/cuda-toolkit$

How else can you program it?
! Libraries
! Thrust, BLAS, SPARSE, FFT, NPP, RAND
! Directives
! OpenACC
! Languages
! CUDA C, CUDA C++, thrust, python, fortran, C++ proposal, matlab, gpu.net
! Learn
! “get cuda,” Udacity, Coursera

! 90 M monthly active users
! 17 M tracks tagged / day
! 27 M tracks in DB
“GPUs enable us to handle our tremendous processing needs at a
substantial cost savings, delivering twice the performance per dollar
compared to a CPU-based system.”
-Jason Titus, CTO, Shazam

Deep Neural Networks for image
classification

Google Datacenter Stanford AI Lab
1000 CPU Servers
600 kWatts
$5,000,000
3 GPU-Accelerated Servers
3.6 kWatts
$21,000
Deep learning with COTS HPC systems, A Coates, B Huval, T Wang, D Wu, A Ng, B Catanzaro, NIPS 2013

The DataScope at JHU
5PB of science data (in 2010)
“The Data-Scope will allow us to mine out
relationships among data that already exist
but that we can’t yet handle and to sift
discoveries from what seems like an
overwhelming flow of information.
New discoveries will definitely emerge this
way. There are relationships and patterns
that we just cannot fathom buried in that
onslaught of data. Data-Scope will tease
these out.”
– Alex Szalay, JHU

Beating Heart Surgery
Patient stands to lose 1 point of IQ every
10 min with heart stopped
Only ~2% of heart surgeons will operate
on a beating heart
GPU enables real-time motion
compensation to virtually stop beating
heart for surgeons:
Courtesy Laboratoire d’Informatique de Robotique et de Microelectronique de Montpellier

Final Thoughts
! Parallel computing is here
! Re-think parallel or get left behind
! Scale up before scaling out
! Several orders of magnitude parallelism increase by using a GPU
! Do you really need a cluster?
! GPUs are the most efficient solution for parallel problems
! Perf / $
! Perf / Watt

What's hot

A Cassandra + Solr + Spark Love Triangle Using DataStax EnterprisePatrick McFadin

Large volume data analysis on the Typesafe Reactive PlatformMartin Zapletal

Distributed real time stream processing- why and howPetr Zapletal

How We Used Cassandra/Solr to Build Real-Time Analytics PlatformDataStax Academy

[262] netflix 빅데이터 플랫폼NAVER D2

Scaling out Tensorflow-as-a-Service on Spark and Commodity GPUsJim Dowling

Real-Time Analytics with Kafka, Cassandra and StormJohn Georgiadis

Bringing the Semantic Web closer to reality: PostgreSQL as RDF Graph DatabaseJimmy Angelakos

Scaling TensorFlow with Hops, Global AI Conference Santa ClaraJim Dowling

Debugging & Tuning in SparkShiao-An Yuan

Beyond the Query: A Cassandra + Solr + Spark Love Triangle Using Datastax Ent...DataStax Academy

PySpark in practice slidesDat Tran

Valerii Vasylkov Erlang. measurements and benefits.Аліна Шепшелей

Multi-Tenant Storm Service on Hadoop GridDataWorks Summit

The Hadoop EcosystemMathias Herberts

Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...DataWorks Summit

Monitoring Cassandra at Scale (Jason Cacciatore, Netflix) | C* Summit 2016DataStax

Effective testing for spark programs Strata NY 2015Holden Karau

유연하고 확장성 있는 빅데이터 처리NAVER D2

Why your Spark job is failingSandy Ryza

What's hot (20)

A Cassandra + Solr + Spark Love Triangle Using DataStax Enterprise

Large volume data analysis on the Typesafe Reactive Platform

Distributed real time stream processing- why and how

How We Used Cassandra/Solr to Build Real-Time Analytics Platform

[262] netflix 빅데이터 플랫폼

Scaling out Tensorflow-as-a-Service on Spark and Commodity GPUs

Real-Time Analytics with Kafka, Cassandra and Storm

Bringing the Semantic Web closer to reality: PostgreSQL as RDF Graph Database

Scaling TensorFlow with Hops, Global AI Conference Santa Clara

Debugging & Tuning in Spark

Beyond the Query: A Cassandra + Solr + Spark Love Triangle Using Datastax Ent...

PySpark in practice slides

Valerii Vasylkov Erlang. measurements and benefits.

Multi-Tenant Storm Service on Hadoop Grid

The Hadoop Ecosystem

Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...

Monitoring Cassandra at Scale (Jason Cacciatore, Netflix) | C* Summit 2016

Effective testing for spark programs Strata NY 2015

유연하고 확장성 있는 빅데이터 처리

Why your Spark job is failing

Viewers also liked

Deep learning on sparkSatyendra Rana

SIGGRAPH 2012: GPU-Accelerated 2D and Web RenderingMark Kilgard

GTC 2012: GPU-Accelerated Path RenderingMark Kilgard

GPU EcosystemOfer Rosenberg

Computational Techniques for the Statistical Analysis of Big Data in Rherbps10

Accelerating Machine Learning Applications on Spark Using GPUsIBM

PG-Strom - GPU Accelerated AsyncrKohei KaiGai

Enabling Graph Analytics at Scale: The Opportunity for GPU-Acceleration of D...odsc

Heterogeneous System Architecture Overviewinside-BigData.com

PG-Strom - GPGPU meets PostgreSQL, PGcon2015Kohei KaiGai

PyData Amsterdam - Name Matching at ScaleGoDataDriven

Hadoop + GPUVladimir Starostenkov

Deep Learning on HadoopDataWorks Summit

From Machine Learning to Learning Machines: Creating an End-to-End Cognitive ...Spark Summit

DeepLearning4J and Spark: Successes and Challenges - François Garillotsparktc

How to Solve Real-Time Data ProblemsIBM Power Systems

Containerizing GPU Applications with Docker for Scaling to the CloudSubbu Rama

Tallinn Estonia Advanced Java Meetup Spark + TensorFlow = TensorFrames Oct 24...Chris Fregly

The Potential of GPU-driven High Performance Data Analytics in SparkSpark Summit

Spark Summit EU talk by Tim HunterSpark Summit

Viewers also liked (20)

Deep learning on spark

SIGGRAPH 2012: GPU-Accelerated 2D and Web Rendering

GTC 2012: GPU-Accelerated Path Rendering

GPU Ecosystem

Computational Techniques for the Statistical Analysis of Big Data in R

Accelerating Machine Learning Applications on Spark Using GPUs

PG-Strom - GPU Accelerated Asyncr

Enabling Graph Analytics at Scale: The Opportunity for GPU-Acceleration of D...

Heterogeneous System Architecture Overview

PG-Strom - GPGPU meets PostgreSQL, PGcon2015

PyData Amsterdam - Name Matching at Scale

Hadoop + GPU

Deep Learning on Hadoop

From Machine Learning to Learning Machines: Creating an End-to-End Cognitive ...

DeepLearning4J and Spark: Successes and Challenges - François Garillot

How to Solve Real-Time Data Problems

Containerizing GPU Applications with Docker for Scaling to the Cloud

Tallinn Estonia Advanced Java Meetup Spark + TensorFlow = TensorFrames Oct 24...

The Potential of GPU-driven High Performance Data Analytics in Spark

Spark Summit EU talk by Tim Hunter

Similar to GPUs in Big Data - StampedeCon 2014

[RakutenTechConf2013] [A-3] TSUBAME2.5 to 3.0 and Convergence with Extreme Bi...Rakuten Group, Inc.

Making AI efficientDr Janet Bastiman

TensorFrames: Google Tensorflow on Apache SparkDatabricks

Csw2016 wheeler barksdale-gruskovnjak-execute_mypacketCanSecWest

Intro to Machine Learning for GPUsSri Ambati

Embedded system Design introduction _ KarakolaJohanAspro

Coding the ContinuumIan Foster

Petascale Analytics - The World of Big Data Requires Big AnalyticsHeiko Joerg Schick

Porting and optimizing UniFrac for GPUsIgor Sfiligoi

MaPU-HPCA2016Shaolin Xie

Nvidia in bioinformaticsShanker Trivedi

Lrz kurs: big data analysisFerdinand Jamitzky

dCUDA: Distributed GPU Computing with Hardware Overlapinside-BigData.com

POWER10 innovations for HPCGanesan Narayanasamy

Apache Spark and Tensorflow as a Service with Jim DowlingSpark Summit

General Purpose Computing using Graphics HardwareDaniel Blezek

A Platform for Accelerating Machine Learning ApplicationsNVIDIA Taiwan

Ted Willke, Senior Principal Engineer, Intel Labs at MLconf NYCMLconf

A Development of Log-based Game AI using Deep LearningSuntae Kim

Similar to GPUs in Big Data - StampedeCon 2014 (20)

[RakutenTechConf2013] [A-3] TSUBAME2.5 to 3.0 and Convergence with Extreme Bi...

Making AI efficient

TensorFrames: Google Tensorflow on Apache Spark

Csw2016 wheeler barksdale-gruskovnjak-execute_mypacket

Intro to Machine Learning for GPUs

Embedded system Design introduction _ Karakola

Coding the Continuum

Petascale Analytics - The World of Big Data Requires Big Analytics

Porting and optimizing UniFrac for GPUs

MaPU-HPCA2016

Nvidia in bioinformatics

Lrz kurs: big data analysis

dCUDA: Distributed GPU Computing with Hardware Overlap

POWER10 innovations for HPC

Apache Spark and Tensorflow as a Service with Jim Dowling

General Purpose Computing using Graphics Hardware

A Platform for Accelerating Machine Learning Applications

Ted Willke, Senior Principal Engineer, Intel Labs at MLconf NYC

A Development of Log-based Game AI using Deep Learning

Recently uploaded

ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous

Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer

A Domino Admins Adventures (Engage 2024)Gabriella Davis

GenCyber Cyber Security Day PresentationMichael W. Hawkins

08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls

Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2

How to convert PDF to text with Nanonetsnaman860154

Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun

Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko

2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong

Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge

Boost PC performance: How more available memory can improve productivityPrincipled Technologies

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays

TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc

presentation ICT roal in 21st century educationjfdjdjcjdnsjd

Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services

Histor y of HAM Radio presentation slidevu2urc

Evaluating the top large language models.pdfChristopherTHyatt

The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge

GenAI Risks & Security Meetup 01052024.pdflior mazor

Recently uploaded (20)

ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke

Axa Assurance Maroc - Insurer Innovation Award 2024

A Domino Admins Adventures (Engage 2024)

GenCyber Cyber Security Day Presentation

08448380779 Call Girls In Civil Lines Women Seeking Men

Exploring the Future Potential of AI-Enabled Smartphone Processors

How to convert PDF to text with Nanonets

Data Cloud, More than a CDP by Matt Robison

Handwritten Text Recognition for manuscripts and early printed texts

2024: Domino Containers - The Next Step. News from the Domino Container commu...

Driving Behavioral Change for Information Management through Data-Driven Gree...

Boost PC performance: How more available memory can improve productivity

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...

TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments

presentation ICT roal in 21st century education

Strategies for Landing an Oracle DBA Job as a Fresher

Histor y of HAM Radio presentation slide

Evaluating the top large language models.pdf

The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf

GenAI Risks & Security Meetup 01052024.pdf

GPUs in Big Data - StampedeCon 2014

1. BIG DATA IN GPUS John Tran | StampedeCon2014, May 29 2014, St Louis, MO

2. “If you were plowing a field, which would you rather use? Two strong oxen or 1024 chickens?” —Seymour Cray

3. Example CPU: Xeon E5-2687W ! 2.27 B transistors ! 8 cores, 16 threads @ 3.1 GHz ! 0.35 SP TFLOPS ! 0.17 DP TFLOPS ! 256 GB DDR3 @1600 MHz ! 51.2 GB/s ! 150 W ! 20 MB L3 cache ! Single thread Perf ! branch prediction ! out of order execution

4. Example GPU: Tesla K40 ! 7.1 B transistors ! 2880 cores, 30720 threads @ 745 MHz ! 4.29 SP TFLOPS ! 1.43 DP TFLOPS ! 12 GB GDDR5 @ 3GHz ! 288 GB/s memory BW ! 235 W ! PCIE Gen3 x16 ! 12 GB/s

5. Math and memory peak throughput 4.29 TFLOPS Xeon E5-2687-W Tesla K40 0.35 0.17 1.43 5 4 3 2 1 0 SP TFLOPS DP TFLOPS 51.2 288 400 300 200 100 0 Memory BW GB/s Xeon E5-2687W Tesla K40

6. The Chickens are Winning ! Parallel computing is no longer “the future” ! If you are not parallel, you are already behind ! GPUs win in ! Performance == $$ ! Power == $$ ! Cost == $$

7. Where did these GPUs come from?

10.

11. OK, but what about computing?

12. All Computing is Parallel Computing

13. Parallel Computing CPU GPU

14. The Basic Idea – Accelerated Computing Application Code Compute-Intensive Functions Rest of Sequential CPU Code GPU CPU CUDA

15. Quick CUDA C example Standard C Code Parallel C Code void saxpy(int n, float a, float *x, float *y) { for (int i = 0; i < n; ++i) y[i] = a*x[i] + y[i]; } int N = 1<<20; // Perform SAXPY on 1M elements saxpy(N, 2.0, x, y); __global__ void saxpy(int n, float a, float *x, float *y) { int i = blockIdx.x*blockDim.x + threadIdx.x; if (i < n) y[i] = a*x[i] + y[i]; } int N = 1<<20; cudaMemcpy(x, d_x, N, cudaMemcpyHostToDevice); cudaMemcpy(y, d_y, N, cudaMemcpyHostToDevice); // Perform SAXPY on 1M elements saxpy<<<4096,256>>>(N, 2.0, x, y); cudaMemcpy(d_y, y, N, cudaMemcpyDeviceToHost); http://developer.nvidia.com/cuda-toolkit

16. How else can you program it? ! Libraries ! Thrust, BLAS, SPARSE, FFT, NPP, RAND ! Directives ! OpenACC ! Languages ! CUDA C, CUDA C++, thrust, python, fortran, C++ proposal, matlab, gpu.net ! Learn ! “get cuda,” Udacity, Coursera

17. How does this matter to Big Data?

18.

19.

20.

21.

22.

23. ! 90 M monthly active users ! 17 M tracks tagged / day ! 27 M tracks in DB “GPUs enable us to handle our tremendous processing needs at a substantial cost savings, delivering twice the performance per dollar compared to a CPU-based system.” -Jason Titus, CTO, Shazam

24.

25. Deep Neural Networks for image classification

26.

27. Google Datacenter Stanford AI Lab 1000 CPU Servers 600 kWatts $5,000,000 3 GPU-Accelerated Servers 3.6 kWatts $21,000 Deep learning with COTS HPC systems, A Coates, B Huval, T Wang, D Wu, A Ng, B Catanzaro, NIPS 2013

28. Speech Recognition

29.

30.

31. The DataScope at JHU 5PB of science data (in 2010) “The Data-Scope will allow us to mine out relationships among data that already exist but that we can’t yet handle and to sift discoveries from what seems like an overwhelming flow of information. New discoveries will definitely emerge this way. There are relationships and patterns that we just cannot fathom buried in that onslaught of data. Data-Scope will tease these out.” – Alex Szalay, JHU

32. HIV Capsid

33. Beating Heart Surgery Patient stands to lose 1 point of IQ every 10 min with heart stopped Only ~2% of heart surgeons will operate on a beating heart GPU enables real-time motion compensation to virtually stop beating heart for surgeons: Courtesy Laboratoire d’Informatique de Robotique et de Microelectronique de Montpellier

34.

35. NVBIO

36. Final Thoughts ! Parallel computing is here ! Re-think parallel or get left behind ! Scale up before scaling out ! Several orders of magnitude parallelism increase by using a GPU ! Do you really need a cluster? ! GPUs are the most efficient solution for parallel problems ! Perf / $ ! Perf / Watt

37. All Computing is Parallel Computing

GPUs in Big Data - StampedeCon 2014

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to GPUs in Big Data - StampedeCon 2014

Similar to GPUs in Big Data - StampedeCon 2014 (20)

More from StampedeCon

More from StampedeCon (20)

Recently uploaded

Recently uploaded (20)

GPUs in Big Data - StampedeCon 2014