Big Data and
High Performance Computing
Dr. Abzetdin ADAMOV
Center for Data Analytics Research (CeDAR)
School of IT & Engineering
ADA University
aadamov@ada.edu.az
Speech Content
• Where Big Data Comes From
• Opportunities derived from Big Data
• Understanding Big Data Problem
• Why Now?
• Hadoop Ecosystem
• Big Data Computing Solution
• Massive Parallelization
• Q&A
WHERE BIG DATA COMES FROM
Where we were?
AAdamov, CeDAR, ADA University
Pope Benedict inauguration in 2005
Where we are?
AAdamov, CeDAR, ADA University
Pope Francis inauguration in 2013
Digital Universe Volume
• 2003 – 5 exabytes from beginning of civilization
• 2005 – 130 exabytes
• 2008 – 480.000 petabytes (PB)
• 2009 – 800.000 PB
• 2010 – 1200 000 PB or 1.2 zettabyte (ZB)
• 2012 – 2.7 ZB
• 2014 ~ 6.2 ZB
• 2015 ~ 10 ZB
• 2017 ~ 16 ZB
• 2019 ~ 30 ZB
• 2020 estimated 44 ZB
Every day now we create as much information as we
did from the dawn of civilization up until 2003
CeDAWI Research Center
Where Data Comes From
Data is produced by:
• People
• Social Media, Public Web, Smartphones, …
• Organizations (Employer)
• OLTP, OLAP, BI, …
• Machines
• IoT, Satellites, Vehicles, Science, …
CeDAWI Research Center
Modern Data Sources
à User Generated Content (Web & Mobile)
• Twitter, Facebook, Snapchat, YouTube
• Clickstream, Ads, User Engagement
• Payments: Paypal, Venmo
à Internet of Anything (IoAT)
• Wind Turbines, Oil Rigs, Cars
• Weather Stations, Smart Grids
• RFID Tags, Beacons, Wearables
Data Variety – Multiple Formats
• Structured: 5-10% of all Data Universe
• SQL - Databases
• Semi-Structured: 5-10%
• CSV, XML, JSON, email structure
• Unstructured: 80-90%
• books, journals, documents, metadata, log files, health records, audio,
video, images, files, email message, Web page, social media, word-
processor document, ...
OPPORTUNITIES FROM BIG DATA
Recommenda-
tion engines
Smart meter
monitoring
Equipment
monitoring
Advertising
analysis
Life sciences
research
Fraud
detection
Healthcare
outcomes
Weather
forecasting for
business
planning
Oil & Gas
exploration
Social network
analysis
Churn
analysis
Traffic flow
optimization
IT infrastructure
& Web App
optimization
Legal
discovery and
document
archiving
Data Analytics is needed everywhere
Intelligence
Gathering
Location-based
tracking &
services
Pricing Analysis
Personalized
Insurance
But do you have the capacity to refine it?
DATA is the NEW OIL!
AAdamov, CeDAR, ADA University
UNDERSTANDING BIG DATA PROBLEM
The “Big Data” Problem
à A single machine cannot process or even store all the data!
Problem
Solution
à Distribute data over large clusters
Difficulty
à How to split work across machines?
à Moving data over network is expensive
à Must consider data & network locality
à How to deal with failures?
à How to deal with slow nodes?
Traditional Approach in Data Management
1TB Hard Drive
3 TB file
1TB of Data
1TB of Data
1TB of Data
STORAGE PROCESSING
DATA Processor
Raw DataProcessed Data
Addressing Data
• Standard Hard Drive data transmission speed 60 – 100 MB/sec
• Solid State Hard Drive (SSD) - 250 – 500 MB/sec
• Hard Drive capacity growing RAPIDLY (4 – 60 TB)
• Online data growth (double every 18 month)
• Processing Speed (relatively same growth – Moore law)
• Hard Drive transmission speed is relatively FLAT
Moving Data IN and OUT of disk is the Bottleneck
WHY NOW?
Addressing Data – Digital Universe
0
5
10
15
20
25
30
35
1995 2000 2003 2005 2008 2009 2010 2011 2012 2014 2016 2017 2018
DataGrowthinZettaBytes
Digital Universe Growth over time
Addressing Data – Hard Disk Capacity
0
10000
20000
30000
40000
50000
60000
70000
1991 1998 2003 2005 2007 2008 2009 2010 2011 2012 2014 2016 2017
CapacityinGigaBytes
Hard Drive Capacity Growth over time
Addressing Data – Storage Cost
1200000
100000
10000 800 10 1 0.1 0.003 0.0020
200000
400000
600000
800000
1000000
1200000
1400000
1980 1985 1990 1995 2000 2005 2010 2015 2020
Price$perGBytes
Data Storage Cost per Gigabyte
AAdamov, CeDAR, ADA University
Computation Power CPU and GPU
0
500
1000
1500
2000
2500
3000
3500
4000
4500
2001 2002 2005 2006 2008 2009 2010 2012 2013 2014 2015 2017 2018
GFLOPS
Computation Power CPU and GPU
GPU
CPU
HADOOP ECOSYSTEM
Hadoop Ecosystem – Big Data Tech Stack
STORAGE
DATA MANAGEMENT
PROCESSING
INTELLIGENCE / VISUALIZATION
Hadoop Core = Storage + Compute
storage storage
storage storage
CPU RAM
Yet Another Resource
Negotiator (YARN)
Hadoop Distributed File
System (HDFS)
BIG DATA COMPUTING SOLUTION
Timeline of Computing Architecture
Traditional
Architecture
– 2000
Distributed
Architecture
2010 +
Operating System
HARDWARE
App App App
HARDWARE
App App App
HYPERVISOR
OS OS OS
HARDWARE HARDWARE HARDWARE
OS OS OS
HADOOP HDFS + YARN
App App App App App App
Virtualized
Architecture
2000 +
Distributed vs Traditional Computing
RDBMS
Function
SAN / NAS
DATA DATA DATA
DATA DATA DATA
DATA DATA DATA
DATA DATA DATA
Function
DATA
Function
DATA
Function
DATA
Function
DATA
Function
DATA
Function
DATA
Function
DATA
Function
DATA
Function
DATA
Function
DATA
Function
DATA
Function
DATA
Traditional Computing Distributed Computing
Distributed Architecture of HDFS
Rack 1
DN1
DN2
DN3
DN4
Switch
Rack 2
DN11
DN12
DN13
DN14
Switch
Rack 3
DN21
DN22
DN23
DN24
Switch
Rack 4
DN31
DN32
DN33
DN34
Switch
CC
AA
DD
BB
A
Where to write file ADA.txt (blocks A, B, C, D) in HDFS?
CLIENT
NAMENODE
A B C D
A – DN32, 11, 14
B – DN01, 22, 23
C – DN12, 02, 04
D – DN34, 12, 14
BCD
AAdamov, CeDAR, ADA University
MapReduce Architecture
INPUTDATA
OUTPUTDATA
Map()
Map()
Map()
Map()
Reduce()
Reduce()
Split
[k1, v1]
Sort
by k
Merge
[k1, [v1, v2, …, vN]]
MapReduce Job – Logical View
MAP SHUFFLE REDUCE
MASSIVE PARALLELIZATION
How different CPU and GPU?
Several Cores
Hundreds of Thousands of Cores
Application
CPU
GPU
Parallel and
Compute-
intensive
FunctionsGeneral Logic
and Serial
Functions
Low Latency vs. High Throughput
CPU
ALU ALU
L2 (cache)
ALU ALU
CONTROL
DRAM
GPU
DRAM
L2 (cache)
Hundreds of ALUs
Hundreds of ALUs
• Optimized for low-latency
access to cached datasets
• Control logic for out-of-
control and speculative
executions
• Optimized for Data-parallel
throughput computation
• Architecture tolerant of
memory latency
• More transistors dedicated to
computation
Interaction between CPU and GPU
CPU
ALU ALU
L2 (cache)
ALU ALU
CONTROL
DRAM
GPU
DRAM
L2 (cache)
Hundreds of ALUs
Hundreds of ALUs
1. Copy Input Data from CPU Memory
to GPU Memory;
2. Load dedicated functions to GPU;
3. Copy computing results from GPU
Memory to CPU Memory.
1
2
3
What is CUDA?
• Parallel Programming Model
• Includes Memory Model
• Can utilize 100’s of CUDA Cores and 1000’s of Parallel Threads
• Developer can focus on Parallel Programming abstracting from many
low-level operations
• Supports heterogeneous systems using CPU+GPU
• Implemented in C++
• There are extensions for C, C++, C#, Fortran, Java, Python, Ruby, etc.
CUDA Kernel
• Kernel is a set of computing instructions (small program) designed for
GPU;
• GPU runs Kernels in 1000’s of parallel threads;
• CUDA Threads:
• Lightweight
• Fast Switching
• Massively Parallel
Logical Architecture vs. Physical Architecture
Threads CUDA Core
….
Thread Block Streaming Multiprocessor (SM)
…. ….
Grid
….
….
….
….
GPU Unit
Executed by
Executed by
Executed by
BIG DOES NOT MEAN SLOW
AAdamov, CeDAR, ADA University
SMALL DOES NOT MEAN WEAK
AAdamov, CeDAR, ADA University
Information is the oil of the 21st century,
and Analytics is the Combustion Engine
Q & A ?
Dr. Abzetdin Adamov,
Email me at: aadamov@ada.edu.az
Follow me at: @
Link to me at: www.linkedin.com/in/adamov
Visit my blog at: aadamov.wordpress.com

Big Data and High Performance Computing

  • 1.
    Big Data and HighPerformance Computing Dr. Abzetdin ADAMOV Center for Data Analytics Research (CeDAR) School of IT & Engineering ADA University aadamov@ada.edu.az
  • 2.
    Speech Content • WhereBig Data Comes From • Opportunities derived from Big Data • Understanding Big Data Problem • Why Now? • Hadoop Ecosystem • Big Data Computing Solution • Massive Parallelization • Q&A
  • 3.
    WHERE BIG DATACOMES FROM
  • 4.
    Where we were? AAdamov,CeDAR, ADA University Pope Benedict inauguration in 2005
  • 5.
    Where we are? AAdamov,CeDAR, ADA University Pope Francis inauguration in 2013
  • 6.
    Digital Universe Volume •2003 – 5 exabytes from beginning of civilization • 2005 – 130 exabytes • 2008 – 480.000 petabytes (PB) • 2009 – 800.000 PB • 2010 – 1200 000 PB or 1.2 zettabyte (ZB) • 2012 – 2.7 ZB • 2014 ~ 6.2 ZB • 2015 ~ 10 ZB • 2017 ~ 16 ZB • 2019 ~ 30 ZB • 2020 estimated 44 ZB Every day now we create as much information as we did from the dawn of civilization up until 2003 CeDAWI Research Center
  • 7.
    Where Data ComesFrom Data is produced by: • People • Social Media, Public Web, Smartphones, … • Organizations (Employer) • OLTP, OLAP, BI, … • Machines • IoT, Satellites, Vehicles, Science, … CeDAWI Research Center
  • 8.
    Modern Data Sources ÃUser Generated Content (Web & Mobile) • Twitter, Facebook, Snapchat, YouTube • Clickstream, Ads, User Engagement • Payments: Paypal, Venmo à Internet of Anything (IoAT) • Wind Turbines, Oil Rigs, Cars • Weather Stations, Smart Grids • RFID Tags, Beacons, Wearables
  • 9.
    Data Variety –Multiple Formats • Structured: 5-10% of all Data Universe • SQL - Databases • Semi-Structured: 5-10% • CSV, XML, JSON, email structure • Unstructured: 80-90% • books, journals, documents, metadata, log files, health records, audio, video, images, files, email message, Web page, social media, word- processor document, ...
  • 10.
  • 11.
    Recommenda- tion engines Smart meter monitoring Equipment monitoring Advertising analysis Lifesciences research Fraud detection Healthcare outcomes Weather forecasting for business planning Oil & Gas exploration Social network analysis Churn analysis Traffic flow optimization IT infrastructure & Web App optimization Legal discovery and document archiving Data Analytics is needed everywhere Intelligence Gathering Location-based tracking & services Pricing Analysis Personalized Insurance
  • 12.
    But do youhave the capacity to refine it? DATA is the NEW OIL! AAdamov, CeDAR, ADA University
  • 13.
  • 14.
    The “Big Data”Problem à A single machine cannot process or even store all the data! Problem Solution à Distribute data over large clusters Difficulty à How to split work across machines? à Moving data over network is expensive à Must consider data & network locality à How to deal with failures? à How to deal with slow nodes?
  • 15.
    Traditional Approach inData Management 1TB Hard Drive 3 TB file 1TB of Data 1TB of Data 1TB of Data STORAGE PROCESSING DATA Processor Raw DataProcessed Data
  • 16.
    Addressing Data • StandardHard Drive data transmission speed 60 – 100 MB/sec • Solid State Hard Drive (SSD) - 250 – 500 MB/sec • Hard Drive capacity growing RAPIDLY (4 – 60 TB) • Online data growth (double every 18 month) • Processing Speed (relatively same growth – Moore law) • Hard Drive transmission speed is relatively FLAT Moving Data IN and OUT of disk is the Bottleneck
  • 17.
  • 18.
    Addressing Data –Digital Universe 0 5 10 15 20 25 30 35 1995 2000 2003 2005 2008 2009 2010 2011 2012 2014 2016 2017 2018 DataGrowthinZettaBytes Digital Universe Growth over time
  • 19.
    Addressing Data –Hard Disk Capacity 0 10000 20000 30000 40000 50000 60000 70000 1991 1998 2003 2005 2007 2008 2009 2010 2011 2012 2014 2016 2017 CapacityinGigaBytes Hard Drive Capacity Growth over time
  • 20.
    Addressing Data –Storage Cost 1200000 100000 10000 800 10 1 0.1 0.003 0.0020 200000 400000 600000 800000 1000000 1200000 1400000 1980 1985 1990 1995 2000 2005 2010 2015 2020 Price$perGBytes Data Storage Cost per Gigabyte AAdamov, CeDAR, ADA University
  • 21.
    Computation Power CPUand GPU 0 500 1000 1500 2000 2500 3000 3500 4000 4500 2001 2002 2005 2006 2008 2009 2010 2012 2013 2014 2015 2017 2018 GFLOPS Computation Power CPU and GPU GPU CPU
  • 22.
  • 23.
    Hadoop Ecosystem –Big Data Tech Stack STORAGE DATA MANAGEMENT PROCESSING INTELLIGENCE / VISUALIZATION
  • 24.
    Hadoop Core =Storage + Compute storage storage storage storage CPU RAM Yet Another Resource Negotiator (YARN) Hadoop Distributed File System (HDFS)
  • 25.
  • 26.
    Timeline of ComputingArchitecture Traditional Architecture – 2000 Distributed Architecture 2010 + Operating System HARDWARE App App App HARDWARE App App App HYPERVISOR OS OS OS HARDWARE HARDWARE HARDWARE OS OS OS HADOOP HDFS + YARN App App App App App App Virtualized Architecture 2000 +
  • 27.
    Distributed vs TraditionalComputing RDBMS Function SAN / NAS DATA DATA DATA DATA DATA DATA DATA DATA DATA DATA DATA DATA Function DATA Function DATA Function DATA Function DATA Function DATA Function DATA Function DATA Function DATA Function DATA Function DATA Function DATA Function DATA Traditional Computing Distributed Computing
  • 28.
    Distributed Architecture ofHDFS Rack 1 DN1 DN2 DN3 DN4 Switch Rack 2 DN11 DN12 DN13 DN14 Switch Rack 3 DN21 DN22 DN23 DN24 Switch Rack 4 DN31 DN32 DN33 DN34 Switch CC AA DD BB A Where to write file ADA.txt (blocks A, B, C, D) in HDFS? CLIENT NAMENODE A B C D A – DN32, 11, 14 B – DN01, 22, 23 C – DN12, 02, 04 D – DN34, 12, 14 BCD AAdamov, CeDAR, ADA University
  • 29.
  • 30.
    MapReduce Job –Logical View MAP SHUFFLE REDUCE
  • 31.
  • 33.
    How different CPUand GPU? Several Cores Hundreds of Thousands of Cores Application CPU GPU Parallel and Compute- intensive FunctionsGeneral Logic and Serial Functions
  • 34.
    Low Latency vs.High Throughput CPU ALU ALU L2 (cache) ALU ALU CONTROL DRAM GPU DRAM L2 (cache) Hundreds of ALUs Hundreds of ALUs • Optimized for low-latency access to cached datasets • Control logic for out-of- control and speculative executions • Optimized for Data-parallel throughput computation • Architecture tolerant of memory latency • More transistors dedicated to computation
  • 35.
    Interaction between CPUand GPU CPU ALU ALU L2 (cache) ALU ALU CONTROL DRAM GPU DRAM L2 (cache) Hundreds of ALUs Hundreds of ALUs 1. Copy Input Data from CPU Memory to GPU Memory; 2. Load dedicated functions to GPU; 3. Copy computing results from GPU Memory to CPU Memory. 1 2 3
  • 36.
    What is CUDA? •Parallel Programming Model • Includes Memory Model • Can utilize 100’s of CUDA Cores and 1000’s of Parallel Threads • Developer can focus on Parallel Programming abstracting from many low-level operations • Supports heterogeneous systems using CPU+GPU • Implemented in C++ • There are extensions for C, C++, C#, Fortran, Java, Python, Ruby, etc.
  • 37.
    CUDA Kernel • Kernelis a set of computing instructions (small program) designed for GPU; • GPU runs Kernels in 1000’s of parallel threads; • CUDA Threads: • Lightweight • Fast Switching • Massively Parallel
  • 38.
    Logical Architecture vs.Physical Architecture Threads CUDA Core …. Thread Block Streaming Multiprocessor (SM) …. …. Grid …. …. …. …. GPU Unit Executed by Executed by Executed by
  • 39.
    BIG DOES NOTMEAN SLOW AAdamov, CeDAR, ADA University
  • 40.
    SMALL DOES NOTMEAN WEAK AAdamov, CeDAR, ADA University
  • 41.
    Information is theoil of the 21st century, and Analytics is the Combustion Engine
  • 42.
    Q & A? Dr. Abzetdin Adamov, Email me at: aadamov@ada.edu.az Follow me at: @ Link to me at: www.linkedin.com/in/adamov Visit my blog at: aadamov.wordpress.com