Big Data and High Performance Computing

Big Data and
High Performance Computing
Dr. Abzetdin ADAMOV
Center for Data Analytics Research (CeDAR)
School of IT & Engineering
ADA University
aadamov@ada.edu.az

Speech Content
• Where Big Data Comes From
• Opportunities derived from Big Data
• Understanding Big Data Problem
• Why Now?
• Hadoop Ecosystem
• Big Data Computing Solution
• Massive Parallelization
• Q&A

Where we were?
AAdamov, CeDAR, ADA University
Pope Benedict inauguration in 2005

Where we are?
Pope Francis inauguration in 2013

Digital Universe Volume
• 2003 – 5 exabytes from beginning of civilization
• 2005 – 130 exabytes
• 2008 – 480.000 petabytes (PB)
• 2009 – 800.000 PB
• 2010 – 1200 000 PB or 1.2 zettabyte (ZB)
• 2012 – 2.7 ZB
• 2014 ~ 6.2 ZB
• 2015 ~ 10 ZB
• 2017 ~ 16 ZB
• 2019 ~ 30 ZB
• 2020 estimated 44 ZB
Every day now we create as much information as we
did from the dawn of civilization up until 2003
CeDAWI Research Center

Where Data Comes From
Data is produced by:
• People
• Social Media, Public Web, Smartphones, …
• Organizations (Employer)
• OLTP, OLAP, BI, …
• Machines
• IoT, Satellites, Vehicles, Science, …
CeDAWI Research Center

Modern Data Sources
Ã User Generated Content (Web & Mobile)
• Twitter, Facebook, Snapchat, YouTube
• Clickstream, Ads, User Engagement
• Payments: Paypal, Venmo
Ã Internet of Anything (IoAT)
• Wind Turbines, Oil Rigs, Cars
• Weather Stations, Smart Grids
• RFID Tags, Beacons, Wearables

Data Variety – Multiple Formats
• Structured: 5-10% of all Data Universe
• SQL - Databases
• Semi-Structured: 5-10%
• CSV, XML, JSON, email structure
• Unstructured: 80-90%
• books, journals, documents, metadata, log files, health records, audio,
video, images, files, email message, Web page, social media, word-
processor document, ...

Recommenda-
tion engines
Smart meter
monitoring
Equipment
monitoring
Advertising
analysis
Life sciences
research
Fraud
detection
Healthcare
outcomes
Weather
forecasting for
business
planning
Oil & Gas
exploration
Social network
analysis
Churn
analysis
Traffic flow
optimization
IT infrastructure
& Web App
optimization
Legal
discovery and
document
archiving
Data Analytics is needed everywhere
Intelligence
Gathering
Location-based
tracking &
services
Pricing Analysis
Personalized
Insurance

But do you have the capacity to refine it?
DATA is the NEW OIL!

UNDERSTANDING BIG DATA PROBLEM

The “Big Data” Problem
Ã A single machine cannot process or even store all the data!
Problem
Solution
Ã Distribute data over large clusters
Difficulty
Ã How to split work across machines?
Ã Moving data over network is expensive
Ã Must consider data & network locality
Ã How to deal with failures?
Ã How to deal with slow nodes?

Traditional Approach in Data Management
1TB Hard Drive
3 TB file
1TB of Data
1TB of Data
1TB of Data
STORAGE PROCESSING
DATA Processor
Raw DataProcessed Data

Addressing Data
• Standard Hard Drive data transmission speed 60 – 100 MB/sec
• Solid State Hard Drive (SSD) - 250 – 500 MB/sec
• Hard Drive capacity growing RAPIDLY (4 – 60 TB)
• Online data growth (double every 18 month)
• Processing Speed (relatively same growth – Moore law)
• Hard Drive transmission speed is relatively FLAT
Moving Data IN and OUT of disk is the Bottleneck

Addressing Data – Digital Universe
0
5
10
15
20
25
30
35
1995 2000 2003 2005 2008 2009 2010 2011 2012 2014 2016 2017 2018
DataGrowthinZettaBytes
Digital Universe Growth over time

Addressing Data – Hard Disk Capacity
0
10000
20000
30000
40000
50000
60000
70000
1991 1998 2003 2005 2007 2008 2009 2010 2011 2012 2014 2016 2017
CapacityinGigaBytes
Hard Drive Capacity Growth over time

Addressing Data – Storage Cost
1200000
100000
10000 800 10 1 0.1 0.003 0.0020
200000
400000
600000
800000
1000000
1200000
1400000
1980 1985 1990 1995 2000 2005 2010 2015 2020
Price$perGBytes
Data Storage Cost per Gigabyte

Computation Power CPU and GPU
0
500
1000
1500
2000
2500
3000
3500
4000
4500
2001 2002 2005 2006 2008 2009 2010 2012 2013 2014 2015 2017 2018
GFLOPS
Computation Power CPU and GPU
GPU
CPU

Hadoop Ecosystem – Big Data Tech Stack
STORAGE
DATA MANAGEMENT
PROCESSING
INTELLIGENCE / VISUALIZATION

Hadoop Core = Storage + Compute
storage storage
storage storage
CPU RAM
Yet Another Resource
Negotiator (YARN)
Hadoop Distributed File
System (HDFS)

Timeline of Computing Architecture
Traditional
Architecture
– 2000
Distributed
Architecture
2010 +
Operating System
HARDWARE
App App App
HARDWARE
App App App
HYPERVISOR
OS OS OS
HARDWARE HARDWARE HARDWARE
OS OS OS
HADOOP HDFS + YARN
App App App App App App
Virtualized
Architecture
2000 +

Distributed vs Traditional Computing
RDBMS
Function
SAN / NAS
DATA DATA DATA
DATA DATA DATA
DATA DATA DATA
DATA DATA DATA
Function
DATA
Function
DATA
Function
DATA
Function
DATA
Function
DATA
Function
DATA
Function
DATA
Function
DATA
Function
DATA
Function
DATA
Function
DATA
Function
DATA
Traditional Computing Distributed Computing

Distributed Architecture of HDFS
Rack 1
DN1
DN2
DN3
DN4
Switch
Rack 2
DN11
DN12
DN13
DN14
Switch
Rack 3
DN21
DN22
DN23
DN24
Switch
Rack 4
DN31
DN32
DN33
DN34
Switch
CC
AA
DD
BB
A
Where to write file ADA.txt (blocks A, B, C, D) in HDFS?
CLIENT
NAMENODE
A B C D
A – DN32, 11, 14
B – DN01, 22, 23
C – DN12, 02, 04
D – DN34, 12, 14
BCD

MapReduce Architecture
INPUTDATA
OUTPUTDATA
Map()
Map()
Map()
Map()
Reduce()
Reduce()
Split
[k1, v1]
Sort
by k
Merge
[k1, [v1, v2, …, vN]]

MapReduce Job – Logical View
MAP SHUFFLE REDUCE

How different CPU and GPU?
Several Cores
Hundreds of Thousands of Cores
Application
CPU
GPU
Parallel and
Compute-
intensive
FunctionsGeneral Logic
and Serial
Functions

Low Latency vs. High Throughput
CPU
ALU ALU
L2 (cache)
ALU ALU
CONTROL
DRAM
GPU
DRAM
L2 (cache)
Hundreds of ALUs
Hundreds of ALUs
• Optimized for low-latency
access to cached datasets
• Control logic for out-of-
control and speculative
executions
• Optimized for Data-parallel
throughput computation
• Architecture tolerant of
memory latency
• More transistors dedicated to
computation

Interaction between CPU and GPU
CPU
ALU ALU
L2 (cache)
ALU ALU
CONTROL
DRAM
GPU
DRAM
L2 (cache)
Hundreds of ALUs
Hundreds of ALUs
1. Copy Input Data from CPU Memory
to GPU Memory;
2. Load dedicated functions to GPU;
3. Copy computing results from GPU
Memory to CPU Memory.
1
2
3

What is CUDA?
• Parallel Programming Model
• Includes Memory Model
• Can utilize 100’s of CUDA Cores and 1000’s of Parallel Threads
• Developer can focus on Parallel Programming abstracting from many
low-level operations
• Supports heterogeneous systems using CPU+GPU
• Implemented in C++
• There are extensions for C, C++, C#, Fortran, Java, Python, Ruby, etc.

CUDA Kernel
• Kernel is a set of computing instructions (small program) designed for
GPU;
• GPU runs Kernels in 1000’s of parallel threads;
• CUDA Threads:
• Lightweight
• Fast Switching
• Massively Parallel

Logical Architecture vs. Physical Architecture
Threads CUDA Core
….
Thread Block Streaming Multiprocessor (SM)
…. ….
Grid
….
….
….
….
GPU Unit
Executed by
Executed by
Executed by

BIG DOES NOT MEAN SLOW

SMALL DOES NOT MEAN WEAK

Information is the oil of the 21st century,
and Analytics is the Combustion Engine

Q & A ?
Dr. Abzetdin Adamov,
Email me at: aadamov@ada.edu.az
Follow me at: @
Link to me at: www.linkedin.com/in/adamov
Visit my blog at: aadamov.wordpress.com

Big Data and High Performance Computing

More Related Content

What's hot

Similar to Big Data and High Performance Computing

More from Abzetdin Adamov

Recently uploaded

Big Data and High Performance Computing