Andrey Vykhodtsev
Andrey.Vykhodtsev@si.ibm.com
Agenda
• Massive Parallel Processing concepts
• Overview of Hadoop Architecture
• Processing Engines
• Map Reduce
• Spark
• Hive, Pig, BigSQL
• Hadoop distributions
• Stream processing
• Advanced analytics on Hadoop
Big Data
 An umbrella term that really means
“analytics at scale on any kind of data”
 It is about :
 Scalability
 Cost reduction (per terabyte or of
infrastructure)
 Variety of formats to analyze
 New types of analytics
Use Cases
 Telco
 Mediation
 Geolocation / fencing
 Call archival
 Lawful intercept
 …
 Banking
 Counter-fraud
 Regulatory
compliance
 Analyzing customer
behavior
Definition
 Perform a set of coordinated
computations in parallel (wiki def.)
 Grid computing
 Cluster computing
 Why? To make things faster
 To count buttons of people on a stadium can
be done in 34* days by 1 person or in 50
minutes* by 1000 persons
Types of systems
 Shared Memory
(SMP)
 Simple in implementing
data processing
features
 Expensive to scale
 Shared Disk clusters
 Easier to implement
storage layer
 Bottlenecked above
storage layer
 Harder to scale
 Shared nothing
clusters
Types of systems (cont.)
 Shared Nothing Clusters
Types of systems (cont.)
 Relational Database management system
 SQL Support
 ACID (Atomicity, Consistency, Isolation,
Durability)
 All interfaces lower than SQL are hidden
 Netezza
 General Processing frameworks
 Lower level interfaces exposed
 MPI
 Hadoop
Notable systems
 MPP RDBMS Teradata ~ 1980
 Netezza ~ 2000
 Hadoop ~ 2006
NoSQL
 CAP Theorem
 Consistency, Availability, Partition tolerance
 Pick 2
 BASE in contrast to ACID
 Cloudant, HBASE
 Different database genres
 Graph
 Document Store
 Columnar
 Key Value
Hadoop
 Distributed platform for thousands of nodes
 Data storage and computation framework
 Open source
 Runs on commodity hardware
 Flexible – everything is loosely coupled
Hadoop benefits
 Linear scalability
 Software resilience rather than
expensive hardware
 “Schema on read”
 Parallelism
 Variety of tools
The Hadoop Filesystem (HDFS)
 Driving principals
 Files are stored across the entire cluster
 Programs are brought to the data, not the data to the program
 Distributed file system (DFS) stores blocks across the whole cluster
 Blocks of a single file are distributed across the cluster
 A given block is typically replicated as well for resiliency
 Just like a regular file system, the contents of a file is up to the application
 Unlike a regular file system, you can ask it “where does each block of my file live?”
FILE
BLOCK
S
Hadoop Distributed File System
HDFS
 Stores files in folders
 Nobody cares what’s in your files
 Chunks large files into blocks (~64MB-2GB)
 3 replicates of each block (by default)
 Blocks are scattered all over the place
FILE
BLOCK
S
HDFS – Architecture
 Master / Slave architecture
 Master: NameNode
 manages the file system
namespace and metadata
○ FsImage
○ EditLog
 regulates access by files by
clients
 Slave: DataNode
 many per cluster
 manages storage attached to
the nodes
 periodically reports status to
NameNode
a
a
a
b
b
b
d
d
dc c
c
File1
a
b
c
d
NameNode
DataNodes
 Common pattern in data processing: apply a function, then aggregate
- Identify words in each line of a document collection
- For each word, return the sum of occurrences throughout the collection
 User simply writes two pieces of code: “mapper” and “reducer”
- Mapper executes on every split of every file
- Reducer consumes/aggregates mapper outputs
• The Hadoop MR framework takes care of the rest (resource allocation,
scheduling, coordination, storage of final result on DFS, . . . )
10110100
10100100
11100111
11100101
00111010
01010010
11001001
01010011
00010100
10111010
11101011
11011011
01010110
10010101
1
2
3
Logical File
Splits
1
Cluster
32
Map Map MapReduce
Result
MapReduce
Logical MapReduce Example: Word
Count
map(String key, String value):
// key: document name
// value: document contents
for each word w in value:
EmitIntermediate(w, "1");
reduce(String key, Iterator values):
// key: a word
// values: a list of counts
int result = 0;
for each v in values:
result += ParseInt(v);
Emit(AsString(result));
Hello World Bye World
Hello IBM
Content of Input Documents
Reduce (final output):
< Bye, 1>
< IBM, 1>
< Hello, 2>
< World, 2>
Map 1 emits:
< Hello, 1>
< World, 1>
< Bye, 1>
< World, 1>
Map 2 emits:
< Hello, 1>
< IBM, 1>
MapReduce processing
Hello World Bye World
Hello IBM
Input Documents
Reduce (final output):
< Bye, 1>
< IBM, 1>
< Hello, 2>
< World, 2>
Map 1 emits:
< Hello, 1>
< World, 1>
< Bye, 1>
< World, 1>
Map 2 emits:
< Hello, 1>
< IBM, 1>
Spark
 Spark brings two significant value-adds:
 Bring to Map Reduce the same added value that
databases (and parallel databases) brought to query
processing:
○ Let the app developer focus on the WHAT (they need to ask) and
let the system figure out HOW (it should be done).
○ Enable faster higher level application development through
higher level constructs and concepts: (RDD concept)
○ Let the system deal with performance (as part of the HOW)
 Leveraging memory (Bufferpools, Caching RDDs in memory)
 Maintaining sets of dedicated worker processes ready to go
(subagents in DBMS, Executors in Spark)
○ Enabling interactive processing (CLP, SQL*Plus, spark-shell,
etc….)
 Be one general purpose engine for multiples types of
workloads (SQL, Streaming, Machine Learning, etc…)
Spark (cont.)
 Apache Spark is a fast, general
purpose, easy-to-use cluster
computing system for large-scale
data processing
 Fast
○ Leverages aggressively cached in-memory
distributed computing and dedicated
App Executor processes even when no jobs
are running
○ Faster than MapReduce
 General purpose
○ Covers a wide range of workloads
○ Provides SQL, streaming and complex
analytics
 Flexible and easier to use than Map
Reduce
○ Spark is written in Scala, an object oriented,
functional programming language
○ Scala, Python and Java APIs
○ Scala and Python interactive shells
○ Runs on Hadoop, Mesos, standalone or
cloud
Logistic regression in Hadoop and Spark
Spark Stack
val wordCounts =
sc.textFile("README.md").flatMap(line =>
line.split(" ")).map(word => (word,
1)).reduceByKey((a, b) => a + b)
WordCount
Spark (cont.)
Pig
 Pig is a query language that runs MapReduce
jobs
 Higher-level than MapReduce: write code in
terms of GROUP BY, DISTINCT, FOREACH,
FILTER, etc.
 Custom loaders and storage functions make
this good glue
A = LOAD ‘data.txt’
AS (name:chararray, age:int, state:chararray);
B = GROUP A BY state;
C = FOREACH B GENERATE group, COUNT(*), AVG(age);
dump c;
Hive
 SQL Engine on top of MapReduce
 Rapidly developed, lots of features
 Query language – HiveQL – is deviant
from ANSI
 Lack of cost based query optimizer,
statistics, and many other features
 Not responsive enough for small jobs
BigSQL
Data shared with Hadoop ecosystem
Comprehensive file format support
Superior enablement of IBM and Third Party
software
Modern MPP runtime
Powerful SQL query rewriter
Cost based optimizer
Optimized for concurrent user throughput
Results not constrained by memory
Distributed requests to multiple data sources
within a single SQL statement
Main data sources supported:
DB2 LUW, Teradata, Oracle, Netezza,
Informix, SQL Server
Advanced security/auditing
Resource and workload management
Self tuning memory management
Comprehensive monitoring
Comprehensive SQL Support
IBM SQL PL compatibility
Extensive Analytic Functions
A lot of buzzwords
 Ambari – web admin interface
 Zookeper – distributed object sync
 Hbase – NoSQL key/value store
 Flume – buffered ingestion
 Sqoop – Database import/export
 Oozie – workflow manager
 YARN – cluster resource manager
 Nagios/Ganglia – monitoring, metrics
Hortonworks HDP
Cloudera
IBM BigInsights
Text Analytics
POSIX Distributed
Filesystem
Multi-workload, multi-tenant
scheduling
IBM BigInsights
Enterprise Management
Machine Learning on
Big R
Big R (R support)
IBM Open Platform with Apache Hadoop*
(HDFS, YARN, MapReduce, Ambari, Hbase, Hive, Oozie, Parquet, Parquet Format, Pig,
Snappy, Solr, Spark, Sqoop, Zookeeper, Open JDK, Knox, Slider)
IBM BigInsights
Data Scientist
IBM BigInsights
Analyst
Big SQL
BigSheets
Industry standard SQL
(Big SQL)
Spreadsheet-style
tool (BigSheets)
Free Quick Start (non production):
• IBM Open Platform
• BigInsights Analyst, Data Scientist
features
• Community support
. . .
IBM BigInsights for
Apache Hadoop
Overview
 Analyzing data on the fly vs. storing it
 Sometimes both has to be done
 Batch vs. Stream processing
 Low latency needs special design
considerations
 Processing is done on “windows” rather
then on tables/dataframes
 Engines differ by architecture,
development tools, latency
Apache Flume
 Agents can be
installed on variety
of platforms
 Collectors buffer
data and put to
HDFS
 Reliable
 Limited to micro
batch data collection
server agent
Collectorserver agent
server agent
HDFS
Spark Streaming
 Micro batch engine
 Reliable
 Integrated with Spark
Apache Storm
 Twitter project
now in Apache
 Development in
Java
 Bolts and Spouts
 Guaranteed
record delivery
Infosphere Streams
 Most performing and
sophisticated streaming
engine
 Easy IDE
 Declarative streaming
language
 Parallel execution
framework
 Many advanced toolkits
 Video, audio, signal
processing, finance,
geospatial, integration, etc
 Integrated with enterprise
tools
Data Science Life: Two Main
Tasks
1) Exploration: We don’t have any special attribute we want
to predict. Rather we want to understand the structure
present in the data. Are there clusters? Non-obvious
relationships?
- Also referred to as “unsupervised learning”
- E.g., K-means clustering
Use Cases -> Understanding categories of customers, cross-
selling opportunities, etc…
2) Prediction: The data contains a particular attribute
(called the target attribute) and we want to learn how the
target attribute depends on the other attributes.
- Also referred to as “supervised learning”
- E.g., Support vector machines
Use Cases -> Building a model to predict customer
churn, fraud, etc…
Data Science Life: Tools at Present
SQL
(42%) R
(33%) Python
(26%)
Excel
(25%)
Java
Ruby
C++
(17%)
SPSS
SAS
(9%)
Data Science Life: Skillset of the Data
Scientist
Statistician
Software
Engineer
Business
Analyst
Process Automation
Parallel Computing
Software Development
Database Systems
Mathematics Background
Analytic Mindset
Domain Expertise
Business Focus
Effective Communication
CRISP-DM: Cross Industry Standard
Process for Data Mining
The Typical Data Science Workflow
The Architect: What is Open Source R?
What is CRAN?
R is a powerful programming language and environment for statistical
computing and graphics.
R offers a rich analytics ecosystem:
 Full analytics life-cycle
○ Data exploration
○ Statistical analysis
○ Modeling, machine learning, simulations
○ Visualization
 Highly extensible via user-submitted packages
○ Tap into innovation pipeline contributed to by highly-regarded statisticians
○ Currently 4700+ statistical packages in repository
○ Easily accessible via CRAN, the Comprehensive R Archive Network
 R is the fastest growing data analysis software
○ Deeply knowledgeable and supportive analytics community
○ The most popular software used in data analysis competitions
○ Gaining speed in corporate, government, and academic settings
49
Big R Architecture
1
Scalable
Algorithms
Scalable Data
Processing
Native
R functions
R User
Interface
2 3
User Experience for Big R
Connect to BI cluster
Data frame proxy to large data file
Data transformation step
Run scalable linear regression on cluster
IBM System ML
 Collection of distributed algoritms
 Currently embedded in BigR
 Contributed to Spark on 15.06.15
SPSS on Hadoop
Python for data analysis
 Ipython notebooks
 Pandas/numpy
 Scikit
 matplotlib
 Python Spark API
Collection of distributed
algorithms
Want to learn more?
 Download Quick Start offering
 Test drive the technologies
 Links all available from HadoopDev
– https://developer.ibm.com/hadoop/

Big Data Essentials meetup @ IBM Ljubljana 23.06.2015

  • 1.
  • 2.
    Agenda • Massive ParallelProcessing concepts • Overview of Hadoop Architecture • Processing Engines • Map Reduce • Spark • Hive, Pig, BigSQL • Hadoop distributions • Stream processing • Advanced analytics on Hadoop
  • 4.
    Big Data  Anumbrella term that really means “analytics at scale on any kind of data”  It is about :  Scalability  Cost reduction (per terabyte or of infrastructure)  Variety of formats to analyze  New types of analytics
  • 5.
    Use Cases  Telco Mediation  Geolocation / fencing  Call archival  Lawful intercept  …  Banking  Counter-fraud  Regulatory compliance  Analyzing customer behavior
  • 7.
    Definition  Perform aset of coordinated computations in parallel (wiki def.)  Grid computing  Cluster computing  Why? To make things faster  To count buttons of people on a stadium can be done in 34* days by 1 person or in 50 minutes* by 1000 persons
  • 8.
    Types of systems Shared Memory (SMP)  Simple in implementing data processing features  Expensive to scale  Shared Disk clusters  Easier to implement storage layer  Bottlenecked above storage layer  Harder to scale  Shared nothing clusters
  • 9.
    Types of systems(cont.)  Shared Nothing Clusters
  • 10.
    Types of systems(cont.)  Relational Database management system  SQL Support  ACID (Atomicity, Consistency, Isolation, Durability)  All interfaces lower than SQL are hidden  Netezza  General Processing frameworks  Lower level interfaces exposed  MPI  Hadoop
  • 11.
    Notable systems  MPPRDBMS Teradata ~ 1980  Netezza ~ 2000  Hadoop ~ 2006
  • 12.
    NoSQL  CAP Theorem Consistency, Availability, Partition tolerance  Pick 2  BASE in contrast to ACID  Cloudant, HBASE  Different database genres  Graph  Document Store  Columnar  Key Value
  • 14.
    Hadoop  Distributed platformfor thousands of nodes  Data storage and computation framework  Open source  Runs on commodity hardware  Flexible – everything is loosely coupled
  • 15.
    Hadoop benefits  Linearscalability  Software resilience rather than expensive hardware  “Schema on read”  Parallelism  Variety of tools
  • 16.
    The Hadoop Filesystem(HDFS)  Driving principals  Files are stored across the entire cluster  Programs are brought to the data, not the data to the program  Distributed file system (DFS) stores blocks across the whole cluster  Blocks of a single file are distributed across the cluster  A given block is typically replicated as well for resiliency  Just like a regular file system, the contents of a file is up to the application  Unlike a regular file system, you can ask it “where does each block of my file live?” FILE BLOCK S
  • 17.
    Hadoop Distributed FileSystem HDFS  Stores files in folders  Nobody cares what’s in your files  Chunks large files into blocks (~64MB-2GB)  3 replicates of each block (by default)  Blocks are scattered all over the place FILE BLOCK S
  • 18.
    HDFS – Architecture Master / Slave architecture  Master: NameNode  manages the file system namespace and metadata ○ FsImage ○ EditLog  regulates access by files by clients  Slave: DataNode  many per cluster  manages storage attached to the nodes  periodically reports status to NameNode a a a b b b d d dc c c File1 a b c d NameNode DataNodes
  • 20.
     Common patternin data processing: apply a function, then aggregate - Identify words in each line of a document collection - For each word, return the sum of occurrences throughout the collection  User simply writes two pieces of code: “mapper” and “reducer” - Mapper executes on every split of every file - Reducer consumes/aggregates mapper outputs • The Hadoop MR framework takes care of the rest (resource allocation, scheduling, coordination, storage of final result on DFS, . . . ) 10110100 10100100 11100111 11100101 00111010 01010010 11001001 01010011 00010100 10111010 11101011 11011011 01010110 10010101 1 2 3 Logical File Splits 1 Cluster 32 Map Map MapReduce Result MapReduce
  • 21.
    Logical MapReduce Example:Word Count map(String key, String value): // key: document name // value: document contents for each word w in value: EmitIntermediate(w, "1"); reduce(String key, Iterator values): // key: a word // values: a list of counts int result = 0; for each v in values: result += ParseInt(v); Emit(AsString(result)); Hello World Bye World Hello IBM Content of Input Documents Reduce (final output): < Bye, 1> < IBM, 1> < Hello, 2> < World, 2> Map 1 emits: < Hello, 1> < World, 1> < Bye, 1> < World, 1> Map 2 emits: < Hello, 1> < IBM, 1>
  • 22.
    MapReduce processing Hello WorldBye World Hello IBM Input Documents Reduce (final output): < Bye, 1> < IBM, 1> < Hello, 2> < World, 2> Map 1 emits: < Hello, 1> < World, 1> < Bye, 1> < World, 1> Map 2 emits: < Hello, 1> < IBM, 1>
  • 23.
    Spark  Spark bringstwo significant value-adds:  Bring to Map Reduce the same added value that databases (and parallel databases) brought to query processing: ○ Let the app developer focus on the WHAT (they need to ask) and let the system figure out HOW (it should be done). ○ Enable faster higher level application development through higher level constructs and concepts: (RDD concept) ○ Let the system deal with performance (as part of the HOW)  Leveraging memory (Bufferpools, Caching RDDs in memory)  Maintaining sets of dedicated worker processes ready to go (subagents in DBMS, Executors in Spark) ○ Enabling interactive processing (CLP, SQL*Plus, spark-shell, etc….)  Be one general purpose engine for multiples types of workloads (SQL, Streaming, Machine Learning, etc…)
  • 24.
    Spark (cont.)  ApacheSpark is a fast, general purpose, easy-to-use cluster computing system for large-scale data processing  Fast ○ Leverages aggressively cached in-memory distributed computing and dedicated App Executor processes even when no jobs are running ○ Faster than MapReduce  General purpose ○ Covers a wide range of workloads ○ Provides SQL, streaming and complex analytics  Flexible and easier to use than Map Reduce ○ Spark is written in Scala, an object oriented, functional programming language ○ Scala, Python and Java APIs ○ Scala and Python interactive shells ○ Runs on Hadoop, Mesos, standalone or cloud Logistic regression in Hadoop and Spark Spark Stack val wordCounts = sc.textFile("README.md").flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey((a, b) => a + b) WordCount
  • 25.
  • 26.
    Pig  Pig isa query language that runs MapReduce jobs  Higher-level than MapReduce: write code in terms of GROUP BY, DISTINCT, FOREACH, FILTER, etc.  Custom loaders and storage functions make this good glue A = LOAD ‘data.txt’ AS (name:chararray, age:int, state:chararray); B = GROUP A BY state; C = FOREACH B GENERATE group, COUNT(*), AVG(age); dump c;
  • 27.
    Hive  SQL Engineon top of MapReduce  Rapidly developed, lots of features  Query language – HiveQL – is deviant from ANSI  Lack of cost based query optimizer, statistics, and many other features  Not responsive enough for small jobs
  • 28.
    BigSQL Data shared withHadoop ecosystem Comprehensive file format support Superior enablement of IBM and Third Party software Modern MPP runtime Powerful SQL query rewriter Cost based optimizer Optimized for concurrent user throughput Results not constrained by memory Distributed requests to multiple data sources within a single SQL statement Main data sources supported: DB2 LUW, Teradata, Oracle, Netezza, Informix, SQL Server Advanced security/auditing Resource and workload management Self tuning memory management Comprehensive monitoring Comprehensive SQL Support IBM SQL PL compatibility Extensive Analytic Functions
  • 30.
    A lot ofbuzzwords  Ambari – web admin interface  Zookeper – distributed object sync  Hbase – NoSQL key/value store  Flume – buffered ingestion  Sqoop – Database import/export  Oozie – workflow manager  YARN – cluster resource manager  Nagios/Ganglia – monitoring, metrics
  • 33.
  • 34.
  • 35.
    IBM BigInsights Text Analytics POSIXDistributed Filesystem Multi-workload, multi-tenant scheduling IBM BigInsights Enterprise Management Machine Learning on Big R Big R (R support) IBM Open Platform with Apache Hadoop* (HDFS, YARN, MapReduce, Ambari, Hbase, Hive, Oozie, Parquet, Parquet Format, Pig, Snappy, Solr, Spark, Sqoop, Zookeeper, Open JDK, Knox, Slider) IBM BigInsights Data Scientist IBM BigInsights Analyst Big SQL BigSheets Industry standard SQL (Big SQL) Spreadsheet-style tool (BigSheets) Free Quick Start (non production): • IBM Open Platform • BigInsights Analyst, Data Scientist features • Community support . . . IBM BigInsights for Apache Hadoop
  • 37.
    Overview  Analyzing dataon the fly vs. storing it  Sometimes both has to be done  Batch vs. Stream processing  Low latency needs special design considerations  Processing is done on “windows” rather then on tables/dataframes  Engines differ by architecture, development tools, latency
  • 38.
    Apache Flume  Agentscan be installed on variety of platforms  Collectors buffer data and put to HDFS  Reliable  Limited to micro batch data collection server agent Collectorserver agent server agent HDFS
  • 39.
    Spark Streaming  Microbatch engine  Reliable  Integrated with Spark
  • 40.
    Apache Storm  Twitterproject now in Apache  Development in Java  Bolts and Spouts  Guaranteed record delivery
  • 41.
    Infosphere Streams  Mostperforming and sophisticated streaming engine  Easy IDE  Declarative streaming language  Parallel execution framework  Many advanced toolkits  Video, audio, signal processing, finance, geospatial, integration, etc  Integrated with enterprise tools
  • 44.
    Data Science Life:Two Main Tasks 1) Exploration: We don’t have any special attribute we want to predict. Rather we want to understand the structure present in the data. Are there clusters? Non-obvious relationships? - Also referred to as “unsupervised learning” - E.g., K-means clustering Use Cases -> Understanding categories of customers, cross- selling opportunities, etc… 2) Prediction: The data contains a particular attribute (called the target attribute) and we want to learn how the target attribute depends on the other attributes. - Also referred to as “supervised learning” - E.g., Support vector machines Use Cases -> Building a model to predict customer churn, fraud, etc…
  • 45.
    Data Science Life:Tools at Present SQL (42%) R (33%) Python (26%) Excel (25%) Java Ruby C++ (17%) SPSS SAS (9%)
  • 46.
    Data Science Life:Skillset of the Data Scientist Statistician Software Engineer Business Analyst Process Automation Parallel Computing Software Development Database Systems Mathematics Background Analytic Mindset Domain Expertise Business Focus Effective Communication
  • 47.
    CRISP-DM: Cross IndustryStandard Process for Data Mining The Typical Data Science Workflow
  • 49.
    The Architect: Whatis Open Source R? What is CRAN? R is a powerful programming language and environment for statistical computing and graphics. R offers a rich analytics ecosystem:  Full analytics life-cycle ○ Data exploration ○ Statistical analysis ○ Modeling, machine learning, simulations ○ Visualization  Highly extensible via user-submitted packages ○ Tap into innovation pipeline contributed to by highly-regarded statisticians ○ Currently 4700+ statistical packages in repository ○ Easily accessible via CRAN, the Comprehensive R Archive Network  R is the fastest growing data analysis software ○ Deeply knowledgeable and supportive analytics community ○ The most popular software used in data analysis competitions ○ Gaining speed in corporate, government, and academic settings 49
  • 51.
    Big R Architecture 1 Scalable Algorithms ScalableData Processing Native R functions R User Interface 2 3
  • 52.
    User Experience forBig R Connect to BI cluster Data frame proxy to large data file Data transformation step Run scalable linear regression on cluster
  • 53.
    IBM System ML Collection of distributed algoritms  Currently embedded in BigR  Contributed to Spark on 15.06.15
  • 55.
  • 57.
    Python for dataanalysis  Ipython notebooks  Pandas/numpy  Scikit  matplotlib  Python Spark API
  • 59.
  • 62.
    Want to learnmore?  Download Quick Start offering  Test drive the technologies  Links all available from HadoopDev – https://developer.ibm.com/hadoop/