2. Workshop Goal
Let’s talk about the 3Vs in Big
Data. Hadoop is good for Volume
and Variety
But…
How about Velocity ???
This is why we are sitting here ….
4. Background Knowledge
• Linux operation system
• Basic Hadoop ecosystem knowledge
• Basic knowledge of SQL
• Java or Python programming experience
5. Terminology
• Hadoop: Open source big data platform
• HDFS: Hadoop Distributed Filesystem
• MapReduce: Parallel computing framework on top of
HDFS
• HBase: NoSQL database on top of Hadoop
• Impala: MPP SQL query engine on top of Hadoop
• Spark: In-memory cluster computing engine
• Hive: SQL to MapReduce translator
• Hive Metastore: Database that stores table schema
• Hive QL: A SQL subset
6. Agenda
• What is Hadoop and what’s wrong with Hadoop in real-time?
• What is Impala?
• Hands-on Impala
• What is Spark?
• Hands-on Spark
• Spark and Impala work together
• Q & A
7. What is Hadoop ?
Apache Hadoop is an open
source platform for data
storage and processing that
is :
Distributed
Fault tolerant
Scalable
CORE HADOOP SYSTEM COMPONENTS
HDFS
HDFS
A fault-tolerant,
scalable clustered
A fault-tolerant,
scalable clustered
storage
storage
MapReduce
MapReduce
A distributed
computing
framework
A distributed
computing
framework
• Ask questions across structured and
unstructured data
• Schema-less
• Scale-out architecture
divides workloads across
nodes.
• Flexible file system
eliminates ETL bottlenecks.
Flexible for storing and mining
any type of data
Processing Complex
Big Data
Scales Economically
• Deploy on commodity
hardware
• Open sourced platform
8. Limitations of MapReduce
• Batch oriented
• High latency
• Doesn’t fit all cases
• Only for developers
9. Pig and Hive
• MR is hard and only for developers
• High level abstraction for converting declarative syntax to
MR
– SQL – Hive
– Dataflow language - Pig
• Build on top of MapReduce
10. Goals
• General-purpose SQL engine:
– Works for both analytics and transactional/single-row workloads.
– Supports queries that take from milliseconds to hours.
• Runs directly within Hadoop and:
– Reads widely used Hadoop file formats.
– Runs on same nodes that run Hadoop processes.
• High performance:
– C++ instead of Java
– Runtime code generation
– Completely new execution engine – No MapReduce
11. What is Impala
• General-purpose SQL engine
• Real-time queries in Apache Hadoop
• Beta version released since Oct. 2012
• GA since Apr. 2013
• Apache licensed
• Latest release v1.4.2
12. Impala Overview
• Distributed service in cluster: One impala daemon on each
data node
• No SPOF
• User submits query via ODBC/JDBC, CLI, or HUE to any of the
daemons.
• Query is distributed to all nodes with data locality.
• Uses Hive’s metadata interfaces and connects to Hive
metastore.
• Supported file formats:
– Uncompressed/lzo-compressed text files
– Sequence files and RCFile with snappy/gzip, Avro
– Parquet columnar format
13. Impala’s SQL
• High compatibility with HiveQL
• SQL support:
– Essential SQL-92, minus correlated subqueries
– INSERT INTO … SELECT …
– Only equi-joins; no non-equi-joins, no cross products
– Order By requires Limit (not required after 1.4.2)
– Limited DDL support
– SQL-style authorization via Apache Sentry
– UDFs and UDAFs are supported
14. Impala’s SQL limitations
• No file formats, SerDes
• No beyond SQL (buckets, samples, transforms, arrays,
structs, maps, xpath, json)
• Broadcast joins and partitioned hash joins supported
(Smaller tables have to fit in the aggregate memory of all
executing nodes)
15. Work with HBase
• Functionality highlights:
– Support for SELECT, INSERT INTO…SELECT…, and INSERT INTO …
VALUES (…)
– Predicates on rowkey columns are mapped into start/stop rows
– Predicates on other columns are mapped into SingleColumnValueFilters
• BUT mapping of HBase tables and metastore table
patterned after Hive:
– All data is stored as scalars and in ASCII.
– The rowkey needs to be mapped into a single string column.
16. HBase in Roadmap
• Full support for UPDATE and DELETE.
• Storage of structured data to minimize storage and access
overhead.
• Composite rowkey encoding mapped into an arbitrary
number of table columns.
18. Impala’s Architecture
Query Planner
Query Coordinator
Query Executor
HDFS DN HBase
Query Planner
Query Coordinator
Query Executor
HDFS DN HBase
Query Planner
Query Coordinator
Query Executor
HDFS DN HBase
Hive
Metastore HDFS NN Statestore
SQL Client
ODBC
2. Planner turns request
into collections of plan
fragments.
3. Coordinator initiates
execution on impalad(s)
local to data.
19. Impala’s Architecture
Query Planner
Query Coordinator
Query Executor
HDFS DN HBase
Query Planner
Query Coordinator
Query Executor
HDFS DN HBase
Query Planner
Query Coordinator
Query Executor
HDFS DN HBase
Hive
Metastore HDFS NN Statestore
SQL Client
ODBC
5. Query results are
streamed back to client.
4. Intermediate results are
streamed between
impalad(s).
20. Metadata Handling
• Impala metadata
– Hive’s metastore: Logical metadata (table definitions, columns,
CREATE TABLE parameters)
– HDFS NameNode: Directory contents and block replica locations
– HDFS DataNode: Block replias’ volume ids
• Caches metadata: No synchronous metastore API calls
during query execution
• Impala instances read metadata from metastore at startup.
• Catalog Service relays metadata when you run DDL or
update metadata on one of the impalads.
21. Metadata Handling – Cont.
• REFRESH [<tbl>]: Reloads metadata on all impalads (if you
added new files via Hive)
• INVALIDATE METADATA: Reloads metadata for all tables
22. Comparing Impala to Dremel
• What is Dremel?
– Columnar storage for data with nested structures
– Distributed scalable aggregation on top of that
• Columnar storage in Hadoop: Parquet
– Store data in appropriate native/binary types
– Can also store nested structures similar to Dremel’s ColumnIO
• Distributed aggregation: Impala
• Impala plus Parquet: A superset of the published version of
Dremel (does not support joins)
23. Comparing Impala to Hive
• Hive: MapReduce as an execution engine
– High latency, low throughput queries
– Fault-tolerance model based on MapReduce’s on-disk check pointing;
materializes all intermediate results
– Java runtime allows for easy late-binding of functionality: file formats
and UDFs
– Extensive layering imposes high runtime overhead
• Impala:
– Direct, process-to-process data exchange
– No fault tolerance
– An execution engine designed for low runtime overhead
24. Impala and Hive
Shares Everything Client-Facing
•Metadata (table definitions)
•ODBC/JDBC drivers
•SQL syntax (Hive SQL)
•Flexible file formats
•Machine pool
•GUI
Resource Management
Data Store
But Built for Different Purposes
•Hive: Runs on MapReduce and ideal for
batch processing
•Impala: Native MPP query engine ideal
for interactive SQL Data Ingestion
HDFS HBase
TEXT, RCFILE,
PARQUET,AVRO, ETC.
RECORDS
Hive
SQL Syntax
MapReduce
Compute framework
Impala
SQL syntax +
compute framework
25. Typical Use Cases
• Data Warehouse Offload
• Ad-hoc Analytics
• Provide SQL interoperability to HBase
26. Hands-on Impala
• Query a file on HDFS with Impala
• Query a table on HBase with Impala
27. What is Spark?
• MapReduce Review…
• Apache Spark…
• How Spark Works…
• Fault Tolerance and Performance…
• Examples…
• Spark & More…
28. MapReduce: Good
The Good:
•Built in fault tolerance
•Optimized IO path
•Scalable
•Developer focuses on Map/Reduce, not infrastructure
•Simple? API
29. MapReduce: Bad
The Bad:
•Optimized for disk IO
– Does not leverage memory
– Iterative algorithms go through disk IO path again and
again
•Primitive API
– Developers have to build on a very simple abstraction
– Key/Value in/out
– Even basic things like join require extensive code
•A common result is many files require to be combined
appropriately
30. Apache Spark
• Originally developed in
2009 in UC Berkeley’s AMP
Lab.
• Fully open sourced in 2010
– now at Apache Software
Foundation.
31. Spark: Easy and Fast Big Data
• Easy to Develop
– Rich APIs in Java, Scala,
Python
– Interactive Shell
• 2-5x less code
• Fast to Run
– General execution
graph
– In-memory store
32. How Spark Works – SparkContext
Cluster
Master
Spark Worker Spark Worker
Executer
Cache Executer
Task Task
Cache
Data Node Data Node
HDFS
Task Task
SparkContext
sc=new SparkContext
Rdd=sc.textfile(“hdfs://..”)
Rdd.filter(…)
Rdd.cache(…)
Rdd.count(…)
Rdd.map
sc=new SparkContext
Rdd=sc.textfile(“hdfs://..”)
Rdd.filter(…)
Rdd.cache(…)
Rdd.count(…)
Rdd.map
33. How Spark Works – RDD
RDD
(Resilient
Distributed
Dataset)
sc=new SparkContext
Rdd=sc.textfile(“hdfs://..”
)
Rdd.filter(…)
Rdd.cache(…)
Rdd.count(…)
Rdd.map
sc=new SparkContext
Rdd=sc.textfile(“hdfs://..”
)
Rdd.filter(…)
Rdd.cache(…)
Rdd.count(…)
Rdd.map
Storage Types:
MEMORY_ONLY,
MEMORY_AND_DISK
DISK_ONLY,
• Fault Tolerant
• Controlled
• Fault Tolerant
• Controlled
partitioning to
optimize data
placement
partitioning to
optimize data
placement
• Manipulated by
• Manipulated by
using a rich set of
operators
using a rich set of
operators
• Partitions of Data
• Dependency between partitions
34. RDD
• Stands for Resilient Distributed Datasets
• Spark revolves around RDDs
• Fault-tolerant read only collection of elements that can be
operated on in parallels
• Cached in memory
Reference:
http://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spar
k.pdf
35. RDD
• Read-only, partitioned collection of records
DD11
DD22
DD33
3 partitions
• Supports only coarse-grained operations
– e.g. map and group-by transformation, reduce
actions
DD11
DD22
DD33
DD11
DD22
DD33
DD11
DD22
DD33
DD11
DD22
DD33
DD11
DD22
DD33
Value
43. Actions
• Parallel Operations
map reduce sample
filter count take
grougBy fold first
sort reduceByKey partitionBy
union groupByKey mapWith
join cogroup pipe
leftOuterJoin cross save
rightOuterJoin zip ….
44. Stages
textFile map map reduceByKey
collect
Stage 1 Stage 2
DAG (Directed Acyclic Graph) Each stage is executed as
Stage 1
Stage 2
a series of Task (one
Task for each partition).
45. Tasks
Task is the fundamental unit of execution in Spark
Fetch Input
Execute Task
Write Output
HDFS
/RDD
HDFS/RDD/Intermediate
shuffle output
Core 1
Fetch Input
Execute Task
Write Output
Fetch Input
Execute Task
Write Output
Fetch Input
Execute Task
Write Output
Core 2
47. Comparison
MapReduce Impala Spark
Storage HDFS HDFS/HBase HDFS
Scheduler MapReduce Job Query Plan Computation Graph
I/O Disk In-memory with
cache
In-memory, cache
and shared data
Fault Tolerance Duplication and
Disk I/O
No Fault Tolerance Hash partition and
auto reconstruction
Iterative Bad Bad Good
Shared data No No Yes
Streaming No No Yes
49. Spark Streaming
• Takes the concept of RDDs and extends it to Dstreams
– Fault-tolerant like RDDs
– Transformable like RDDs
• Adds new “rolling window” operations
– Rolling averages, etc..
• But keeps everything else!
– Regular Spark code works in Spark Streaming
– Can still access HDFS data, etc.
• Example use cases:
– “On-the-fly” ETL as data is ingested into Hadoop/HDFS
– Detecting anomalous behavior and trigger alerts
– Continuous reporting of summary metrics for incoming data
53. Spark SQL
• Spark SQL is one of Spark’s
components.
– Executes SQL on Spark
– Builds SchemaRDD
• Optimizes execution plan
• Uses existing Hive metastores,
SerDes, and UDFs.
54. Unified Data Access
• Ability to load and query
data from a variety of
sources.
• SchemaRDDs provides a
single interface that
efficiently works with
structured data, including
Hive tables, parquet files,
and JSON.
sqlCtx.jsonFile("s3n://...")
.registerAsTable("json")
schema_rdd = sqlCtx.sql("""
SELECT *
FROM hiveTable
JOIN json ...""")
Query and join different data sources
55. Hands-on Spark
• Parse/transform log on the fly with Spark-Streaming
• Aggregate with Spark SQL (Top N)
• Output from Spark to HDFS
56. Spark & Impala work together
Data
Strea
m
Data
Strea
m
Spark-
Streaming
Spark
Impala
DN
RS
Data
Strea
m
Spark-
Streaming
Spark
Impala
DN
RS
Data
Strea
m
Spark-
Streaming
Spark
DN
RS
Impala
…
Data
Strea
m
Data
Strea
m
Data Stream
-Click Steam
-Machine Data
-Log
-Network Traffic
-Etc..
On-the-fly Processing
-ETL, transformation, filter
-Pattern Matching & Alert
Real-time Analytics
-Machine Learning (Rec. Cluster..)
-Iterative Algorithms
Near Real-time Query
- Ad-hoc query
- Reporting
Long term data store
-Batch process
-Offline analytics
-Historical Mining