Etu Solution Day 2014 Track-D: 掌握Impala和Spark

掌握Impala和Spark
Real-time Big Data即時應用架構研習
Etu首席顧問陳昭宇
Oct 8, 2014

Workshop Goal
Let’s talk about the 3Vs in Big
Data. Hadoop is good for Volume
and Variety
But…
How about Velocity ???
This is why we are sitting here ….

Target Audience
• CTO
• Architect
• Software/Application Developer
• IT

Background Knowledge
• Linux operation system
• Basic Hadoop ecosystem knowledge
• Basic knowledge of SQL
• Java or Python programming experience

Terminology
• Hadoop: Open source big data platform
• HDFS: Hadoop Distributed Filesystem
• MapReduce: Parallel computing framework on top of
HDFS
• HBase: NoSQL database on top of Hadoop
• Impala: MPP SQL query engine on top of Hadoop
• Spark: In-memory cluster computing engine
• Hive: SQL to MapReduce translator
• Hive Metastore: Database that stores table schema
• Hive QL: A SQL subset

Agenda
• What is Hadoop and what’s wrong with Hadoop in real-time?
• What is Impala?
• Hands-on Impala
• What is Spark?
• Hands-on Spark
• Spark and Impala work together
• Q & A

What is Hadoop ?
Apache Hadoop is an open
source platform for data
storage and processing that
is :
Distributed
Fault tolerant
Scalable
CORE HADOOP SYSTEM COMPONENTS
HDFS
HDFS
A fault-tolerant,
scalable clustered
A fault-tolerant,
scalable clustered
storage
storage
MapReduce
MapReduce
A distributed
computing
framework
A distributed
computing
framework
• Ask questions across structured and
unstructured data
• Schema-less
• Scale-out architecture
divides workloads across
nodes.
• Flexible file system
eliminates ETL bottlenecks.
Flexible for storing and mining
any type of data
Processing Complex
Big Data
Scales Economically
• Deploy on commodity
hardware
• Open sourced platform

Limitations of MapReduce
• Batch oriented
• High latency
• Doesn’t fit all cases
• Only for developers

Pig and Hive
• MR is hard and only for developers
• High level abstraction for converting declarative syntax to
MR
– SQL – Hive
– Dataflow language - Pig
• Build on top of MapReduce

Goals
• General-purpose SQL engine:
– Works for both analytics and transactional/single-row workloads.
– Supports queries that take from milliseconds to hours.
• Runs directly within Hadoop and:
– Reads widely used Hadoop file formats.
– Runs on same nodes that run Hadoop processes.
• High performance:
– C++ instead of Java
– Runtime code generation
– Completely new execution engine – No MapReduce

What is Impala
• General-purpose SQL engine
• Real-time queries in Apache Hadoop
• Beta version released since Oct. 2012
• GA since Apr. 2013
• Apache licensed
• Latest release v1.4.2

Impala Overview
• Distributed service in cluster: One impala daemon on each
data node
• No SPOF
• User submits query via ODBC/JDBC, CLI, or HUE to any of the
daemons.
• Query is distributed to all nodes with data locality.
• Uses Hive’s metadata interfaces and connects to Hive
metastore.
• Supported file formats:
– Uncompressed/lzo-compressed text files
– Sequence files and RCFile with snappy/gzip, Avro
– Parquet columnar format

Impala’s SQL
• High compatibility with HiveQL
• SQL support:
– Essential SQL-92, minus correlated subqueries
– INSERT INTO … SELECT …
– Only equi-joins; no non-equi-joins, no cross products
– Order By requires Limit (not required after 1.4.2)
– Limited DDL support
– SQL-style authorization via Apache Sentry
– UDFs and UDAFs are supported

Impala’s SQL limitations
• No file formats, SerDes
• No beyond SQL (buckets, samples, transforms, arrays,
structs, maps, xpath, json)
• Broadcast joins and partitioned hash joins supported
(Smaller tables have to fit in the aggregate memory of all
executing nodes)

Work with HBase
• Functionality highlights:
– Support for SELECT, INSERT INTO…SELECT…, and INSERT INTO …
VALUES (…)
– Predicates on rowkey columns are mapped into start/stop rows
– Predicates on other columns are mapped into SingleColumnValueFilters
• BUT mapping of HBase tables and metastore table
patterned after Hive:
– All data is stored as scalars and in ASCII.
– The rowkey needs to be mapped into a single string column.

HBase in Roadmap
• Full support for UPDATE and DELETE.
• Storage of structured data to minimize storage and access
overhead.
• Composite rowkey encoding mapped into an arbitrary
number of table columns.

Impala’s Architecture
Query Planner
Query Coordinator
Query Executor
HDFS DN HBase
Query Planner
Query Coordinator
Query Executor
HDFS DN HBase
Query Planner
Query Coordinator
Query Executor
HDFS DN HBase
Hive
Metastore HDFS NN Statestore
SQL Client
ODBC
1. Request arrives via
ODBC/JDBC/Beeswax/
Shell.

Query Planner
Query Coordinator
Query Executor
HDFS DN HBase
Query Planner
Query Coordinator
Query Executor
HDFS DN HBase
Query Planner
Query Coordinator
Query Executor
HDFS DN HBase
Hive
SQL Client
ODBC
2. Planner turns request
into collections of plan
fragments.
3. Coordinator initiates
execution on impalad(s)
local to data.

Query Planner
Query Coordinator
Query Executor
HDFS DN HBase
Query Planner
Query Coordinator
Query Executor
HDFS DN HBase
Query Planner
Query Coordinator
Query Executor
HDFS DN HBase
Hive
SQL Client
ODBC
5. Query results are
streamed back to client.
4. Intermediate results are
streamed between
impalad(s).

Metadata Handling
• Impala metadata
– Hive’s metastore: Logical metadata (table definitions, columns,
CREATE TABLE parameters)
– HDFS NameNode: Directory contents and block replica locations
– HDFS DataNode: Block replias’ volume ids
• Caches metadata: No synchronous metastore API calls
during query execution
• Impala instances read metadata from metastore at startup.
• Catalog Service relays metadata when you run DDL or
update metadata on one of the impalads.

Metadata Handling – Cont.
• REFRESH [<tbl>]: Reloads metadata on all impalads (if you
added new files via Hive)
• INVALIDATE METADATA: Reloads metadata for all tables

Comparing Impala to Dremel
• What is Dremel?
– Columnar storage for data with nested structures
– Distributed scalable aggregation on top of that
• Columnar storage in Hadoop: Parquet
– Store data in appropriate native/binary types
– Can also store nested structures similar to Dremel’s ColumnIO
• Distributed aggregation: Impala
• Impala plus Parquet: A superset of the published version of
Dremel (does not support joins)

Comparing Impala to Hive
• Hive: MapReduce as an execution engine
– High latency, low throughput queries
– Fault-tolerance model based on MapReduce’s on-disk check pointing;
materializes all intermediate results
– Java runtime allows for easy late-binding of functionality: file formats
and UDFs
– Extensive layering imposes high runtime overhead
• Impala:
– Direct, process-to-process data exchange
– No fault tolerance
– An execution engine designed for low runtime overhead

Impala and Hive
Shares Everything Client-Facing
•Metadata (table definitions)
•ODBC/JDBC drivers
•SQL syntax (Hive SQL)
•Flexible file formats
•Machine pool
•GUI
Resource Management
Data Store
But Built for Different Purposes
•Hive: Runs on MapReduce and ideal for
batch processing
•Impala: Native MPP query engine ideal
for interactive SQL Data Ingestion
HDFS HBase
TEXT, RCFILE,
PARQUET,AVRO, ETC.
RECORDS
Hive
SQL Syntax
MapReduce
Compute framework
Impala
SQL syntax +
compute framework

Typical Use Cases
• Data Warehouse Offload
• Ad-hoc Analytics
• Provide SQL interoperability to HBase

Hands-on Impala
• Query a file on HDFS with Impala
• Query a table on HBase with Impala

What is Spark?
• MapReduce Review…
• Apache Spark…
• How Spark Works…
• Fault Tolerance and Performance…
• Examples…
• Spark & More…

MapReduce: Good
The Good:
•Built in fault tolerance
•Optimized IO path
•Scalable
•Developer focuses on Map/Reduce, not infrastructure
•Simple? API

MapReduce: Bad
The Bad:
•Optimized for disk IO
– Does not leverage memory
– Iterative algorithms go through disk IO path again and
again
•Primitive API
– Developers have to build on a very simple abstraction
– Key/Value in/out
– Even basic things like join require extensive code
•A common result is many files require to be combined
appropriately

Apache Spark
• Originally developed in
2009 in UC Berkeley’s AMP
Lab.
• Fully open sourced in 2010
– now at Apache Software
Foundation.

Spark: Easy and Fast Big Data
• Easy to Develop
– Rich APIs in Java, Scala,
Python
– Interactive Shell
• 2-5x less code
• Fast to Run
– General execution
graph
– In-memory store

How Spark Works – SparkContext
Cluster
Master
Spark Worker Spark Worker
Executer
Cache Executer
Task Task
Cache
Data Node Data Node
HDFS
Task Task
SparkContext
sc=new SparkContext
Rdd=sc.textfile(“hdfs://..”)
Rdd.filter(…)
Rdd.cache(…)
Rdd.count(…)
Rdd.map
sc=new SparkContext
Rdd=sc.textfile(“hdfs://..”)
Rdd.filter(…)
Rdd.cache(…)
Rdd.count(…)
Rdd.map

How Spark Works – RDD
RDD
(Resilient
Distributed
Dataset)
sc=new SparkContext
Rdd=sc.textfile(“hdfs://..”
)
Rdd.filter(…)
Rdd.cache(…)
Rdd.count(…)
Rdd.map
sc=new SparkContext
Rdd=sc.textfile(“hdfs://..”
)
Rdd.filter(…)
Rdd.cache(…)
Rdd.count(…)
Rdd.map
Storage Types:
MEMORY_ONLY,
MEMORY_AND_DISK
DISK_ONLY,
• Fault Tolerant
• Controlled
• Fault Tolerant
• Controlled
partitioning to
optimize data
placement
partitioning to
optimize data
placement
• Manipulated by
• Manipulated by
using a rich set of
operators
using a rich set of
operators
• Partitions of Data
• Dependency between partitions

RDD
• Stands for Resilient Distributed Datasets
• Spark revolves around RDDs
• Fault-tolerant read only collection of elements that can be
operated on in parallels
• Cached in memory
Reference:
http://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spar
k.pdf

RDD
• Read-only, partitioned collection of records
DD11
DD22
DD33
3 partitions
• Supports only coarse-grained operations
– e.g. map and group-by transformation, reduce
actions
DD11
DD22
DD33
DD11
DD22
DD33
DD11
DD22
DD33
DD11
DD22
DD33
DD11
DD22
DD33
Value

RDD Operations - Expressive
• Transformations
– Creation of a new RDD dataset from an existing:
• map, filter, distinct, union, sample, groupByKey, join,
reduce, etc…
• Actions
– Returns a value after running a computation:
• Collect, count, first, takeSample, foreach, etc…
• Reference
– http://spark.apache.org/docs/latest/scala-programming-guide.html#rdd-operations

Word Count on Spark
sparkContext.textFile(“hdfs://…”) RDD[String]
textFile

Word Count on Spark
sparkContext.textFile(“hdfs://…”)
.map(line => line.split(“s”))
RDD[String]
RDD[List[String]]
textFile map

Word Count on Spark
.map(word => (word, 1))
RDD[String]
RDD[List[String]]
RDD[(String, Int)]
textFile map map

Word Count on Spark
.reduceByKey((a, b) => a+b)
RDD[String]
RDD[List[String]]
RDD[(String, Int)]
RDD[(String, Int)]
textFile map map reduceByKey
map

Word Count on Spark
.reduceByKey((a, b) => a+b)
.collect()
RDD[String]
RDD[List[String]]
RDD[(String, Int)]
RDD[(String, Int)]
Array[(String, Int)]
map
collect

Actions
• Parallel Operations
map reduce sample
filter count take
grougBy fold first
sort reduceByKey partitionBy
union groupByKey mapWith
join cogroup pipe
leftOuterJoin cross save
rightOuterJoin zip ….

Stages
collect
Stage 1 Stage 2
DAG (Directed Acyclic Graph) Each stage is executed as
Stage 1
Stage 2
a series of Task (one
Task for each partition).

Tasks
Task is the fundamental unit of execution in Spark
Fetch Input
Execute Task
Write Output
HDFS
/RDD
HDFS/RDD/Intermediate
shuffle output
Core 1
Fetch Input
Execute Task
Write Output
Fetch Input
Execute Task
Write Output
Fetch Input
Execute Task
Write Output
Core 2

Spark Summary
• SparkContext
• Resilient Distributed Dataset
• Parallel Operations
• Shared Variables
– Broadcast Variables – read-only
– Accumulators

Comparison
MapReduce Impala Spark
Storage HDFS HDFS/HBase HDFS
Scheduler MapReduce Job Query Plan Computation Graph
I/O Disk In-memory with
cache
In-memory, cache
and shared data
Fault Tolerance Duplication and
Disk I/O
No Fault Tolerance Hash partition and
auto reconstruction
Iterative Bad Bad Good
Shared data No No Yes
Streaming No No Yes

Hands-on Spark
• Spark Shell
• Word Count

Spark Streaming
• Takes the concept of RDDs and extends it to Dstreams
– Fault-tolerant like RDDs
– Transformable like RDDs
• Adds new “rolling window” operations
– Rolling averages, etc..
• But keeps everything else!
– Regular Spark code works in Spark Streaming
– Can still access HDFS data, etc.
• Example use cases:
– “On-the-fly” ETL as data is ingested into Hadoop/HDFS
– Detecting anomalous behavior and trigger alerts
– Continuous reporting of summary metrics for incoming data

Micro-batching for on the fly ETL

Spark SQL
• Spark SQL is one of Spark’s
components.
– Executes SQL on Spark
– Builds SchemaRDD
• Optimizes execution plan
• Uses existing Hive metastores,
SerDes, and UDFs.

Unified Data Access
• Ability to load and query
data from a variety of
sources.
• SchemaRDDs provides a
single interface that
efficiently works with
structured data, including
Hive tables, parquet files,
and JSON.
sqlCtx.jsonFile("s3n://...")
.registerAsTable("json")
schema_rdd = sqlCtx.sql("""
SELECT *
FROM hiveTable
JOIN json ...""")
Query and join different data sources

Hands-on Spark
• Parse/transform log on the fly with Spark-Streaming
• Aggregate with Spark SQL (Top N)
• Output from Spark to HDFS

Spark & Impala work together
Data
Strea
m
Data
Strea
m
Spark-
Streaming
Spark
Impala
DN
RS
Data
Strea
m
Spark-
Streaming
Spark
Impala
DN
RS
Data
Strea
m
Spark-
Streaming
Spark
DN
RS
Impala
…
Data
Strea
m
Data
Strea
m
Data Stream
-Click Steam
-Machine Data
-Log
-Network Traffic
-Etc..
On-the-fly Processing
-ETL, transformation, filter
-Pattern Matching & Alert
Real-time Analytics
-Machine Learning (Rec. Cluster..)
-Iterative Algorithms
Near Real-time Query
- Ad-hoc query
- Reporting
Long term data store
-Batch process
-Offline analytics
-Historical Mining

Etu 讓 Hadoop 更容易
全自動、高效能、易管理的軟體式一體機
空機自動部署
效能最佳化
全叢集管理
唯一在地 Hadoop 專業服務
主流 X86 商用伺服器

ESA Software Stack
Cloudera Manager
Etu Manager
安全管理
效能最佳化
組態同步網路管理監控告警套件管理
CentOS作業系統 (64 bits)
HA 管理
Rack
Awareness
Etu 加值模組

Etu在Hadoop生態系的定位與價值
Etu Services
人才
招聘
團隊
建立
程式開發
數據架構
挖掘設計
部署、調校
易
Etu 易
運維、管理
應用
平台
搶
佔
市場
Etu Professional
Services
核心
價
值
資源
調配
屏蔽 Hadoop 平台
部署與運維的複雜度
易
• 快速推出服務，搶佔市場
• 應用、數據才是企業核心價值
• 依核心價值調配資源，建立競爭優
勢
Software
Appliance
EEttuu SSuuppppoorrtt
Etu Professional
Services
EEttuu CCoonnssuullttiinngg
EEttuu TTrraaiinniinngg
易

Question and Discussion
Thank you
318, Rueiguang Rd., Taipei 114, Taiwan
T: +886 2 7720 1888
F: +886 2 8798 6069
www.etusolution.com

Etu Solution Day 2014 Track-D: 掌握Impala和Spark

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Etu Solution Day 2014 Track-D: 掌握Impala和Spark

Similar to Etu Solution Day 2014 Track-D: 掌握Impala和Spark (20)

More from James Chen

More from James Chen (6)

Recently uploaded

Recently uploaded (20)

Etu Solution Day 2014 Track-D: 掌握Impala和Spark