SlideShare a Scribd company logo
1 of 60
掌握Impala和Spark 
Real-time Big Data即時應用架構研習 
Etu首席顧問 陳昭宇 
Oct 8, 2014
Workshop Goal 
Let’s talk about the 3Vs in Big 
Data. Hadoop is good for Volume 
and Variety 
But… 
How about Velocity ??? 
This is why we are sitting here ….
Target Audience 
• CTO 
• Architect 
• Software/Application Developer 
• IT
Background Knowledge 
• Linux operation system 
• Basic Hadoop ecosystem knowledge 
• Basic knowledge of SQL 
• Java or Python programming experience
Terminology 
• Hadoop: Open source big data platform 
• HDFS: Hadoop Distributed Filesystem 
• MapReduce: Parallel computing framework on top of 
HDFS 
• HBase: NoSQL database on top of Hadoop 
• Impala: MPP SQL query engine on top of Hadoop 
• Spark: In-memory cluster computing engine 
• Hive: SQL to MapReduce translator 
• Hive Metastore: Database that stores table schema 
• Hive QL: A SQL subset
Agenda 
• What is Hadoop and what’s wrong with Hadoop in real-time? 
• What is Impala? 
• Hands-on Impala 
• What is Spark? 
• Hands-on Spark 
• Spark and Impala work together 
• Q & A
What is Hadoop ? 
Apache Hadoop is an open 
source platform for data 
storage and processing that 
is : 
Distributed 
Fault tolerant 
Scalable 
CORE HADOOP SYSTEM COMPONENTS 
HDFS 
HDFS 
A fault-tolerant, 
scalable clustered 
A fault-tolerant, 
scalable clustered 
storage 
storage 
MapReduce 
MapReduce 
A distributed 
computing 
framework 
A distributed 
computing 
framework 
• Ask questions across structured and 
unstructured data 
• Schema-less 
• Scale-out architecture 
divides workloads across 
nodes. 
• Flexible file system 
eliminates ETL bottlenecks. 
Flexible for storing and mining 
any type of data 
Processing Complex 
Big Data 
Scales Economically 
• Deploy on commodity 
hardware 
• Open sourced platform
Limitations of MapReduce 
• Batch oriented 
• High latency 
• Doesn’t fit all cases 
• Only for developers
Pig and Hive 
• MR is hard and only for developers 
• High level abstraction for converting declarative syntax to 
MR 
– SQL – Hive 
– Dataflow language - Pig 
• Build on top of MapReduce
Goals 
• General-purpose SQL engine: 
– Works for both analytics and transactional/single-row workloads. 
– Supports queries that take from milliseconds to hours. 
• Runs directly within Hadoop and: 
– Reads widely used Hadoop file formats. 
– Runs on same nodes that run Hadoop processes. 
• High performance: 
– C++ instead of Java 
– Runtime code generation 
– Completely new execution engine – No MapReduce
What is Impala 
• General-purpose SQL engine 
• Real-time queries in Apache Hadoop 
• Beta version released since Oct. 2012 
• GA since Apr. 2013 
• Apache licensed 
• Latest release v1.4.2
Impala Overview 
• Distributed service in cluster: One impala daemon on each 
data node 
• No SPOF 
• User submits query via ODBC/JDBC, CLI, or HUE to any of the 
daemons. 
• Query is distributed to all nodes with data locality. 
• Uses Hive’s metadata interfaces and connects to Hive 
metastore. 
• Supported file formats: 
– Uncompressed/lzo-compressed text files 
– Sequence files and RCFile with snappy/gzip, Avro 
– Parquet columnar format
Impala’s SQL 
• High compatibility with HiveQL 
• SQL support: 
– Essential SQL-92, minus correlated subqueries 
– INSERT INTO … SELECT … 
– Only equi-joins; no non-equi-joins, no cross products 
– Order By requires Limit (not required after 1.4.2) 
– Limited DDL support 
– SQL-style authorization via Apache Sentry 
– UDFs and UDAFs are supported
Impala’s SQL limitations 
• No file formats, SerDes 
• No beyond SQL (buckets, samples, transforms, arrays, 
structs, maps, xpath, json) 
• Broadcast joins and partitioned hash joins supported 
(Smaller tables have to fit in the aggregate memory of all 
executing nodes)
Work with HBase 
• Functionality highlights: 
– Support for SELECT, INSERT INTO…SELECT…, and INSERT INTO … 
VALUES (…) 
– Predicates on rowkey columns are mapped into start/stop rows 
– Predicates on other columns are mapped into SingleColumnValueFilters 
• BUT mapping of HBase tables and metastore table 
patterned after Hive: 
– All data is stored as scalars and in ASCII. 
– The rowkey needs to be mapped into a single string column.
HBase in Roadmap 
• Full support for UPDATE and DELETE. 
• Storage of structured data to minimize storage and access 
overhead. 
• Composite rowkey encoding mapped into an arbitrary 
number of table columns.
Impala’s Architecture 
Query Planner 
Query Coordinator 
Query Executor 
HDFS DN HBase 
Query Planner 
Query Coordinator 
Query Executor 
HDFS DN HBase 
Query Planner 
Query Coordinator 
Query Executor 
HDFS DN HBase 
Hive 
Metastore HDFS NN Statestore 
SQL Client 
ODBC 
1. Request arrives via 
ODBC/JDBC/Beeswax/ 
Shell.
Impala’s Architecture 
Query Planner 
Query Coordinator 
Query Executor 
HDFS DN HBase 
Query Planner 
Query Coordinator 
Query Executor 
HDFS DN HBase 
Query Planner 
Query Coordinator 
Query Executor 
HDFS DN HBase 
Hive 
Metastore HDFS NN Statestore 
SQL Client 
ODBC 
2. Planner turns request 
into collections of plan 
fragments. 
3. Coordinator initiates 
execution on impalad(s) 
local to data.
Impala’s Architecture 
Query Planner 
Query Coordinator 
Query Executor 
HDFS DN HBase 
Query Planner 
Query Coordinator 
Query Executor 
HDFS DN HBase 
Query Planner 
Query Coordinator 
Query Executor 
HDFS DN HBase 
Hive 
Metastore HDFS NN Statestore 
SQL Client 
ODBC 
5. Query results are 
streamed back to client. 
4. Intermediate results are 
streamed between 
impalad(s).
Metadata Handling 
• Impala metadata 
– Hive’s metastore: Logical metadata (table definitions, columns, 
CREATE TABLE parameters) 
– HDFS NameNode: Directory contents and block replica locations 
– HDFS DataNode: Block replias’ volume ids 
• Caches metadata: No synchronous metastore API calls 
during query execution 
• Impala instances read metadata from metastore at startup. 
• Catalog Service relays metadata when you run DDL or 
update metadata on one of the impalads.
Metadata Handling – Cont. 
• REFRESH [<tbl>]: Reloads metadata on all impalads (if you 
added new files via Hive) 
• INVALIDATE METADATA: Reloads metadata for all tables
Comparing Impala to Dremel 
• What is Dremel? 
– Columnar storage for data with nested structures 
– Distributed scalable aggregation on top of that 
• Columnar storage in Hadoop: Parquet 
– Store data in appropriate native/binary types 
– Can also store nested structures similar to Dremel’s ColumnIO 
• Distributed aggregation: Impala 
• Impala plus Parquet: A superset of the published version of 
Dremel (does not support joins)
Comparing Impala to Hive 
• Hive: MapReduce as an execution engine 
– High latency, low throughput queries 
– Fault-tolerance model based on MapReduce’s on-disk check pointing; 
materializes all intermediate results 
– Java runtime allows for easy late-binding of functionality: file formats 
and UDFs 
– Extensive layering imposes high runtime overhead 
• Impala: 
– Direct, process-to-process data exchange 
– No fault tolerance 
– An execution engine designed for low runtime overhead
Impala and Hive 
Shares Everything Client-Facing 
•Metadata (table definitions) 
•ODBC/JDBC drivers 
•SQL syntax (Hive SQL) 
•Flexible file formats 
•Machine pool 
•GUI 
Resource Management 
Data Store 
But Built for Different Purposes 
•Hive: Runs on MapReduce and ideal for 
batch processing 
•Impala: Native MPP query engine ideal 
for interactive SQL Data Ingestion 
HDFS HBase 
TEXT, RCFILE, 
PARQUET,AVRO, ETC. 
RECORDS 
Hive 
SQL Syntax 
MapReduce 
Compute framework 
Impala 
SQL syntax + 
compute framework
Typical Use Cases 
• Data Warehouse Offload 
• Ad-hoc Analytics 
• Provide SQL interoperability to HBase
Hands-on Impala 
• Query a file on HDFS with Impala 
• Query a table on HBase with Impala
What is Spark? 
• MapReduce Review… 
• Apache Spark… 
• How Spark Works… 
• Fault Tolerance and Performance… 
• Examples… 
• Spark & More…
MapReduce: Good 
The Good: 
•Built in fault tolerance 
•Optimized IO path 
•Scalable 
•Developer focuses on Map/Reduce, not infrastructure 
•Simple? API
MapReduce: Bad 
The Bad: 
•Optimized for disk IO 
– Does not leverage memory 
– Iterative algorithms go through disk IO path again and 
again 
•Primitive API 
– Developers have to build on a very simple abstraction 
– Key/Value in/out 
– Even basic things like join require extensive code 
•A common result is many files require to be combined 
appropriately
Apache Spark 
• Originally developed in 
2009 in UC Berkeley’s AMP 
Lab. 
• Fully open sourced in 2010 
– now at Apache Software 
Foundation.
Spark: Easy and Fast Big Data 
• Easy to Develop 
– Rich APIs in Java, Scala, 
Python 
– Interactive Shell 
• 2-5x less code 
• Fast to Run 
– General execution 
graph 
– In-memory store
How Spark Works – SparkContext 
Cluster 
Master 
Spark Worker Spark Worker 
Executer 
Cache Executer 
Task Task 
Cache 
Data Node Data Node 
HDFS 
Task Task 
SparkContext 
sc=new SparkContext 
Rdd=sc.textfile(“hdfs://..”) 
Rdd.filter(…) 
Rdd.cache(…) 
Rdd.count(…) 
Rdd.map 
sc=new SparkContext 
Rdd=sc.textfile(“hdfs://..”) 
Rdd.filter(…) 
Rdd.cache(…) 
Rdd.count(…) 
Rdd.map
How Spark Works – RDD 
RDD 
(Resilient 
Distributed 
Dataset) 
sc=new SparkContext 
Rdd=sc.textfile(“hdfs://..” 
) 
Rdd.filter(…) 
Rdd.cache(…) 
Rdd.count(…) 
Rdd.map 
sc=new SparkContext 
Rdd=sc.textfile(“hdfs://..” 
) 
Rdd.filter(…) 
Rdd.cache(…) 
Rdd.count(…) 
Rdd.map 
Storage Types: 
MEMORY_ONLY, 
MEMORY_AND_DISK 
DISK_ONLY, 
• Fault Tolerant 
• Controlled 
• Fault Tolerant 
• Controlled 
partitioning to 
optimize data 
placement 
partitioning to 
optimize data 
placement 
• Manipulated by 
• Manipulated by 
using a rich set of 
operators 
using a rich set of 
operators 
• Partitions of Data 
• Dependency between partitions
RDD 
• Stands for Resilient Distributed Datasets 
• Spark revolves around RDDs 
• Fault-tolerant read only collection of elements that can be 
operated on in parallels 
• Cached in memory 
Reference: 
http://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spar 
k.pdf
RDD 
• Read-only, partitioned collection of records 
DD11 
DD22 
DD33 
3 partitions 
• Supports only coarse-grained operations 
– e.g. map and group-by transformation, reduce 
actions 
DD11 
DD22 
DD33 
DD11 
DD22 
DD33 
DD11 
DD22 
DD33 
DD11 
DD22 
DD33 
DD11 
DD22 
DD33 
Value
RDD Operations
RDD Operations - Expressive 
• Transformations 
– Creation of a new RDD dataset from an existing: 
• map, filter, distinct, union, sample, groupByKey, join, 
reduce, etc… 
• Actions 
– Returns a value after running a computation: 
• Collect, count, first, takeSample, foreach, etc… 
• Reference 
– http://spark.apache.org/docs/latest/scala-programming-guide.html#rdd-operations
Word Count on Spark 
sparkContext.textFile(“hdfs://…”) RDD[String] 
textFile
Word Count on Spark 
sparkContext.textFile(“hdfs://…”) 
.map(line => line.split(“s”)) 
RDD[String] 
RDD[List[String]] 
textFile map
Word Count on Spark 
sparkContext.textFile(“hdfs://…”) 
.map(line => line.split(“s”)) 
.map(word => (word, 1)) 
RDD[String] 
RDD[List[String]] 
RDD[(String, Int)] 
textFile map map
Word Count on Spark 
sparkContext.textFile(“hdfs://…”) 
.map(line => line.split(“s”)) 
.map(word => (word, 1)) 
.reduceByKey((a, b) => a+b) 
RDD[String] 
RDD[List[String]] 
RDD[(String, Int)] 
RDD[(String, Int)] 
textFile map map reduceByKey 
map
Word Count on Spark 
sparkContext.textFile(“hdfs://…”) 
.map(line => line.split(“s”)) 
.map(word => (word, 1)) 
.reduceByKey((a, b) => a+b) 
.collect() 
RDD[String] 
RDD[List[String]] 
RDD[(String, Int)] 
RDD[(String, Int)] 
Array[(String, Int)] 
textFile map map reduceByKey 
map 
collect
Actions 
• Parallel Operations 
map reduce sample 
filter count take 
grougBy fold first 
sort reduceByKey partitionBy 
union groupByKey mapWith 
join cogroup pipe 
leftOuterJoin cross save 
rightOuterJoin zip ….
Stages 
textFile map map reduceByKey 
collect 
Stage 1 Stage 2 
DAG (Directed Acyclic Graph) Each stage is executed as 
Stage 1 
Stage 2 
a series of Task (one 
Task for each partition).
Tasks 
Task is the fundamental unit of execution in Spark 
Fetch Input 
Execute Task 
Write Output 
HDFS 
/RDD 
HDFS/RDD/Intermediate 
shuffle output 
Core 1 
Fetch Input 
Execute Task 
Write Output 
Fetch Input 
Execute Task 
Write Output 
Fetch Input 
Execute Task 
Write Output 
Core 2
Spark Summary 
• SparkContext 
• Resilient Distributed Dataset 
• Parallel Operations 
• Shared Variables 
– Broadcast Variables – read-only 
– Accumulators
Comparison 
MapReduce Impala Spark 
Storage HDFS HDFS/HBase HDFS 
Scheduler MapReduce Job Query Plan Computation Graph 
I/O Disk In-memory with 
cache 
In-memory, cache 
and shared data 
Fault Tolerance Duplication and 
Disk I/O 
No Fault Tolerance Hash partition and 
auto reconstruction 
Iterative Bad Bad Good 
Shared data No No Yes 
Streaming No No Yes
Hands-on Spark 
• Spark Shell 
• Word Count
Spark Streaming 
• Takes the concept of RDDs and extends it to Dstreams 
– Fault-tolerant like RDDs 
– Transformable like RDDs 
• Adds new “rolling window” operations 
– Rolling averages, etc.. 
• But keeps everything else! 
– Regular Spark code works in Spark Streaming 
– Can still access HDFS data, etc. 
• Example use cases: 
– “On-the-fly” ETL as data is ingested into Hadoop/HDFS 
– Detecting anomalous behavior and trigger alerts 
– Continuous reporting of summary metrics for incoming data
How Streaming Works
Micro-batching for on the fly ETL
Window-based Transformation
Spark SQL 
• Spark SQL is one of Spark’s 
components. 
– Executes SQL on Spark 
– Builds SchemaRDD 
• Optimizes execution plan 
• Uses existing Hive metastores, 
SerDes, and UDFs.
Unified Data Access 
• Ability to load and query 
data from a variety of 
sources. 
• SchemaRDDs provides a 
single interface that 
efficiently works with 
structured data, including 
Hive tables, parquet files, 
and JSON. 
sqlCtx.jsonFile("s3n://...") 
.registerAsTable("json") 
schema_rdd = sqlCtx.sql(""" 
SELECT * 
FROM hiveTable 
JOIN json ...""") 
Query and join different data sources
Hands-on Spark 
• Parse/transform log on the fly with Spark-Streaming 
• Aggregate with Spark SQL (Top N) 
• Output from Spark to HDFS
Spark & Impala work together 
Data 
Strea 
m 
Data 
Strea 
m 
Spark- 
Streaming 
Spark 
Impala 
DN 
RS 
Data 
Strea 
m 
Spark- 
Streaming 
Spark 
Impala 
DN 
RS 
Data 
Strea 
m 
Spark- 
Streaming 
Spark 
DN 
RS 
Impala 
… 
Data 
Strea 
m 
Data 
Strea 
m 
Data Stream 
-Click Steam 
-Machine Data 
-Log 
-Network Traffic 
-Etc.. 
On-the-fly Processing 
-ETL, transformation, filter 
-Pattern Matching & Alert 
Real-time Analytics 
-Machine Learning (Rec. Cluster..) 
-Iterative Algorithms 
Near Real-time Query 
- Ad-hoc query 
- Reporting 
Long term data store 
-Batch process 
-Offline analytics 
-Historical Mining
Etu 讓 Hadoop 更容易 
全自動、高效能、易管理的軟體式一體機 
空機自動部署 
效能最佳化 
全叢集管理 
唯一在地 Hadoop 專業服務 
主流 X86 商用伺服器
ESA Software Stack 
Cloudera Manager 
Etu Manager 
安全管理 
效能最佳化 
組態同步網路管理監控告警套件管理 
CentOS作業系統 (64 bits) 
HA 管理 
Rack 
Awareness 
Etu 加值模組
Etu在Hadoop生態系的定位與價值 
Etu Services 
人才 
招聘 
團隊 
建立 
程式開發
數據 架構 
挖掘 設計 
部署、調校
易 
Etu 易 
運維、管理
應用
平台
搶 
佔 
市場 
Etu Professional 
Services 
核心 
價 
值 
資源 
調配 
屏蔽 Hadoop 平台 
部署與運維的複雜度 
易 
• 快速推出服務,搶佔市場 
• 應用、數據才是企業核心價值 
• 依核心價值調配資源,建立競爭優 
勢 
Software 
Appliance 
EEttuu SSuuppppoorrtt 
Etu Professional 
Services 
EEttuu CCoonnssuullttiinngg 
EEttuu TTrraaiinniinngg 
易
Question and Discussion 
Thank you 
318, Rueiguang Rd., Taipei 114, Taiwan 
T: +886 2 7720 1888 
F: +886 2 8798 6069 
www.etusolution.com

More Related Content

What's hot

Impala presentation
Impala presentationImpala presentation
Impala presentationtrihug
 
Impala: Real-time Queries in Hadoop
Impala: Real-time Queries in HadoopImpala: Real-time Queries in Hadoop
Impala: Real-time Queries in HadoopCloudera, Inc.
 
Cloudera Impala: A Modern SQL Engine for Apache Hadoop
Cloudera Impala: A Modern SQL Engine for Apache HadoopCloudera Impala: A Modern SQL Engine for Apache Hadoop
Cloudera Impala: A Modern SQL Engine for Apache HadoopCloudera, Inc.
 
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaSwiss Big Data User Group
 
Application architectures with Hadoop – Big Data TechCon 2014
Application architectures with Hadoop – Big Data TechCon 2014Application architectures with Hadoop – Big Data TechCon 2014
Application architectures with Hadoop – Big Data TechCon 2014hadooparchbook
 
Jan 2013 HUG: Impala - Real-time Queries for Apache Hadoop
Jan 2013 HUG: Impala - Real-time Queries for Apache HadoopJan 2013 HUG: Impala - Real-time Queries for Apache Hadoop
Jan 2013 HUG: Impala - Real-time Queries for Apache HadoopYahoo Developer Network
 
Apache Spark Fundamentals
Apache Spark FundamentalsApache Spark Fundamentals
Apache Spark FundamentalsZahra Eskandari
 
Operationalizing Data Science Using Cloud Foundry
Operationalizing Data Science Using Cloud FoundryOperationalizing Data Science Using Cloud Foundry
Operationalizing Data Science Using Cloud FoundryVMware Tanzu
 
Architecting Applications with Hadoop
Architecting Applications with HadoopArchitecting Applications with Hadoop
Architecting Applications with Hadoopmarkgrover
 
Real-Time Queries in Hadoop w/ Cloudera Impala
Real-Time Queries in Hadoop w/ Cloudera ImpalaReal-Time Queries in Hadoop w/ Cloudera Impala
Real-Time Queries in Hadoop w/ Cloudera ImpalaData Science London
 
Get most out of Spark on YARN
Get most out of Spark on YARNGet most out of Spark on YARN
Get most out of Spark on YARNDataWorks Summit
 
Cloudera Impala - San Diego Big Data Meetup August 13th 2014
Cloudera Impala - San Diego Big Data Meetup August 13th 2014Cloudera Impala - San Diego Big Data Meetup August 13th 2014
Cloudera Impala - San Diego Big Data Meetup August 13th 2014cdmaxime
 
SQL and Machine Learning on Hadoop
SQL and Machine Learning on HadoopSQL and Machine Learning on Hadoop
SQL and Machine Learning on HadoopMukund Babbar
 

What's hot (20)

Impala for PhillyDB Meetup
Impala for PhillyDB MeetupImpala for PhillyDB Meetup
Impala for PhillyDB Meetup
 
Impala presentation
Impala presentationImpala presentation
Impala presentation
 
Impala: Real-time Queries in Hadoop
Impala: Real-time Queries in HadoopImpala: Real-time Queries in Hadoop
Impala: Real-time Queries in Hadoop
 
Cloudera Impala: A Modern SQL Engine for Apache Hadoop
Cloudera Impala: A Modern SQL Engine for Apache HadoopCloudera Impala: A Modern SQL Engine for Apache Hadoop
Cloudera Impala: A Modern SQL Engine for Apache Hadoop
 
Apache Spark Components
Apache Spark ComponentsApache Spark Components
Apache Spark Components
 
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with Impala
 
Application architectures with Hadoop – Big Data TechCon 2014
Application architectures with Hadoop – Big Data TechCon 2014Application architectures with Hadoop – Big Data TechCon 2014
Application architectures with Hadoop – Big Data TechCon 2014
 
Jan 2013 HUG: Impala - Real-time Queries for Apache Hadoop
Jan 2013 HUG: Impala - Real-time Queries for Apache HadoopJan 2013 HUG: Impala - Real-time Queries for Apache Hadoop
Jan 2013 HUG: Impala - Real-time Queries for Apache Hadoop
 
Apache Spark Fundamentals
Apache Spark FundamentalsApache Spark Fundamentals
Apache Spark Fundamentals
 
Operationalizing Data Science Using Cloud Foundry
Operationalizing Data Science Using Cloud FoundryOperationalizing Data Science Using Cloud Foundry
Operationalizing Data Science Using Cloud Foundry
 
SQL on Hadoop in Taiwan
SQL on Hadoop in TaiwanSQL on Hadoop in Taiwan
SQL on Hadoop in Taiwan
 
Architecting Applications with Hadoop
Architecting Applications with HadoopArchitecting Applications with Hadoop
Architecting Applications with Hadoop
 
Apache drill
Apache drillApache drill
Apache drill
 
Real-Time Queries in Hadoop w/ Cloudera Impala
Real-Time Queries in Hadoop w/ Cloudera ImpalaReal-Time Queries in Hadoop w/ Cloudera Impala
Real-Time Queries in Hadoop w/ Cloudera Impala
 
Get most out of Spark on YARN
Get most out of Spark on YARNGet most out of Spark on YARN
Get most out of Spark on YARN
 
Cloudera Impala
Cloudera ImpalaCloudera Impala
Cloudera Impala
 
Incredible Impala
Incredible Impala Incredible Impala
Incredible Impala
 
Cloudera Impala
Cloudera ImpalaCloudera Impala
Cloudera Impala
 
Cloudera Impala - San Diego Big Data Meetup August 13th 2014
Cloudera Impala - San Diego Big Data Meetup August 13th 2014Cloudera Impala - San Diego Big Data Meetup August 13th 2014
Cloudera Impala - San Diego Big Data Meetup August 13th 2014
 
SQL and Machine Learning on Hadoop
SQL and Machine Learning on HadoopSQL and Machine Learning on Hadoop
SQL and Machine Learning on Hadoop
 

Similar to Etu Solution Day 2014 Track-D: 掌握Impala和Spark

Big Data Developers Moscow Meetup 1 - sql on hadoop
Big Data Developers Moscow Meetup 1  - sql on hadoopBig Data Developers Moscow Meetup 1  - sql on hadoop
Big Data Developers Moscow Meetup 1 - sql on hadoopbddmoscow
 
impalapresentation-130130105033-phpapp02 (1)_221220_235919.pdf
impalapresentation-130130105033-phpapp02 (1)_221220_235919.pdfimpalapresentation-130130105033-phpapp02 (1)_221220_235919.pdf
impalapresentation-130130105033-phpapp02 (1)_221220_235919.pdfssusere05ec21
 
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014cdmaxime
 
hive_slides_Webinar_Session_1.pptx
hive_slides_Webinar_Session_1.pptxhive_slides_Webinar_Session_1.pptx
hive_slides_Webinar_Session_1.pptxvishwasgarade1
 
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)VMware Tanzu
 
SQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for ImpalaSQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for Impalamarkgrover
 
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the CloudSpeed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloudgluent.
 
Introduction to apache spark
Introduction to apache sparkIntroduction to apache spark
Introduction to apache sparkUserReport
 
Unveiling Hive: A Comprehensive Exploration of Hive in Hadoop Ecosystem
Unveiling Hive: A Comprehensive Exploration of Hive in Hadoop EcosystemUnveiling Hive: A Comprehensive Exploration of Hive in Hadoop Ecosystem
Unveiling Hive: A Comprehensive Exploration of Hive in Hadoop Ecosystemmashoodsyed66
 
SQL on Hadoop
SQL on HadoopSQL on Hadoop
SQL on Hadoopnvvrajesh
 
Intro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of TwingoIntro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of TwingoMapR Technologies
 
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012Andrew Brust
 

Similar to Etu Solution Day 2014 Track-D: 掌握Impala和Spark (20)

Big Data Developers Moscow Meetup 1 - sql on hadoop
Big Data Developers Moscow Meetup 1  - sql on hadoopBig Data Developers Moscow Meetup 1  - sql on hadoop
Big Data Developers Moscow Meetup 1 - sql on hadoop
 
SQL on Hadoop
SQL on HadoopSQL on Hadoop
SQL on Hadoop
 
impalapresentation-130130105033-phpapp02 (1)_221220_235919.pdf
impalapresentation-130130105033-phpapp02 (1)_221220_235919.pdfimpalapresentation-130130105033-phpapp02 (1)_221220_235919.pdf
impalapresentation-130130105033-phpapp02 (1)_221220_235919.pdf
 
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
 
Apache Drill
Apache DrillApache Drill
Apache Drill
 
Cloudera Hadoop Distribution
Cloudera Hadoop DistributionCloudera Hadoop Distribution
Cloudera Hadoop Distribution
 
hive_slides_Webinar_Session_1.pptx
hive_slides_Webinar_Session_1.pptxhive_slides_Webinar_Session_1.pptx
hive_slides_Webinar_Session_1.pptx
 
Apache Spark on HDinsight Training
Apache Spark on HDinsight TrainingApache Spark on HDinsight Training
Apache Spark on HDinsight Training
 
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
 
SQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for ImpalaSQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for Impala
 
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the CloudSpeed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
 
Introduction to apache spark
Introduction to apache sparkIntroduction to apache spark
Introduction to apache spark
 
Big data Hadoop
Big data  Hadoop   Big data  Hadoop
Big data Hadoop
 
Spark SQL
Spark SQLSpark SQL
Spark SQL
 
Unveiling Hive: A Comprehensive Exploration of Hive in Hadoop Ecosystem
Unveiling Hive: A Comprehensive Exploration of Hive in Hadoop EcosystemUnveiling Hive: A Comprehensive Exploration of Hive in Hadoop Ecosystem
Unveiling Hive: A Comprehensive Exploration of Hive in Hadoop Ecosystem
 
Getting started big data
Getting started big dataGetting started big data
Getting started big data
 
SQL on Hadoop
SQL on HadoopSQL on Hadoop
SQL on Hadoop
 
Intro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of TwingoIntro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of Twingo
 
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012
 
Apache Spark Core
Apache Spark CoreApache Spark Core
Apache Spark Core
 

More from James Chen

Hadoop con 2015 hadoop enables enterprise data lake
Hadoop con 2015   hadoop enables enterprise data lakeHadoop con 2015   hadoop enables enterprise data lake
Hadoop con 2015 hadoop enables enterprise data lakeJames Chen
 
Apache Mahout 於電子商務的應用
Apache Mahout 於電子商務的應用Apache Mahout 於電子商務的應用
Apache Mahout 於電子商務的應用James Chen
 
Etu L2 Training - Hadoop 企業應用實作
Etu L2 Training - Hadoop 企業應用實作Etu L2 Training - Hadoop 企業應用實作
Etu L2 Training - Hadoop 企業應用實作James Chen
 
Hadoop的典型应用与企业化之路 for HBTC 2012
Hadoop的典型应用与企业化之路 for HBTC 2012Hadoop的典型应用与企业化之路 for HBTC 2012
Hadoop的典型应用与企业化之路 for HBTC 2012James Chen
 
Hadoop 與 SQL 的甜蜜連結
Hadoop 與 SQL 的甜蜜連結Hadoop 與 SQL 的甜蜜連結
Hadoop 與 SQL 的甜蜜連結James Chen
 
[Hic2011] using hadoop lucene-solr-for-large-scale-search by systex
[Hic2011] using hadoop lucene-solr-for-large-scale-search by systex[Hic2011] using hadoop lucene-solr-for-large-scale-search by systex
[Hic2011] using hadoop lucene-solr-for-large-scale-search by systexJames Chen
 

More from James Chen (6)

Hadoop con 2015 hadoop enables enterprise data lake
Hadoop con 2015   hadoop enables enterprise data lakeHadoop con 2015   hadoop enables enterprise data lake
Hadoop con 2015 hadoop enables enterprise data lake
 
Apache Mahout 於電子商務的應用
Apache Mahout 於電子商務的應用Apache Mahout 於電子商務的應用
Apache Mahout 於電子商務的應用
 
Etu L2 Training - Hadoop 企業應用實作
Etu L2 Training - Hadoop 企業應用實作Etu L2 Training - Hadoop 企業應用實作
Etu L2 Training - Hadoop 企業應用實作
 
Hadoop的典型应用与企业化之路 for HBTC 2012
Hadoop的典型应用与企业化之路 for HBTC 2012Hadoop的典型应用与企业化之路 for HBTC 2012
Hadoop的典型应用与企业化之路 for HBTC 2012
 
Hadoop 與 SQL 的甜蜜連結
Hadoop 與 SQL 的甜蜜連結Hadoop 與 SQL 的甜蜜連結
Hadoop 與 SQL 的甜蜜連結
 
[Hic2011] using hadoop lucene-solr-for-large-scale-search by systex
[Hic2011] using hadoop lucene-solr-for-large-scale-search by systex[Hic2011] using hadoop lucene-solr-for-large-scale-search by systex
[Hic2011] using hadoop lucene-solr-for-large-scale-search by systex
 

Recently uploaded

How to Track Employee Performance A Comprehensive Guide.pdf
How to Track Employee Performance A Comprehensive Guide.pdfHow to Track Employee Performance A Comprehensive Guide.pdf
How to Track Employee Performance A Comprehensive Guide.pdfLivetecs LLC
 
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024StefanoLambiase
 
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Matt Ray
 
What is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWhat is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWave PLM
 
MYjobs Presentation Django-based project
MYjobs Presentation Django-based projectMYjobs Presentation Django-based project
MYjobs Presentation Django-based projectAnoyGreter
 
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)jennyeacort
 
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...Christina Lin
 
What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...Technogeeks
 
Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Velvetech LLC
 
What are the key points to focus on before starting to learn ETL Development....
What are the key points to focus on before starting to learn ETL Development....What are the key points to focus on before starting to learn ETL Development....
What are the key points to focus on before starting to learn ETL Development....kzayra69
 
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company OdishaBalasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odishasmiwainfosol
 
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxKnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxTier1 app
 
CRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. SalesforceCRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. SalesforceBrainSell Technologies
 
Unveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsUnveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsAhmed Mohamed
 
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...soniya singh
 
Folding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesFolding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesPhilip Schwarz
 
Cloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackCloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackVICTOR MAESTRE RAMIREZ
 
React Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaReact Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaHanief Utama
 
Implementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureImplementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureDinusha Kumarasiri
 

Recently uploaded (20)

How to Track Employee Performance A Comprehensive Guide.pdf
How to Track Employee Performance A Comprehensive Guide.pdfHow to Track Employee Performance A Comprehensive Guide.pdf
How to Track Employee Performance A Comprehensive Guide.pdf
 
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
 
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
 
What is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWhat is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need It
 
MYjobs Presentation Django-based project
MYjobs Presentation Django-based projectMYjobs Presentation Django-based project
MYjobs Presentation Django-based project
 
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
 
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
 
What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...
 
Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...
 
What are the key points to focus on before starting to learn ETL Development....
What are the key points to focus on before starting to learn ETL Development....What are the key points to focus on before starting to learn ETL Development....
What are the key points to focus on before starting to learn ETL Development....
 
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company OdishaBalasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
 
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxKnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
 
CRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. SalesforceCRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. Salesforce
 
Unveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsUnveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML Diagrams
 
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
 
Folding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesFolding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a series
 
Cloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackCloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStack
 
React Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaReact Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief Utama
 
Implementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureImplementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with Azure
 
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort ServiceHot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
 

Etu Solution Day 2014 Track-D: 掌握Impala和Spark

  • 1. 掌握Impala和Spark Real-time Big Data即時應用架構研習 Etu首席顧問 陳昭宇 Oct 8, 2014
  • 2. Workshop Goal Let’s talk about the 3Vs in Big Data. Hadoop is good for Volume and Variety But… How about Velocity ??? This is why we are sitting here ….
  • 3. Target Audience • CTO • Architect • Software/Application Developer • IT
  • 4. Background Knowledge • Linux operation system • Basic Hadoop ecosystem knowledge • Basic knowledge of SQL • Java or Python programming experience
  • 5. Terminology • Hadoop: Open source big data platform • HDFS: Hadoop Distributed Filesystem • MapReduce: Parallel computing framework on top of HDFS • HBase: NoSQL database on top of Hadoop • Impala: MPP SQL query engine on top of Hadoop • Spark: In-memory cluster computing engine • Hive: SQL to MapReduce translator • Hive Metastore: Database that stores table schema • Hive QL: A SQL subset
  • 6. Agenda • What is Hadoop and what’s wrong with Hadoop in real-time? • What is Impala? • Hands-on Impala • What is Spark? • Hands-on Spark • Spark and Impala work together • Q & A
  • 7. What is Hadoop ? Apache Hadoop is an open source platform for data storage and processing that is : Distributed Fault tolerant Scalable CORE HADOOP SYSTEM COMPONENTS HDFS HDFS A fault-tolerant, scalable clustered A fault-tolerant, scalable clustered storage storage MapReduce MapReduce A distributed computing framework A distributed computing framework • Ask questions across structured and unstructured data • Schema-less • Scale-out architecture divides workloads across nodes. • Flexible file system eliminates ETL bottlenecks. Flexible for storing and mining any type of data Processing Complex Big Data Scales Economically • Deploy on commodity hardware • Open sourced platform
  • 8. Limitations of MapReduce • Batch oriented • High latency • Doesn’t fit all cases • Only for developers
  • 9. Pig and Hive • MR is hard and only for developers • High level abstraction for converting declarative syntax to MR – SQL – Hive – Dataflow language - Pig • Build on top of MapReduce
  • 10. Goals • General-purpose SQL engine: – Works for both analytics and transactional/single-row workloads. – Supports queries that take from milliseconds to hours. • Runs directly within Hadoop and: – Reads widely used Hadoop file formats. – Runs on same nodes that run Hadoop processes. • High performance: – C++ instead of Java – Runtime code generation – Completely new execution engine – No MapReduce
  • 11. What is Impala • General-purpose SQL engine • Real-time queries in Apache Hadoop • Beta version released since Oct. 2012 • GA since Apr. 2013 • Apache licensed • Latest release v1.4.2
  • 12. Impala Overview • Distributed service in cluster: One impala daemon on each data node • No SPOF • User submits query via ODBC/JDBC, CLI, or HUE to any of the daemons. • Query is distributed to all nodes with data locality. • Uses Hive’s metadata interfaces and connects to Hive metastore. • Supported file formats: – Uncompressed/lzo-compressed text files – Sequence files and RCFile with snappy/gzip, Avro – Parquet columnar format
  • 13. Impala’s SQL • High compatibility with HiveQL • SQL support: – Essential SQL-92, minus correlated subqueries – INSERT INTO … SELECT … – Only equi-joins; no non-equi-joins, no cross products – Order By requires Limit (not required after 1.4.2) – Limited DDL support – SQL-style authorization via Apache Sentry – UDFs and UDAFs are supported
  • 14. Impala’s SQL limitations • No file formats, SerDes • No beyond SQL (buckets, samples, transforms, arrays, structs, maps, xpath, json) • Broadcast joins and partitioned hash joins supported (Smaller tables have to fit in the aggregate memory of all executing nodes)
  • 15. Work with HBase • Functionality highlights: – Support for SELECT, INSERT INTO…SELECT…, and INSERT INTO … VALUES (…) – Predicates on rowkey columns are mapped into start/stop rows – Predicates on other columns are mapped into SingleColumnValueFilters • BUT mapping of HBase tables and metastore table patterned after Hive: – All data is stored as scalars and in ASCII. – The rowkey needs to be mapped into a single string column.
  • 16. HBase in Roadmap • Full support for UPDATE and DELETE. • Storage of structured data to minimize storage and access overhead. • Composite rowkey encoding mapped into an arbitrary number of table columns.
  • 17. Impala’s Architecture Query Planner Query Coordinator Query Executor HDFS DN HBase Query Planner Query Coordinator Query Executor HDFS DN HBase Query Planner Query Coordinator Query Executor HDFS DN HBase Hive Metastore HDFS NN Statestore SQL Client ODBC 1. Request arrives via ODBC/JDBC/Beeswax/ Shell.
  • 18. Impala’s Architecture Query Planner Query Coordinator Query Executor HDFS DN HBase Query Planner Query Coordinator Query Executor HDFS DN HBase Query Planner Query Coordinator Query Executor HDFS DN HBase Hive Metastore HDFS NN Statestore SQL Client ODBC 2. Planner turns request into collections of plan fragments. 3. Coordinator initiates execution on impalad(s) local to data.
  • 19. Impala’s Architecture Query Planner Query Coordinator Query Executor HDFS DN HBase Query Planner Query Coordinator Query Executor HDFS DN HBase Query Planner Query Coordinator Query Executor HDFS DN HBase Hive Metastore HDFS NN Statestore SQL Client ODBC 5. Query results are streamed back to client. 4. Intermediate results are streamed between impalad(s).
  • 20. Metadata Handling • Impala metadata – Hive’s metastore: Logical metadata (table definitions, columns, CREATE TABLE parameters) – HDFS NameNode: Directory contents and block replica locations – HDFS DataNode: Block replias’ volume ids • Caches metadata: No synchronous metastore API calls during query execution • Impala instances read metadata from metastore at startup. • Catalog Service relays metadata when you run DDL or update metadata on one of the impalads.
  • 21. Metadata Handling – Cont. • REFRESH [<tbl>]: Reloads metadata on all impalads (if you added new files via Hive) • INVALIDATE METADATA: Reloads metadata for all tables
  • 22. Comparing Impala to Dremel • What is Dremel? – Columnar storage for data with nested structures – Distributed scalable aggregation on top of that • Columnar storage in Hadoop: Parquet – Store data in appropriate native/binary types – Can also store nested structures similar to Dremel’s ColumnIO • Distributed aggregation: Impala • Impala plus Parquet: A superset of the published version of Dremel (does not support joins)
  • 23. Comparing Impala to Hive • Hive: MapReduce as an execution engine – High latency, low throughput queries – Fault-tolerance model based on MapReduce’s on-disk check pointing; materializes all intermediate results – Java runtime allows for easy late-binding of functionality: file formats and UDFs – Extensive layering imposes high runtime overhead • Impala: – Direct, process-to-process data exchange – No fault tolerance – An execution engine designed for low runtime overhead
  • 24. Impala and Hive Shares Everything Client-Facing •Metadata (table definitions) •ODBC/JDBC drivers •SQL syntax (Hive SQL) •Flexible file formats •Machine pool •GUI Resource Management Data Store But Built for Different Purposes •Hive: Runs on MapReduce and ideal for batch processing •Impala: Native MPP query engine ideal for interactive SQL Data Ingestion HDFS HBase TEXT, RCFILE, PARQUET,AVRO, ETC. RECORDS Hive SQL Syntax MapReduce Compute framework Impala SQL syntax + compute framework
  • 25. Typical Use Cases • Data Warehouse Offload • Ad-hoc Analytics • Provide SQL interoperability to HBase
  • 26. Hands-on Impala • Query a file on HDFS with Impala • Query a table on HBase with Impala
  • 27. What is Spark? • MapReduce Review… • Apache Spark… • How Spark Works… • Fault Tolerance and Performance… • Examples… • Spark & More…
  • 28. MapReduce: Good The Good: •Built in fault tolerance •Optimized IO path •Scalable •Developer focuses on Map/Reduce, not infrastructure •Simple? API
  • 29. MapReduce: Bad The Bad: •Optimized for disk IO – Does not leverage memory – Iterative algorithms go through disk IO path again and again •Primitive API – Developers have to build on a very simple abstraction – Key/Value in/out – Even basic things like join require extensive code •A common result is many files require to be combined appropriately
  • 30. Apache Spark • Originally developed in 2009 in UC Berkeley’s AMP Lab. • Fully open sourced in 2010 – now at Apache Software Foundation.
  • 31. Spark: Easy and Fast Big Data • Easy to Develop – Rich APIs in Java, Scala, Python – Interactive Shell • 2-5x less code • Fast to Run – General execution graph – In-memory store
  • 32. How Spark Works – SparkContext Cluster Master Spark Worker Spark Worker Executer Cache Executer Task Task Cache Data Node Data Node HDFS Task Task SparkContext sc=new SparkContext Rdd=sc.textfile(“hdfs://..”) Rdd.filter(…) Rdd.cache(…) Rdd.count(…) Rdd.map sc=new SparkContext Rdd=sc.textfile(“hdfs://..”) Rdd.filter(…) Rdd.cache(…) Rdd.count(…) Rdd.map
  • 33. How Spark Works – RDD RDD (Resilient Distributed Dataset) sc=new SparkContext Rdd=sc.textfile(“hdfs://..” ) Rdd.filter(…) Rdd.cache(…) Rdd.count(…) Rdd.map sc=new SparkContext Rdd=sc.textfile(“hdfs://..” ) Rdd.filter(…) Rdd.cache(…) Rdd.count(…) Rdd.map Storage Types: MEMORY_ONLY, MEMORY_AND_DISK DISK_ONLY, • Fault Tolerant • Controlled • Fault Tolerant • Controlled partitioning to optimize data placement partitioning to optimize data placement • Manipulated by • Manipulated by using a rich set of operators using a rich set of operators • Partitions of Data • Dependency between partitions
  • 34. RDD • Stands for Resilient Distributed Datasets • Spark revolves around RDDs • Fault-tolerant read only collection of elements that can be operated on in parallels • Cached in memory Reference: http://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spar k.pdf
  • 35. RDD • Read-only, partitioned collection of records DD11 DD22 DD33 3 partitions • Supports only coarse-grained operations – e.g. map and group-by transformation, reduce actions DD11 DD22 DD33 DD11 DD22 DD33 DD11 DD22 DD33 DD11 DD22 DD33 DD11 DD22 DD33 Value
  • 37. RDD Operations - Expressive • Transformations – Creation of a new RDD dataset from an existing: • map, filter, distinct, union, sample, groupByKey, join, reduce, etc… • Actions – Returns a value after running a computation: • Collect, count, first, takeSample, foreach, etc… • Reference – http://spark.apache.org/docs/latest/scala-programming-guide.html#rdd-operations
  • 38. Word Count on Spark sparkContext.textFile(“hdfs://…”) RDD[String] textFile
  • 39. Word Count on Spark sparkContext.textFile(“hdfs://…”) .map(line => line.split(“s”)) RDD[String] RDD[List[String]] textFile map
  • 40. Word Count on Spark sparkContext.textFile(“hdfs://…”) .map(line => line.split(“s”)) .map(word => (word, 1)) RDD[String] RDD[List[String]] RDD[(String, Int)] textFile map map
  • 41. Word Count on Spark sparkContext.textFile(“hdfs://…”) .map(line => line.split(“s”)) .map(word => (word, 1)) .reduceByKey((a, b) => a+b) RDD[String] RDD[List[String]] RDD[(String, Int)] RDD[(String, Int)] textFile map map reduceByKey map
  • 42. Word Count on Spark sparkContext.textFile(“hdfs://…”) .map(line => line.split(“s”)) .map(word => (word, 1)) .reduceByKey((a, b) => a+b) .collect() RDD[String] RDD[List[String]] RDD[(String, Int)] RDD[(String, Int)] Array[(String, Int)] textFile map map reduceByKey map collect
  • 43. Actions • Parallel Operations map reduce sample filter count take grougBy fold first sort reduceByKey partitionBy union groupByKey mapWith join cogroup pipe leftOuterJoin cross save rightOuterJoin zip ….
  • 44. Stages textFile map map reduceByKey collect Stage 1 Stage 2 DAG (Directed Acyclic Graph) Each stage is executed as Stage 1 Stage 2 a series of Task (one Task for each partition).
  • 45. Tasks Task is the fundamental unit of execution in Spark Fetch Input Execute Task Write Output HDFS /RDD HDFS/RDD/Intermediate shuffle output Core 1 Fetch Input Execute Task Write Output Fetch Input Execute Task Write Output Fetch Input Execute Task Write Output Core 2
  • 46. Spark Summary • SparkContext • Resilient Distributed Dataset • Parallel Operations • Shared Variables – Broadcast Variables – read-only – Accumulators
  • 47. Comparison MapReduce Impala Spark Storage HDFS HDFS/HBase HDFS Scheduler MapReduce Job Query Plan Computation Graph I/O Disk In-memory with cache In-memory, cache and shared data Fault Tolerance Duplication and Disk I/O No Fault Tolerance Hash partition and auto reconstruction Iterative Bad Bad Good Shared data No No Yes Streaming No No Yes
  • 48. Hands-on Spark • Spark Shell • Word Count
  • 49. Spark Streaming • Takes the concept of RDDs and extends it to Dstreams – Fault-tolerant like RDDs – Transformable like RDDs • Adds new “rolling window” operations – Rolling averages, etc.. • But keeps everything else! – Regular Spark code works in Spark Streaming – Can still access HDFS data, etc. • Example use cases: – “On-the-fly” ETL as data is ingested into Hadoop/HDFS – Detecting anomalous behavior and trigger alerts – Continuous reporting of summary metrics for incoming data
  • 51. Micro-batching for on the fly ETL
  • 53. Spark SQL • Spark SQL is one of Spark’s components. – Executes SQL on Spark – Builds SchemaRDD • Optimizes execution plan • Uses existing Hive metastores, SerDes, and UDFs.
  • 54. Unified Data Access • Ability to load and query data from a variety of sources. • SchemaRDDs provides a single interface that efficiently works with structured data, including Hive tables, parquet files, and JSON. sqlCtx.jsonFile("s3n://...") .registerAsTable("json") schema_rdd = sqlCtx.sql(""" SELECT * FROM hiveTable JOIN json ...""") Query and join different data sources
  • 55. Hands-on Spark • Parse/transform log on the fly with Spark-Streaming • Aggregate with Spark SQL (Top N) • Output from Spark to HDFS
  • 56. Spark & Impala work together Data Strea m Data Strea m Spark- Streaming Spark Impala DN RS Data Strea m Spark- Streaming Spark Impala DN RS Data Strea m Spark- Streaming Spark DN RS Impala … Data Strea m Data Strea m Data Stream -Click Steam -Machine Data -Log -Network Traffic -Etc.. On-the-fly Processing -ETL, transformation, filter -Pattern Matching & Alert Real-time Analytics -Machine Learning (Rec. Cluster..) -Iterative Algorithms Near Real-time Query - Ad-hoc query - Reporting Long term data store -Batch process -Offline analytics -Historical Mining
  • 57. Etu 讓 Hadoop 更容易 全自動、高效能、易管理的軟體式一體機 空機自動部署 效能最佳化 全叢集管理 唯一在地 Hadoop 專業服務 主流 X86 商用伺服器
  • 58. ESA Software Stack Cloudera Manager Etu Manager 安全管理 效能最佳化 組態同步網路管理監控告警套件管理 CentOS作業系統 (64 bits) HA 管理 Rack Awareness Etu 加值模組
  • 59. Etu在Hadoop生態系的定位與價值 Etu Services 人才 招聘 團隊 建立 程式開發 數據 架構 挖掘 設計 部署、調校 易 Etu 易 運維、管理 應用 平台 搶 佔 市場 Etu Professional Services 核心 價 值 資源 調配 屏蔽 Hadoop 平台 部署與運維的複雜度 易 • 快速推出服務,搶佔市場 • 應用、數據才是企業核心價值 • 依核心價值調配資源,建立競爭優 勢 Software Appliance EEttuu SSuuppppoorrtt Etu Professional Services EEttuu CCoonnssuullttiinngg EEttuu TTrraaiinniinngg 易
  • 60. Question and Discussion Thank you 318, Rueiguang Rd., Taipei 114, Taiwan T: +886 2 7720 1888 F: +886 2 8798 6069 www.etusolution.com