Track A-2 基於 Spark 的數據分析

1
Spark Drives Big Data
Analytics Application
基於Spark的數據分析
James Chen
Etu CTO
June 16, 2015

2
• Spark Brief
• What Cloudera is doing on Spark
• Spark Use Cases
• Cloudera’s Position on Spark
• Etu and Cloudera
Agenda

3
Key Advances by MapReduce:
• Data Locality: Automatic split computation and launch of mappers
appropriately
• Fault-Tolerance: Write out of intermediate results and restartable mappers
meant ability to run on commodity hardware
• Linear Scalability: Combination of locality + programming model that forces
developers to write generally scalable solutions to problems
A Brief Review of MapReduce
M
ap
M
ap
M
ap
M
ap
M
ap
M
ap
M
ap
M
ap
M
ap
M
ap
M
ap
M
ap
Redu
ce
Redu
ce
Redu
ce
Redu
ce

4
MapReduce: Good
The Good:
• Built in fault tolerance
• Optimized IO path
• Scalable
• Developer focuses on Map/Reduce, not infrastructure
• Simple? API

5
MapReduce: Bad
The Bad:
•Optimized for disk IO
– Doesn’t leverage memory
– Iterative algorithms go through disk IO path again and
again
•Primitive API
– Developer’s have to build on very simple abstraction
– Key/Value in/out
– Even basic things like join require extensive code
•Result often many files that need to be combined appropriately

6
Spark is a general purpose computational framework
with more flexibility than MapReduce
Key properties:
• Leverages distributed memory
• Full Directed Graph expressions for data parallel computations
• Improved developer experience
Yet retains:
Linear scalability, Fault-tolerance, and Data Locality based computations
Reference:
https://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf
What is Spark?

7
Easy to Develop
– High productive
language support
– Clean and expressive
APIs
– Interactive shell
– Out of box
functionality
Spark: Easy and Fast Big Data
Fast to Run
–General execution
graphs
–In-memory storage
2-5× less code
Up to 10× faster on disk,
100× in memory

8
Spark
Easy: Example – Word Count
Hadoop MapReduce
public static class WordCountMapClass extends MapReduceBase
implements Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value,
OutputCollector<Text, IntWritable> output,
Reporter reporter) throws IOException {
String line = value.toString();
StringTokenizer itr = new StringTokenizer(line);
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
output.collect(word, one);
}
}
}
public static class WorkdCountReduce extends MapReduceBase
implements Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterator<IntWritable> values,
OutputCollector<Text, IntWritable> output,
Reporter reporter) throws IOException {
int sum = 0;
while (values.hasNext()) {
sum += values.next().get();
}
output.collect(key, new IntWritable(sum));
}
}
val spark = new SparkContext(master, appName,
[sparkHome], [jars])
val file = spark.textFile("hdfs://...")
val counts = file.(line => line.split(" "))
.map(word => (word, 1))
.reduceByKey(_ + _)
counts.saveAsTextFile("hdfs://...")

9
Hadoop Integration
• Works with Hadoop Data
• Runs With YARN
Libraries
• MLlib
• Spark Streaming
• GraphX (alpha)
Out-of-the-Box Functionality
Language support:
• Improved Python support
• SparkR
• Java 8
• Schema support in Spark’s
APIs

10
data = spark.textFile(...).map(readPoint).cache()
w = numpy.random.rand(D)
for i in range(iterations):
gradient = data
.map(lambda p: (1 / (1 + exp(-p.y * w.dot(p.x))))
* p.y * p.x)
.reduce(lambda x, y: x + y)
w -= gradient
print “Final w: %s” % w
Example: Logistic Regression

11
• Hadoop cluster with 100 nodes
contains 10+TB of RAM today and
will double next year
• 1 GB RAM ~ $10-$20
• Trends:
• ½ price every 18 months
• 2x bandwidth every 3 years
Memory Management Leads to Greater
Performance
64-‐128GB RAM
16 cores
50 GB per
sec
Memory can be enabler for high
performance big data applications

12
In-memory Caching
• Data Partitions read from
RAM instead of disk
Operator Graphs
• Scheduling Optimizations
• Fault Tolerance
Fast: Using RAM, Operator Graphs
join
filter
groupBy
B: B:
C: D: E:
F:
Ç√
Ω
map
A:
map
take
= cached partition= RDD

13
Expressiveness of Programming Model
Map
Reduce
Map
Map
Reduce
Map
Reduce Efficient group-‐by aggregations
and other analytics
Pipelined MapReduce Jobs
Ma
p
Reduc
e
Ma
p
Reduc
eX X X
Ma
p
Reduc
e
Iterative jobs (Machine Learning)

14
Logistic Regression Performance (Data
Fits in Memory)
0
500
1000
1500
2000
2500
3000
3500
4000
1 5 10 20 30
Running Time (s)
Number of Iterations
Hadoop
Spark
110 s / iteration
first iteration 80 s
further iterations 1 s

15
• Spark Brief
• Spark Use Cases
Agenda

16
Spark Engineering in Cloudera
• Cloudera embraced Spark in early 2014
• Engineering with Intel to broaden Spark ecosystem
– Hive-on-Spark
– Pig-on-Spark
– Spark-over-YARN
– Spark Streaming Reliability
– General Spark Optimization

17
Hive on Spark
• Technology
– Hive: “standard” SQL tool in Hadoop
– Spark: next-gen distributed processing framework
– Hive + Spark
• Performance
• Minimum feature gap
• Industry
– A lot of customers heavily invest in Hive
– Want to leverage the Spark engine

18
Design Principles
• No or limited impact on Hive’s existing code path
• Maximize code reuse
• Minimum feature customization
• Low future maintenance cost

19
Class Hierarchy
TaskCompiler
MapRedCompiler TezCompiler
Task Work
MapRedTask TezTask TezWorkMapRedWork
SparkCompiler SparkTask SparkWork
generates described by

20
Work – Metadata for Task
• MapReduceWork contains one MapWork and a possible ReduceWork
• SparkWork contains a graph of MapWorks and ReduceWorks
MapWork1
ReduceWork1
MapWork2
ReduceWork2
MapWork1
ReduceWork1
ReduceWork2
Query: select name,
sum(value) as v
from dec
group by name
order by v;
Spark Job
MR Job 2
MR Job 1

21
Data Processing via Spark
• Treat Table as HadoopRDD (input RDD)
• Apply the function that wraps MR’s map-side processing
• Shuffle map output using Spark’s transformations (groupByKey,
sortByKey, etc)
• Apply the function that wraps MR’s reduce-side processing

22
Spark Plan
• MapInput – encapsulate a table
• MapTran – map-side processing
• ShuffleTran – shuffling
• ReduceTran – reduce-side processing
Query: Select name, sum(value) as v from dec group by name order by v;

23
Current Status
• All functionality in Hive is implemented
• First round of optimization is completed
– Map join, SMB
– Split generation and grouping
– CBO, vectorization
• More optimization and benchmarking coming
• Beta in CDH
– http://archive-primary.cloudera.com/cloudera-labs/hive-on-spark/
– http://www.cloudera.com/content/cloudera/en/documentation/hive-
spark/latest/PDF/hive-spark-get-started.pdf

24
• Spark Brief
• Spark Use Cases
Agenda

25
User Use Case Spark’s Value
Conviva
通過實時分析流量規則以及更精細的流
量控制，優化終端用戶的在線視頻體驗
• 快速原型開發
• 共享的離線和在線計算業務邏輯
• 開源的機器學習算法
Yahoo!
加速廣告投放的模型訓練週期，特徵提
取提高3備，採用協同過濾算法進行內容
推薦
• 降低數據管道的延遲
• 迭代式機器學習
• 高效的P2P廣播
Anonymous
(Large Tech
Company)
準實時日誌聚合於分析，實現監控和告
警
• 低延遲、高頻度的運行“mini”批
總也來處理最新數據
Technicolor
為（電信）客戶提供實時分析；提供流
處理和實時查詢能力
• 部署簡單，只需要Spark和Spark
Streaming
• 在線數據的隨機查詢
Sample Use Cases

26
Large Tech Company – Spark is used for new machine learning
investigations for search personalization
Financial Services – Process millions of stock positions and future
scenarios in 4hrs with Spark (compared with 1 week using
MapReduce)
University – Genomics research using Spark pipelines
Video – Spark and Spark Streaming for video streaming and analysis
Hospital – Spark for predictive modeling of disease conditions
Cloudera Use Cases in Verticals

27
• Run ETL on Spark using PIG
– To achieve very tight SLA’s.
– Accenture Smart Water Application.
• Spark Analytics over Hbase
– Patients physiological data, experiment and user data
– Serving Researchers.
• Traffic analysis using MLlib/Clustering at Dylan
• Annotated Variants analysis on Spark
– Using the Spark/Java framework in Duke
• Sepsis detection with Spark Streaming
Cloudera Use cases with different
Components

28
• A car shopping website where people
from all across the nation come to read
reviews, compare prices, and in general
get help in all matters car related.
• The goal was to build a near real-time
dashboard that would provide both
unique visitor and page view counts per
make and make/model that could be
engineered in a couple of weeks.
• In the past, these updates have been
restricted to hourly granularities with an
additional hour delay.
• Furthermore, as this data was not
available in an easy-to-use dashboard,
manual processing was needed to
visualize the data.
Near real-time dashboard by
Edmunds.com

33
Case Study in Etu Insight
l Problem domain:
− Analyze user behavior from web site interaction log
− Analyze users behavior from existing offline data
− Make data aggregation on the data grouping by time
and users
l Approach:
− ETL process from the web log to Hive structure data
− Import existing database data
− Define and implement the aggregation function in Spark
(with Scala)
− Output the calculation result to HBase

34
Architecture & Flow
Web log User Data
Hive
(Structured Data)
SPARK
HBase

36
Advanced Analytics with Spark
• Written by Cloudera data science team
– First ever book bridging ML with
Hadoop ecosystem
– Focusing on use cases and examples
rather than a manual
– Target for data scientist solving real
word analysis problems
– Generally available in May 2015

37
Analyzing Big Data
• Building a model to detect credit card fraud using thousands
of features and billions of transactions
• Intelligently recommend millions of products to millions of
users
• Estimate financial risk through simulations of portfolios
including millions of instruments
• Easily manipulate data from thousands of human genomes to
detect genetic associations with disease

38
• Spark Brief
• Spark Use Cases
Agenda

39
Spark is a fully integrated and supported part of Cloudera’s
enterprise data hub
• First vendor to ship and support Spark
– Invested early to make it a cohesive part of the platform
– Complemented by Intel’s early investment
– Developed and supported in collaboration with Databricks to
ensure success
• Only vendor with Spark committers on staff
• Several Spark use cases in production
• Well-trained support staff and external Training Courses
Cloudera’s Investment in Spark

40
Hadoop in the Spark World
YARN
Spark
Spark
Streaming
GraphX MLlib
HDFS, HBase
HivePig
Impala
MapReduce2
SparkSQL
Search
Core Hadoop
Support Spark components
Unsupported add-‐ons

41
Focusing on Open Standards, not just Open Source
Open Standards are just as
important as Open Source.
Why does it matter?
• Diverse engineering is more sustainable.
• Broad support ensures vendor
portability.
• Project utility depends on ecosystem
compatibility, which depends on
standards.
Cloudera leads in defining
the de facto open standards
adopted by the market.
Vendor Support
Component
(Founder)
Cloudera Pivota
l
MapR Amaz
on
IBM Hortonwo
rks
Spark
(UC
Berkeley)
✔ ✔ ✔ ✔ ✔ ✔
Impala (Cloudera) ✔ ✖ ✔ ✔ ✖ ✖
Hue (Cloudera) ✔ ✖ ✔ ✔ ✖ ✔
Sentry (Cloudera) ✔ ✔ ✔ ✖ ✔ ✖
Flume (Cloudera) ✔ ✔ ✔ ✖ ✔ ✔
Parquet
(Cloudera/Twitter)
✔ ✔ ✔ ✔ ✔ ✖
Sqoop (Cloudera) ✔ ✔ ✔ ✔ ✔ ✔
Falcon ✖ ✖ ✖ ✖ ✖ ✔
Knox ✖ ✖ ✖ ✖ ✖ ✔
Tez ✖ ✖ ✔ ✖ ✖ ✔
Ranger ✖ ✖ ✖ ✖ ✖ ✔

42
Cloudera is a member of, and aligned with, the broader Spark
community
Spark:
• Will replace MapReduce as the general purpose Hadoop framework
– Broad community and vendor adoption
– Hadoop ecosystem integration (native & 3rd party)
• Goes beyond data science/machine learning
– Cloudera working on Spark Core, Streaming, Security, YARN, and MLlib
• Does not replace special purpose frameworks
– One size does not fit all for SQL, Search, Graph, Stream
Cloudera’s Position on Spark

43
• Spark Brief
• Spark Use Cases
Agenda

45
Etu 在 Hadoop 企業化的定位與價值
人才
招聘
團隊
建立
程式開發
資料架構
探勘設計
部署、調校
運維、管理
應用
平台
搶佔
市場
核心
價值
資源
調配
標準化、自動化
降低 Hadoop 平台
部署與維運的複雜度
• 省力：到府安裝調校、專案技術服務
• 省時：顧問與教育訓練，協助迅速上手
• 安心：本土技術支援，降低導入風險
• 智開：多年經驗分享，打通任督二脈
難
難
易
易
Etu
Manager
Etu Professional
Service
Etu Consulting
Etu Training
Etu Services
Etu Support
難
難
易
易

46
Etu Support
Etu Professional
Service
Etu Consulting
Cloudera
Support
Etu Manager Etu Services
Etu Big Data 軟體平台與服務
Cloudera
Manager
Etu
Manager
Cloudera Manager inside
Etu Training

47
主流 X86 商用伺服器
效能最佳化
全叢集管理
空機自動部署
全自動、高效能、易管理的巨量資料處理平台
唯一在地 Hadoop 專業服務
Etu Manager 讓 Hadoop 更容易

48
Etu Services
• Etu Manager 功能模組更新
• HDFS / MapReduce / HBase / Pig / Hive / Impala / Spark 技術諮詢 (電⼦子郵
件)
• 配合 CDH 提供升級與更新套件
• 客⼾戶問題管理 (Issue Management)
• Hadoop叢集規劃與設計 ● Hadoop軟體架構與資料模型設計
• Hadoop系統安裝與建置(on-site) ● Hadoop資料處理與應⽤用軟體開發
• Hadoop叢集維護檢測與調教(on-site) ● Hadoop資料移轉服務
Etu 專業服務 (以⼈人天計費)
• 叢集規劃與網路架構設計/顧問服務
• 應⽤用程式架構設計/顧問服務
Etu 科技顧問 (以⼈人天計費)
• 標準課程：Hadoop 直通學習地圖 – 針對不同職務需求，全⽅方位巨量資料技術實作學習
• 企業包班
Etu 教育訓練 (以⼈人次計費)
Etu 技術⽀支援 8X5 (以年計算)

49
Booth 4 : Etu Data Lake
Booth 5 : Cloudera
進一步了解

51
• Driver & Workers
• RDD – Resilient Distributed Dataset
• Transformations
• Actions
• Caching
Spark Concepts - Overview

52
Drivers and Workers
Driver
Worker
Worker
Worker
Data
Data
RAM
Data
RAM
Tasks
Results
RAM

53
• Read-only partitioned collection of records
• Created through:
– Transformation of data in storage
– Transformation of RDDs
• Contains lineage to compute from storage
• Lazy materialization
• Users control persistence and partitioning
RDD – Resilient Distributed Dataset

54
• Map
• Filter
• Sample
• Join
Operations
• Reduce
• Count
• First, Take
• SaveAs
Transformations Actions

55
• Transformations create new RDD from an
existing one
• Actions run computation on RDD and return a
value
• Transformations are lazy
• Actions materialize RDDs by computing
transformations
• RDDs can be cached to avoid re-computing
Operations

56
• RDDs contain lineage
• Lineage – source location and list of transformations
• Lost partitions can be re-computed from source data
Fault-Tolerance
msgs = textFile.filter(lambda s:
s.startsWith(“ERROR”))
.map(lambda s: s.split(“t”)[2])
HDFS File Filtered RDD Mapped RDD
filter
(func = startsWith(…))
map
(func = split(...))

57
• Persist() and cache() mark data
• RDD is cached after first action
• Fault-tolerant – lost partitions will re-compute
• If not enough memory, some partitions will not be
cached
• Future actions are performed on cached
partitioned, so they are much faster
Use caching for iterative algorithms
Caching

58
• MEMORY_ONLY
• MEMORY_AND_DISK
• MEMORY_ONLY_SER
• MEMORY_AND_DISK_SER
• DISK_ONLY
• MEMORY_ONLY_2, MEMORY_AND_DISK_2…
Caching – Storage Levels

59
• map
• filter
• groupBy
• sort
• union
• join
• leftOuterJoin
• rightOuterJoin
Easy: Expressive API
• reduce
• count
• fold
• reduceByKey
• groupByKey
• cogroup
• cross
• zip
• sample
• take
• first
• partitionBy
• mapWith
• pipe
• save ...

Track A-2 基於 Spark 的數據分析

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Track A-2 基於 Spark 的數據分析

Similar to Track A-2 基於 Spark 的數據分析 (20)

More from Etu Solution

More from Etu Solution (20)

Recently uploaded

Recently uploaded (20)

Track A-2 基於 Spark 的數據分析