SlideShare a Scribd company logo
2 
24.11.2014 
uweseiler 
Apache Spark
2 About me 
24.11.2014 
Big Data Nerd 
Hadoop Trainer NoSQL Fan Boy 
Photography Enthusiast Travelpirate
2 About us 
24.11.2014 
specializes on... 
Big Data Nerds Agile Ninjas Continuous Delivery Gurus 
Enterprise Java Specialists Performance Geeks 
Join us!
2 Agenda 
24.11.2014 
• Why? 
• How? 
• What else? 
• Who? 
• Future?
2 Agenda 
24.11.2014 
• Why? 
• How? 
• What else? 
• Who? 
• Future?
2 Spark: In a tweet 
24.11.2014 
“Spark … is what you might 
call a Swiss Army knife of Big 
Data analytics tools” 
– Reynold Xin (@rxin), Berkeley AmpLab Shark Development Lead
2 Spark: In a nutshell 
24.11.2014 
• Fast and general engine for large scale data 
processing 
• Advanced DAG execution engine with support for 
 in-memory storage 
 data locality 
 (micro) batch  streaming support 
• Improves usability via 
 Rich APIs in Scala, Java, Python 
 Interactive shell 
• Runs Standalone, on YARN, on Mesos, and on 
Amazon EC2
2 Spark is also… 
24.11.2014 
• Came out of AMPLab at UCB in 2009 
• A top-level Apache project as of 2014 
– http://spark.apache.org 
• Backed by a commercial entity: Databricks 
• A toolset for Data Scientist / Analysts 
• Implementation of Resilient Distributed Dataset 
(RDD) in Scala 
• Hadoop Compatible
2 Spark: Trends 
24.11.2014 
Apache Drill Apache Storm Apache Spark Apache YARN Apache Tez 
Generated using http://www.google.com/trends/
2 Spark: Community 
24.11.2014 
https://github.com/apache/spark/pulse
2 Spark: Performance 
24.11.2014 
3X faster using 10X fewer machines 
http://finance.yahoo.com/news/apache-spark-beats-world-record-130000796.html 
http://www.wired.com/2014/10/startup-crunches-100-terabytes-data-record-23-minutes/
2 
24.11.2014 
BlinkDB 
MapReduce 
Cluster resource mgmt. + data 
processing 
HDFS 
Spark: Ecosystem 
Redundant, reliable storage 
Spark Core 
Spark 
SQL 
SQL 
Spark 
Streaming 
Streaming 
MLlib 
Machine 
Learning 
SparkR 
R on Spark 
GraphX 
Graph 
Computation
2 Agenda 
24.11.2014 
• Why? 
• How? 
• What else? 
• Who? 
• Future?
2 Spark: Core Concept 
24.11.2014 
• Resilient Distributed Dataset (RDD) 
Conceptually, RDDs can be roughly 
viewed as partitioned, locality aware 
distributed vectors 
RDD 
A11 
A12 
A13 
• Read-only collection of objects spread across a 
cluster 
• Built through parallel transformations  actions 
• Computation can be represented by lazy evaluated 
lineage DAGs composed by connected RDDs 
• Automatically rebuilt on failure 
• Controllable persistence
2 Spark: RDD Example 
24.11.2014 
Base RDD from HDFS 
lines = spark.textFile(“hdfs://...”) 
errors = 
lines.filter(_.startsWith(Error)) 
messages = errors.map(_.split('t')(2)) 
messages.cache() 
RDD in memory 
Iterative Processing 
for (str - Array(“foo”, “bar”)) 
messages.filter(_.contains(str)).count()
2 Spark: Transformations 
24.11.2014 
Transformations - 
Create new datasets from existing ones 
map
2 Spark: Transformations 
24.11.2014 
Transformations - 
Create new datasets from existing ones 
map(func) 
filter(func) 
flatMap(func) 
mapPartitions(func) 
mapPartitionsWithIndex(func) 
union(otherDataset) 
intersection(otherDataset) 
distinct([numTasks])) 
groupByKey([numTasks]) 
sortByKey([ascending], [numTasks]) 
reduceByKey(func, [numTasks]) 
aggregateByKey(zeroValue)(seqOp, 
combOp, [numTasks]) 
join(otherDataset, [numTasks]) 
cogroup(otherDataset, [numTasks]) 
cartesian(otherDataset) 
pipe(command, [envVars]) 
coalesce(numPartitions) 
sample(withReplacement,fraction, seed) 
repartition(numPartitions)
2 Spark: Actions 
24.11.2014 
Actions - 
Return a value to the client after running a 
computation on the dataset 
reduce
2 Spark: Actions 
24.11.2014 
Actions - 
Return a value to the client after running a 
computation on the dataset 
reduce(func) 
collect() 
count() 
first() 
countByKey() 
foreach(func) 
take(n) 
takeSample(withReplacement,num, [seed]) 
takeOrdered(n, [ordering]) 
saveAsTextFile(path) 
saveAsSequenceFile(path) 
(Only Java and Scala) 
saveAsObjectFile(path) 
(Only Java and Scala)
2 Spark: Dataflow 
24.11.2014 
All transformations in Spark are lazy and are only 
computed when an actions requires it.
2 Spark: Persistence 
24.11.2014 
One of the most important capabilities in Spark is 
caching a dataset in-memory across operations 
• cache() MEMORY_ONLY 
• persist() MEMORY_ONLY
2 Spark: Storage Levels 
24.11.2014 
• persist(Storage Level) 
Storage Level Meaning 
MEMORY_ONLY Store RDD as deserialized Java objects in the JVM. If the RDD does 
not fit in memory, some partitions will not be cached and will be 
recomputed on the fly each time they're needed. This is the default 
level. 
MEMORY_AND_DISK Store RDD as deserialized Java objects in the JVM. If the RDD does 
not fit in memory, store the partitions that don't fit on disk, and 
read them from there when they're needed. 
MEMORY_ONLY_SER Store RDD as serialized Java objects (one byte array per partition). 
This is generally more space-efficient than deserialized objects, 
especially when using a fast serializer, but more CPU-intensive to 
read. 
MEMORY_AND_DISK_SER Similar to MEMORY_ONLY_SER, but spill partitions that don't fit in 
memory to disk instead of recomputing them on the fly each time 
they're needed. 
DISK_ONLY Store the RDD partitions only on disk. 
MEMORY_ONLY_2, 
MEMORY_AND_DISK_2, 
… … … 
Same as the levels above, but replicate each partition on two cluster 
nodes.
2 Spark: Parallelism 
24.11.2014 
Can be specified in a number of different ways 
• RDD partition number 
• sc.textFile(input, minSplits = 10) 
• sc.parallelize(1 to 10000, numSlices = 10) 
• Mapper side parallelism 
• Usually inherited from parent RDD(s) 
• Reducer side parallelism 
• rdd.reduceByKey(_ + _, numPartitions = 10) 
• rdd.reduceByKey(partitioner = p, _ + _) 
• “Zoom in/out” 
• rdd.repartition(numPartitions: Int) 
• rdd.coalesce(numPartitions: Int, shuffle: Boolean)
2 Spark: Example 
24.11.2014 
Text Processing Example 
Top words by frequency
2 Spark: Frequency Example 
24.11.2014 
Create RDD from external data 
Data Sources supported by 
Hadoop 
Cassandra ElasticSearch 
HDFS S3 HBase 
Mongo 
DB 
… 
I/O via Hadoop optional 
// Step 1. Create RDD from Hadoop text files 
val docs = spark.textFile(“hdfs://docs/“)
2 Spark: Frequency Example 
24.11.2014 
Function map 
Hello World 
This is 
Spark 
Spark 
The end 
hello world 
this is 
spark 
spark 
the end 
RDD[String] 
.map(line = line.ToLowerCase) 
RDD[String]
2 Spark: Frequency Example 
24.11.2014 
Function map 
Hello World 
This is 
Spark 
Spark 
The end 
hello world 
this is 
spark 
spark 
the end 
RDD[String] 
.map(line = line.ToLowerCase) 
RDD[String] 
= 
.map(_.ToLowerCase)
2 Spark: Frequency Example 
24.11.2014 
Function map 
Hello World 
This is 
Spark 
Spark 
The end 
= 
// Step 2. Convert lines to lower case 
val lower = docs.map(line = line.ToLowerCase) 
hello world 
this is 
spark 
spark 
the end 
RDD[String] 
.map(line = line.ToLowerCase) 
RDD[String] 
.map(_.ToLowerCase)
2 Spark: Frequency Example 
24.11.2014 
map vs. flatMap 
RDD[String] 
hello world 
this is 
spark 
spark 
the end 
.map(…) 
RDD[Array[String]] 
hello 
spark 
_.split(s+) 
world 
this is spark 
the end
2 Spark: Frequency Example 
24.11.2014 
map vs. flatMap 
RDD[String] 
hello world 
this is 
spark 
spark 
the end 
.map(…) 
RDD[String] 
RDD[Array[String]] 
hello 
spark 
.flatten* 
_.split(s+) 
world 
this is spark 
hello 
world 
this 
the end 
end
2 Spark: Frequency Example 
24.11.2014 
map vs. flatMap 
RDD[String] 
hello world 
this is 
spark 
spark 
the end 
.map(…) 
RDD[String] 
RDD[Array[String]] 
hello 
world 
this is spark 
spark 
.flatten* 
_.split(s+) 
the end 
.flatMap(line = line.split(“s+“)) 
hello 
world 
this 
end
2 Spark: Frequency Example 
24.11.2014 
map vs. flatMap 
RDD[String] 
hello world 
this is 
spark 
spark 
the end 
.map(…) 
RDD[String] 
RDD[Array[String]] 
hello 
world 
this is spark 
spark 
.flatten* 
_.split(s+) 
hello 
world 
this 
the end 
end 
.flatMap(line = line.split(“s+“)) 
// Step 3. Split lines into words 
val words = lower.flatMap(line = line.split(“s+“))
2 Spark: Frequency Example 
24.11.2014 
Key-Value Pairs 
RDD[String] 
hello 
world 
spark 
end 
.map(word = Tuple2(word,1)) 
RDD[(String, Int)] 
hello 
world 
spark 
end 
spark 
1 
1 
spark 
1 
1 
1
2 Spark: Frequency Example 
24.11.2014 
Key-Value Pairs 
RDD[String] 
hello 
world 
spark 
end 
.map(word = Tuple2(word,1)) 
= 
.map(word = (word,1)) 
RDD[(String, Int)] 
hello 
world 
spark 
end 
spark 
1 
1 
spark 
1 
1 
1
2 Spark: Frequency Example 
24.11.2014 
Key-Value Pairs 
RDD[String] 
hello 
world 
spark 
end 
.map(word = Tuple2(word,1)) 
= 
.map(word = (word,1)) 
// Step 4. Convert into tuples 
val counts = words.map(word = (word,1)) 
RDD[(String, Int)] 
hello 
world 
spark 
end 
spark 
1 
1 
spark 
1 
1 
1
2 Spark: Frequency Example 
24.11.2014 
Shuffling 
RDD[(String, Int)] 
hello 
world 
spark 
end 
1 
1 
spark 
1 
1 
1 
RDD[(String, Iterator(Int))] 
.groupByKey 
end 1 
hello 1 
spark 1 1 
world 1
2 Spark: Frequency Example 
24.11.2014 
Shuffling 
RDD[(String, Int)] 
hello 
world 
spark 
end 
1 
1 
spark 
1 
1 
1 
RDD[(String, Iterator(Int))] RDD[(String, Int)] 
.groupByKey 
end 1 
hello 1 
spark 1 1 
world 1 
end 1 
hello 1 
spark 2 
world 1 
.mapValues 
_.reduce… 
(a,b) = a+b
2 Spark: Frequency Example 
24.11.2014 
Shuffling 
RDD[(String, Int)] 
hello 
world 
spark 
end 
1 
1 
spark 
1 
1 
1 
RDD[(String, Iterator(Int))] RDD[(String, Int)] 
.groupByKey 
end 1 
hello 1 
spark 1 1 
world 1 
end 1 
hello 1 
spark 2 
world 1 
.mapValues 
_.reduce… 
(a,b) = a+b 
.reduceByKey((a,b) = a+b)
2 Spark: Frequency Example 
24.11.2014 
Shuffling 
RDD[(String, Int)] 
hello 
world 
spark 
spark 
end 
1 
1 
1 
1 
1 
RDD[(String, Iterator(Int))] RDD[(String, Int)] 
.groupByKey 
end 1 
hello 1 
spark 1 1 
world 1 
// Step 5. Count all words 
val freq = counts.reduceByKey(_ + _) 
end 1 
hello 1 
spark 2 
world 1 
.mapValues 
_.reduce… 
(a,b) = a+b
2 Spark: Frequency Example 
24.11.2014 
Top N (Prepare data) 
RDD[(String, Int)] 
end 1 
hello 1 
spark 2 
world 1 
// Step 6. Swap tupels (Partial code) 
freq.map(_.swap) 
RDD[(Int, String)] 
1 end 
1 hello 
2 spark 
1 world 
.map(_.swap)
2 Spark: Frequency Example 
24.11.2014 
Top N (First Attempt) 
RDD[(Int, String)] 
1 end 
1 hello 
2 spark 
1 world 
RDD[(Int, String)] 
2 spark 
1 end 
1 hello 
1 world 
.sortByKey
2 Spark: Frequency Example 
24.11.2014 
Top N 
RDD[(Int, String)] 
1 end 
1 hello 
2 spark 
1 world 
RDD[(Int, String)] 
2 spark 
1 end 
1 hello 
1 world 
local top N 
.top(N) 
local top N
2 Spark: Frequency Example 
24.11.2014 
Top N 
RDD[(Int, String)] 
1 end 
1 hello 
2 spark 
1 world 
RDD[(Int, String)] 
2 spark 
1 end 
1 hello 
1 world 
.top(N) 
Array[(Int, String)] 
2 spark 
1 end 
local top N 
local top N 
reduction
2 Spark: Frequency Example 
24.11.2014 
Top N 
RDD[(Int, String)] 
1 end 
1 hello 
2 
spark 
1 world 
RDD[(Int, String)] 
spark 
2 
1 end 
1 hello 
1 world 
.top(N) 
Array[(Int, String)] 
2 spark 
1 end 
local top N 
local top N 
reduction 
// Step 6. Swap tupels (Complete code) 
val top = freq.map(_.swap).top(N)
2 Spark: Frequency Example 
24.11.2014 
val spark = new SparkContext() 
// Create RDD from Hadoop text file 
val docs = spark.textFile(“hdfs://docs/“) 
// Split lines into words and process 
val lower = docs.map(line = line.ToLowerCase) 
val words = lower.flatMap(line = line.split(“s+“)) 
val counts = words.map(word = (word,1)) 
// Count all words 
val freq = counts.reduceByKey(_ + _) 
// Swap tupels and get top results 
val top = freq.map(_.swap).top(N) 
top.foreach(println)
2 Agenda 
24.11.2014 
• Why? 
• How? 
• What else? 
• Who? 
• Future?
2 Spark: Streaming 
24.11.2014 
• Real-time computation 
• Similar to Apache Storm… 
• Streaming input split into sliding windows of 
RDD‘s 
• Input distributed to memory for fault 
tolerance 
• Supports input from Kafka, Flume, ZeroMQ, 
HDFS, S3, Kinesis, Twitter, …
2 Spark: Streaming 
24.11.2014 
Discretized Stream 
Windowed Computations
2 Spark: Streaming 
24.11.2014 
TwitterUtils.createStream() 
.filter(_.getText.contains(Spark)) 
.countByWindow(Seconds(5))
2 Spark: SQL 
24.11.2014 
• Spark SQL allows relational queries 
expressed in SQL, HiveQL or Scala 
• Uses SchemaRDD’s composed of Row objects 
(= table in a traditional RDBMS) 
• SchemaRDD can be created from an 
• Existing RDD 
• Parquet File 
• JSON dataset 
• By running HiveQL against data stored in Apache Hive 
• Supports a domain specific language for 
writing queries
2 Spark: SQL 
24.11.2014 
registerFunction(LEN, (_: String).length) 
val queryRdd = sql( 
SELECT * FROM counts 
WHERE LEN(word) = 10 
ORDER BY total DESC 
LIMIT 10 
) 
queryRdd 
.map( c = sword: ${c(0)} t| total: ${c(1)}) 
.collect() 
.foreach(println)
2 Spark: GraphX 
24.11.2014 
• GraphX is the Spark API for graphs 
and graph-parallel computation 
• API’s to join and traverse graphs 
• Optimally partitions and indexes 
vertices  edges (represented as RDD’s) 
• Supports PageRank, connected 
components, triangle counting, …
2 Spark: GraphX 
24.11.2014 
val graph = Graph(userIdRDD, assocRDD) 
val ranks = graph.pageRank(0.0001).vertices 
val userRDD = sc.textFile(graphx/data/users.txt) 
val users = userRdd. map {line = 
val fields = line.split(,) 
(fields(0).toLong, fields( 1)) 
} 
val ranksByUsername = users.join(ranks).map { 
case (id, (username, rank)) = (username, rank) 
}
2 Spark: MLlib 
24.11.2014 
• Machine learning library similar to 
Apache Mahout 
• Supports statistics, regression, decision 
trees, clustering, PCA, gradient 
descent, … 
• Iterative algorithms much faster due to 
in-memory processing
2 Spark: MLlib 
24.11.2014 
val data = sc.textFile(data.txt) 
val parsedData = data.map {line = 
val parts = line.split(',') 
LabeledPoint( 
parts( 0). toDouble, 
Vectors.dense(parts(1).split(' ').map(_.toDouble)) ) 
} 
val model = LinearRegressionWithSGD.train( 
parsedData, 100 
) 
val valuesAndPreds = parsedData.map {point = 
val prediction = model.predict(point.features) 
(point.label, prediction) 
} 
val MSE = valuesAndPreds 
.map{case(v, p) = math.pow((v - p), 2)}.mean()
2 Agenda 
24.11.2014 
• Why? 
• How? 
• What else? 
• Who? 
• Future?
2 Use Case: Yahoo Native Ads 
24.11.2014 
Logistic regression 
algorithm 
• 120 LOC in Spark/Scala 
• 30 min. on model creation for 
100M samples and 13K 
features 
Initial version launched 
within 2 hours after Spark-on- 
YARN announcement 
• Compared: Several days on 
hardware acquisition, system 
setup and data movement 
http://de.slideshare.net/Hadoop_Summit/sparkonyarn-empower-spark-applications-on-hadoop-cluster
2 Use Case: Yahoo Mobile Ads 
24.11.2014 
Learn from mobile search 
ads clicks data 
• 600M labeled examples on 
HDFS 
• 100M sparse features 
Spark programs for 
Gradient Boosting Decision 
Trees 
• 6 hours for model training 
with 100 workers 
• Model with accuracy very 
close to heavily-manually-tuned 
Logistic Regression 
models 
http://de.slideshare.net/Hadoop_Summit/sparkonyarn-empower-spark-applications-on-hadoop-cluster
2 Agenda 
24.11.2014 
• Why? 
• How? 
• What else? 
• Who? 
• Future?
2 Spark-on-YARN (Current) 
24.11.2014 
Hadoop 2 Spark as YARN App 
Pig … In- 
Hive Stream 
Tez 
Spark MapReduce 
Execution Engine 
Execution Engine 
YARN 
Memory 
Cluster resource management 
HDFS 
Redundant, reliable storage 
ing 
Storm 
…
2 Spark-on-YARN (Future) 
24.11.2014 
Hadoop 2 Spark as Execution Engine 
Hive … Mahout 
YARN 
HDFS 
Pig 
MapReduce 
Execution Engine 
Stream 
ing 
Storm 
… 
Tez 
Execution Engine 
Spark 
Execution Engine 
Slider
2 Spark: Future work 
24.11.2014 
• Spark Core 
• Focus on maturity, optimization  
pluggability 
• Enable long-running services (Slider) 
• Give resources back to cluster when idle 
• Integrate with Hadoop enhancements 
• Timeline server 
• ORC File Format 
• Spark Eco System 
• Focus on adding capabilities
2 One more thing… 
24.11.2014 
Let’s get started with 
Spark!
2 Hortonworks Sandbox 2.2 
24.11.2014 
http://hortonworks.com/hdp/downloads/
2 Hortonworks Sandbox 2.2 
24.11.2014 
// 1. Download 
wget http://public-repo-1.hortonworks.com/HDP-LABS/ 
Projects/spark/1.1.1/spark-1.1.0.2.1.5.0-701-bin- 
2.4.0.tgz 
// 2. Untar 
tar xvfz spark-1.1.0.2.1.5.0-701-bin-2.4.0.tgz 
// 3. Start Spark Shell 
./bin/spark-shell
2 Thanks for listening 
24.11.2014 
Twitter: 
@uweseiler 
Mail: 
uwe.seiler@codecentric.de 
XING: 
https://www.xing.com/profile 
/Uwe_Seiler

More Related Content

What's hot

Apache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query ProcessingApache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query Processing
Hortonworks
 
Ray and Its Growing Ecosystem
Ray and Its Growing EcosystemRay and Its Growing Ecosystem
Ray and Its Growing Ecosystem
Databricks
 
Block join toranomaki
Block join toranomakiBlock join toranomaki
Block join toranomaki
Ebisawa Shinobu
 
Apache Spark超入門 (Hadoop / Spark Conference Japan 2016 講演資料)
Apache Spark超入門 (Hadoop / Spark Conference Japan 2016 講演資料)Apache Spark超入門 (Hadoop / Spark Conference Japan 2016 講演資料)
Apache Spark超入門 (Hadoop / Spark Conference Japan 2016 講演資料)
NTT DATA OSS Professional Services
 
Apache Spark 101
Apache Spark 101Apache Spark 101
Apache Spark 101
Abdullah Çetin ÇAVDAR
 
負荷試験入門公開資料 201611
負荷試験入門公開資料 201611負荷試験入門公開資料 201611
負荷試験入門公開資料 201611
樽八 仲川
 
Real-time Analytics with Apache Flink and Druid
Real-time Analytics with Apache Flink and DruidReal-time Analytics with Apache Flink and Druid
Real-time Analytics with Apache Flink and Druid
Jan Graßegger
 
Oracle Advanced Security Data Redactionのご紹介
Oracle Advanced Security Data Redactionのご紹介Oracle Advanced Security Data Redactionのご紹介
Oracle Advanced Security Data Redactionのご紹介
オラクルエンジニア通信
 
Java EE Introduction
Java EE IntroductionJava EE Introduction
Java EE Introduction
ejlp12
 
Hadoopのシステム設計・運用のポイント
Hadoopのシステム設計・運用のポイントHadoopのシステム設計・運用のポイント
Hadoopのシステム設計・運用のポイント
Cloudera Japan
 
GraphFrames: DataFrame-based graphs for Apache® Spark™
GraphFrames: DataFrame-based graphs for Apache® Spark™GraphFrames: DataFrame-based graphs for Apache® Spark™
GraphFrames: DataFrame-based graphs for Apache® Spark™
Databricks
 
DeNA データプラットフォームにおける 自由と統制のバランス【DeNA TechCon 2020 ライブ配信】
DeNA データプラットフォームにおける 自由と統制のバランス【DeNA TechCon 2020 ライブ配信】DeNA データプラットフォームにおける 自由と統制のバランス【DeNA TechCon 2020 ライブ配信】
DeNA データプラットフォームにおける 自由と統制のバランス【DeNA TechCon 2020 ライブ配信】
DeNA
 
Oracle Cloud Infrastructure:2022年9月度サービス・アップデート
Oracle Cloud Infrastructure:2022年9月度サービス・アップデートOracle Cloud Infrastructure:2022年9月度サービス・アップデート
Oracle Cloud Infrastructure:2022年9月度サービス・アップデート
オラクルエンジニア通信
 
Oracle GoldenGate Veridata 12cR2 セットアップガイド
Oracle GoldenGate Veridata 12cR2 セットアップガイドOracle GoldenGate Veridata 12cR2 セットアップガイド
Oracle GoldenGate Veridata 12cR2 セットアップガイド
オラクルエンジニア通信
 
Apache Spark on Kubernetes入門(Open Source Conference 2021 Online Hiroshima 発表資料)
Apache Spark on Kubernetes入門(Open Source Conference 2021 Online Hiroshima 発表資料)Apache Spark on Kubernetes入門(Open Source Conference 2021 Online Hiroshima 発表資料)
Apache Spark on Kubernetes入門(Open Source Conference 2021 Online Hiroshima 発表資料)
NTT DATA Technology & Innovation
 
Introducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data ScienceIntroducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data Science
Databricks
 
Always on 可用性グループ 構築時のポイント
Always on 可用性グループ 構築時のポイントAlways on 可用性グループ 構築時のポイント
Always on 可用性グループ 構築時のポイントMasayuki Ozawa
 
SELENIUM PPT.pdf
SELENIUM PPT.pdfSELENIUM PPT.pdf
SELENIUM PPT.pdf
RebelSnowball
 
私はここでつまづいた! Oracle database 11g から 12cへのアップグレードと Oracle Database 12c の新機能@201...
私はここでつまづいた! Oracle database 11g から 12cへのアップグレードと Oracle Database 12c の新機能@201...私はここでつまづいた! Oracle database 11g から 12cへのアップグレードと Oracle Database 12c の新機能@201...
私はここでつまづいた! Oracle database 11g から 12cへのアップグレードと Oracle Database 12c の新機能@201...
yoshimotot
 
Helidon 概要
Helidon 概要Helidon 概要

What's hot (20)

Apache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query ProcessingApache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query Processing
 
Ray and Its Growing Ecosystem
Ray and Its Growing EcosystemRay and Its Growing Ecosystem
Ray and Its Growing Ecosystem
 
Block join toranomaki
Block join toranomakiBlock join toranomaki
Block join toranomaki
 
Apache Spark超入門 (Hadoop / Spark Conference Japan 2016 講演資料)
Apache Spark超入門 (Hadoop / Spark Conference Japan 2016 講演資料)Apache Spark超入門 (Hadoop / Spark Conference Japan 2016 講演資料)
Apache Spark超入門 (Hadoop / Spark Conference Japan 2016 講演資料)
 
Apache Spark 101
Apache Spark 101Apache Spark 101
Apache Spark 101
 
負荷試験入門公開資料 201611
負荷試験入門公開資料 201611負荷試験入門公開資料 201611
負荷試験入門公開資料 201611
 
Real-time Analytics with Apache Flink and Druid
Real-time Analytics with Apache Flink and DruidReal-time Analytics with Apache Flink and Druid
Real-time Analytics with Apache Flink and Druid
 
Oracle Advanced Security Data Redactionのご紹介
Oracle Advanced Security Data Redactionのご紹介Oracle Advanced Security Data Redactionのご紹介
Oracle Advanced Security Data Redactionのご紹介
 
Java EE Introduction
Java EE IntroductionJava EE Introduction
Java EE Introduction
 
Hadoopのシステム設計・運用のポイント
Hadoopのシステム設計・運用のポイントHadoopのシステム設計・運用のポイント
Hadoopのシステム設計・運用のポイント
 
GraphFrames: DataFrame-based graphs for Apache® Spark™
GraphFrames: DataFrame-based graphs for Apache® Spark™GraphFrames: DataFrame-based graphs for Apache® Spark™
GraphFrames: DataFrame-based graphs for Apache® Spark™
 
DeNA データプラットフォームにおける 自由と統制のバランス【DeNA TechCon 2020 ライブ配信】
DeNA データプラットフォームにおける 自由と統制のバランス【DeNA TechCon 2020 ライブ配信】DeNA データプラットフォームにおける 自由と統制のバランス【DeNA TechCon 2020 ライブ配信】
DeNA データプラットフォームにおける 自由と統制のバランス【DeNA TechCon 2020 ライブ配信】
 
Oracle Cloud Infrastructure:2022年9月度サービス・アップデート
Oracle Cloud Infrastructure:2022年9月度サービス・アップデートOracle Cloud Infrastructure:2022年9月度サービス・アップデート
Oracle Cloud Infrastructure:2022年9月度サービス・アップデート
 
Oracle GoldenGate Veridata 12cR2 セットアップガイド
Oracle GoldenGate Veridata 12cR2 セットアップガイドOracle GoldenGate Veridata 12cR2 セットアップガイド
Oracle GoldenGate Veridata 12cR2 セットアップガイド
 
Apache Spark on Kubernetes入門(Open Source Conference 2021 Online Hiroshima 発表資料)
Apache Spark on Kubernetes入門(Open Source Conference 2021 Online Hiroshima 発表資料)Apache Spark on Kubernetes入門(Open Source Conference 2021 Online Hiroshima 発表資料)
Apache Spark on Kubernetes入門(Open Source Conference 2021 Online Hiroshima 発表資料)
 
Introducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data ScienceIntroducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data Science
 
Always on 可用性グループ 構築時のポイント
Always on 可用性グループ 構築時のポイントAlways on 可用性グループ 構築時のポイント
Always on 可用性グループ 構築時のポイント
 
SELENIUM PPT.pdf
SELENIUM PPT.pdfSELENIUM PPT.pdf
SELENIUM PPT.pdf
 
私はここでつまづいた! Oracle database 11g から 12cへのアップグレードと Oracle Database 12c の新機能@201...
私はここでつまづいた! Oracle database 11g から 12cへのアップグレードと Oracle Database 12c の新機能@201...私はここでつまづいた! Oracle database 11g から 12cへのアップグレードと Oracle Database 12c の新機能@201...
私はここでつまづいた! Oracle database 11g から 12cへのアップグレードと Oracle Database 12c の新機能@201...
 
Helidon 概要
Helidon 概要Helidon 概要
Helidon 概要
 

Viewers also liked

Hadoop Operations - Best practices from the field
Hadoop Operations - Best practices from the fieldHadoop Operations - Best practices from the field
Hadoop Operations - Best practices from the field
Uwe Printz
 
Hadoop meets Agile! - An Agile Big Data Model
Hadoop meets Agile! - An Agile Big Data ModelHadoop meets Agile! - An Agile Big Data Model
Hadoop meets Agile! - An Agile Big Data Model
Uwe Printz
 
Hadoop & Security - Past, Present, Future
Hadoop & Security - Past, Present, FutureHadoop & Security - Past, Present, Future
Hadoop & Security - Past, Present, Future
Uwe Printz
 
Apache Spark with Scala
Apache Spark with ScalaApache Spark with Scala
Apache Spark with Scala
Fernando Rodriguez
 
Deep Learning with Apache Spark: an Introduction
Deep Learning with Apache Spark: an IntroductionDeep Learning with Apache Spark: an Introduction
Deep Learning with Apache Spark: an Introduction
Emanuele Bezzi
 
Big Data Asset Maturity Model
Big Data Asset Maturity ModelBig Data Asset Maturity Model
Big Data Asset Maturity Modelnoahwong
 
Beyond a Big Data Pilot: Building a Production Data Infrastructure - Stampede...
Beyond a Big Data Pilot: Building a Production Data Infrastructure - Stampede...Beyond a Big Data Pilot: Building a Production Data Infrastructure - Stampede...
Beyond a Big Data Pilot: Building a Production Data Infrastructure - Stampede...
StampedeCon
 
Hadoop Summit San Jose 2014: Costing Your Big Data Operations
Hadoop Summit San Jose 2014: Costing Your Big Data Operations Hadoop Summit San Jose 2014: Costing Your Big Data Operations
Hadoop Summit San Jose 2014: Costing Your Big Data Operations
Sumeet Singh
 
Deep dive into spark streaming
Deep dive into spark streamingDeep dive into spark streaming
Deep dive into spark streaming
Tao Li
 
Big data, Analytics and Beyond
Big data, Analytics and BeyondBig data, Analytics and Beyond
Big data, Analytics and Beyond
QuantUniversity
 
Fully fault tolerant real time data pipeline with docker and mesos
Fully fault tolerant real time data pipeline with docker and mesos Fully fault tolerant real time data pipeline with docker and mesos
Fully fault tolerant real time data pipeline with docker and mesos
Rahul Kumar
 
Hadoop security landscape
Hadoop security landscapeHadoop security landscape
Hadoop security landscape
Sujee Maniyam
 
Contexti / Oracle - Big Data : From Pilot to Production
Contexti / Oracle - Big Data : From Pilot to ProductionContexti / Oracle - Big Data : From Pilot to Production
Contexti / Oracle - Big Data : From Pilot to Production
Contexti
 
Energy analytics with Apache Spark workshop
Energy analytics with Apache Spark workshopEnergy analytics with Apache Spark workshop
Energy analytics with Apache Spark workshop
QuantUniversity
 
Hadoop Security Now and Future
Hadoop Security Now and FutureHadoop Security Now and Future
Hadoop Security Now and Future
tcloudcomputing-tw
 
Launching your career in Big Data
Launching your career in Big DataLaunching your career in Big Data
Launching your career in Big Data
Sujee Maniyam
 
Hadoop bootcamp getting started
Hadoop bootcamp getting startedHadoop bootcamp getting started
Hadoop bootcamp getting started
JWORKS powered by Ordina
 
HDP Advanced Security: Comprehensive Security for Enterprise Hadoop
HDP Advanced Security: Comprehensive Security for Enterprise HadoopHDP Advanced Security: Comprehensive Security for Enterprise Hadoop
HDP Advanced Security: Comprehensive Security for Enterprise Hadoop
Hortonworks
 
Unicom Big Data Conference
Unicom  Big Data ConferenceUnicom  Big Data Conference
Unicom Big Data Conference
Samudra Kanankearachchi
 
Apache Sentry for Hadoop security
Apache Sentry for Hadoop securityApache Sentry for Hadoop security
Apache Sentry for Hadoop security
bigdatagurus_meetup
 

Viewers also liked (20)

Hadoop Operations - Best practices from the field
Hadoop Operations - Best practices from the fieldHadoop Operations - Best practices from the field
Hadoop Operations - Best practices from the field
 
Hadoop meets Agile! - An Agile Big Data Model
Hadoop meets Agile! - An Agile Big Data ModelHadoop meets Agile! - An Agile Big Data Model
Hadoop meets Agile! - An Agile Big Data Model
 
Hadoop & Security - Past, Present, Future
Hadoop & Security - Past, Present, FutureHadoop & Security - Past, Present, Future
Hadoop & Security - Past, Present, Future
 
Apache Spark with Scala
Apache Spark with ScalaApache Spark with Scala
Apache Spark with Scala
 
Deep Learning with Apache Spark: an Introduction
Deep Learning with Apache Spark: an IntroductionDeep Learning with Apache Spark: an Introduction
Deep Learning with Apache Spark: an Introduction
 
Big Data Asset Maturity Model
Big Data Asset Maturity ModelBig Data Asset Maturity Model
Big Data Asset Maturity Model
 
Beyond a Big Data Pilot: Building a Production Data Infrastructure - Stampede...
Beyond a Big Data Pilot: Building a Production Data Infrastructure - Stampede...Beyond a Big Data Pilot: Building a Production Data Infrastructure - Stampede...
Beyond a Big Data Pilot: Building a Production Data Infrastructure - Stampede...
 
Hadoop Summit San Jose 2014: Costing Your Big Data Operations
Hadoop Summit San Jose 2014: Costing Your Big Data Operations Hadoop Summit San Jose 2014: Costing Your Big Data Operations
Hadoop Summit San Jose 2014: Costing Your Big Data Operations
 
Deep dive into spark streaming
Deep dive into spark streamingDeep dive into spark streaming
Deep dive into spark streaming
 
Big data, Analytics and Beyond
Big data, Analytics and BeyondBig data, Analytics and Beyond
Big data, Analytics and Beyond
 
Fully fault tolerant real time data pipeline with docker and mesos
Fully fault tolerant real time data pipeline with docker and mesos Fully fault tolerant real time data pipeline with docker and mesos
Fully fault tolerant real time data pipeline with docker and mesos
 
Hadoop security landscape
Hadoop security landscapeHadoop security landscape
Hadoop security landscape
 
Contexti / Oracle - Big Data : From Pilot to Production
Contexti / Oracle - Big Data : From Pilot to ProductionContexti / Oracle - Big Data : From Pilot to Production
Contexti / Oracle - Big Data : From Pilot to Production
 
Energy analytics with Apache Spark workshop
Energy analytics with Apache Spark workshopEnergy analytics with Apache Spark workshop
Energy analytics with Apache Spark workshop
 
Hadoop Security Now and Future
Hadoop Security Now and FutureHadoop Security Now and Future
Hadoop Security Now and Future
 
Launching your career in Big Data
Launching your career in Big DataLaunching your career in Big Data
Launching your career in Big Data
 
Hadoop bootcamp getting started
Hadoop bootcamp getting startedHadoop bootcamp getting started
Hadoop bootcamp getting started
 
HDP Advanced Security: Comprehensive Security for Enterprise Hadoop
HDP Advanced Security: Comprehensive Security for Enterprise HadoopHDP Advanced Security: Comprehensive Security for Enterprise Hadoop
HDP Advanced Security: Comprehensive Security for Enterprise Hadoop
 
Unicom Big Data Conference
Unicom  Big Data ConferenceUnicom  Big Data Conference
Unicom Big Data Conference
 
Apache Sentry for Hadoop security
Apache Sentry for Hadoop securityApache Sentry for Hadoop security
Apache Sentry for Hadoop security
 

Similar to Apache Spark

Solr and Spark for Real-Time Big Data Analytics: Presented by Tim Potter, Luc...
Solr and Spark for Real-Time Big Data Analytics: Presented by Tim Potter, Luc...Solr and Spark for Real-Time Big Data Analytics: Presented by Tim Potter, Luc...
Solr and Spark for Real-Time Big Data Analytics: Presented by Tim Potter, Luc...
Lucidworks
 
Dive into spark2
Dive into spark2Dive into spark2
Dive into spark2
Gal Marder
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache Spark
Databricks
 
11. From Hadoop to Spark 2/2
11. From Hadoop to Spark 2/211. From Hadoop to Spark 2/2
11. From Hadoop to Spark 2/2
Fabio Fumarola
 
Apache Spark™ is a multi-language engine for executing data-S5.ppt
Apache Spark™ is a multi-language engine for executing data-S5.pptApache Spark™ is a multi-language engine for executing data-S5.ppt
Apache Spark™ is a multi-language engine for executing data-S5.ppt
bhargavi804095
 
Apache spark sneha challa- google pittsburgh-aug 25th
Apache spark  sneha challa- google pittsburgh-aug 25thApache spark  sneha challa- google pittsburgh-aug 25th
Apache spark sneha challa- google pittsburgh-aug 25th
Sneha Challa
 
Apache Spark and DataStax Enablement
Apache Spark and DataStax EnablementApache Spark and DataStax Enablement
Apache Spark and DataStax Enablement
Vincent Poncet
 
20130912 YTC_Reynold Xin_Spark and Shark
20130912 YTC_Reynold Xin_Spark and Shark20130912 YTC_Reynold Xin_Spark and Shark
20130912 YTC_Reynold Xin_Spark and Shark
YahooTechConference
 
Apache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupApache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetup
Ned Shawa
 
Apache Spark Fundamentals Meetup Talk
Apache Spark Fundamentals Meetup TalkApache Spark Fundamentals Meetup Talk
Apache Spark Fundamentals Meetup Talk
Eren Avşaroğulları
 
Big Data Analytics with Apache Spark
Big Data Analytics with Apache SparkBig Data Analytics with Apache Spark
Big Data Analytics with Apache Spark
MarcoYuriFujiiMelo
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache Spark
Patrick Wendell
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
Robert Sanders
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
clairvoyantllc
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
Vincent Poncet
 
Zero to Streaming: Spark and Cassandra
Zero to Streaming: Spark and CassandraZero to Streaming: Spark and Cassandra
Zero to Streaming: Spark and Cassandra
Russell Spitzer
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
Mohamed hedi Abidi
 
Webinar: Solr & Spark for Real Time Big Data Analytics
Webinar: Solr & Spark for Real Time Big Data AnalyticsWebinar: Solr & Spark for Real Time Big Data Analytics
Webinar: Solr & Spark for Real Time Big Data Analytics
Lucidworks
 
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLabApache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
CloudxLab
 
[DSC 2016] 系列活動:李泳泉 / 星火燎原 - Spark 機器學習初探
[DSC 2016] 系列活動:李泳泉 / 星火燎原 - Spark 機器學習初探[DSC 2016] 系列活動:李泳泉 / 星火燎原 - Spark 機器學習初探
[DSC 2016] 系列活動:李泳泉 / 星火燎原 - Spark 機器學習初探
台灣資料科學年會
 

Similar to Apache Spark (20)

Solr and Spark for Real-Time Big Data Analytics: Presented by Tim Potter, Luc...
Solr and Spark for Real-Time Big Data Analytics: Presented by Tim Potter, Luc...Solr and Spark for Real-Time Big Data Analytics: Presented by Tim Potter, Luc...
Solr and Spark for Real-Time Big Data Analytics: Presented by Tim Potter, Luc...
 
Dive into spark2
Dive into spark2Dive into spark2
Dive into spark2
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache Spark
 
11. From Hadoop to Spark 2/2
11. From Hadoop to Spark 2/211. From Hadoop to Spark 2/2
11. From Hadoop to Spark 2/2
 
Apache Spark™ is a multi-language engine for executing data-S5.ppt
Apache Spark™ is a multi-language engine for executing data-S5.pptApache Spark™ is a multi-language engine for executing data-S5.ppt
Apache Spark™ is a multi-language engine for executing data-S5.ppt
 
Apache spark sneha challa- google pittsburgh-aug 25th
Apache spark  sneha challa- google pittsburgh-aug 25thApache spark  sneha challa- google pittsburgh-aug 25th
Apache spark sneha challa- google pittsburgh-aug 25th
 
Apache Spark and DataStax Enablement
Apache Spark and DataStax EnablementApache Spark and DataStax Enablement
Apache Spark and DataStax Enablement
 
20130912 YTC_Reynold Xin_Spark and Shark
20130912 YTC_Reynold Xin_Spark and Shark20130912 YTC_Reynold Xin_Spark and Shark
20130912 YTC_Reynold Xin_Spark and Shark
 
Apache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupApache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetup
 
Apache Spark Fundamentals Meetup Talk
Apache Spark Fundamentals Meetup TalkApache Spark Fundamentals Meetup Talk
Apache Spark Fundamentals Meetup Talk
 
Big Data Analytics with Apache Spark
Big Data Analytics with Apache SparkBig Data Analytics with Apache Spark
Big Data Analytics with Apache Spark
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache Spark
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Zero to Streaming: Spark and Cassandra
Zero to Streaming: Spark and CassandraZero to Streaming: Spark and Cassandra
Zero to Streaming: Spark and Cassandra
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Webinar: Solr & Spark for Real Time Big Data Analytics
Webinar: Solr & Spark for Real Time Big Data AnalyticsWebinar: Solr & Spark for Real Time Big Data Analytics
Webinar: Solr & Spark for Real Time Big Data Analytics
 
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLabApache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
 
[DSC 2016] 系列活動:李泳泉 / 星火燎原 - Spark 機器學習初探
[DSC 2016] 系列活動:李泳泉 / 星火燎原 - Spark 機器學習初探[DSC 2016] 系列活動:李泳泉 / 星火燎原 - Spark 機器學習初探
[DSC 2016] 系列活動:李泳泉 / 星火燎原 - Spark 機器學習初探
 

More from Uwe Printz

Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?
Uwe Printz
 
Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?
Uwe Printz
 
Lightning Talk: Agility & Databases
Lightning Talk: Agility & DatabasesLightning Talk: Agility & Databases
Lightning Talk: Agility & Databases
Uwe Printz
 
Hadoop 2 - More than MapReduce
Hadoop 2 - More than MapReduceHadoop 2 - More than MapReduce
Hadoop 2 - More than MapReduce
Uwe Printz
 
Welcome to Hadoop2Land!
Welcome to Hadoop2Land!Welcome to Hadoop2Land!
Welcome to Hadoop2Land!
Uwe Printz
 
Hadoop 2 - Beyond MapReduce
Hadoop 2 - Beyond MapReduceHadoop 2 - Beyond MapReduce
Hadoop 2 - Beyond MapReduce
Uwe Printz
 
MongoDB für Java Programmierer (JUGKA, 11.12.13)
MongoDB für Java Programmierer (JUGKA, 11.12.13)MongoDB für Java Programmierer (JUGKA, 11.12.13)
MongoDB für Java Programmierer (JUGKA, 11.12.13)
Uwe Printz
 
Hadoop 2 - Going beyond MapReduce
Hadoop 2 - Going beyond MapReduceHadoop 2 - Going beyond MapReduce
Hadoop 2 - Going beyond MapReduce
Uwe Printz
 
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
Uwe Printz
 
MongoDB for Coder Training (Coding Serbia 2013)
MongoDB for Coder Training (Coding Serbia 2013)MongoDB for Coder Training (Coding Serbia 2013)
MongoDB for Coder Training (Coding Serbia 2013)
Uwe Printz
 
MongoDB für Java-Programmierer
MongoDB für Java-ProgrammiererMongoDB für Java-Programmierer
MongoDB für Java-Programmierer
Uwe Printz
 
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...
Uwe Printz
 
Introduction to Twitter Storm
Introduction to Twitter StormIntroduction to Twitter Storm
Introduction to Twitter Storm
Uwe Printz
 
Introduction to the Hadoop Ecosystem (FrOSCon Edition)
Introduction to the Hadoop Ecosystem (FrOSCon Edition)Introduction to the Hadoop Ecosystem (FrOSCon Edition)
Introduction to the Hadoop Ecosystem (FrOSCon Edition)
Uwe Printz
 
Introduction to the Hadoop Ecosystem (SEACON Edition)
Introduction to the Hadoop Ecosystem (SEACON Edition)Introduction to the Hadoop Ecosystem (SEACON Edition)
Introduction to the Hadoop Ecosystem (SEACON Edition)
Uwe Printz
 
Introduction to the Hadoop Ecosystem (codemotion Edition)
Introduction to the Hadoop Ecosystem (codemotion Edition)Introduction to the Hadoop Ecosystem (codemotion Edition)
Introduction to the Hadoop Ecosystem (codemotion Edition)
Uwe Printz
 
Map/Confused? A practical approach to Map/Reduce with MongoDB
Map/Confused? A practical approach to Map/Reduce with MongoDBMap/Confused? A practical approach to Map/Reduce with MongoDB
Map/Confused? A practical approach to Map/Reduce with MongoDB
Uwe Printz
 
First meetup of the MongoDB User Group Frankfurt
First meetup of the MongoDB User Group FrankfurtFirst meetup of the MongoDB User Group Frankfurt
First meetup of the MongoDB User Group FrankfurtUwe Printz
 

More from Uwe Printz (18)

Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?
 
Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?
 
Lightning Talk: Agility & Databases
Lightning Talk: Agility & DatabasesLightning Talk: Agility & Databases
Lightning Talk: Agility & Databases
 
Hadoop 2 - More than MapReduce
Hadoop 2 - More than MapReduceHadoop 2 - More than MapReduce
Hadoop 2 - More than MapReduce
 
Welcome to Hadoop2Land!
Welcome to Hadoop2Land!Welcome to Hadoop2Land!
Welcome to Hadoop2Land!
 
Hadoop 2 - Beyond MapReduce
Hadoop 2 - Beyond MapReduceHadoop 2 - Beyond MapReduce
Hadoop 2 - Beyond MapReduce
 
MongoDB für Java Programmierer (JUGKA, 11.12.13)
MongoDB für Java Programmierer (JUGKA, 11.12.13)MongoDB für Java Programmierer (JUGKA, 11.12.13)
MongoDB für Java Programmierer (JUGKA, 11.12.13)
 
Hadoop 2 - Going beyond MapReduce
Hadoop 2 - Going beyond MapReduceHadoop 2 - Going beyond MapReduce
Hadoop 2 - Going beyond MapReduce
 
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
 
MongoDB for Coder Training (Coding Serbia 2013)
MongoDB for Coder Training (Coding Serbia 2013)MongoDB for Coder Training (Coding Serbia 2013)
MongoDB for Coder Training (Coding Serbia 2013)
 
MongoDB für Java-Programmierer
MongoDB für Java-ProgrammiererMongoDB für Java-Programmierer
MongoDB für Java-Programmierer
 
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...
 
Introduction to Twitter Storm
Introduction to Twitter StormIntroduction to Twitter Storm
Introduction to Twitter Storm
 
Introduction to the Hadoop Ecosystem (FrOSCon Edition)
Introduction to the Hadoop Ecosystem (FrOSCon Edition)Introduction to the Hadoop Ecosystem (FrOSCon Edition)
Introduction to the Hadoop Ecosystem (FrOSCon Edition)
 
Introduction to the Hadoop Ecosystem (SEACON Edition)
Introduction to the Hadoop Ecosystem (SEACON Edition)Introduction to the Hadoop Ecosystem (SEACON Edition)
Introduction to the Hadoop Ecosystem (SEACON Edition)
 
Introduction to the Hadoop Ecosystem (codemotion Edition)
Introduction to the Hadoop Ecosystem (codemotion Edition)Introduction to the Hadoop Ecosystem (codemotion Edition)
Introduction to the Hadoop Ecosystem (codemotion Edition)
 
Map/Confused? A practical approach to Map/Reduce with MongoDB
Map/Confused? A practical approach to Map/Reduce with MongoDBMap/Confused? A practical approach to Map/Reduce with MongoDB
Map/Confused? A practical approach to Map/Reduce with MongoDB
 
First meetup of the MongoDB User Group Frankfurt
First meetup of the MongoDB User Group FrankfurtFirst meetup of the MongoDB User Group Frankfurt
First meetup of the MongoDB User Group Frankfurt
 

Recently uploaded

De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
Product School
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
RTTS
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
Paul Groth
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
UiPathCommunity
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Product School
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
James Anderson
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
Laura Byrne
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
Elena Simperl
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
Product School
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Product School
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
Thijs Feryn
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
Product School
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
Cheryl Hung
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Thierry Lestable
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
Alison B. Lowndes
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
Safe Software
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
91mobiles
 

Recently uploaded (20)

De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
 

Apache Spark

  • 1. 2 24.11.2014 uweseiler Apache Spark
  • 2. 2 About me 24.11.2014 Big Data Nerd Hadoop Trainer NoSQL Fan Boy Photography Enthusiast Travelpirate
  • 3. 2 About us 24.11.2014 specializes on... Big Data Nerds Agile Ninjas Continuous Delivery Gurus Enterprise Java Specialists Performance Geeks Join us!
  • 4. 2 Agenda 24.11.2014 • Why? • How? • What else? • Who? • Future?
  • 5. 2 Agenda 24.11.2014 • Why? • How? • What else? • Who? • Future?
  • 6. 2 Spark: In a tweet 24.11.2014 “Spark … is what you might call a Swiss Army knife of Big Data analytics tools” – Reynold Xin (@rxin), Berkeley AmpLab Shark Development Lead
  • 7. 2 Spark: In a nutshell 24.11.2014 • Fast and general engine for large scale data processing • Advanced DAG execution engine with support for in-memory storage data locality (micro) batch streaming support • Improves usability via Rich APIs in Scala, Java, Python Interactive shell • Runs Standalone, on YARN, on Mesos, and on Amazon EC2
  • 8. 2 Spark is also… 24.11.2014 • Came out of AMPLab at UCB in 2009 • A top-level Apache project as of 2014 – http://spark.apache.org • Backed by a commercial entity: Databricks • A toolset for Data Scientist / Analysts • Implementation of Resilient Distributed Dataset (RDD) in Scala • Hadoop Compatible
  • 9. 2 Spark: Trends 24.11.2014 Apache Drill Apache Storm Apache Spark Apache YARN Apache Tez Generated using http://www.google.com/trends/
  • 10. 2 Spark: Community 24.11.2014 https://github.com/apache/spark/pulse
  • 11. 2 Spark: Performance 24.11.2014 3X faster using 10X fewer machines http://finance.yahoo.com/news/apache-spark-beats-world-record-130000796.html http://www.wired.com/2014/10/startup-crunches-100-terabytes-data-record-23-minutes/
  • 12. 2 24.11.2014 BlinkDB MapReduce Cluster resource mgmt. + data processing HDFS Spark: Ecosystem Redundant, reliable storage Spark Core Spark SQL SQL Spark Streaming Streaming MLlib Machine Learning SparkR R on Spark GraphX Graph Computation
  • 13. 2 Agenda 24.11.2014 • Why? • How? • What else? • Who? • Future?
  • 14. 2 Spark: Core Concept 24.11.2014 • Resilient Distributed Dataset (RDD) Conceptually, RDDs can be roughly viewed as partitioned, locality aware distributed vectors RDD A11 A12 A13 • Read-only collection of objects spread across a cluster • Built through parallel transformations actions • Computation can be represented by lazy evaluated lineage DAGs composed by connected RDDs • Automatically rebuilt on failure • Controllable persistence
  • 15. 2 Spark: RDD Example 24.11.2014 Base RDD from HDFS lines = spark.textFile(“hdfs://...”) errors = lines.filter(_.startsWith(Error)) messages = errors.map(_.split('t')(2)) messages.cache() RDD in memory Iterative Processing for (str - Array(“foo”, “bar”)) messages.filter(_.contains(str)).count()
  • 16. 2 Spark: Transformations 24.11.2014 Transformations - Create new datasets from existing ones map
  • 17. 2 Spark: Transformations 24.11.2014 Transformations - Create new datasets from existing ones map(func) filter(func) flatMap(func) mapPartitions(func) mapPartitionsWithIndex(func) union(otherDataset) intersection(otherDataset) distinct([numTasks])) groupByKey([numTasks]) sortByKey([ascending], [numTasks]) reduceByKey(func, [numTasks]) aggregateByKey(zeroValue)(seqOp, combOp, [numTasks]) join(otherDataset, [numTasks]) cogroup(otherDataset, [numTasks]) cartesian(otherDataset) pipe(command, [envVars]) coalesce(numPartitions) sample(withReplacement,fraction, seed) repartition(numPartitions)
  • 18. 2 Spark: Actions 24.11.2014 Actions - Return a value to the client after running a computation on the dataset reduce
  • 19. 2 Spark: Actions 24.11.2014 Actions - Return a value to the client after running a computation on the dataset reduce(func) collect() count() first() countByKey() foreach(func) take(n) takeSample(withReplacement,num, [seed]) takeOrdered(n, [ordering]) saveAsTextFile(path) saveAsSequenceFile(path) (Only Java and Scala) saveAsObjectFile(path) (Only Java and Scala)
  • 20. 2 Spark: Dataflow 24.11.2014 All transformations in Spark are lazy and are only computed when an actions requires it.
  • 21. 2 Spark: Persistence 24.11.2014 One of the most important capabilities in Spark is caching a dataset in-memory across operations • cache() MEMORY_ONLY • persist() MEMORY_ONLY
  • 22. 2 Spark: Storage Levels 24.11.2014 • persist(Storage Level) Storage Level Meaning MEMORY_ONLY Store RDD as deserialized Java objects in the JVM. If the RDD does not fit in memory, some partitions will not be cached and will be recomputed on the fly each time they're needed. This is the default level. MEMORY_AND_DISK Store RDD as deserialized Java objects in the JVM. If the RDD does not fit in memory, store the partitions that don't fit on disk, and read them from there when they're needed. MEMORY_ONLY_SER Store RDD as serialized Java objects (one byte array per partition). This is generally more space-efficient than deserialized objects, especially when using a fast serializer, but more CPU-intensive to read. MEMORY_AND_DISK_SER Similar to MEMORY_ONLY_SER, but spill partitions that don't fit in memory to disk instead of recomputing them on the fly each time they're needed. DISK_ONLY Store the RDD partitions only on disk. MEMORY_ONLY_2, MEMORY_AND_DISK_2, … … … Same as the levels above, but replicate each partition on two cluster nodes.
  • 23. 2 Spark: Parallelism 24.11.2014 Can be specified in a number of different ways • RDD partition number • sc.textFile(input, minSplits = 10) • sc.parallelize(1 to 10000, numSlices = 10) • Mapper side parallelism • Usually inherited from parent RDD(s) • Reducer side parallelism • rdd.reduceByKey(_ + _, numPartitions = 10) • rdd.reduceByKey(partitioner = p, _ + _) • “Zoom in/out” • rdd.repartition(numPartitions: Int) • rdd.coalesce(numPartitions: Int, shuffle: Boolean)
  • 24. 2 Spark: Example 24.11.2014 Text Processing Example Top words by frequency
  • 25. 2 Spark: Frequency Example 24.11.2014 Create RDD from external data Data Sources supported by Hadoop Cassandra ElasticSearch HDFS S3 HBase Mongo DB … I/O via Hadoop optional // Step 1. Create RDD from Hadoop text files val docs = spark.textFile(“hdfs://docs/“)
  • 26. 2 Spark: Frequency Example 24.11.2014 Function map Hello World This is Spark Spark The end hello world this is spark spark the end RDD[String] .map(line = line.ToLowerCase) RDD[String]
  • 27. 2 Spark: Frequency Example 24.11.2014 Function map Hello World This is Spark Spark The end hello world this is spark spark the end RDD[String] .map(line = line.ToLowerCase) RDD[String] = .map(_.ToLowerCase)
  • 28. 2 Spark: Frequency Example 24.11.2014 Function map Hello World This is Spark Spark The end = // Step 2. Convert lines to lower case val lower = docs.map(line = line.ToLowerCase) hello world this is spark spark the end RDD[String] .map(line = line.ToLowerCase) RDD[String] .map(_.ToLowerCase)
  • 29. 2 Spark: Frequency Example 24.11.2014 map vs. flatMap RDD[String] hello world this is spark spark the end .map(…) RDD[Array[String]] hello spark _.split(s+) world this is spark the end
  • 30. 2 Spark: Frequency Example 24.11.2014 map vs. flatMap RDD[String] hello world this is spark spark the end .map(…) RDD[String] RDD[Array[String]] hello spark .flatten* _.split(s+) world this is spark hello world this the end end
  • 31. 2 Spark: Frequency Example 24.11.2014 map vs. flatMap RDD[String] hello world this is spark spark the end .map(…) RDD[String] RDD[Array[String]] hello world this is spark spark .flatten* _.split(s+) the end .flatMap(line = line.split(“s+“)) hello world this end
  • 32. 2 Spark: Frequency Example 24.11.2014 map vs. flatMap RDD[String] hello world this is spark spark the end .map(…) RDD[String] RDD[Array[String]] hello world this is spark spark .flatten* _.split(s+) hello world this the end end .flatMap(line = line.split(“s+“)) // Step 3. Split lines into words val words = lower.flatMap(line = line.split(“s+“))
  • 33. 2 Spark: Frequency Example 24.11.2014 Key-Value Pairs RDD[String] hello world spark end .map(word = Tuple2(word,1)) RDD[(String, Int)] hello world spark end spark 1 1 spark 1 1 1
  • 34. 2 Spark: Frequency Example 24.11.2014 Key-Value Pairs RDD[String] hello world spark end .map(word = Tuple2(word,1)) = .map(word = (word,1)) RDD[(String, Int)] hello world spark end spark 1 1 spark 1 1 1
  • 35. 2 Spark: Frequency Example 24.11.2014 Key-Value Pairs RDD[String] hello world spark end .map(word = Tuple2(word,1)) = .map(word = (word,1)) // Step 4. Convert into tuples val counts = words.map(word = (word,1)) RDD[(String, Int)] hello world spark end spark 1 1 spark 1 1 1
  • 36. 2 Spark: Frequency Example 24.11.2014 Shuffling RDD[(String, Int)] hello world spark end 1 1 spark 1 1 1 RDD[(String, Iterator(Int))] .groupByKey end 1 hello 1 spark 1 1 world 1
  • 37. 2 Spark: Frequency Example 24.11.2014 Shuffling RDD[(String, Int)] hello world spark end 1 1 spark 1 1 1 RDD[(String, Iterator(Int))] RDD[(String, Int)] .groupByKey end 1 hello 1 spark 1 1 world 1 end 1 hello 1 spark 2 world 1 .mapValues _.reduce… (a,b) = a+b
  • 38. 2 Spark: Frequency Example 24.11.2014 Shuffling RDD[(String, Int)] hello world spark end 1 1 spark 1 1 1 RDD[(String, Iterator(Int))] RDD[(String, Int)] .groupByKey end 1 hello 1 spark 1 1 world 1 end 1 hello 1 spark 2 world 1 .mapValues _.reduce… (a,b) = a+b .reduceByKey((a,b) = a+b)
  • 39. 2 Spark: Frequency Example 24.11.2014 Shuffling RDD[(String, Int)] hello world spark spark end 1 1 1 1 1 RDD[(String, Iterator(Int))] RDD[(String, Int)] .groupByKey end 1 hello 1 spark 1 1 world 1 // Step 5. Count all words val freq = counts.reduceByKey(_ + _) end 1 hello 1 spark 2 world 1 .mapValues _.reduce… (a,b) = a+b
  • 40. 2 Spark: Frequency Example 24.11.2014 Top N (Prepare data) RDD[(String, Int)] end 1 hello 1 spark 2 world 1 // Step 6. Swap tupels (Partial code) freq.map(_.swap) RDD[(Int, String)] 1 end 1 hello 2 spark 1 world .map(_.swap)
  • 41. 2 Spark: Frequency Example 24.11.2014 Top N (First Attempt) RDD[(Int, String)] 1 end 1 hello 2 spark 1 world RDD[(Int, String)] 2 spark 1 end 1 hello 1 world .sortByKey
  • 42. 2 Spark: Frequency Example 24.11.2014 Top N RDD[(Int, String)] 1 end 1 hello 2 spark 1 world RDD[(Int, String)] 2 spark 1 end 1 hello 1 world local top N .top(N) local top N
  • 43. 2 Spark: Frequency Example 24.11.2014 Top N RDD[(Int, String)] 1 end 1 hello 2 spark 1 world RDD[(Int, String)] 2 spark 1 end 1 hello 1 world .top(N) Array[(Int, String)] 2 spark 1 end local top N local top N reduction
  • 44. 2 Spark: Frequency Example 24.11.2014 Top N RDD[(Int, String)] 1 end 1 hello 2 spark 1 world RDD[(Int, String)] spark 2 1 end 1 hello 1 world .top(N) Array[(Int, String)] 2 spark 1 end local top N local top N reduction // Step 6. Swap tupels (Complete code) val top = freq.map(_.swap).top(N)
  • 45. 2 Spark: Frequency Example 24.11.2014 val spark = new SparkContext() // Create RDD from Hadoop text file val docs = spark.textFile(“hdfs://docs/“) // Split lines into words and process val lower = docs.map(line = line.ToLowerCase) val words = lower.flatMap(line = line.split(“s+“)) val counts = words.map(word = (word,1)) // Count all words val freq = counts.reduceByKey(_ + _) // Swap tupels and get top results val top = freq.map(_.swap).top(N) top.foreach(println)
  • 46. 2 Agenda 24.11.2014 • Why? • How? • What else? • Who? • Future?
  • 47. 2 Spark: Streaming 24.11.2014 • Real-time computation • Similar to Apache Storm… • Streaming input split into sliding windows of RDD‘s • Input distributed to memory for fault tolerance • Supports input from Kafka, Flume, ZeroMQ, HDFS, S3, Kinesis, Twitter, …
  • 48. 2 Spark: Streaming 24.11.2014 Discretized Stream Windowed Computations
  • 49. 2 Spark: Streaming 24.11.2014 TwitterUtils.createStream() .filter(_.getText.contains(Spark)) .countByWindow(Seconds(5))
  • 50. 2 Spark: SQL 24.11.2014 • Spark SQL allows relational queries expressed in SQL, HiveQL or Scala • Uses SchemaRDD’s composed of Row objects (= table in a traditional RDBMS) • SchemaRDD can be created from an • Existing RDD • Parquet File • JSON dataset • By running HiveQL against data stored in Apache Hive • Supports a domain specific language for writing queries
  • 51. 2 Spark: SQL 24.11.2014 registerFunction(LEN, (_: String).length) val queryRdd = sql( SELECT * FROM counts WHERE LEN(word) = 10 ORDER BY total DESC LIMIT 10 ) queryRdd .map( c = sword: ${c(0)} t| total: ${c(1)}) .collect() .foreach(println)
  • 52. 2 Spark: GraphX 24.11.2014 • GraphX is the Spark API for graphs and graph-parallel computation • API’s to join and traverse graphs • Optimally partitions and indexes vertices edges (represented as RDD’s) • Supports PageRank, connected components, triangle counting, …
  • 53. 2 Spark: GraphX 24.11.2014 val graph = Graph(userIdRDD, assocRDD) val ranks = graph.pageRank(0.0001).vertices val userRDD = sc.textFile(graphx/data/users.txt) val users = userRdd. map {line = val fields = line.split(,) (fields(0).toLong, fields( 1)) } val ranksByUsername = users.join(ranks).map { case (id, (username, rank)) = (username, rank) }
  • 54. 2 Spark: MLlib 24.11.2014 • Machine learning library similar to Apache Mahout • Supports statistics, regression, decision trees, clustering, PCA, gradient descent, … • Iterative algorithms much faster due to in-memory processing
  • 55. 2 Spark: MLlib 24.11.2014 val data = sc.textFile(data.txt) val parsedData = data.map {line = val parts = line.split(',') LabeledPoint( parts( 0). toDouble, Vectors.dense(parts(1).split(' ').map(_.toDouble)) ) } val model = LinearRegressionWithSGD.train( parsedData, 100 ) val valuesAndPreds = parsedData.map {point = val prediction = model.predict(point.features) (point.label, prediction) } val MSE = valuesAndPreds .map{case(v, p) = math.pow((v - p), 2)}.mean()
  • 56. 2 Agenda 24.11.2014 • Why? • How? • What else? • Who? • Future?
  • 57. 2 Use Case: Yahoo Native Ads 24.11.2014 Logistic regression algorithm • 120 LOC in Spark/Scala • 30 min. on model creation for 100M samples and 13K features Initial version launched within 2 hours after Spark-on- YARN announcement • Compared: Several days on hardware acquisition, system setup and data movement http://de.slideshare.net/Hadoop_Summit/sparkonyarn-empower-spark-applications-on-hadoop-cluster
  • 58. 2 Use Case: Yahoo Mobile Ads 24.11.2014 Learn from mobile search ads clicks data • 600M labeled examples on HDFS • 100M sparse features Spark programs for Gradient Boosting Decision Trees • 6 hours for model training with 100 workers • Model with accuracy very close to heavily-manually-tuned Logistic Regression models http://de.slideshare.net/Hadoop_Summit/sparkonyarn-empower-spark-applications-on-hadoop-cluster
  • 59. 2 Agenda 24.11.2014 • Why? • How? • What else? • Who? • Future?
  • 60. 2 Spark-on-YARN (Current) 24.11.2014 Hadoop 2 Spark as YARN App Pig … In- Hive Stream Tez Spark MapReduce Execution Engine Execution Engine YARN Memory Cluster resource management HDFS Redundant, reliable storage ing Storm …
  • 61. 2 Spark-on-YARN (Future) 24.11.2014 Hadoop 2 Spark as Execution Engine Hive … Mahout YARN HDFS Pig MapReduce Execution Engine Stream ing Storm … Tez Execution Engine Spark Execution Engine Slider
  • 62. 2 Spark: Future work 24.11.2014 • Spark Core • Focus on maturity, optimization pluggability • Enable long-running services (Slider) • Give resources back to cluster when idle • Integrate with Hadoop enhancements • Timeline server • ORC File Format • Spark Eco System • Focus on adding capabilities
  • 63. 2 One more thing… 24.11.2014 Let’s get started with Spark!
  • 64. 2 Hortonworks Sandbox 2.2 24.11.2014 http://hortonworks.com/hdp/downloads/
  • 65. 2 Hortonworks Sandbox 2.2 24.11.2014 // 1. Download wget http://public-repo-1.hortonworks.com/HDP-LABS/ Projects/spark/1.1.1/spark-1.1.0.2.1.5.0-701-bin- 2.4.0.tgz // 2. Untar tar xvfz spark-1.1.0.2.1.5.0-701-bin-2.4.0.tgz // 3. Start Spark Shell ./bin/spark-shell
  • 66. 2 Thanks for listening 24.11.2014 Twitter: @uweseiler Mail: uwe.seiler@codecentric.de XING: https://www.xing.com/profile /Uwe_Seiler