SlideShare a Scribd company logo
1 of 66
Download to read offline
2 
24.11.2014 
uweseiler 
Apache Spark
2 About me 
24.11.2014 
Big Data Nerd 
Hadoop Trainer NoSQL Fan Boy 
Photography Enthusiast Travelpirate
2 About us 
24.11.2014 
specializes on... 
Big Data Nerds Agile Ninjas Continuous Delivery Gurus 
Enterprise Java Specialists Performance Geeks 
Join us!
2 Agenda 
24.11.2014 
• Why? 
• How? 
• What else? 
• Who? 
• Future?
2 Agenda 
24.11.2014 
• Why? 
• How? 
• What else? 
• Who? 
• Future?
2 Spark: In a tweet 
24.11.2014 
“Spark … is what you might 
call a Swiss Army knife of Big 
Data analytics tools” 
– Reynold Xin (@rxin), Berkeley AmpLab Shark Development Lead
2 Spark: In a nutshell 
24.11.2014 
• Fast and general engine for large scale data 
processing 
• Advanced DAG execution engine with support for 
 in-memory storage 
 data locality 
 (micro) batch  streaming support 
• Improves usability via 
 Rich APIs in Scala, Java, Python 
 Interactive shell 
• Runs Standalone, on YARN, on Mesos, and on 
Amazon EC2
2 Spark is also… 
24.11.2014 
• Came out of AMPLab at UCB in 2009 
• A top-level Apache project as of 2014 
– http://spark.apache.org 
• Backed by a commercial entity: Databricks 
• A toolset for Data Scientist / Analysts 
• Implementation of Resilient Distributed Dataset 
(RDD) in Scala 
• Hadoop Compatible
2 Spark: Trends 
24.11.2014 
Apache Drill Apache Storm Apache Spark Apache YARN Apache Tez 
Generated using http://www.google.com/trends/
2 Spark: Community 
24.11.2014 
https://github.com/apache/spark/pulse
2 Spark: Performance 
24.11.2014 
3X faster using 10X fewer machines 
http://finance.yahoo.com/news/apache-spark-beats-world-record-130000796.html 
http://www.wired.com/2014/10/startup-crunches-100-terabytes-data-record-23-minutes/
2 
24.11.2014 
BlinkDB 
MapReduce 
Cluster resource mgmt. + data 
processing 
HDFS 
Spark: Ecosystem 
Redundant, reliable storage 
Spark Core 
Spark 
SQL 
SQL 
Spark 
Streaming 
Streaming 
MLlib 
Machine 
Learning 
SparkR 
R on Spark 
GraphX 
Graph 
Computation
2 Agenda 
24.11.2014 
• Why? 
• How? 
• What else? 
• Who? 
• Future?
2 Spark: Core Concept 
24.11.2014 
• Resilient Distributed Dataset (RDD) 
Conceptually, RDDs can be roughly 
viewed as partitioned, locality aware 
distributed vectors 
RDD 
A11 
A12 
A13 
• Read-only collection of objects spread across a 
cluster 
• Built through parallel transformations  actions 
• Computation can be represented by lazy evaluated 
lineage DAGs composed by connected RDDs 
• Automatically rebuilt on failure 
• Controllable persistence
2 Spark: RDD Example 
24.11.2014 
Base RDD from HDFS 
lines = spark.textFile(“hdfs://...”) 
errors = 
lines.filter(_.startsWith(Error)) 
messages = errors.map(_.split('t')(2)) 
messages.cache() 
RDD in memory 
Iterative Processing 
for (str - Array(“foo”, “bar”)) 
messages.filter(_.contains(str)).count()
2 Spark: Transformations 
24.11.2014 
Transformations - 
Create new datasets from existing ones 
map
2 Spark: Transformations 
24.11.2014 
Transformations - 
Create new datasets from existing ones 
map(func) 
filter(func) 
flatMap(func) 
mapPartitions(func) 
mapPartitionsWithIndex(func) 
union(otherDataset) 
intersection(otherDataset) 
distinct([numTasks])) 
groupByKey([numTasks]) 
sortByKey([ascending], [numTasks]) 
reduceByKey(func, [numTasks]) 
aggregateByKey(zeroValue)(seqOp, 
combOp, [numTasks]) 
join(otherDataset, [numTasks]) 
cogroup(otherDataset, [numTasks]) 
cartesian(otherDataset) 
pipe(command, [envVars]) 
coalesce(numPartitions) 
sample(withReplacement,fraction, seed) 
repartition(numPartitions)
2 Spark: Actions 
24.11.2014 
Actions - 
Return a value to the client after running a 
computation on the dataset 
reduce
2 Spark: Actions 
24.11.2014 
Actions - 
Return a value to the client after running a 
computation on the dataset 
reduce(func) 
collect() 
count() 
first() 
countByKey() 
foreach(func) 
take(n) 
takeSample(withReplacement,num, [seed]) 
takeOrdered(n, [ordering]) 
saveAsTextFile(path) 
saveAsSequenceFile(path) 
(Only Java and Scala) 
saveAsObjectFile(path) 
(Only Java and Scala)
2 Spark: Dataflow 
24.11.2014 
All transformations in Spark are lazy and are only 
computed when an actions requires it.
2 Spark: Persistence 
24.11.2014 
One of the most important capabilities in Spark is 
caching a dataset in-memory across operations 
• cache() MEMORY_ONLY 
• persist() MEMORY_ONLY
2 Spark: Storage Levels 
24.11.2014 
• persist(Storage Level) 
Storage Level Meaning 
MEMORY_ONLY Store RDD as deserialized Java objects in the JVM. If the RDD does 
not fit in memory, some partitions will not be cached and will be 
recomputed on the fly each time they're needed. This is the default 
level. 
MEMORY_AND_DISK Store RDD as deserialized Java objects in the JVM. If the RDD does 
not fit in memory, store the partitions that don't fit on disk, and 
read them from there when they're needed. 
MEMORY_ONLY_SER Store RDD as serialized Java objects (one byte array per partition). 
This is generally more space-efficient than deserialized objects, 
especially when using a fast serializer, but more CPU-intensive to 
read. 
MEMORY_AND_DISK_SER Similar to MEMORY_ONLY_SER, but spill partitions that don't fit in 
memory to disk instead of recomputing them on the fly each time 
they're needed. 
DISK_ONLY Store the RDD partitions only on disk. 
MEMORY_ONLY_2, 
MEMORY_AND_DISK_2, 
… … … 
Same as the levels above, but replicate each partition on two cluster 
nodes.
2 Spark: Parallelism 
24.11.2014 
Can be specified in a number of different ways 
• RDD partition number 
• sc.textFile(input, minSplits = 10) 
• sc.parallelize(1 to 10000, numSlices = 10) 
• Mapper side parallelism 
• Usually inherited from parent RDD(s) 
• Reducer side parallelism 
• rdd.reduceByKey(_ + _, numPartitions = 10) 
• rdd.reduceByKey(partitioner = p, _ + _) 
• “Zoom in/out” 
• rdd.repartition(numPartitions: Int) 
• rdd.coalesce(numPartitions: Int, shuffle: Boolean)
2 Spark: Example 
24.11.2014 
Text Processing Example 
Top words by frequency
2 Spark: Frequency Example 
24.11.2014 
Create RDD from external data 
Data Sources supported by 
Hadoop 
Cassandra ElasticSearch 
HDFS S3 HBase 
Mongo 
DB 
… 
I/O via Hadoop optional 
// Step 1. Create RDD from Hadoop text files 
val docs = spark.textFile(“hdfs://docs/“)
2 Spark: Frequency Example 
24.11.2014 
Function map 
Hello World 
This is 
Spark 
Spark 
The end 
hello world 
this is 
spark 
spark 
the end 
RDD[String] 
.map(line = line.ToLowerCase) 
RDD[String]
2 Spark: Frequency Example 
24.11.2014 
Function map 
Hello World 
This is 
Spark 
Spark 
The end 
hello world 
this is 
spark 
spark 
the end 
RDD[String] 
.map(line = line.ToLowerCase) 
RDD[String] 
= 
.map(_.ToLowerCase)
2 Spark: Frequency Example 
24.11.2014 
Function map 
Hello World 
This is 
Spark 
Spark 
The end 
= 
// Step 2. Convert lines to lower case 
val lower = docs.map(line = line.ToLowerCase) 
hello world 
this is 
spark 
spark 
the end 
RDD[String] 
.map(line = line.ToLowerCase) 
RDD[String] 
.map(_.ToLowerCase)
2 Spark: Frequency Example 
24.11.2014 
map vs. flatMap 
RDD[String] 
hello world 
this is 
spark 
spark 
the end 
.map(…) 
RDD[Array[String]] 
hello 
spark 
_.split(s+) 
world 
this is spark 
the end
2 Spark: Frequency Example 
24.11.2014 
map vs. flatMap 
RDD[String] 
hello world 
this is 
spark 
spark 
the end 
.map(…) 
RDD[String] 
RDD[Array[String]] 
hello 
spark 
.flatten* 
_.split(s+) 
world 
this is spark 
hello 
world 
this 
the end 
end
2 Spark: Frequency Example 
24.11.2014 
map vs. flatMap 
RDD[String] 
hello world 
this is 
spark 
spark 
the end 
.map(…) 
RDD[String] 
RDD[Array[String]] 
hello 
world 
this is spark 
spark 
.flatten* 
_.split(s+) 
the end 
.flatMap(line = line.split(“s+“)) 
hello 
world 
this 
end
2 Spark: Frequency Example 
24.11.2014 
map vs. flatMap 
RDD[String] 
hello world 
this is 
spark 
spark 
the end 
.map(…) 
RDD[String] 
RDD[Array[String]] 
hello 
world 
this is spark 
spark 
.flatten* 
_.split(s+) 
hello 
world 
this 
the end 
end 
.flatMap(line = line.split(“s+“)) 
// Step 3. Split lines into words 
val words = lower.flatMap(line = line.split(“s+“))
2 Spark: Frequency Example 
24.11.2014 
Key-Value Pairs 
RDD[String] 
hello 
world 
spark 
end 
.map(word = Tuple2(word,1)) 
RDD[(String, Int)] 
hello 
world 
spark 
end 
spark 
1 
1 
spark 
1 
1 
1
2 Spark: Frequency Example 
24.11.2014 
Key-Value Pairs 
RDD[String] 
hello 
world 
spark 
end 
.map(word = Tuple2(word,1)) 
= 
.map(word = (word,1)) 
RDD[(String, Int)] 
hello 
world 
spark 
end 
spark 
1 
1 
spark 
1 
1 
1
2 Spark: Frequency Example 
24.11.2014 
Key-Value Pairs 
RDD[String] 
hello 
world 
spark 
end 
.map(word = Tuple2(word,1)) 
= 
.map(word = (word,1)) 
// Step 4. Convert into tuples 
val counts = words.map(word = (word,1)) 
RDD[(String, Int)] 
hello 
world 
spark 
end 
spark 
1 
1 
spark 
1 
1 
1
2 Spark: Frequency Example 
24.11.2014 
Shuffling 
RDD[(String, Int)] 
hello 
world 
spark 
end 
1 
1 
spark 
1 
1 
1 
RDD[(String, Iterator(Int))] 
.groupByKey 
end 1 
hello 1 
spark 1 1 
world 1
2 Spark: Frequency Example 
24.11.2014 
Shuffling 
RDD[(String, Int)] 
hello 
world 
spark 
end 
1 
1 
spark 
1 
1 
1 
RDD[(String, Iterator(Int))] RDD[(String, Int)] 
.groupByKey 
end 1 
hello 1 
spark 1 1 
world 1 
end 1 
hello 1 
spark 2 
world 1 
.mapValues 
_.reduce… 
(a,b) = a+b
2 Spark: Frequency Example 
24.11.2014 
Shuffling 
RDD[(String, Int)] 
hello 
world 
spark 
end 
1 
1 
spark 
1 
1 
1 
RDD[(String, Iterator(Int))] RDD[(String, Int)] 
.groupByKey 
end 1 
hello 1 
spark 1 1 
world 1 
end 1 
hello 1 
spark 2 
world 1 
.mapValues 
_.reduce… 
(a,b) = a+b 
.reduceByKey((a,b) = a+b)
2 Spark: Frequency Example 
24.11.2014 
Shuffling 
RDD[(String, Int)] 
hello 
world 
spark 
spark 
end 
1 
1 
1 
1 
1 
RDD[(String, Iterator(Int))] RDD[(String, Int)] 
.groupByKey 
end 1 
hello 1 
spark 1 1 
world 1 
// Step 5. Count all words 
val freq = counts.reduceByKey(_ + _) 
end 1 
hello 1 
spark 2 
world 1 
.mapValues 
_.reduce… 
(a,b) = a+b
2 Spark: Frequency Example 
24.11.2014 
Top N (Prepare data) 
RDD[(String, Int)] 
end 1 
hello 1 
spark 2 
world 1 
// Step 6. Swap tupels (Partial code) 
freq.map(_.swap) 
RDD[(Int, String)] 
1 end 
1 hello 
2 spark 
1 world 
.map(_.swap)
2 Spark: Frequency Example 
24.11.2014 
Top N (First Attempt) 
RDD[(Int, String)] 
1 end 
1 hello 
2 spark 
1 world 
RDD[(Int, String)] 
2 spark 
1 end 
1 hello 
1 world 
.sortByKey
2 Spark: Frequency Example 
24.11.2014 
Top N 
RDD[(Int, String)] 
1 end 
1 hello 
2 spark 
1 world 
RDD[(Int, String)] 
2 spark 
1 end 
1 hello 
1 world 
local top N 
.top(N) 
local top N
2 Spark: Frequency Example 
24.11.2014 
Top N 
RDD[(Int, String)] 
1 end 
1 hello 
2 spark 
1 world 
RDD[(Int, String)] 
2 spark 
1 end 
1 hello 
1 world 
.top(N) 
Array[(Int, String)] 
2 spark 
1 end 
local top N 
local top N 
reduction
2 Spark: Frequency Example 
24.11.2014 
Top N 
RDD[(Int, String)] 
1 end 
1 hello 
2 
spark 
1 world 
RDD[(Int, String)] 
spark 
2 
1 end 
1 hello 
1 world 
.top(N) 
Array[(Int, String)] 
2 spark 
1 end 
local top N 
local top N 
reduction 
// Step 6. Swap tupels (Complete code) 
val top = freq.map(_.swap).top(N)
2 Spark: Frequency Example 
24.11.2014 
val spark = new SparkContext() 
// Create RDD from Hadoop text file 
val docs = spark.textFile(“hdfs://docs/“) 
// Split lines into words and process 
val lower = docs.map(line = line.ToLowerCase) 
val words = lower.flatMap(line = line.split(“s+“)) 
val counts = words.map(word = (word,1)) 
// Count all words 
val freq = counts.reduceByKey(_ + _) 
// Swap tupels and get top results 
val top = freq.map(_.swap).top(N) 
top.foreach(println)
2 Agenda 
24.11.2014 
• Why? 
• How? 
• What else? 
• Who? 
• Future?
2 Spark: Streaming 
24.11.2014 
• Real-time computation 
• Similar to Apache Storm… 
• Streaming input split into sliding windows of 
RDD‘s 
• Input distributed to memory for fault 
tolerance 
• Supports input from Kafka, Flume, ZeroMQ, 
HDFS, S3, Kinesis, Twitter, …
2 Spark: Streaming 
24.11.2014 
Discretized Stream 
Windowed Computations
2 Spark: Streaming 
24.11.2014 
TwitterUtils.createStream() 
.filter(_.getText.contains(Spark)) 
.countByWindow(Seconds(5))
2 Spark: SQL 
24.11.2014 
• Spark SQL allows relational queries 
expressed in SQL, HiveQL or Scala 
• Uses SchemaRDD’s composed of Row objects 
(= table in a traditional RDBMS) 
• SchemaRDD can be created from an 
• Existing RDD 
• Parquet File 
• JSON dataset 
• By running HiveQL against data stored in Apache Hive 
• Supports a domain specific language for 
writing queries
2 Spark: SQL 
24.11.2014 
registerFunction(LEN, (_: String).length) 
val queryRdd = sql( 
SELECT * FROM counts 
WHERE LEN(word) = 10 
ORDER BY total DESC 
LIMIT 10 
) 
queryRdd 
.map( c = sword: ${c(0)} t| total: ${c(1)}) 
.collect() 
.foreach(println)
2 Spark: GraphX 
24.11.2014 
• GraphX is the Spark API for graphs 
and graph-parallel computation 
• API’s to join and traverse graphs 
• Optimally partitions and indexes 
vertices  edges (represented as RDD’s) 
• Supports PageRank, connected 
components, triangle counting, …
2 Spark: GraphX 
24.11.2014 
val graph = Graph(userIdRDD, assocRDD) 
val ranks = graph.pageRank(0.0001).vertices 
val userRDD = sc.textFile(graphx/data/users.txt) 
val users = userRdd. map {line = 
val fields = line.split(,) 
(fields(0).toLong, fields( 1)) 
} 
val ranksByUsername = users.join(ranks).map { 
case (id, (username, rank)) = (username, rank) 
}
2 Spark: MLlib 
24.11.2014 
• Machine learning library similar to 
Apache Mahout 
• Supports statistics, regression, decision 
trees, clustering, PCA, gradient 
descent, … 
• Iterative algorithms much faster due to 
in-memory processing
2 Spark: MLlib 
24.11.2014 
val data = sc.textFile(data.txt) 
val parsedData = data.map {line = 
val parts = line.split(',') 
LabeledPoint( 
parts( 0). toDouble, 
Vectors.dense(parts(1).split(' ').map(_.toDouble)) ) 
} 
val model = LinearRegressionWithSGD.train( 
parsedData, 100 
) 
val valuesAndPreds = parsedData.map {point = 
val prediction = model.predict(point.features) 
(point.label, prediction) 
} 
val MSE = valuesAndPreds 
.map{case(v, p) = math.pow((v - p), 2)}.mean()
2 Agenda 
24.11.2014 
• Why? 
• How? 
• What else? 
• Who? 
• Future?
2 Use Case: Yahoo Native Ads 
24.11.2014 
Logistic regression 
algorithm 
• 120 LOC in Spark/Scala 
• 30 min. on model creation for 
100M samples and 13K 
features 
Initial version launched 
within 2 hours after Spark-on- 
YARN announcement 
• Compared: Several days on 
hardware acquisition, system 
setup and data movement 
http://de.slideshare.net/Hadoop_Summit/sparkonyarn-empower-spark-applications-on-hadoop-cluster
2 Use Case: Yahoo Mobile Ads 
24.11.2014 
Learn from mobile search 
ads clicks data 
• 600M labeled examples on 
HDFS 
• 100M sparse features 
Spark programs for 
Gradient Boosting Decision 
Trees 
• 6 hours for model training 
with 100 workers 
• Model with accuracy very 
close to heavily-manually-tuned 
Logistic Regression 
models 
http://de.slideshare.net/Hadoop_Summit/sparkonyarn-empower-spark-applications-on-hadoop-cluster
2 Agenda 
24.11.2014 
• Why? 
• How? 
• What else? 
• Who? 
• Future?
2 Spark-on-YARN (Current) 
24.11.2014 
Hadoop 2 Spark as YARN App 
Pig … In- 
Hive Stream 
Tez 
Spark MapReduce 
Execution Engine 
Execution Engine 
YARN 
Memory 
Cluster resource management 
HDFS 
Redundant, reliable storage 
ing 
Storm 
…
2 Spark-on-YARN (Future) 
24.11.2014 
Hadoop 2 Spark as Execution Engine 
Hive … Mahout 
YARN 
HDFS 
Pig 
MapReduce 
Execution Engine 
Stream 
ing 
Storm 
… 
Tez 
Execution Engine 
Spark 
Execution Engine 
Slider
2 Spark: Future work 
24.11.2014 
• Spark Core 
• Focus on maturity, optimization  
pluggability 
• Enable long-running services (Slider) 
• Give resources back to cluster when idle 
• Integrate with Hadoop enhancements 
• Timeline server 
• ORC File Format 
• Spark Eco System 
• Focus on adding capabilities
2 One more thing… 
24.11.2014 
Let’s get started with 
Spark!
2 Hortonworks Sandbox 2.2 
24.11.2014 
http://hortonworks.com/hdp/downloads/
2 Hortonworks Sandbox 2.2 
24.11.2014 
// 1. Download 
wget http://public-repo-1.hortonworks.com/HDP-LABS/ 
Projects/spark/1.1.1/spark-1.1.0.2.1.5.0-701-bin- 
2.4.0.tgz 
// 2. Untar 
tar xvfz spark-1.1.0.2.1.5.0-701-bin-2.4.0.tgz 
// 3. Start Spark Shell 
./bin/spark-shell
2 Thanks for listening 
24.11.2014 
Twitter: 
@uweseiler 
Mail: 
uwe.seiler@codecentric.de 
XING: 
https://www.xing.com/profile 
/Uwe_Seiler

More Related Content

What's hot

Introduction to Spark (Intern Event Presentation)
Introduction to Spark (Intern Event Presentation)Introduction to Spark (Intern Event Presentation)
Introduction to Spark (Intern Event Presentation)Databricks
 
How To Connect Spark To Your Own Datasource
How To Connect Spark To Your Own DatasourceHow To Connect Spark To Your Own Datasource
How To Connect Spark To Your Own DatasourceMongoDB
 
Project Tungsten: Bringing Spark Closer to Bare Metal
Project Tungsten: Bringing Spark Closer to Bare MetalProject Tungsten: Bringing Spark Closer to Bare Metal
Project Tungsten: Bringing Spark Closer to Bare MetalDatabricks
 
Introduction to Storm
Introduction to Storm Introduction to Storm
Introduction to Storm Chandler Huang
 
Stream processing using Kafka
Stream processing using KafkaStream processing using Kafka
Stream processing using KafkaKnoldus Inc.
 
Kafka Tutorial - Introduction to Apache Kafka (Part 1)
Kafka Tutorial - Introduction to Apache Kafka (Part 1)Kafka Tutorial - Introduction to Apache Kafka (Part 1)
Kafka Tutorial - Introduction to Apache Kafka (Part 1)Jean-Paul Azar
 
Optimizing S3 Write-heavy Spark workloads
Optimizing S3 Write-heavy Spark workloadsOptimizing S3 Write-heavy Spark workloads
Optimizing S3 Write-heavy Spark workloadsdatamantra
 
Why your Spark Job is Failing
Why your Spark Job is FailingWhy your Spark Job is Failing
Why your Spark Job is FailingDataWorks Summit
 
Top 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark ApplicationsTop 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark ApplicationsSpark Summit
 
Removing performance bottlenecks with Kafka Monitoring and topic configuration
Removing performance bottlenecks with Kafka Monitoring and topic configurationRemoving performance bottlenecks with Kafka Monitoring and topic configuration
Removing performance bottlenecks with Kafka Monitoring and topic configurationKnoldus Inc.
 
Bulk Loading into Cassandra
Bulk Loading into CassandraBulk Loading into Cassandra
Bulk Loading into CassandraBrian Hess
 
Spark introduction and architecture
Spark introduction and architectureSpark introduction and architecture
Spark introduction and architectureSohil Jain
 
New Directions for Apache Arrow
New Directions for Apache ArrowNew Directions for Apache Arrow
New Directions for Apache ArrowWes McKinney
 
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and HudiHow to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and HudiFlink Forward
 
Apache Kafka 0.8 basic training - Verisign
Apache Kafka 0.8 basic training - VerisignApache Kafka 0.8 basic training - Verisign
Apache Kafka 0.8 basic training - VerisignMichael Noll
 
Apache Spark overview
Apache Spark overviewApache Spark overview
Apache Spark overviewDataArt
 

What's hot (20)

Introduction to Spark (Intern Event Presentation)
Introduction to Spark (Intern Event Presentation)Introduction to Spark (Intern Event Presentation)
Introduction to Spark (Intern Event Presentation)
 
How To Connect Spark To Your Own Datasource
How To Connect Spark To Your Own DatasourceHow To Connect Spark To Your Own Datasource
How To Connect Spark To Your Own Datasource
 
Project Tungsten: Bringing Spark Closer to Bare Metal
Project Tungsten: Bringing Spark Closer to Bare MetalProject Tungsten: Bringing Spark Closer to Bare Metal
Project Tungsten: Bringing Spark Closer to Bare Metal
 
Introduction to Storm
Introduction to Storm Introduction to Storm
Introduction to Storm
 
Stream processing using Kafka
Stream processing using KafkaStream processing using Kafka
Stream processing using Kafka
 
Kafka Tutorial - Introduction to Apache Kafka (Part 1)
Kafka Tutorial - Introduction to Apache Kafka (Part 1)Kafka Tutorial - Introduction to Apache Kafka (Part 1)
Kafka Tutorial - Introduction to Apache Kafka (Part 1)
 
Optimizing S3 Write-heavy Spark workloads
Optimizing S3 Write-heavy Spark workloadsOptimizing S3 Write-heavy Spark workloads
Optimizing S3 Write-heavy Spark workloads
 
Why your Spark Job is Failing
Why your Spark Job is FailingWhy your Spark Job is Failing
Why your Spark Job is Failing
 
Spark architecture
Spark architectureSpark architecture
Spark architecture
 
Top 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark ApplicationsTop 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark Applications
 
Cassandra 101
Cassandra 101Cassandra 101
Cassandra 101
 
Removing performance bottlenecks with Kafka Monitoring and topic configuration
Removing performance bottlenecks with Kafka Monitoring and topic configurationRemoving performance bottlenecks with Kafka Monitoring and topic configuration
Removing performance bottlenecks with Kafka Monitoring and topic configuration
 
kafka
kafkakafka
kafka
 
Bulk Loading into Cassandra
Bulk Loading into CassandraBulk Loading into Cassandra
Bulk Loading into Cassandra
 
Spark introduction and architecture
Spark introduction and architectureSpark introduction and architecture
Spark introduction and architecture
 
New Directions for Apache Arrow
New Directions for Apache ArrowNew Directions for Apache Arrow
New Directions for Apache Arrow
 
Apache Spark Overview
Apache Spark OverviewApache Spark Overview
Apache Spark Overview
 
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and HudiHow to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
 
Apache Kafka 0.8 basic training - Verisign
Apache Kafka 0.8 basic training - VerisignApache Kafka 0.8 basic training - Verisign
Apache Kafka 0.8 basic training - Verisign
 
Apache Spark overview
Apache Spark overviewApache Spark overview
Apache Spark overview
 

Viewers also liked

Hadoop Operations - Best practices from the field
Hadoop Operations - Best practices from the fieldHadoop Operations - Best practices from the field
Hadoop Operations - Best practices from the fieldUwe Printz
 
Hadoop meets Agile! - An Agile Big Data Model
Hadoop meets Agile! - An Agile Big Data ModelHadoop meets Agile! - An Agile Big Data Model
Hadoop meets Agile! - An Agile Big Data ModelUwe Printz
 
Hadoop & Security - Past, Present, Future
Hadoop & Security - Past, Present, FutureHadoop & Security - Past, Present, Future
Hadoop & Security - Past, Present, FutureUwe Printz
 
Deep Learning with Apache Spark: an Introduction
Deep Learning with Apache Spark: an IntroductionDeep Learning with Apache Spark: an Introduction
Deep Learning with Apache Spark: an IntroductionEmanuele Bezzi
 
Big Data Asset Maturity Model
Big Data Asset Maturity ModelBig Data Asset Maturity Model
Big Data Asset Maturity Modelnoahwong
 
Beyond a Big Data Pilot: Building a Production Data Infrastructure - Stampede...
Beyond a Big Data Pilot: Building a Production Data Infrastructure - Stampede...Beyond a Big Data Pilot: Building a Production Data Infrastructure - Stampede...
Beyond a Big Data Pilot: Building a Production Data Infrastructure - Stampede...StampedeCon
 
Hadoop Summit San Jose 2014: Costing Your Big Data Operations
Hadoop Summit San Jose 2014: Costing Your Big Data Operations Hadoop Summit San Jose 2014: Costing Your Big Data Operations
Hadoop Summit San Jose 2014: Costing Your Big Data Operations Sumeet Singh
 
Deep dive into spark streaming
Deep dive into spark streamingDeep dive into spark streaming
Deep dive into spark streamingTao Li
 
Big data, Analytics and Beyond
Big data, Analytics and BeyondBig data, Analytics and Beyond
Big data, Analytics and BeyondQuantUniversity
 
Fully fault tolerant real time data pipeline with docker and mesos
Fully fault tolerant real time data pipeline with docker and mesos Fully fault tolerant real time data pipeline with docker and mesos
Fully fault tolerant real time data pipeline with docker and mesos Rahul Kumar
 
Hadoop security landscape
Hadoop security landscapeHadoop security landscape
Hadoop security landscapeSujee Maniyam
 
Contexti / Oracle - Big Data : From Pilot to Production
Contexti / Oracle - Big Data : From Pilot to ProductionContexti / Oracle - Big Data : From Pilot to Production
Contexti / Oracle - Big Data : From Pilot to ProductionContexti
 
Energy analytics with Apache Spark workshop
Energy analytics with Apache Spark workshopEnergy analytics with Apache Spark workshop
Energy analytics with Apache Spark workshopQuantUniversity
 
Hadoop Security Now and Future
Hadoop Security Now and FutureHadoop Security Now and Future
Hadoop Security Now and Futuretcloudcomputing-tw
 
Launching your career in Big Data
Launching your career in Big DataLaunching your career in Big Data
Launching your career in Big DataSujee Maniyam
 
HDP Advanced Security: Comprehensive Security for Enterprise Hadoop
HDP Advanced Security: Comprehensive Security for Enterprise HadoopHDP Advanced Security: Comprehensive Security for Enterprise Hadoop
HDP Advanced Security: Comprehensive Security for Enterprise HadoopHortonworks
 
Apache Sentry for Hadoop security
Apache Sentry for Hadoop securityApache Sentry for Hadoop security
Apache Sentry for Hadoop securitybigdatagurus_meetup
 

Viewers also liked (20)

Hadoop Operations - Best practices from the field
Hadoop Operations - Best practices from the fieldHadoop Operations - Best practices from the field
Hadoop Operations - Best practices from the field
 
Hadoop meets Agile! - An Agile Big Data Model
Hadoop meets Agile! - An Agile Big Data ModelHadoop meets Agile! - An Agile Big Data Model
Hadoop meets Agile! - An Agile Big Data Model
 
Hadoop & Security - Past, Present, Future
Hadoop & Security - Past, Present, FutureHadoop & Security - Past, Present, Future
Hadoop & Security - Past, Present, Future
 
Apache Spark with Scala
Apache Spark with ScalaApache Spark with Scala
Apache Spark with Scala
 
Deep Learning with Apache Spark: an Introduction
Deep Learning with Apache Spark: an IntroductionDeep Learning with Apache Spark: an Introduction
Deep Learning with Apache Spark: an Introduction
 
Big Data Asset Maturity Model
Big Data Asset Maturity ModelBig Data Asset Maturity Model
Big Data Asset Maturity Model
 
Beyond a Big Data Pilot: Building a Production Data Infrastructure - Stampede...
Beyond a Big Data Pilot: Building a Production Data Infrastructure - Stampede...Beyond a Big Data Pilot: Building a Production Data Infrastructure - Stampede...
Beyond a Big Data Pilot: Building a Production Data Infrastructure - Stampede...
 
Hadoop Summit San Jose 2014: Costing Your Big Data Operations
Hadoop Summit San Jose 2014: Costing Your Big Data Operations Hadoop Summit San Jose 2014: Costing Your Big Data Operations
Hadoop Summit San Jose 2014: Costing Your Big Data Operations
 
Deep dive into spark streaming
Deep dive into spark streamingDeep dive into spark streaming
Deep dive into spark streaming
 
Big data, Analytics and Beyond
Big data, Analytics and BeyondBig data, Analytics and Beyond
Big data, Analytics and Beyond
 
Fully fault tolerant real time data pipeline with docker and mesos
Fully fault tolerant real time data pipeline with docker and mesos Fully fault tolerant real time data pipeline with docker and mesos
Fully fault tolerant real time data pipeline with docker and mesos
 
Hadoop security landscape
Hadoop security landscapeHadoop security landscape
Hadoop security landscape
 
Contexti / Oracle - Big Data : From Pilot to Production
Contexti / Oracle - Big Data : From Pilot to ProductionContexti / Oracle - Big Data : From Pilot to Production
Contexti / Oracle - Big Data : From Pilot to Production
 
Energy analytics with Apache Spark workshop
Energy analytics with Apache Spark workshopEnergy analytics with Apache Spark workshop
Energy analytics with Apache Spark workshop
 
Hadoop Security Now and Future
Hadoop Security Now and FutureHadoop Security Now and Future
Hadoop Security Now and Future
 
Launching your career in Big Data
Launching your career in Big DataLaunching your career in Big Data
Launching your career in Big Data
 
Hadoop bootcamp getting started
Hadoop bootcamp getting startedHadoop bootcamp getting started
Hadoop bootcamp getting started
 
HDP Advanced Security: Comprehensive Security for Enterprise Hadoop
HDP Advanced Security: Comprehensive Security for Enterprise HadoopHDP Advanced Security: Comprehensive Security for Enterprise Hadoop
HDP Advanced Security: Comprehensive Security for Enterprise Hadoop
 
Unicom Big Data Conference
Unicom  Big Data ConferenceUnicom  Big Data Conference
Unicom Big Data Conference
 
Apache Sentry for Hadoop security
Apache Sentry for Hadoop securityApache Sentry for Hadoop security
Apache Sentry for Hadoop security
 

Similar to Apache Spark

Solr and Spark for Real-Time Big Data Analytics: Presented by Tim Potter, Luc...
Solr and Spark for Real-Time Big Data Analytics: Presented by Tim Potter, Luc...Solr and Spark for Real-Time Big Data Analytics: Presented by Tim Potter, Luc...
Solr and Spark for Real-Time Big Data Analytics: Presented by Tim Potter, Luc...Lucidworks
 
Dive into spark2
Dive into spark2Dive into spark2
Dive into spark2Gal Marder
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkDatabricks
 
11. From Hadoop to Spark 2/2
11. From Hadoop to Spark 2/211. From Hadoop to Spark 2/2
11. From Hadoop to Spark 2/2Fabio Fumarola
 
Apache spark sneha challa- google pittsburgh-aug 25th
Apache spark  sneha challa- google pittsburgh-aug 25thApache spark  sneha challa- google pittsburgh-aug 25th
Apache spark sneha challa- google pittsburgh-aug 25thSneha Challa
 
Apache Spark and DataStax Enablement
Apache Spark and DataStax EnablementApache Spark and DataStax Enablement
Apache Spark and DataStax EnablementVincent Poncet
 
20130912 YTC_Reynold Xin_Spark and Shark
20130912 YTC_Reynold Xin_Spark and Shark20130912 YTC_Reynold Xin_Spark and Shark
20130912 YTC_Reynold Xin_Spark and SharkYahooTechConference
 
Apache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupApache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupNed Shawa
 
Apache Spark Fundamentals Meetup Talk
Apache Spark Fundamentals Meetup TalkApache Spark Fundamentals Meetup Talk
Apache Spark Fundamentals Meetup TalkEren Avşaroğulları
 
Big Data Analytics with Apache Spark
Big Data Analytics with Apache SparkBig Data Analytics with Apache Spark
Big Data Analytics with Apache SparkMarcoYuriFujiiMelo
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkPatrick Wendell
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache SparkVincent Poncet
 
Zero to Streaming: Spark and Cassandra
Zero to Streaming: Spark and CassandraZero to Streaming: Spark and Cassandra
Zero to Streaming: Spark and CassandraRussell Spitzer
 
Webinar: Solr & Spark for Real Time Big Data Analytics
Webinar: Solr & Spark for Real Time Big Data AnalyticsWebinar: Solr & Spark for Real Time Big Data Analytics
Webinar: Solr & Spark for Real Time Big Data AnalyticsLucidworks
 
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLabApache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
 
[DSC 2016] 系列活動:李泳泉 / 星火燎原 - Spark 機器學習初探
[DSC 2016] 系列活動:李泳泉 / 星火燎原 - Spark 機器學習初探[DSC 2016] 系列活動:李泳泉 / 星火燎原 - Spark 機器學習初探
[DSC 2016] 系列活動:李泳泉 / 星火燎原 - Spark 機器學習初探台灣資料科學年會
 
NYC Lucene/Solr Meetup: Spark / Solr
NYC Lucene/Solr Meetup: Spark / SolrNYC Lucene/Solr Meetup: Spark / Solr
NYC Lucene/Solr Meetup: Spark / Solrthelabdude
 

Similar to Apache Spark (20)

Solr and Spark for Real-Time Big Data Analytics: Presented by Tim Potter, Luc...
Solr and Spark for Real-Time Big Data Analytics: Presented by Tim Potter, Luc...Solr and Spark for Real-Time Big Data Analytics: Presented by Tim Potter, Luc...
Solr and Spark for Real-Time Big Data Analytics: Presented by Tim Potter, Luc...
 
Dive into spark2
Dive into spark2Dive into spark2
Dive into spark2
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache Spark
 
11. From Hadoop to Spark 2/2
11. From Hadoop to Spark 2/211. From Hadoop to Spark 2/2
11. From Hadoop to Spark 2/2
 
Apache spark sneha challa- google pittsburgh-aug 25th
Apache spark  sneha challa- google pittsburgh-aug 25thApache spark  sneha challa- google pittsburgh-aug 25th
Apache spark sneha challa- google pittsburgh-aug 25th
 
Apache Spark and DataStax Enablement
Apache Spark and DataStax EnablementApache Spark and DataStax Enablement
Apache Spark and DataStax Enablement
 
20130912 YTC_Reynold Xin_Spark and Shark
20130912 YTC_Reynold Xin_Spark and Shark20130912 YTC_Reynold Xin_Spark and Shark
20130912 YTC_Reynold Xin_Spark and Shark
 
Apache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupApache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetup
 
Apache Spark Fundamentals Meetup Talk
Apache Spark Fundamentals Meetup TalkApache Spark Fundamentals Meetup Talk
Apache Spark Fundamentals Meetup Talk
 
Big Data Analytics with Apache Spark
Big Data Analytics with Apache SparkBig Data Analytics with Apache Spark
Big Data Analytics with Apache Spark
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache Spark
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Zero to Streaming: Spark and Cassandra
Zero to Streaming: Spark and CassandraZero to Streaming: Spark and Cassandra
Zero to Streaming: Spark and Cassandra
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Webinar: Solr & Spark for Real Time Big Data Analytics
Webinar: Solr & Spark for Real Time Big Data AnalyticsWebinar: Solr & Spark for Real Time Big Data Analytics
Webinar: Solr & Spark for Real Time Big Data Analytics
 
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLabApache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
 
[DSC 2016] 系列活動:李泳泉 / 星火燎原 - Spark 機器學習初探
[DSC 2016] 系列活動:李泳泉 / 星火燎原 - Spark 機器學習初探[DSC 2016] 系列活動:李泳泉 / 星火燎原 - Spark 機器學習初探
[DSC 2016] 系列活動:李泳泉 / 星火燎原 - Spark 機器學習初探
 
NYC Lucene/Solr Meetup: Spark / Solr
NYC Lucene/Solr Meetup: Spark / SolrNYC Lucene/Solr Meetup: Spark / Solr
NYC Lucene/Solr Meetup: Spark / Solr
 

More from Uwe Printz

Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?Uwe Printz
 
Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?Uwe Printz
 
Lightning Talk: Agility & Databases
Lightning Talk: Agility & DatabasesLightning Talk: Agility & Databases
Lightning Talk: Agility & DatabasesUwe Printz
 
Hadoop 2 - More than MapReduce
Hadoop 2 - More than MapReduceHadoop 2 - More than MapReduce
Hadoop 2 - More than MapReduceUwe Printz
 
Welcome to Hadoop2Land!
Welcome to Hadoop2Land!Welcome to Hadoop2Land!
Welcome to Hadoop2Land!Uwe Printz
 
Hadoop 2 - Beyond MapReduce
Hadoop 2 - Beyond MapReduceHadoop 2 - Beyond MapReduce
Hadoop 2 - Beyond MapReduceUwe Printz
 
MongoDB für Java Programmierer (JUGKA, 11.12.13)
MongoDB für Java Programmierer (JUGKA, 11.12.13)MongoDB für Java Programmierer (JUGKA, 11.12.13)
MongoDB für Java Programmierer (JUGKA, 11.12.13)Uwe Printz
 
Hadoop 2 - Going beyond MapReduce
Hadoop 2 - Going beyond MapReduceHadoop 2 - Going beyond MapReduce
Hadoop 2 - Going beyond MapReduceUwe Printz
 
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)Uwe Printz
 
MongoDB for Coder Training (Coding Serbia 2013)
MongoDB for Coder Training (Coding Serbia 2013)MongoDB for Coder Training (Coding Serbia 2013)
MongoDB for Coder Training (Coding Serbia 2013)Uwe Printz
 
MongoDB für Java-Programmierer
MongoDB für Java-ProgrammiererMongoDB für Java-Programmierer
MongoDB für Java-ProgrammiererUwe Printz
 
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...Uwe Printz
 
Introduction to Twitter Storm
Introduction to Twitter StormIntroduction to Twitter Storm
Introduction to Twitter StormUwe Printz
 
Introduction to the Hadoop Ecosystem (FrOSCon Edition)
Introduction to the Hadoop Ecosystem (FrOSCon Edition)Introduction to the Hadoop Ecosystem (FrOSCon Edition)
Introduction to the Hadoop Ecosystem (FrOSCon Edition)Uwe Printz
 
Introduction to the Hadoop Ecosystem (SEACON Edition)
Introduction to the Hadoop Ecosystem (SEACON Edition)Introduction to the Hadoop Ecosystem (SEACON Edition)
Introduction to the Hadoop Ecosystem (SEACON Edition)Uwe Printz
 
Introduction to the Hadoop Ecosystem (codemotion Edition)
Introduction to the Hadoop Ecosystem (codemotion Edition)Introduction to the Hadoop Ecosystem (codemotion Edition)
Introduction to the Hadoop Ecosystem (codemotion Edition)Uwe Printz
 
Map/Confused? A practical approach to Map/Reduce with MongoDB
Map/Confused? A practical approach to Map/Reduce with MongoDBMap/Confused? A practical approach to Map/Reduce with MongoDB
Map/Confused? A practical approach to Map/Reduce with MongoDBUwe Printz
 
First meetup of the MongoDB User Group Frankfurt
First meetup of the MongoDB User Group FrankfurtFirst meetup of the MongoDB User Group Frankfurt
First meetup of the MongoDB User Group FrankfurtUwe Printz
 

More from Uwe Printz (18)

Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?
 
Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?
 
Lightning Talk: Agility & Databases
Lightning Talk: Agility & DatabasesLightning Talk: Agility & Databases
Lightning Talk: Agility & Databases
 
Hadoop 2 - More than MapReduce
Hadoop 2 - More than MapReduceHadoop 2 - More than MapReduce
Hadoop 2 - More than MapReduce
 
Welcome to Hadoop2Land!
Welcome to Hadoop2Land!Welcome to Hadoop2Land!
Welcome to Hadoop2Land!
 
Hadoop 2 - Beyond MapReduce
Hadoop 2 - Beyond MapReduceHadoop 2 - Beyond MapReduce
Hadoop 2 - Beyond MapReduce
 
MongoDB für Java Programmierer (JUGKA, 11.12.13)
MongoDB für Java Programmierer (JUGKA, 11.12.13)MongoDB für Java Programmierer (JUGKA, 11.12.13)
MongoDB für Java Programmierer (JUGKA, 11.12.13)
 
Hadoop 2 - Going beyond MapReduce
Hadoop 2 - Going beyond MapReduceHadoop 2 - Going beyond MapReduce
Hadoop 2 - Going beyond MapReduce
 
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
 
MongoDB for Coder Training (Coding Serbia 2013)
MongoDB for Coder Training (Coding Serbia 2013)MongoDB for Coder Training (Coding Serbia 2013)
MongoDB for Coder Training (Coding Serbia 2013)
 
MongoDB für Java-Programmierer
MongoDB für Java-ProgrammiererMongoDB für Java-Programmierer
MongoDB für Java-Programmierer
 
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...
 
Introduction to Twitter Storm
Introduction to Twitter StormIntroduction to Twitter Storm
Introduction to Twitter Storm
 
Introduction to the Hadoop Ecosystem (FrOSCon Edition)
Introduction to the Hadoop Ecosystem (FrOSCon Edition)Introduction to the Hadoop Ecosystem (FrOSCon Edition)
Introduction to the Hadoop Ecosystem (FrOSCon Edition)
 
Introduction to the Hadoop Ecosystem (SEACON Edition)
Introduction to the Hadoop Ecosystem (SEACON Edition)Introduction to the Hadoop Ecosystem (SEACON Edition)
Introduction to the Hadoop Ecosystem (SEACON Edition)
 
Introduction to the Hadoop Ecosystem (codemotion Edition)
Introduction to the Hadoop Ecosystem (codemotion Edition)Introduction to the Hadoop Ecosystem (codemotion Edition)
Introduction to the Hadoop Ecosystem (codemotion Edition)
 
Map/Confused? A practical approach to Map/Reduce with MongoDB
Map/Confused? A practical approach to Map/Reduce with MongoDBMap/Confused? A practical approach to Map/Reduce with MongoDB
Map/Confused? A practical approach to Map/Reduce with MongoDB
 
First meetup of the MongoDB User Group Frankfurt
First meetup of the MongoDB User Group FrankfurtFirst meetup of the MongoDB User Group Frankfurt
First meetup of the MongoDB User Group Frankfurt
 

Recently uploaded

(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...AliaaTarek5
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Ryan Mahoney - Will Artificial Intelligence Replace Real Estate Agents
Ryan Mahoney - Will Artificial Intelligence Replace Real Estate AgentsRyan Mahoney - Will Artificial Intelligence Replace Real Estate Agents
Ryan Mahoney - Will Artificial Intelligence Replace Real Estate AgentsRyan Mahoney
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxBkGupta21
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterMydbops
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 

Recently uploaded (20)

(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Ryan Mahoney - Will Artificial Intelligence Replace Real Estate Agents
Ryan Mahoney - Will Artificial Intelligence Replace Real Estate AgentsRyan Mahoney - Will Artificial Intelligence Replace Real Estate Agents
Ryan Mahoney - Will Artificial Intelligence Replace Real Estate Agents
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptx
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL Router
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 

Apache Spark

  • 1. 2 24.11.2014 uweseiler Apache Spark
  • 2. 2 About me 24.11.2014 Big Data Nerd Hadoop Trainer NoSQL Fan Boy Photography Enthusiast Travelpirate
  • 3. 2 About us 24.11.2014 specializes on... Big Data Nerds Agile Ninjas Continuous Delivery Gurus Enterprise Java Specialists Performance Geeks Join us!
  • 4. 2 Agenda 24.11.2014 • Why? • How? • What else? • Who? • Future?
  • 5. 2 Agenda 24.11.2014 • Why? • How? • What else? • Who? • Future?
  • 6. 2 Spark: In a tweet 24.11.2014 “Spark … is what you might call a Swiss Army knife of Big Data analytics tools” – Reynold Xin (@rxin), Berkeley AmpLab Shark Development Lead
  • 7. 2 Spark: In a nutshell 24.11.2014 • Fast and general engine for large scale data processing • Advanced DAG execution engine with support for in-memory storage data locality (micro) batch streaming support • Improves usability via Rich APIs in Scala, Java, Python Interactive shell • Runs Standalone, on YARN, on Mesos, and on Amazon EC2
  • 8. 2 Spark is also… 24.11.2014 • Came out of AMPLab at UCB in 2009 • A top-level Apache project as of 2014 – http://spark.apache.org • Backed by a commercial entity: Databricks • A toolset for Data Scientist / Analysts • Implementation of Resilient Distributed Dataset (RDD) in Scala • Hadoop Compatible
  • 9. 2 Spark: Trends 24.11.2014 Apache Drill Apache Storm Apache Spark Apache YARN Apache Tez Generated using http://www.google.com/trends/
  • 10. 2 Spark: Community 24.11.2014 https://github.com/apache/spark/pulse
  • 11. 2 Spark: Performance 24.11.2014 3X faster using 10X fewer machines http://finance.yahoo.com/news/apache-spark-beats-world-record-130000796.html http://www.wired.com/2014/10/startup-crunches-100-terabytes-data-record-23-minutes/
  • 12. 2 24.11.2014 BlinkDB MapReduce Cluster resource mgmt. + data processing HDFS Spark: Ecosystem Redundant, reliable storage Spark Core Spark SQL SQL Spark Streaming Streaming MLlib Machine Learning SparkR R on Spark GraphX Graph Computation
  • 13. 2 Agenda 24.11.2014 • Why? • How? • What else? • Who? • Future?
  • 14. 2 Spark: Core Concept 24.11.2014 • Resilient Distributed Dataset (RDD) Conceptually, RDDs can be roughly viewed as partitioned, locality aware distributed vectors RDD A11 A12 A13 • Read-only collection of objects spread across a cluster • Built through parallel transformations actions • Computation can be represented by lazy evaluated lineage DAGs composed by connected RDDs • Automatically rebuilt on failure • Controllable persistence
  • 15. 2 Spark: RDD Example 24.11.2014 Base RDD from HDFS lines = spark.textFile(“hdfs://...”) errors = lines.filter(_.startsWith(Error)) messages = errors.map(_.split('t')(2)) messages.cache() RDD in memory Iterative Processing for (str - Array(“foo”, “bar”)) messages.filter(_.contains(str)).count()
  • 16. 2 Spark: Transformations 24.11.2014 Transformations - Create new datasets from existing ones map
  • 17. 2 Spark: Transformations 24.11.2014 Transformations - Create new datasets from existing ones map(func) filter(func) flatMap(func) mapPartitions(func) mapPartitionsWithIndex(func) union(otherDataset) intersection(otherDataset) distinct([numTasks])) groupByKey([numTasks]) sortByKey([ascending], [numTasks]) reduceByKey(func, [numTasks]) aggregateByKey(zeroValue)(seqOp, combOp, [numTasks]) join(otherDataset, [numTasks]) cogroup(otherDataset, [numTasks]) cartesian(otherDataset) pipe(command, [envVars]) coalesce(numPartitions) sample(withReplacement,fraction, seed) repartition(numPartitions)
  • 18. 2 Spark: Actions 24.11.2014 Actions - Return a value to the client after running a computation on the dataset reduce
  • 19. 2 Spark: Actions 24.11.2014 Actions - Return a value to the client after running a computation on the dataset reduce(func) collect() count() first() countByKey() foreach(func) take(n) takeSample(withReplacement,num, [seed]) takeOrdered(n, [ordering]) saveAsTextFile(path) saveAsSequenceFile(path) (Only Java and Scala) saveAsObjectFile(path) (Only Java and Scala)
  • 20. 2 Spark: Dataflow 24.11.2014 All transformations in Spark are lazy and are only computed when an actions requires it.
  • 21. 2 Spark: Persistence 24.11.2014 One of the most important capabilities in Spark is caching a dataset in-memory across operations • cache() MEMORY_ONLY • persist() MEMORY_ONLY
  • 22. 2 Spark: Storage Levels 24.11.2014 • persist(Storage Level) Storage Level Meaning MEMORY_ONLY Store RDD as deserialized Java objects in the JVM. If the RDD does not fit in memory, some partitions will not be cached and will be recomputed on the fly each time they're needed. This is the default level. MEMORY_AND_DISK Store RDD as deserialized Java objects in the JVM. If the RDD does not fit in memory, store the partitions that don't fit on disk, and read them from there when they're needed. MEMORY_ONLY_SER Store RDD as serialized Java objects (one byte array per partition). This is generally more space-efficient than deserialized objects, especially when using a fast serializer, but more CPU-intensive to read. MEMORY_AND_DISK_SER Similar to MEMORY_ONLY_SER, but spill partitions that don't fit in memory to disk instead of recomputing them on the fly each time they're needed. DISK_ONLY Store the RDD partitions only on disk. MEMORY_ONLY_2, MEMORY_AND_DISK_2, … … … Same as the levels above, but replicate each partition on two cluster nodes.
  • 23. 2 Spark: Parallelism 24.11.2014 Can be specified in a number of different ways • RDD partition number • sc.textFile(input, minSplits = 10) • sc.parallelize(1 to 10000, numSlices = 10) • Mapper side parallelism • Usually inherited from parent RDD(s) • Reducer side parallelism • rdd.reduceByKey(_ + _, numPartitions = 10) • rdd.reduceByKey(partitioner = p, _ + _) • “Zoom in/out” • rdd.repartition(numPartitions: Int) • rdd.coalesce(numPartitions: Int, shuffle: Boolean)
  • 24. 2 Spark: Example 24.11.2014 Text Processing Example Top words by frequency
  • 25. 2 Spark: Frequency Example 24.11.2014 Create RDD from external data Data Sources supported by Hadoop Cassandra ElasticSearch HDFS S3 HBase Mongo DB … I/O via Hadoop optional // Step 1. Create RDD from Hadoop text files val docs = spark.textFile(“hdfs://docs/“)
  • 26. 2 Spark: Frequency Example 24.11.2014 Function map Hello World This is Spark Spark The end hello world this is spark spark the end RDD[String] .map(line = line.ToLowerCase) RDD[String]
  • 27. 2 Spark: Frequency Example 24.11.2014 Function map Hello World This is Spark Spark The end hello world this is spark spark the end RDD[String] .map(line = line.ToLowerCase) RDD[String] = .map(_.ToLowerCase)
  • 28. 2 Spark: Frequency Example 24.11.2014 Function map Hello World This is Spark Spark The end = // Step 2. Convert lines to lower case val lower = docs.map(line = line.ToLowerCase) hello world this is spark spark the end RDD[String] .map(line = line.ToLowerCase) RDD[String] .map(_.ToLowerCase)
  • 29. 2 Spark: Frequency Example 24.11.2014 map vs. flatMap RDD[String] hello world this is spark spark the end .map(…) RDD[Array[String]] hello spark _.split(s+) world this is spark the end
  • 30. 2 Spark: Frequency Example 24.11.2014 map vs. flatMap RDD[String] hello world this is spark spark the end .map(…) RDD[String] RDD[Array[String]] hello spark .flatten* _.split(s+) world this is spark hello world this the end end
  • 31. 2 Spark: Frequency Example 24.11.2014 map vs. flatMap RDD[String] hello world this is spark spark the end .map(…) RDD[String] RDD[Array[String]] hello world this is spark spark .flatten* _.split(s+) the end .flatMap(line = line.split(“s+“)) hello world this end
  • 32. 2 Spark: Frequency Example 24.11.2014 map vs. flatMap RDD[String] hello world this is spark spark the end .map(…) RDD[String] RDD[Array[String]] hello world this is spark spark .flatten* _.split(s+) hello world this the end end .flatMap(line = line.split(“s+“)) // Step 3. Split lines into words val words = lower.flatMap(line = line.split(“s+“))
  • 33. 2 Spark: Frequency Example 24.11.2014 Key-Value Pairs RDD[String] hello world spark end .map(word = Tuple2(word,1)) RDD[(String, Int)] hello world spark end spark 1 1 spark 1 1 1
  • 34. 2 Spark: Frequency Example 24.11.2014 Key-Value Pairs RDD[String] hello world spark end .map(word = Tuple2(word,1)) = .map(word = (word,1)) RDD[(String, Int)] hello world spark end spark 1 1 spark 1 1 1
  • 35. 2 Spark: Frequency Example 24.11.2014 Key-Value Pairs RDD[String] hello world spark end .map(word = Tuple2(word,1)) = .map(word = (word,1)) // Step 4. Convert into tuples val counts = words.map(word = (word,1)) RDD[(String, Int)] hello world spark end spark 1 1 spark 1 1 1
  • 36. 2 Spark: Frequency Example 24.11.2014 Shuffling RDD[(String, Int)] hello world spark end 1 1 spark 1 1 1 RDD[(String, Iterator(Int))] .groupByKey end 1 hello 1 spark 1 1 world 1
  • 37. 2 Spark: Frequency Example 24.11.2014 Shuffling RDD[(String, Int)] hello world spark end 1 1 spark 1 1 1 RDD[(String, Iterator(Int))] RDD[(String, Int)] .groupByKey end 1 hello 1 spark 1 1 world 1 end 1 hello 1 spark 2 world 1 .mapValues _.reduce… (a,b) = a+b
  • 38. 2 Spark: Frequency Example 24.11.2014 Shuffling RDD[(String, Int)] hello world spark end 1 1 spark 1 1 1 RDD[(String, Iterator(Int))] RDD[(String, Int)] .groupByKey end 1 hello 1 spark 1 1 world 1 end 1 hello 1 spark 2 world 1 .mapValues _.reduce… (a,b) = a+b .reduceByKey((a,b) = a+b)
  • 39. 2 Spark: Frequency Example 24.11.2014 Shuffling RDD[(String, Int)] hello world spark spark end 1 1 1 1 1 RDD[(String, Iterator(Int))] RDD[(String, Int)] .groupByKey end 1 hello 1 spark 1 1 world 1 // Step 5. Count all words val freq = counts.reduceByKey(_ + _) end 1 hello 1 spark 2 world 1 .mapValues _.reduce… (a,b) = a+b
  • 40. 2 Spark: Frequency Example 24.11.2014 Top N (Prepare data) RDD[(String, Int)] end 1 hello 1 spark 2 world 1 // Step 6. Swap tupels (Partial code) freq.map(_.swap) RDD[(Int, String)] 1 end 1 hello 2 spark 1 world .map(_.swap)
  • 41. 2 Spark: Frequency Example 24.11.2014 Top N (First Attempt) RDD[(Int, String)] 1 end 1 hello 2 spark 1 world RDD[(Int, String)] 2 spark 1 end 1 hello 1 world .sortByKey
  • 42. 2 Spark: Frequency Example 24.11.2014 Top N RDD[(Int, String)] 1 end 1 hello 2 spark 1 world RDD[(Int, String)] 2 spark 1 end 1 hello 1 world local top N .top(N) local top N
  • 43. 2 Spark: Frequency Example 24.11.2014 Top N RDD[(Int, String)] 1 end 1 hello 2 spark 1 world RDD[(Int, String)] 2 spark 1 end 1 hello 1 world .top(N) Array[(Int, String)] 2 spark 1 end local top N local top N reduction
  • 44. 2 Spark: Frequency Example 24.11.2014 Top N RDD[(Int, String)] 1 end 1 hello 2 spark 1 world RDD[(Int, String)] spark 2 1 end 1 hello 1 world .top(N) Array[(Int, String)] 2 spark 1 end local top N local top N reduction // Step 6. Swap tupels (Complete code) val top = freq.map(_.swap).top(N)
  • 45. 2 Spark: Frequency Example 24.11.2014 val spark = new SparkContext() // Create RDD from Hadoop text file val docs = spark.textFile(“hdfs://docs/“) // Split lines into words and process val lower = docs.map(line = line.ToLowerCase) val words = lower.flatMap(line = line.split(“s+“)) val counts = words.map(word = (word,1)) // Count all words val freq = counts.reduceByKey(_ + _) // Swap tupels and get top results val top = freq.map(_.swap).top(N) top.foreach(println)
  • 46. 2 Agenda 24.11.2014 • Why? • How? • What else? • Who? • Future?
  • 47. 2 Spark: Streaming 24.11.2014 • Real-time computation • Similar to Apache Storm… • Streaming input split into sliding windows of RDD‘s • Input distributed to memory for fault tolerance • Supports input from Kafka, Flume, ZeroMQ, HDFS, S3, Kinesis, Twitter, …
  • 48. 2 Spark: Streaming 24.11.2014 Discretized Stream Windowed Computations
  • 49. 2 Spark: Streaming 24.11.2014 TwitterUtils.createStream() .filter(_.getText.contains(Spark)) .countByWindow(Seconds(5))
  • 50. 2 Spark: SQL 24.11.2014 • Spark SQL allows relational queries expressed in SQL, HiveQL or Scala • Uses SchemaRDD’s composed of Row objects (= table in a traditional RDBMS) • SchemaRDD can be created from an • Existing RDD • Parquet File • JSON dataset • By running HiveQL against data stored in Apache Hive • Supports a domain specific language for writing queries
  • 51. 2 Spark: SQL 24.11.2014 registerFunction(LEN, (_: String).length) val queryRdd = sql( SELECT * FROM counts WHERE LEN(word) = 10 ORDER BY total DESC LIMIT 10 ) queryRdd .map( c = sword: ${c(0)} t| total: ${c(1)}) .collect() .foreach(println)
  • 52. 2 Spark: GraphX 24.11.2014 • GraphX is the Spark API for graphs and graph-parallel computation • API’s to join and traverse graphs • Optimally partitions and indexes vertices edges (represented as RDD’s) • Supports PageRank, connected components, triangle counting, …
  • 53. 2 Spark: GraphX 24.11.2014 val graph = Graph(userIdRDD, assocRDD) val ranks = graph.pageRank(0.0001).vertices val userRDD = sc.textFile(graphx/data/users.txt) val users = userRdd. map {line = val fields = line.split(,) (fields(0).toLong, fields( 1)) } val ranksByUsername = users.join(ranks).map { case (id, (username, rank)) = (username, rank) }
  • 54. 2 Spark: MLlib 24.11.2014 • Machine learning library similar to Apache Mahout • Supports statistics, regression, decision trees, clustering, PCA, gradient descent, … • Iterative algorithms much faster due to in-memory processing
  • 55. 2 Spark: MLlib 24.11.2014 val data = sc.textFile(data.txt) val parsedData = data.map {line = val parts = line.split(',') LabeledPoint( parts( 0). toDouble, Vectors.dense(parts(1).split(' ').map(_.toDouble)) ) } val model = LinearRegressionWithSGD.train( parsedData, 100 ) val valuesAndPreds = parsedData.map {point = val prediction = model.predict(point.features) (point.label, prediction) } val MSE = valuesAndPreds .map{case(v, p) = math.pow((v - p), 2)}.mean()
  • 56. 2 Agenda 24.11.2014 • Why? • How? • What else? • Who? • Future?
  • 57. 2 Use Case: Yahoo Native Ads 24.11.2014 Logistic regression algorithm • 120 LOC in Spark/Scala • 30 min. on model creation for 100M samples and 13K features Initial version launched within 2 hours after Spark-on- YARN announcement • Compared: Several days on hardware acquisition, system setup and data movement http://de.slideshare.net/Hadoop_Summit/sparkonyarn-empower-spark-applications-on-hadoop-cluster
  • 58. 2 Use Case: Yahoo Mobile Ads 24.11.2014 Learn from mobile search ads clicks data • 600M labeled examples on HDFS • 100M sparse features Spark programs for Gradient Boosting Decision Trees • 6 hours for model training with 100 workers • Model with accuracy very close to heavily-manually-tuned Logistic Regression models http://de.slideshare.net/Hadoop_Summit/sparkonyarn-empower-spark-applications-on-hadoop-cluster
  • 59. 2 Agenda 24.11.2014 • Why? • How? • What else? • Who? • Future?
  • 60. 2 Spark-on-YARN (Current) 24.11.2014 Hadoop 2 Spark as YARN App Pig … In- Hive Stream Tez Spark MapReduce Execution Engine Execution Engine YARN Memory Cluster resource management HDFS Redundant, reliable storage ing Storm …
  • 61. 2 Spark-on-YARN (Future) 24.11.2014 Hadoop 2 Spark as Execution Engine Hive … Mahout YARN HDFS Pig MapReduce Execution Engine Stream ing Storm … Tez Execution Engine Spark Execution Engine Slider
  • 62. 2 Spark: Future work 24.11.2014 • Spark Core • Focus on maturity, optimization pluggability • Enable long-running services (Slider) • Give resources back to cluster when idle • Integrate with Hadoop enhancements • Timeline server • ORC File Format • Spark Eco System • Focus on adding capabilities
  • 63. 2 One more thing… 24.11.2014 Let’s get started with Spark!
  • 64. 2 Hortonworks Sandbox 2.2 24.11.2014 http://hortonworks.com/hdp/downloads/
  • 65. 2 Hortonworks Sandbox 2.2 24.11.2014 // 1. Download wget http://public-repo-1.hortonworks.com/HDP-LABS/ Projects/spark/1.1.1/spark-1.1.0.2.1.5.0-701-bin- 2.4.0.tgz // 2. Untar tar xvfz spark-1.1.0.2.1.5.0-701-bin-2.4.0.tgz // 3. Start Spark Shell ./bin/spark-shell
  • 66. 2 Thanks for listening 24.11.2014 Twitter: @uweseiler Mail: uwe.seiler@codecentric.de XING: https://www.xing.com/profile /Uwe_Seiler