Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Hands-On 
@samklr & @nivdul
2015-03-10
is a fast and general engine for large-scale
data processing
• big data analytics in memory/disk
• complements Hadoop
• fa...
RDD
(Resilient Distributed Dataset)
• process in parallel
• controllable persistence (memory, disk…)
• higher-level operat...
Dataflow
Example : Wordcount
// create configuration for Spark and the context
val conf = new SparkConf()
.setAppName("Spark word c...
Spark ecosystem
unifies access to structured data
SQL
// make sql request on RDD
val nb = sqlContext.sql("SELECT user, COUNT(*) AS c FROM ...
makes it easy to build scalable fault-tolerant streaming
applications
Streaming
// create a java streaming context and def...
MLlib
is Apache Spark's scalable machine learning library
• regression
• classification
• clustering
• optimization
• colla...
Exercices
Part 1 : Spark API
!
!
Part 2 : Spark Streaming
!
!
Part 3 : Spark SQL
!
!
Part 4 : MLlib
Let’s go !
Clone the projet from the Duchess France github repository
!
Java
https://github.com/DuchessFrance/Hands-On-Spa...
Upcoming SlideShare
Loading in …5
×

Hands-On Apache Spark

1,077 views

Published on

Spark-HandsOn
In this Hands-On, we are going to show how you can use Apache Spark and some components of it ecosystem for data processing. This workshop is split in four parts. We will use a dataset that consists of tweets containing just a few fields like id, user, text, country and place.

In the first one, you will play with the Spark API for basic operations like counting, filtering, aggregating.

After that, you will get to know Spark SQL to query structured data (here in json) using SQL.

In the third part, you will use Spark Streaming and the twitter streaming API to analyse a live stream of Tweets.

To finish we will build a simple model to identify the language in a text. For that you will use MLLib.

Let's go and have fun !

Prerequisites

Java > 6 (8 is better to use the lambdas)
IDE

Some links
Apache Spark https://spark.apache.org

https://speakerdeck.com/nivdul/lightning-fast-machine-learning-with-spark
https://speakerdeck.com/samklr/scalable-machine-learning-with-spark

Published in: Data & Analytics
  • Be the first to comment

Hands-On Apache Spark

  1. 1. Hands-On @samklr & @nivdul 2015-03-10
  2. 2. is a fast and general engine for large-scale data processing • big data analytics in memory/disk • complements Hadoop • faster and more flexible • Resilient Distributed Datasets (RDD) • shared variables interactive shell (scala & python) Lambda (Java 8)
  3. 3. RDD (Resilient Distributed Dataset) • process in parallel • controllable persistence (memory, disk…) • higher-level operations (transformation & actions) • rebuilt automatically
  4. 4. Dataflow
  5. 5. Example : Wordcount // create configuration for Spark and the context val conf = new SparkConf() .setAppName("Spark word count") .setMaster("local") ! val sc = new SparkContext(conf) ! // load the data val data = sc.textFile("filepath/wordcount.txt") // map then reduce step val wordCounts = data.flatMap(line => line.split("s+")) .map(word => (word, 1)) .reduceByKey(_ + _) // persist the data wordCounts.cache()
  6. 6. Spark ecosystem
  7. 7. unifies access to structured data SQL // make sql request on RDD val nb = sqlContext.sql("SELECT user, COUNT(*) AS c FROM tweet " + "WHERE user <> '' " + "GROUP BY user " + "ORDER BY c "); ! // create a sql context from the Spark context val sqlContext = new SQLContext(sc); ! // load data and create an RDD val tweets = sqlContext.jsonFile(pathToFile); // register tweets as a table to operate on it later tweets.registerAsTable("tweet");
  8. 8. makes it easy to build scalable fault-tolerant streaming applications Streaming // create a java streaming context and define the window val jssc = new StreamingContext(conf, Durations.seconds(10)) ! // create our DStream (sequence of RDD) val tweetsStream = TwitterUtils.createStream(jssc, StreamUtils.getAuth()) ! // find all user val tweetUser = tweetsStream.map(tweetStatus => tweetStatus.getUser())
  9. 9. MLlib is Apache Spark's scalable machine learning library • regression • classification • clustering • optimization • collaborative filtering • feature extraction (TF-IDF, Word2Vec…)
  10. 10. Exercices Part 1 : Spark API ! ! Part 2 : Spark Streaming ! ! Part 3 : Spark SQL ! ! Part 4 : MLlib
  11. 11. Let’s go ! Clone the projet from the Duchess France github repository ! Java https://github.com/DuchessFrance/Hands-On-Spark-java Scala https://github.com/DuchessFrance/Hands-On-Spark-scala ! ! All about Spark http://spark.apache.org/ ! ! And ask if you have any questions :) ! ! Have Fun !

×