SlideShare a Scribd company logo
1 of 11
Download to read offline
Hands-On 
@samklr & @nivdul
2015-03-10
is a fast and general engine for large-scale
data processing
• big data analytics in memory/disk
• complements Hadoop
• faster and more flexible
• Resilient Distributed Datasets (RDD)
• shared variables
interactive shell (scala & python)
Lambda 
(Java 8)
RDD
(Resilient Distributed Dataset)
• process in parallel
• controllable persistence (memory, disk…)
• higher-level operations (transformation & actions)
• rebuilt automatically
Dataflow
Example : Wordcount
// create configuration for Spark and the context
val conf = new SparkConf()
.setAppName("Spark word count")
.setMaster("local")
!
val sc = new SparkContext(conf)
!
// load the data
val data = sc.textFile("filepath/wordcount.txt")
// map then reduce step
val wordCounts = data.flatMap(line => line.split("s+"))
.map(word => (word, 1))
.reduceByKey(_ + _)
// persist the data
wordCounts.cache()
Spark ecosystem
unifies access to structured data
SQL
// make sql request on RDD
val nb = sqlContext.sql("SELECT user, COUNT(*) AS c FROM tweet " +
"WHERE user <> '' " +
"GROUP BY user " +
"ORDER BY c ");
!
// create a sql context from the Spark context
val sqlContext = new SQLContext(sc);
!
// load data and create an RDD
val tweets = sqlContext.jsonFile(pathToFile);
// register tweets as a table to operate on it later
tweets.registerAsTable("tweet");
makes it easy to build scalable fault-tolerant streaming
applications
Streaming
// create a java streaming context and define the window
val jssc = new StreamingContext(conf, Durations.seconds(10))
!
// create our DStream (sequence of RDD)
val tweetsStream = TwitterUtils.createStream(jssc, StreamUtils.getAuth())
!
// find all user
val tweetUser = tweetsStream.map(tweetStatus => tweetStatus.getUser())
MLlib
is Apache Spark's scalable machine learning library
• regression
• classification
• clustering
• optimization
• collaborative filtering
• feature extraction (TF-IDF, Word2Vec…)
Exercices
Part 1 : Spark API
!
!
Part 2 : Spark Streaming
!
!
Part 3 : Spark SQL
!
!
Part 4 : MLlib
Let’s go !
Clone the projet from the Duchess France github repository
!
Java
https://github.com/DuchessFrance/Hands-On-Spark-java
Scala
https://github.com/DuchessFrance/Hands-On-Spark-scala
!
!
All about Spark
http://spark.apache.org/
!
!
And ask if you have any questions :)
!
!
Have Fun !

More Related Content

What's hot

Simplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache SparkSimplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache Spark
Databricks
 
Predictive Maintenance at the Dutch Railways with Ivo Everts
Predictive Maintenance at the Dutch Railways with Ivo EvertsPredictive Maintenance at the Dutch Railways with Ivo Everts
Predictive Maintenance at the Dutch Railways with Ivo Everts
Databricks
 

What's hot (20)

RDD
RDDRDD
RDD
 
Simplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache SparkSimplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache Spark
 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsApache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & Internals
 
Road to Analytics
Road to AnalyticsRoad to Analytics
Road to Analytics
 
Apache Spark II (SparkSQL)
Apache Spark II (SparkSQL)Apache Spark II (SparkSQL)
Apache Spark II (SparkSQL)
 
Large volume data analysis on the Typesafe Reactive Platform - Big Data Scala...
Large volume data analysis on the Typesafe Reactive Platform - Big Data Scala...Large volume data analysis on the Typesafe Reactive Platform - Big Data Scala...
Large volume data analysis on the Typesafe Reactive Platform - Big Data Scala...
 
Data Source API in Spark
Data Source API in SparkData Source API in Spark
Data Source API in Spark
 
Introduction to apache spark
Introduction to apache sparkIntroduction to apache spark
Introduction to apache spark
 
Bellevue Big Data meetup: Dive Deep into Spark Streaming
Bellevue Big Data meetup: Dive Deep into Spark StreamingBellevue Big Data meetup: Dive Deep into Spark Streaming
Bellevue Big Data meetup: Dive Deep into Spark Streaming
 
Cassandra + Spark + Elk
Cassandra + Spark + ElkCassandra + Spark + Elk
Cassandra + Spark + Elk
 
Building a fully-automated Fast Data Platform
Building a fully-automated Fast Data PlatformBuilding a fully-automated Fast Data Platform
Building a fully-automated Fast Data Platform
 
Predictive Maintenance at the Dutch Railways with Ivo Everts
Predictive Maintenance at the Dutch Railways with Ivo EvertsPredictive Maintenance at the Dutch Railways with Ivo Everts
Predictive Maintenance at the Dutch Railways with Ivo Everts
 
Cassandra Data Maintenance with Spark
Cassandra Data Maintenance with SparkCassandra Data Maintenance with Spark
Cassandra Data Maintenance with Spark
 
Spark 计算模型
Spark 计算模型Spark 计算模型
Spark 计算模型
 
Introduce to Spark sql 1.3.0
Introduce to Spark sql 1.3.0 Introduce to Spark sql 1.3.0
Introduce to Spark sql 1.3.0
 
Lessons Learned with Cassandra and Spark at the US Patent and Trademark Office
Lessons Learned with Cassandra and Spark at the US Patent and Trademark OfficeLessons Learned with Cassandra and Spark at the US Patent and Trademark Office
Lessons Learned with Cassandra and Spark at the US Patent and Trademark Office
 
Apache Spark overview
Apache Spark overviewApache Spark overview
Apache Spark overview
 
Spark Sql for Training
Spark Sql for TrainingSpark Sql for Training
Spark Sql for Training
 
SMACK Stack 1.1
SMACK Stack 1.1SMACK Stack 1.1
SMACK Stack 1.1
 
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep diveApache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
 

Similar to Hands-On Apache Spark

An Introduction to Spark
An Introduction to SparkAn Introduction to Spark
An Introduction to Spark
jlacefie
 

Similar to Hands-On Apache Spark (20)

Spark core
Spark coreSpark core
Spark core
 
Spark Study Notes
Spark Study NotesSpark Study Notes
Spark Study Notes
 
20170126 big data processing
20170126 big data processing20170126 big data processing
20170126 big data processing
 
Apache Spark and DataStax Enablement
Apache Spark and DataStax EnablementApache Spark and DataStax Enablement
Apache Spark and DataStax Enablement
 
Apache Spark: What? Why? When?
Apache Spark: What? Why? When?Apache Spark: What? Why? When?
Apache Spark: What? Why? When?
 
Azure Databricks is Easier Than You Think
Azure Databricks is Easier Than You ThinkAzure Databricks is Easier Than You Think
Azure Databricks is Easier Than You Think
 
Apache Spark Introduction
Apache Spark IntroductionApache Spark Introduction
Apache Spark Introduction
 
Apache Spark RDDs
Apache Spark RDDsApache Spark RDDs
Apache Spark RDDs
 
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
 
Big data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.irBig data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.ir
 
In Memory Analytics with Apache Spark
In Memory Analytics with Apache SparkIn Memory Analytics with Apache Spark
In Memory Analytics with Apache Spark
 
Apache Spark Overview @ ferret
Apache Spark Overview @ ferretApache Spark Overview @ ferret
Apache Spark Overview @ ferret
 
Spark from the Surface
Spark from the SurfaceSpark from the Surface
Spark from the Surface
 
Brief Intro to Apache Spark @ Stanford ICME
Brief Intro to Apache Spark @ Stanford ICMEBrief Intro to Apache Spark @ Stanford ICME
Brief Intro to Apache Spark @ Stanford ICME
 
Data processing platforms with SMACK: Spark and Mesos internals
Data processing platforms with SMACK:  Spark and Mesos internalsData processing platforms with SMACK:  Spark and Mesos internals
Data processing platforms with SMACK: Spark and Mesos internals
 
An Introduction to Apache Spark
An Introduction to Apache SparkAn Introduction to Apache Spark
An Introduction to Apache Spark
 
Bds session 13 14
Bds session 13 14Bds session 13 14
Bds session 13 14
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming
 
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
 
An Introduction to Spark
An Introduction to SparkAn Introduction to Spark
An Introduction to Spark
 

More from Duchess France

4 ans de Duchess France : Cassandra 2.0
4 ans de Duchess France : Cassandra 2.04 ans de Duchess France : Cassandra 2.0
4 ans de Duchess France : Cassandra 2.0
Duchess France
 
Gemfire Sqlfire - La Marmite NoSql
Gemfire Sqlfire - La Marmite NoSqlGemfire Sqlfire - La Marmite NoSql
Gemfire Sqlfire - La Marmite NoSql
Duchess France
 
MongoDB - Marmite NoSql
MongoDB - Marmite NoSqlMongoDB - Marmite NoSql
MongoDB - Marmite NoSql
Duchess France
 
Neo4 j - Marmite NoSql
Neo4 j - Marmite NoSqlNeo4 j - Marmite NoSql
Neo4 j - Marmite NoSql
Duchess France
 
Intro - La Marmite NoSql
Intro - La Marmite NoSqlIntro - La Marmite NoSql
Intro - La Marmite NoSql
Duchess France
 
Duchess advice events_september2011
Duchess advice events_september2011Duchess advice events_september2011
Duchess advice events_september2011
Duchess France
 

More from Duchess France (15)

Conding Dojo Fruit Shop
Conding Dojo Fruit ShopConding Dojo Fruit Shop
Conding Dojo Fruit Shop
 
Dans les coulisses de Google BigQuery
 Dans les coulisses de Google BigQuery Dans les coulisses de Google BigQuery
Dans les coulisses de Google BigQuery
 
4 ans de Duchess France : Cassandra 2.0
4 ans de Duchess France : Cassandra 2.04 ans de Duchess France : Cassandra 2.0
4 ans de Duchess France : Cassandra 2.0
 
BOF Duchess France à Devoxx France 2013
BOF Duchess France à Devoxx France 2013BOF Duchess France à Devoxx France 2013
BOF Duchess France à Devoxx France 2013
 
Gemfire Sqlfire - La Marmite NoSql
Gemfire Sqlfire - La Marmite NoSqlGemfire Sqlfire - La Marmite NoSql
Gemfire Sqlfire - La Marmite NoSql
 
MongoDB - Marmite NoSql
MongoDB - Marmite NoSqlMongoDB - Marmite NoSql
MongoDB - Marmite NoSql
 
Neo4 j - Marmite NoSql
Neo4 j - Marmite NoSqlNeo4 j - Marmite NoSql
Neo4 j - Marmite NoSql
 
Intro - La Marmite NoSql
Intro - La Marmite NoSqlIntro - La Marmite NoSql
Intro - La Marmite NoSql
 
2 ans de Duchess France - Ouverture
2 ans de Duchess France - Ouverture2 ans de Duchess France - Ouverture
2 ans de Duchess France - Ouverture
 
Ces nanas qui codent
Ces nanas qui codentCes nanas qui codent
Ces nanas qui codent
 
Design poo togo_jug_final
Design poo togo_jug_finalDesign poo togo_jug_final
Design poo togo_jug_final
 
Duchess advice events_september2011
Duchess advice events_september2011Duchess advice events_september2011
Duchess advice events_september2011
 
Trivial Java - Part 2
Trivial Java - Part 2Trivial Java - Part 2
Trivial Java - Part 2
 
Trivial Java - Part 1
Trivial Java - Part 1Trivial Java - Part 1
Trivial Java - Part 1
 
Presentation anniversaire duchess
Presentation anniversaire duchessPresentation anniversaire duchess
Presentation anniversaire duchess
 

Recently uploaded

Audience Researchndfhcvnfgvgbhujhgfv.pptx
Audience Researchndfhcvnfgvgbhujhgfv.pptxAudience Researchndfhcvnfgvgbhujhgfv.pptx
Audience Researchndfhcvnfgvgbhujhgfv.pptx
Stephen266013
 
如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证
acoha1
 
obat aborsi Tarakan wa 081336238223 jual obat aborsi cytotec asli di Tarakan9...
obat aborsi Tarakan wa 081336238223 jual obat aborsi cytotec asli di Tarakan9...obat aborsi Tarakan wa 081336238223 jual obat aborsi cytotec asli di Tarakan9...
obat aborsi Tarakan wa 081336238223 jual obat aborsi cytotec asli di Tarakan9...
yulianti213969
 
如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单学位证留信学历认证原件一样
如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单学位证留信学历认证原件一样如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单学位证留信学历认证原件一样
如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单学位证留信学历认证原件一样
jk0tkvfv
 
NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...
NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...
NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...
Amil baba
 
如何办理英国卡迪夫大学毕业证(Cardiff毕业证书)成绩单留信学历认证
如何办理英国卡迪夫大学毕业证(Cardiff毕业证书)成绩单留信学历认证如何办理英国卡迪夫大学毕业证(Cardiff毕业证书)成绩单留信学历认证
如何办理英国卡迪夫大学毕业证(Cardiff毕业证书)成绩单留信学历认证
ju0dztxtn
 
如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证
acoha1
 
Displacement, Velocity, Acceleration, and Second Derivatives
Displacement, Velocity, Acceleration, and Second DerivativesDisplacement, Velocity, Acceleration, and Second Derivatives
Displacement, Velocity, Acceleration, and Second Derivatives
23050636
 
如何办理哥伦比亚大学毕业证(Columbia毕业证)成绩单原版一比一
如何办理哥伦比亚大学毕业证(Columbia毕业证)成绩单原版一比一如何办理哥伦比亚大学毕业证(Columbia毕业证)成绩单原版一比一
如何办理哥伦比亚大学毕业证(Columbia毕业证)成绩单原版一比一
fztigerwe
 
1:1原版定制伦敦政治经济学院毕业证(LSE毕业证)成绩单学位证书留信学历认证
1:1原版定制伦敦政治经济学院毕业证(LSE毕业证)成绩单学位证书留信学历认证1:1原版定制伦敦政治经济学院毕业证(LSE毕业证)成绩单学位证书留信学历认证
1:1原版定制伦敦政治经济学院毕业证(LSE毕业证)成绩单学位证书留信学历认证
dq9vz1isj
 
原件一样伦敦国王学院毕业证成绩单留信学历认证
原件一样伦敦国王学院毕业证成绩单留信学历认证原件一样伦敦国王学院毕业证成绩单留信学历认证
原件一样伦敦国王学院毕业证成绩单留信学历认证
pwgnohujw
 

Recently uploaded (20)

Audience Researchndfhcvnfgvgbhujhgfv.pptx
Audience Researchndfhcvnfgvgbhujhgfv.pptxAudience Researchndfhcvnfgvgbhujhgfv.pptx
Audience Researchndfhcvnfgvgbhujhgfv.pptx
 
Bios of leading Astrologers & Researchers
Bios of leading Astrologers & ResearchersBios of leading Astrologers & Researchers
Bios of leading Astrologers & Researchers
 
如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证
 
NOAM AAUG Adobe Summit 2024: Summit Slam Dunks
NOAM AAUG Adobe Summit 2024: Summit Slam DunksNOAM AAUG Adobe Summit 2024: Summit Slam Dunks
NOAM AAUG Adobe Summit 2024: Summit Slam Dunks
 
obat aborsi Tarakan wa 081336238223 jual obat aborsi cytotec asli di Tarakan9...
obat aborsi Tarakan wa 081336238223 jual obat aborsi cytotec asli di Tarakan9...obat aborsi Tarakan wa 081336238223 jual obat aborsi cytotec asli di Tarakan9...
obat aborsi Tarakan wa 081336238223 jual obat aborsi cytotec asli di Tarakan9...
 
如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单学位证留信学历认证原件一样
如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单学位证留信学历认证原件一样如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单学位证留信学历认证原件一样
如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单学位证留信学历认证原件一样
 
NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...
NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...
NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...
 
MATERI MANAJEMEN OF PENYAKIT TETANUS.ppt
MATERI  MANAJEMEN OF PENYAKIT TETANUS.pptMATERI  MANAJEMEN OF PENYAKIT TETANUS.ppt
MATERI MANAJEMEN OF PENYAKIT TETANUS.ppt
 
如何办理英国卡迪夫大学毕业证(Cardiff毕业证书)成绩单留信学历认证
如何办理英国卡迪夫大学毕业证(Cardiff毕业证书)成绩单留信学历认证如何办理英国卡迪夫大学毕业证(Cardiff毕业证书)成绩单留信学历认证
如何办理英国卡迪夫大学毕业证(Cardiff毕业证书)成绩单留信学历认证
 
如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证
 
Displacement, Velocity, Acceleration, and Second Derivatives
Displacement, Velocity, Acceleration, and Second DerivativesDisplacement, Velocity, Acceleration, and Second Derivatives
Displacement, Velocity, Acceleration, and Second Derivatives
 
如何办理哥伦比亚大学毕业证(Columbia毕业证)成绩单原版一比一
如何办理哥伦比亚大学毕业证(Columbia毕业证)成绩单原版一比一如何办理哥伦比亚大学毕业证(Columbia毕业证)成绩单原版一比一
如何办理哥伦比亚大学毕业证(Columbia毕业证)成绩单原版一比一
 
Data Visualization Exploring and Explaining with Data 1st Edition by Camm sol...
Data Visualization Exploring and Explaining with Data 1st Edition by Camm sol...Data Visualization Exploring and Explaining with Data 1st Edition by Camm sol...
Data Visualization Exploring and Explaining with Data 1st Edition by Camm sol...
 
How to Transform Clinical Trial Management with Advanced Data Analytics
How to Transform Clinical Trial Management with Advanced Data AnalyticsHow to Transform Clinical Trial Management with Advanced Data Analytics
How to Transform Clinical Trial Management with Advanced Data Analytics
 
Jual Obat Aborsi Bandung (Asli No.1) Wa 082134680322 Klinik Obat Penggugur Ka...
Jual Obat Aborsi Bandung (Asli No.1) Wa 082134680322 Klinik Obat Penggugur Ka...Jual Obat Aborsi Bandung (Asli No.1) Wa 082134680322 Klinik Obat Penggugur Ka...
Jual Obat Aborsi Bandung (Asli No.1) Wa 082134680322 Klinik Obat Penggugur Ka...
 
1:1原版定制伦敦政治经济学院毕业证(LSE毕业证)成绩单学位证书留信学历认证
1:1原版定制伦敦政治经济学院毕业证(LSE毕业证)成绩单学位证书留信学历认证1:1原版定制伦敦政治经济学院毕业证(LSE毕业证)成绩单学位证书留信学历认证
1:1原版定制伦敦政治经济学院毕业证(LSE毕业证)成绩单学位证书留信学历认证
 
SCI8-Q4-MOD11.pdfwrwujrrjfaajerjrajrrarj
SCI8-Q4-MOD11.pdfwrwujrrjfaajerjrajrrarjSCI8-Q4-MOD11.pdfwrwujrrjfaajerjrajrrarj
SCI8-Q4-MOD11.pdfwrwujrrjfaajerjrajrrarj
 
Seven tools of quality control.slideshare
Seven tools of quality control.slideshareSeven tools of quality control.slideshare
Seven tools of quality control.slideshare
 
The Significance of Transliteration Enhancing
The Significance of Transliteration EnhancingThe Significance of Transliteration Enhancing
The Significance of Transliteration Enhancing
 
原件一样伦敦国王学院毕业证成绩单留信学历认证
原件一样伦敦国王学院毕业证成绩单留信学历认证原件一样伦敦国王学院毕业证成绩单留信学历认证
原件一样伦敦国王学院毕业证成绩单留信学历认证
 

Hands-On Apache Spark

  • 1. Hands-On @samklr & @nivdul 2015-03-10
  • 2. is a fast and general engine for large-scale data processing • big data analytics in memory/disk • complements Hadoop • faster and more flexible • Resilient Distributed Datasets (RDD) • shared variables interactive shell (scala & python) Lambda (Java 8)
  • 3. RDD (Resilient Distributed Dataset) • process in parallel • controllable persistence (memory, disk…) • higher-level operations (transformation & actions) • rebuilt automatically
  • 5. Example : Wordcount // create configuration for Spark and the context val conf = new SparkConf() .setAppName("Spark word count") .setMaster("local") ! val sc = new SparkContext(conf) ! // load the data val data = sc.textFile("filepath/wordcount.txt") // map then reduce step val wordCounts = data.flatMap(line => line.split("s+")) .map(word => (word, 1)) .reduceByKey(_ + _) // persist the data wordCounts.cache()
  • 7. unifies access to structured data SQL // make sql request on RDD val nb = sqlContext.sql("SELECT user, COUNT(*) AS c FROM tweet " + "WHERE user <> '' " + "GROUP BY user " + "ORDER BY c "); ! // create a sql context from the Spark context val sqlContext = new SQLContext(sc); ! // load data and create an RDD val tweets = sqlContext.jsonFile(pathToFile); // register tweets as a table to operate on it later tweets.registerAsTable("tweet");
  • 8. makes it easy to build scalable fault-tolerant streaming applications Streaming // create a java streaming context and define the window val jssc = new StreamingContext(conf, Durations.seconds(10)) ! // create our DStream (sequence of RDD) val tweetsStream = TwitterUtils.createStream(jssc, StreamUtils.getAuth()) ! // find all user val tweetUser = tweetsStream.map(tweetStatus => tweetStatus.getUser())
  • 9. MLlib is Apache Spark's scalable machine learning library • regression • classification • clustering • optimization • collaborative filtering • feature extraction (TF-IDF, Word2Vec…)
  • 10. Exercices Part 1 : Spark API ! ! Part 2 : Spark Streaming ! ! Part 3 : Spark SQL ! ! Part 4 : MLlib
  • 11. Let’s go ! Clone the projet from the Duchess France github repository ! Java https://github.com/DuchessFrance/Hands-On-Spark-java Scala https://github.com/DuchessFrance/Hands-On-Spark-scala ! ! All about Spark http://spark.apache.org/ ! ! And ask if you have any questions :) ! ! Have Fun !