SlideShare a Scribd company logo
1 of 34
Download to read offline
Spark Streaming
Data with Wings
Never stop the movement
Agenda
• Spark overview
• Spark Streaming overview
• Processing streams from meetup.com
• Little intro to stateful stream processing
• All together
There will be no pictures of
cats!
Lambda Architecture
BUFFER
BATCH
SPEED
SERVING
Hourly/Daily Loads
Every Second
Near
Real-time
Views
Real-time Decision
Making
Do you see any problems
with Lambda Architecture?
Spark Overview
What if you could write programs operating with petabyte size
datasets and large streams same way you operate Iterable
collections?
Apache Spark
Resilient Distributed Dataset
Resilient - resilience is addressed by tracking the
log of operations performed on the dataset.
Because of the side effects are eliminated, every
lost partition can be recalculated in case of a loss.
Distributed - the dataset is partitioned. We can
specify partitioning scheme for every operation.
Dataset - can be built from regular files, HDFS
large files, Cassandra table, HBase table, etc.
Obligatory word count
lines = spark.textFile("hdfs://...")
lines.flatMap{line: line.split(“ “)}
.map({word: (word, 1)})
.reduceByKey(_+_)
Spark Runtime
Spark is like Yoda and Hulk
combined
Discretized Stream
New RDD every second
https://spark.apache.org/docs/1.2.0/streaming-programming-guide.html
DStream - still pretty
You operate on stream just like on collections
https://spark.apache.org/docs/1.2.0/streaming-programming-guide.html
meetup.com Streams
Let’s play with meetup.com streams:
• Events
• RSVPs
RSVP Schema
{ "event" : { "event_id" : "220993343",
"event_name" : "Paper Collage",
"event_url" : "http://www.meetup.com/ELEOS-Art-School-Studio/events/220993343/",
"time" : 1427551200000
},
"group" : {...},
"guests" : 0,
"member" : { "member_id" : 120762942,
"member_name" : "Esther",
"photo" : "http://photos1.meetupstatic.com/photos/member/b/b/0/thumb_159962992.jpeg"
},
...
}
Event Schema
{ "description" : "<p>90 Minute walking tour with your dog!  Please arrive early.</p>n<p>Tour of Balboa Parks spookiest
locations on ...>",
"duration" : 5400000,
"event_url" : "http://www.meetup.com/SanDiegoDogWalkers/events/220302036/",
"group" : {...},
"id" : "220302036",
"maybe_rsvp_count" : 0,
"mtime" : 1425785739616,
"name" : "After Dark Ghost Walking Tour in Balboa Park with your dog!",
"payment_required" : "0",
"rsvp_limit" : 20,
"status" : "upcoming",
"time" : 1426993200000,
"utc_offset" : -25200000,
"venue" : {....},
Meetup Receiver
Receiver is the way to pump data to spark streaming.
In our case, we connect to meetup streaming api and
send json to Spark.
Code is a bit long, but you can explore it:
https://github.com/actions/meetup-stream/blob/
master/src/main/scala/receiver/MeetupReceiver.scala
Intro to Stateful Stream
Processing
def locaitonCounts(eventsStream: DStream[(Venue,Event)])=
liveEvents

.filter{case(venue,event)=>venue.country==“usa”}
.map{case(venue,event)=>(venue.toCityState,1)}
.updateStateByKey(countForLocaiton)
.print
def countForLocaiton(counts: Seq[Int], initCount: Option[Int])
=Some(initCount.getOrElse(0) + counts.sum)
How it works again…?
CityState Count
San Francisco, CA 5
LA, CA 3
Portland, OR 2
CityState Count
San Francisco,CA 1
Miami, FL 1
CityState Count
San Francisco, CA 6
LA, CA 3
Portland, OR 2
Miami, FL 1
11 sec10 sec
CityState Count
San Francisco,CA 1
Miami, FL 1
Incoming Stream
State Stream
Aerial refueling
http://commons.wikimedia.org/wiki/File:Aerial_refueling_CH-53_DF-SD-06-02984.JPG
meetup.com connection
recommendation app
Meetup Event Clustering
Meetup Professional
Connection Recommendation
Meetup Recommendations
Pipeline
Initializing RSVP Stream and
the Event Dataset
val conf = new SparkConf()
.setMaster("local[4]")
.setAppName("MeetupExperiments")
.set("spark.executor.memory", "1g")
.set("spark.driver.memory", "1g")
val ssc=new StreamingContext(conf, Seconds(1))
val rsvpStream = ssc.receiverStream(
new MeetupReceiver(“http://stream.meetup.com/2/
rsvps”)).flatMap(parseRsvp)
val eventsHistory = ssc.sparkContext.textFile("data/events/
events.json", 1).flatMap(parseEvent)
Broadcasting Dictionary
val localDictionary=Source
.fromURL(getClass.getResource("/wordsEn.txt"))
.getLines
.zipWithIndex
.toMap
val dictionary=ssc.sparkContext
.broadcast(localDictionary)
Feature Extraction
10 most popular words in the description.
def eventToVector(event: Event): Option[Vector]={
val wordsIterator =
event.description.map(breakToWords).getOrElse(Iterator())
val topWords=popularWords(wordsIterator)
if (topWords.size==10)
Some(Vectors.sparse(dictionary.value.size,topWords)) else None
}
val eventVectors=eventsHistory.flatMap{event=>eventToVector(event)}
Training based on existing
dataset
val eventClusters = KMeans.train(eventVectors, 10, 2)
http://scikit-learn.org/0.11/auto_examples/cluster/plot_kmeans_digits.html
Event History By ID
val eventHistoryById=eventsHistory
.map{event=>(event.id, event.description.getOrElse(""))}
.reduceByKey{(first: String, second: String)=>first}
(220302036,“…description1 …”)

(220302037,“…description2 …”)

(220302038,”…description3 …”)
Streaming lookups
Looking up the event description by eventId from
rsvp.
val rsvpEventInfo = membersByEventId.transform(
rdd=>rdd.join(eventHistoryById)
)
(eventId, (member, response), description)
(220819928,((Member(Some(cecelia rogers),Some(162556712)),yes),"...")
(221153676,((Member(Some(Carol),Some(183499291)),no),”...")
…
Streaming Clustering
val memberEventInfo = rsvpEventInfo
.flatMap{
case(eventId, ((member, response), event)) => {
eventToVector(event).map{ eventVector=>
val eventCluster=eventClusters.predict(eventVector)
(eventCluster,(member, response))
}
}}
Clustering members
def groupMembers(memberResponses: Seq[(Member, String)], initList: Option[Set[Member]]) =
{
val initialMemberList=initList.getOrElse(Set())
val newMemberList=(memberResponses : initialMemberList) {
case((member, response), memberList) =>
if (response == "yes") memberList + member else memberList - member
}
if (newMemberList.size>0) Some(newMemberList) else None
}
val memberGroups =
memberEventInfo.updateStateByKey(groupMembers)
Recommendations
val recommendations=memberEventInfo
.join(memberGroups)
.map{
case(cluster, ((member, memberResponse), members)) =>
(member.memberName, members-member)
}
(Some(Mike D) -> Set(Member(Some(Sioux),Some(85761302)),
Member(Some(Aileen),Some(12579632)),
Member(Some(Teri),Some(148306762))))
Try it
https://github.com/actions/meetup-stream

More Related Content

What's hot

An Introduction to Spark
An Introduction to SparkAn Introduction to Spark
An Introduction to Sparkjlacefie
 
Bellevue Big Data meetup: Dive Deep into Spark Streaming
Bellevue Big Data meetup: Dive Deep into Spark StreamingBellevue Big Data meetup: Dive Deep into Spark Streaming
Bellevue Big Data meetup: Dive Deep into Spark StreamingSantosh Sahoo
 
Cassandra spark connector
Cassandra spark connectorCassandra spark connector
Cassandra spark connectorDuyhai Doan
 
Zero to Streaming: Spark and Cassandra
Zero to Streaming: Spark and CassandraZero to Streaming: Spark and Cassandra
Zero to Streaming: Spark and CassandraRussell Spitzer
 
Analyzing Time Series Data with Apache Spark and Cassandra
Analyzing Time Series Data with Apache Spark and CassandraAnalyzing Time Series Data with Apache Spark and Cassandra
Analyzing Time Series Data with Apache Spark and CassandraPatrick McFadin
 
Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...
Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...
Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...Brian O'Neill
 
Big data analytics with Spark & Cassandra
Big data analytics with Spark & Cassandra Big data analytics with Spark & Cassandra
Big data analytics with Spark & Cassandra Matthias Niehoff
 
Analytics with Cassandra, Spark & MLLib - Cassandra Essentials Day
Analytics with Cassandra, Spark & MLLib - Cassandra Essentials DayAnalytics with Cassandra, Spark & MLLib - Cassandra Essentials Day
Analytics with Cassandra, Spark & MLLib - Cassandra Essentials DayMatthias Niehoff
 
[Spark meetup] Spark Streaming Overview
[Spark meetup] Spark Streaming Overview[Spark meetup] Spark Streaming Overview
[Spark meetup] Spark Streaming OverviewStratio
 
Productionizing your Streaming Jobs
Productionizing your Streaming JobsProductionizing your Streaming Jobs
Productionizing your Streaming JobsDatabricks
 
Arbitrary Stateful Aggregations using Structured Streaming in Apache Spark
Arbitrary Stateful Aggregations using Structured Streaming in Apache SparkArbitrary Stateful Aggregations using Structured Streaming in Apache Spark
Arbitrary Stateful Aggregations using Structured Streaming in Apache SparkDatabricks
 
Spark ETL Techniques - Creating An Optimal Fantasy Baseball Roster
Spark ETL Techniques - Creating An Optimal Fantasy Baseball RosterSpark ETL Techniques - Creating An Optimal Fantasy Baseball Roster
Spark ETL Techniques - Creating An Optimal Fantasy Baseball RosterDon Drake
 
Spark And Cassandra: 2 Fast, 2 Furious
Spark And Cassandra: 2 Fast, 2 FuriousSpark And Cassandra: 2 Fast, 2 Furious
Spark And Cassandra: 2 Fast, 2 FuriousJen Aman
 
Dive into Spark Streaming
Dive into Spark StreamingDive into Spark Streaming
Dive into Spark StreamingGerard Maas
 
Apache Cassandra and Python for Analyzing Streaming Big Data
Apache Cassandra and Python for Analyzing Streaming Big Data Apache Cassandra and Python for Analyzing Streaming Big Data
Apache Cassandra and Python for Analyzing Streaming Big Data prajods
 
Apache Spark and DataStax Enablement
Apache Spark and DataStax EnablementApache Spark and DataStax Enablement
Apache Spark and DataStax EnablementVincent Poncet
 
Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...
Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...
Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...Anton Kirillov
 
Spark Streaming @ Berlin Apache Spark Meetup, March 2015
Spark Streaming @ Berlin Apache Spark Meetup, March 2015Spark Streaming @ Berlin Apache Spark Meetup, March 2015
Spark Streaming @ Berlin Apache Spark Meetup, March 2015Stratio
 
Analyzing Time-Series Data with Apache Spark and Cassandra - StampedeCon 2016
Analyzing Time-Series Data with Apache Spark and Cassandra - StampedeCon 2016Analyzing Time-Series Data with Apache Spark and Cassandra - StampedeCon 2016
Analyzing Time-Series Data with Apache Spark and Cassandra - StampedeCon 2016StampedeCon
 

What's hot (20)

An Introduction to Spark
An Introduction to SparkAn Introduction to Spark
An Introduction to Spark
 
Bellevue Big Data meetup: Dive Deep into Spark Streaming
Bellevue Big Data meetup: Dive Deep into Spark StreamingBellevue Big Data meetup: Dive Deep into Spark Streaming
Bellevue Big Data meetup: Dive Deep into Spark Streaming
 
Cassandra spark connector
Cassandra spark connectorCassandra spark connector
Cassandra spark connector
 
Zero to Streaming: Spark and Cassandra
Zero to Streaming: Spark and CassandraZero to Streaming: Spark and Cassandra
Zero to Streaming: Spark and Cassandra
 
Analyzing Time Series Data with Apache Spark and Cassandra
Analyzing Time Series Data with Apache Spark and CassandraAnalyzing Time Series Data with Apache Spark and Cassandra
Analyzing Time Series Data with Apache Spark and Cassandra
 
Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...
Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...
Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...
 
Big data analytics with Spark & Cassandra
Big data analytics with Spark & Cassandra Big data analytics with Spark & Cassandra
Big data analytics with Spark & Cassandra
 
Spark streaming: Best Practices
Spark streaming: Best PracticesSpark streaming: Best Practices
Spark streaming: Best Practices
 
Analytics with Cassandra, Spark & MLLib - Cassandra Essentials Day
Analytics with Cassandra, Spark & MLLib - Cassandra Essentials DayAnalytics with Cassandra, Spark & MLLib - Cassandra Essentials Day
Analytics with Cassandra, Spark & MLLib - Cassandra Essentials Day
 
[Spark meetup] Spark Streaming Overview
[Spark meetup] Spark Streaming Overview[Spark meetup] Spark Streaming Overview
[Spark meetup] Spark Streaming Overview
 
Productionizing your Streaming Jobs
Productionizing your Streaming JobsProductionizing your Streaming Jobs
Productionizing your Streaming Jobs
 
Arbitrary Stateful Aggregations using Structured Streaming in Apache Spark
Arbitrary Stateful Aggregations using Structured Streaming in Apache SparkArbitrary Stateful Aggregations using Structured Streaming in Apache Spark
Arbitrary Stateful Aggregations using Structured Streaming in Apache Spark
 
Spark ETL Techniques - Creating An Optimal Fantasy Baseball Roster
Spark ETL Techniques - Creating An Optimal Fantasy Baseball RosterSpark ETL Techniques - Creating An Optimal Fantasy Baseball Roster
Spark ETL Techniques - Creating An Optimal Fantasy Baseball Roster
 
Spark And Cassandra: 2 Fast, 2 Furious
Spark And Cassandra: 2 Fast, 2 FuriousSpark And Cassandra: 2 Fast, 2 Furious
Spark And Cassandra: 2 Fast, 2 Furious
 
Dive into Spark Streaming
Dive into Spark StreamingDive into Spark Streaming
Dive into Spark Streaming
 
Apache Cassandra and Python for Analyzing Streaming Big Data
Apache Cassandra and Python for Analyzing Streaming Big Data Apache Cassandra and Python for Analyzing Streaming Big Data
Apache Cassandra and Python for Analyzing Streaming Big Data
 
Apache Spark and DataStax Enablement
Apache Spark and DataStax EnablementApache Spark and DataStax Enablement
Apache Spark and DataStax Enablement
 
Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...
Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...
Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...
 
Spark Streaming @ Berlin Apache Spark Meetup, March 2015
Spark Streaming @ Berlin Apache Spark Meetup, March 2015Spark Streaming @ Berlin Apache Spark Meetup, March 2015
Spark Streaming @ Berlin Apache Spark Meetup, March 2015
 
Analyzing Time-Series Data with Apache Spark and Cassandra - StampedeCon 2016
Analyzing Time-Series Data with Apache Spark and Cassandra - StampedeCon 2016Analyzing Time-Series Data with Apache Spark and Cassandra - StampedeCon 2016
Analyzing Time-Series Data with Apache Spark and Cassandra - StampedeCon 2016
 

Viewers also liked

Apache Spark Structured Streaming for Machine Learning - StrataConf 2016
Apache Spark Structured Streaming for Machine Learning - StrataConf 2016Apache Spark Structured Streaming for Machine Learning - StrataConf 2016
Apache Spark Structured Streaming for Machine Learning - StrataConf 2016Holden Karau
 
검색어 대중도, 연결망 분석 - 21021899 김수빈
검색어 대중도, 연결망 분석 - 21021899 김수빈검색어 대중도, 연결망 분석 - 21021899 김수빈
검색어 대중도, 연결망 분석 - 21021899 김수빈Webometrics Class
 
Transformations and actions a visual guide training
Transformations and actions a visual guide trainingTransformations and actions a visual guide training
Transformations and actions a visual guide trainingSpark Summit
 
Using Spark Part Time
Using Spark Part TimeUsing Spark Part Time
Using Spark Part TimeRajiv Shah
 
Build and Deploy a Python Web App to Amazon in 30 Mins
Build and Deploy a Python Web App to Amazon in 30 MinsBuild and Deploy a Python Web App to Amazon in 30 Mins
Build and Deploy a Python Web App to Amazon in 30 MinsJeff Hull
 
Dictionary Based Annotation at Scale with Spark by Sujit Pal
Dictionary Based Annotation at Scale with Spark by Sujit PalDictionary Based Annotation at Scale with Spark by Sujit Pal
Dictionary Based Annotation at Scale with Spark by Sujit PalSpark Summit
 
Microsoft Machine Learning Smackdown
Microsoft Machine Learning SmackdownMicrosoft Machine Learning Smackdown
Microsoft Machine Learning SmackdownLynn Langit
 
Demystifying Stream Processing with Apache Kafka
Demystifying Stream Processing with Apache KafkaDemystifying Stream Processing with Apache Kafka
Demystifying Stream Processing with Apache Kafkaconfluent
 
Visualizing AutoTrader Traffic in Near Real-Time with Spark Streaming-(Jon Gr...
Visualizing AutoTrader Traffic in Near Real-Time with Spark Streaming-(Jon Gr...Visualizing AutoTrader Traffic in Near Real-Time with Spark Streaming-(Jon Gr...
Visualizing AutoTrader Traffic in Near Real-Time with Spark Streaming-(Jon Gr...Spark Summit
 
Real-time Big Data Processing with Storm
Real-time Big Data Processing with StormReal-time Big Data Processing with Storm
Real-time Big Data Processing with Stormviirya
 
Building Distributed Data Streaming System
Building Distributed Data Streaming SystemBuilding Distributed Data Streaming System
Building Distributed Data Streaming SystemAshish Tadose
 
A product-focused introduction to Machine Learning
A product-focused introduction to Machine LearningA product-focused introduction to Machine Learning
A product-focused introduction to Machine LearningSatpreet Singh
 
Data integration with Apache Kafka
Data integration with Apache KafkaData integration with Apache Kafka
Data integration with Apache Kafkaconfluent
 
Real Time Data Processing using Spark Streaming | Data Day Texas 2015
Real Time Data Processing using Spark Streaming | Data Day Texas 2015Real Time Data Processing using Spark Streaming | Data Day Texas 2015
Real Time Data Processing using Spark Streaming | Data Day Texas 2015Cloudera, Inc.
 
Introduction To Streaming Data and Stream Processing with Apache Kafka
Introduction To Streaming Data and Stream Processing with Apache KafkaIntroduction To Streaming Data and Stream Processing with Apache Kafka
Introduction To Streaming Data and Stream Processing with Apache Kafkaconfluent
 
Relationship Extraction from Unstructured Text-Based on Stanford NLP with Spa...
Relationship Extraction from Unstructured Text-Based on Stanford NLP with Spa...Relationship Extraction from Unstructured Text-Based on Stanford NLP with Spa...
Relationship Extraction from Unstructured Text-Based on Stanford NLP with Spa...Spark Summit
 
Use of Spark MLib for Predicting the Offlining of Digital Media-(Christopher ...
Use of Spark MLib for Predicting the Offlining of Digital Media-(Christopher ...Use of Spark MLib for Predicting the Offlining of Digital Media-(Christopher ...
Use of Spark MLib for Predicting the Offlining of Digital Media-(Christopher ...Spark Summit
 
Apache kafka-a distributed streaming platform
Apache kafka-a distributed streaming platformApache kafka-a distributed streaming platform
Apache kafka-a distributed streaming platformconfluent
 
Introduction to Machine Learning (case studies)
Introduction to Machine Learning (case studies)Introduction to Machine Learning (case studies)
Introduction to Machine Learning (case studies)Dmitry Efimov
 

Viewers also liked (20)

Apache Spark Structured Streaming for Machine Learning - StrataConf 2016
Apache Spark Structured Streaming for Machine Learning - StrataConf 2016Apache Spark Structured Streaming for Machine Learning - StrataConf 2016
Apache Spark Structured Streaming for Machine Learning - StrataConf 2016
 
검색어 대중도, 연결망 분석 - 21021899 김수빈
검색어 대중도, 연결망 분석 - 21021899 김수빈검색어 대중도, 연결망 분석 - 21021899 김수빈
검색어 대중도, 연결망 분석 - 21021899 김수빈
 
Transformations and actions a visual guide training
Transformations and actions a visual guide trainingTransformations and actions a visual guide training
Transformations and actions a visual guide training
 
Using Spark Part Time
Using Spark Part TimeUsing Spark Part Time
Using Spark Part Time
 
Build and Deploy a Python Web App to Amazon in 30 Mins
Build and Deploy a Python Web App to Amazon in 30 MinsBuild and Deploy a Python Web App to Amazon in 30 Mins
Build and Deploy a Python Web App to Amazon in 30 Mins
 
Dictionary Based Annotation at Scale with Spark by Sujit Pal
Dictionary Based Annotation at Scale with Spark by Sujit PalDictionary Based Annotation at Scale with Spark by Sujit Pal
Dictionary Based Annotation at Scale with Spark by Sujit Pal
 
Microsoft Machine Learning Smackdown
Microsoft Machine Learning SmackdownMicrosoft Machine Learning Smackdown
Microsoft Machine Learning Smackdown
 
Real Time Machine Learning Visualization with Spark
Real Time Machine Learning Visualization with SparkReal Time Machine Learning Visualization with Spark
Real Time Machine Learning Visualization with Spark
 
Demystifying Stream Processing with Apache Kafka
Demystifying Stream Processing with Apache KafkaDemystifying Stream Processing with Apache Kafka
Demystifying Stream Processing with Apache Kafka
 
Visualizing AutoTrader Traffic in Near Real-Time with Spark Streaming-(Jon Gr...
Visualizing AutoTrader Traffic in Near Real-Time with Spark Streaming-(Jon Gr...Visualizing AutoTrader Traffic in Near Real-Time with Spark Streaming-(Jon Gr...
Visualizing AutoTrader Traffic in Near Real-Time with Spark Streaming-(Jon Gr...
 
Real-time Big Data Processing with Storm
Real-time Big Data Processing with StormReal-time Big Data Processing with Storm
Real-time Big Data Processing with Storm
 
Building Distributed Data Streaming System
Building Distributed Data Streaming SystemBuilding Distributed Data Streaming System
Building Distributed Data Streaming System
 
A product-focused introduction to Machine Learning
A product-focused introduction to Machine LearningA product-focused introduction to Machine Learning
A product-focused introduction to Machine Learning
 
Data integration with Apache Kafka
Data integration with Apache KafkaData integration with Apache Kafka
Data integration with Apache Kafka
 
Real Time Data Processing using Spark Streaming | Data Day Texas 2015
Real Time Data Processing using Spark Streaming | Data Day Texas 2015Real Time Data Processing using Spark Streaming | Data Day Texas 2015
Real Time Data Processing using Spark Streaming | Data Day Texas 2015
 
Introduction To Streaming Data and Stream Processing with Apache Kafka
Introduction To Streaming Data and Stream Processing with Apache KafkaIntroduction To Streaming Data and Stream Processing with Apache Kafka
Introduction To Streaming Data and Stream Processing with Apache Kafka
 
Relationship Extraction from Unstructured Text-Based on Stanford NLP with Spa...
Relationship Extraction from Unstructured Text-Based on Stanford NLP with Spa...Relationship Extraction from Unstructured Text-Based on Stanford NLP with Spa...
Relationship Extraction from Unstructured Text-Based on Stanford NLP with Spa...
 
Use of Spark MLib for Predicting the Offlining of Digital Media-(Christopher ...
Use of Spark MLib for Predicting the Offlining of Digital Media-(Christopher ...Use of Spark MLib for Predicting the Offlining of Digital Media-(Christopher ...
Use of Spark MLib for Predicting the Offlining of Digital Media-(Christopher ...
 
Apache kafka-a distributed streaming platform
Apache kafka-a distributed streaming platformApache kafka-a distributed streaming platform
Apache kafka-a distributed streaming platform
 
Introduction to Machine Learning (case studies)
Introduction to Machine Learning (case studies)Introduction to Machine Learning (case studies)
Introduction to Machine Learning (case studies)
 

Similar to Spark Streaming, Machine Learning and meetup.com streaming API.

Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Databricks
 
Introduction to Spark with Scala
Introduction to Spark with ScalaIntroduction to Spark with Scala
Introduction to Spark with ScalaHimanshu Gupta
 
Spark Study Notes
Spark Study NotesSpark Study Notes
Spark Study NotesRichard Kuo
 
Spark what's new what's coming
Spark what's new what's comingSpark what's new what's coming
Spark what's new what's comingDatabricks
 
Simplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache SparkSimplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache SparkDatabricks
 
Introduction to Spark (Intern Event Presentation)
Introduction to Spark (Intern Event Presentation)Introduction to Spark (Intern Event Presentation)
Introduction to Spark (Intern Event Presentation)Databricks
 
Big Data LDN 2017: Processing Fast Data With Apache Spark: the Tale of Two APIs
Big Data LDN 2017: Processing Fast Data With Apache Spark: the Tale of Two APIsBig Data LDN 2017: Processing Fast Data With Apache Spark: the Tale of Two APIs
Big Data LDN 2017: Processing Fast Data With Apache Spark: the Tale of Two APIsMatt Stubbs
 
Big Data Analytics with Scala at SCALA.IO 2013
Big Data Analytics with Scala at SCALA.IO 2013Big Data Analytics with Scala at SCALA.IO 2013
Big Data Analytics with Scala at SCALA.IO 2013Samir Bessalah
 
Toying with spark
Toying with sparkToying with spark
Toying with sparkRaymond Tay
 
A Tale of Two APIs: Using Spark Streaming In Production
A Tale of Two APIs: Using Spark Streaming In ProductionA Tale of Two APIs: Using Spark Streaming In Production
A Tale of Two APIs: Using Spark Streaming In ProductionLightbend
 
Apache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & librariesApache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & librariesWalaa Hamdy Assy
 
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Guido Schmutz
 
Artigo 81 - spark_tutorial.pdf
Artigo 81 - spark_tutorial.pdfArtigo 81 - spark_tutorial.pdf
Artigo 81 - spark_tutorial.pdfWalmirCouto3
 
SparkR: Enabling Interactive Data Science at Scale on Hadoop
SparkR: Enabling Interactive Data Science at Scale on HadoopSparkR: Enabling Interactive Data Science at Scale on Hadoop
SparkR: Enabling Interactive Data Science at Scale on HadoopDataWorks Summit
 
Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17
Deep Dive with Spark Streaming - Tathagata  Das - Spark Meetup 2013-06-17Deep Dive with Spark Streaming - Tathagata  Das - Spark Meetup 2013-06-17
Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17spark-project
 

Similar to Spark Streaming, Machine Learning and meetup.com streaming API. (20)

20170126 big data processing
20170126 big data processing20170126 big data processing
20170126 big data processing
 
Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)
 
Introduction to Spark with Scala
Introduction to Spark with ScalaIntroduction to Spark with Scala
Introduction to Spark with Scala
 
Spark Study Notes
Spark Study NotesSpark Study Notes
Spark Study Notes
 
Spark what's new what's coming
Spark what's new what's comingSpark what's new what's coming
Spark what's new what's coming
 
Simplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache SparkSimplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache Spark
 
Introduction to Spark (Intern Event Presentation)
Introduction to Spark (Intern Event Presentation)Introduction to Spark (Intern Event Presentation)
Introduction to Spark (Intern Event Presentation)
 
Spark training-in-bangalore
Spark training-in-bangaloreSpark training-in-bangalore
Spark training-in-bangalore
 
So you think you can stream.pptx
So you think you can stream.pptxSo you think you can stream.pptx
So you think you can stream.pptx
 
Yahoo compares Storm and Spark
Yahoo compares Storm and SparkYahoo compares Storm and Spark
Yahoo compares Storm and Spark
 
Big Data LDN 2017: Processing Fast Data With Apache Spark: the Tale of Two APIs
Big Data LDN 2017: Processing Fast Data With Apache Spark: the Tale of Two APIsBig Data LDN 2017: Processing Fast Data With Apache Spark: the Tale of Two APIs
Big Data LDN 2017: Processing Fast Data With Apache Spark: the Tale of Two APIs
 
Big Data Analytics with Scala at SCALA.IO 2013
Big Data Analytics with Scala at SCALA.IO 2013Big Data Analytics with Scala at SCALA.IO 2013
Big Data Analytics with Scala at SCALA.IO 2013
 
Toying with spark
Toying with sparkToying with spark
Toying with spark
 
A Tale of Two APIs: Using Spark Streaming In Production
A Tale of Two APIs: Using Spark Streaming In ProductionA Tale of Two APIs: Using Spark Streaming In Production
A Tale of Two APIs: Using Spark Streaming In Production
 
Apache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & librariesApache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & libraries
 
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
 
Artigo 81 - spark_tutorial.pdf
Artigo 81 - spark_tutorial.pdfArtigo 81 - spark_tutorial.pdf
Artigo 81 - spark_tutorial.pdf
 
Scala+data
Scala+dataScala+data
Scala+data
 
SparkR: Enabling Interactive Data Science at Scale on Hadoop
SparkR: Enabling Interactive Data Science at Scale on HadoopSparkR: Enabling Interactive Data Science at Scale on Hadoop
SparkR: Enabling Interactive Data Science at Scale on Hadoop
 
Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17
Deep Dive with Spark Streaming - Tathagata  Das - Spark Meetup 2013-06-17Deep Dive with Spark Streaming - Tathagata  Das - Spark Meetup 2013-06-17
Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17
 

Recently uploaded

SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 

Recently uploaded (20)

SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 

Spark Streaming, Machine Learning and meetup.com streaming API.

  • 1. Spark Streaming Data with Wings Never stop the movement
  • 2. Agenda • Spark overview • Spark Streaming overview • Processing streams from meetup.com • Little intro to stateful stream processing • All together
  • 3. There will be no pictures of cats!
  • 4. Lambda Architecture BUFFER BATCH SPEED SERVING Hourly/Daily Loads Every Second Near Real-time Views Real-time Decision Making
  • 5. Do you see any problems with Lambda Architecture?
  • 6. Spark Overview What if you could write programs operating with petabyte size datasets and large streams same way you operate Iterable collections?
  • 8. Resilient Distributed Dataset Resilient - resilience is addressed by tracking the log of operations performed on the dataset. Because of the side effects are eliminated, every lost partition can be recalculated in case of a loss. Distributed - the dataset is partitioned. We can specify partitioning scheme for every operation. Dataset - can be built from regular files, HDFS large files, Cassandra table, HBase table, etc.
  • 9. Obligatory word count lines = spark.textFile("hdfs://...") lines.flatMap{line: line.split(“ “)} .map({word: (word, 1)}) .reduceByKey(_+_)
  • 11. Spark is like Yoda and Hulk combined
  • 12. Discretized Stream New RDD every second https://spark.apache.org/docs/1.2.0/streaming-programming-guide.html
  • 13. DStream - still pretty You operate on stream just like on collections https://spark.apache.org/docs/1.2.0/streaming-programming-guide.html
  • 14. meetup.com Streams Let’s play with meetup.com streams: • Events • RSVPs
  • 15. RSVP Schema { "event" : { "event_id" : "220993343", "event_name" : "Paper Collage", "event_url" : "http://www.meetup.com/ELEOS-Art-School-Studio/events/220993343/", "time" : 1427551200000 }, "group" : {...}, "guests" : 0, "member" : { "member_id" : 120762942, "member_name" : "Esther", "photo" : "http://photos1.meetupstatic.com/photos/member/b/b/0/thumb_159962992.jpeg" }, ... }
  • 16. Event Schema { "description" : "<p>90 Minute walking tour with your dog!  Please arrive early.</p>n<p>Tour of Balboa Parks spookiest locations on ...>", "duration" : 5400000, "event_url" : "http://www.meetup.com/SanDiegoDogWalkers/events/220302036/", "group" : {...}, "id" : "220302036", "maybe_rsvp_count" : 0, "mtime" : 1425785739616, "name" : "After Dark Ghost Walking Tour in Balboa Park with your dog!", "payment_required" : "0", "rsvp_limit" : 20, "status" : "upcoming", "time" : 1426993200000, "utc_offset" : -25200000, "venue" : {....},
  • 17. Meetup Receiver Receiver is the way to pump data to spark streaming. In our case, we connect to meetup streaming api and send json to Spark. Code is a bit long, but you can explore it: https://github.com/actions/meetup-stream/blob/ master/src/main/scala/receiver/MeetupReceiver.scala
  • 18. Intro to Stateful Stream Processing def locaitonCounts(eventsStream: DStream[(Venue,Event)])= liveEvents
 .filter{case(venue,event)=>venue.country==“usa”} .map{case(venue,event)=>(venue.toCityState,1)} .updateStateByKey(countForLocaiton) .print def countForLocaiton(counts: Seq[Int], initCount: Option[Int]) =Some(initCount.getOrElse(0) + counts.sum)
  • 19. How it works again…? CityState Count San Francisco, CA 5 LA, CA 3 Portland, OR 2 CityState Count San Francisco,CA 1 Miami, FL 1 CityState Count San Francisco, CA 6 LA, CA 3 Portland, OR 2 Miami, FL 1 11 sec10 sec CityState Count San Francisco,CA 1 Miami, FL 1 Incoming Stream State Stream
  • 25. Initializing RSVP Stream and the Event Dataset val conf = new SparkConf() .setMaster("local[4]") .setAppName("MeetupExperiments") .set("spark.executor.memory", "1g") .set("spark.driver.memory", "1g") val ssc=new StreamingContext(conf, Seconds(1)) val rsvpStream = ssc.receiverStream( new MeetupReceiver(“http://stream.meetup.com/2/ rsvps”)).flatMap(parseRsvp) val eventsHistory = ssc.sparkContext.textFile("data/events/ events.json", 1).flatMap(parseEvent)
  • 27. Feature Extraction 10 most popular words in the description. def eventToVector(event: Event): Option[Vector]={ val wordsIterator = event.description.map(breakToWords).getOrElse(Iterator()) val topWords=popularWords(wordsIterator) if (topWords.size==10) Some(Vectors.sparse(dictionary.value.size,topWords)) else None } val eventVectors=eventsHistory.flatMap{event=>eventToVector(event)}
  • 28. Training based on existing dataset val eventClusters = KMeans.train(eventVectors, 10, 2) http://scikit-learn.org/0.11/auto_examples/cluster/plot_kmeans_digits.html
  • 29. Event History By ID val eventHistoryById=eventsHistory .map{event=>(event.id, event.description.getOrElse(""))} .reduceByKey{(first: String, second: String)=>first} (220302036,“…description1 …”)
 (220302037,“…description2 …”)
 (220302038,”…description3 …”)
  • 30. Streaming lookups Looking up the event description by eventId from rsvp. val rsvpEventInfo = membersByEventId.transform( rdd=>rdd.join(eventHistoryById) ) (eventId, (member, response), description) (220819928,((Member(Some(cecelia rogers),Some(162556712)),yes),"...") (221153676,((Member(Some(Carol),Some(183499291)),no),”...") …
  • 31. Streaming Clustering val memberEventInfo = rsvpEventInfo .flatMap{ case(eventId, ((member, response), event)) => { eventToVector(event).map{ eventVector=> val eventCluster=eventClusters.predict(eventVector) (eventCluster,(member, response)) } }}
  • 32. Clustering members def groupMembers(memberResponses: Seq[(Member, String)], initList: Option[Set[Member]]) = { val initialMemberList=initList.getOrElse(Set()) val newMemberList=(memberResponses : initialMemberList) { case((member, response), memberList) => if (response == "yes") memberList + member else memberList - member } if (newMemberList.size>0) Some(newMemberList) else None } val memberGroups = memberEventInfo.updateStateByKey(groupMembers)
  • 33. Recommendations val recommendations=memberEventInfo .join(memberGroups) .map{ case(cluster, ((member, memberResponse), members)) => (member.memberName, members-member) } (Some(Mike D) -> Set(Member(Some(Sioux),Some(85761302)), Member(Some(Aileen),Some(12579632)), Member(Some(Teri),Some(148306762))))