SlideShare a Scribd company logo
1 of 24
Fast Data Intelligence in the IoT
Real-time Data Analytics with Spark Streaming and MLlib
Bas Geerdink
#iottechday
ABOUT ME
• Chapter Lead in Analytics area at ING
• Academic background in Artificial
Intelligence and Informatics
• Working in IT since 2004, previously as
developer and software architect
• Spark Certified Developer
• Twitter: @bgeerdink
• Github: geerdink
WHAT’S NEW IN THE IOT?
• More data
– Streaming data from multiple sources
• New use cases
– Combining data streams
• New technology
– Fast processing and scalability
Front
End
Back
End
Data
PATTERNS & PRACTICES
FOR FAST DATA ANALYTICS
• Lambda Architecture
• Reactive Principles
• Pipes & filters
• Event Sourcing
• REST, HATEOAS
• …
LAMBDA ARCHITECTURE
Source: Nathan Marz & James Warren (2015)
REACTIVE PRINCIPLES
Source: Reactive Manifesto (2014)
USE CASE
nest
WWW
FAST DATA ARCHITECTURE
Products
Users
API
App
Web
…
Batch
(Machine Learning)
Social
Media
Search
History
GPS
Data
…
Message
Broker
Events
Streaming
(Business Logic)
VisualizeProcessing Database
A SHIFT IN TECHNOLOGY PARADIGMS
Disk  In-memory
Database  Stream
Objects  Functions
Centralized  Distributed
Shared Memory/CPU/Disk  Shared Nothing
TOOLS FOR THE JOB
• Apache Kafka
• Apache Cassandra
• Apache Spark
• Apache Zeppelin
• Akka
• Scala
FAST DATA ARCHITECTURE
Products
Users
API
App
Web
…
Batch
Machine Learning
Social
Media
Search
History
GPS
Data
GPS
Data
Message
Broker
Streaming
Business Logic
Events VisualizeProcessing Database
KAFKA
• Distributed Message broker
• Built for speed, scalability, fault-tolerance
• Works with topics, producers, consumers
• Created at LinkedIn, now open source
• Written in Scala
CODE: KAFKA
• build.sbt:
"org.apache.kafka" %% "kafka" % kafkaVersion
• Application.conf:
kafka { producer … consumer }
• KafkaConnection.scala:
def producer, def consumer
• KafkaProducerActor.scala:
producer.send(msg)
• KafkaConsumerActor.scala:
val kafkaStream =
connection.createMessageStreams(Map(topic -> 1))(topic)(0)
CASSANDRA
• NoSQL database
• Built for speed, scalability, fault-tolerance
• Works with CQL, consistency levels, replication factors
• Created at Facebook, now open source
• Written in Java
CODE: CASSANDRA
CREATE TABLE products (user_name text, product_category text, product_name text,
score int, insertion_time timeuuid, PRIMARY KEY (user_name, product_category,
product_name));
val cluster = new Cluster.Builder().
addContactPoints(uri.hosts.toArray: _*).
withPort(uri.port).
withQueryOptions(new
QueryOptions().setConsistencyLevel(defaultConsistencyLevel)).build
val session = cluster.connect
session.execute(s"USE ${uri.keyspace}")
def insertScore(productScore: ProductScore): Unit = {
val query = s”INSERT INTO products (user_name, product_category, product_name,
score, insertion_time) VALUES ('${productScore.userName}',
'${productScore.productCategory}', '${productScore.productName}',
${productScore.score}, now())"
session.execute(query)
}
SPARK
• Fast, parallel, in-memory, general-purpose data
processing engine
• Winner of Daytona Gray Sort benchmark 2014
• Runs on Hadoop YARN, Mesos, cloud, or standalone
• Created at AMPLab UC Berkeley, now open source
• Written in Scala
CODE: SPARK BASICS
val l = List(1,2,3,4,5)
val p = sc.parallelize(l) // create RDD
p.count() // action
def fun1(x: Int): Int = x * 2
p.map(fun1).collect() // transformation
p.map(i => i * 2).filter(_ < 6).collect() // lambda
SPARK
SPARK STREAMING
CODE: SPARK STREAMING
val conf = new SparkConf().setAppName("fast-data-search-history").setMaster("local[2]")
val ssc = new StreamingContext(conf, Seconds(2)) // batch interval = 2 sec
val kafkaParams = Map[String, String]("metadata.broker.list" -> "localhost:9092")
val kafkaDirectStream = KafkaUtils.createDirectStream[String, String, StringDecoder,
StringDecoder](ssc, kafkaParams, Set("search_history"))
kafkaDirectStream
.map(rdd => ProductScoreHelper.createProductScore(rdd._2))
.filter(_.productCategory != "Sneakers")
.foreachRDD(rdd => rdd.foreach(CassandraHelper.insertScore))
ssc.start() // it's necessary to explicitly tell the StreamingContext to start receiving data
ssc.awaitTermination() // wait for the job to finish
CODE: SPARK MLLIB
// initialize Spark MLlib
val conf = new SparkConf().setAppName("fast-data-social-media").setMaster("local[2]")
val sc = new SparkContext(conf)
// load machine learning model from disk
val model = LinearRegressionModel.load(sc, "/home/social_media.model")
def processEvent(sme: SocialMediaEvent): Unit = {
// feature vector extraction
val vector = new DenseVector(Array(sme.userName, sme.message))
// get a new prediction for the top user category
val value = model.predict(vector)
// store the predicted category value
val user = new User(sme.userName, UserHelper.getCategory(value))
CassandraHelper.updateUserCategory(user)
}
THREE KEY TAKEAWAYS
• The IoT comes with new architecture: reactive and
scalable are the new normal
• Be aware of the paradigm shift: in-memory,
streaming, distributed, shared nothing
• Open source tooling such as Kafka, Cassandra, and
Spark can help to process the fast data flows
Thank You!
“please rate my talk in the offical IoT Tech Day app”
@bgeerdink
#iottechday

More Related Content

What's hot

Machine Learning at Hand with Power BI
Machine Learning at Hand with Power BIMachine Learning at Hand with Power BI
Machine Learning at Hand with Power BIIvo Andreev
 
Optier presentation for open analytics event
Optier presentation for open analytics eventOptier presentation for open analytics event
Optier presentation for open analytics eventOpen Analytics
 
Using Spark-Solr at Scale: Productionizing Spark for Search with Apache Solr...
 Using Spark-Solr at Scale: Productionizing Spark for Search with Apache Solr... Using Spark-Solr at Scale: Productionizing Spark for Search with Apache Solr...
Using Spark-Solr at Scale: Productionizing Spark for Search with Apache Solr...Databricks
 
Nodes2020 | Graph of enterprise_metadata | NEO4J Conference
Nodes2020 | Graph of enterprise_metadata | NEO4J ConferenceNodes2020 | Graph of enterprise_metadata | NEO4J Conference
Nodes2020 | Graph of enterprise_metadata | NEO4J ConferenceDeepak Chandramouli
 
Big data analytic platform
Big data analytic platformBig data analytic platform
Big data analytic platformJesse Wang
 
Dive into H2O: NYC
Dive into H2O: NYCDive into H2O: NYC
Dive into H2O: NYCSri Ambati
 
Building Real-Time Data Pipeline for Diabetes Medication Recommender System U...
Building Real-Time Data Pipeline for Diabetes Medication Recommender System U...Building Real-Time Data Pipeline for Diabetes Medication Recommender System U...
Building Real-Time Data Pipeline for Diabetes Medication Recommender System U...Databricks
 
Wizard Driven AI Anomaly Detection with Databricks in Azure
Wizard Driven AI Anomaly Detection with Databricks in AzureWizard Driven AI Anomaly Detection with Databricks in Azure
Wizard Driven AI Anomaly Detection with Databricks in AzureDatabricks
 
H2O World - Data Science w/ Big Data in a Corporate Environment - Nachum Shacham
H2O World - Data Science w/ Big Data in a Corporate Environment - Nachum ShachamH2O World - Data Science w/ Big Data in a Corporate Environment - Nachum Shacham
H2O World - Data Science w/ Big Data in a Corporate Environment - Nachum ShachamSri Ambati
 
Software Engineering for Data Scientists
Software Engineering for Data ScientistsSoftware Engineering for Data Scientists
Software Engineering for Data ScientistsDomino Data Lab
 
Software Analytics with Jupyter, Pandas, jQAssistant, and Neo4j [Neo4j Online...
Software Analytics with Jupyter, Pandas, jQAssistant, and Neo4j [Neo4j Online...Software Analytics with Jupyter, Pandas, jQAssistant, and Neo4j [Neo4j Online...
Software Analytics with Jupyter, Pandas, jQAssistant, and Neo4j [Neo4j Online...Markus Harrer
 
SoC Keynote:The State of the Art in Integration Technology
SoC Keynote:The State of the Art in Integration TechnologySoC Keynote:The State of the Art in Integration Technology
SoC Keynote:The State of the Art in Integration TechnologySrinath Perera
 
Sudhir Rawat, Sr Techonology Evangelist at Microsoft SQL Business Intelligenc...
Sudhir Rawat, Sr Techonology Evangelist at Microsoft SQL Business Intelligenc...Sudhir Rawat, Sr Techonology Evangelist at Microsoft SQL Business Intelligenc...
Sudhir Rawat, Sr Techonology Evangelist at Microsoft SQL Business Intelligenc...Dataconomy Media
 
2016 Spark Summit East Keynote: Ali Ghodsi and Databricks Community Edition demo
2016 Spark Summit East Keynote: Ali Ghodsi and Databricks Community Edition demo2016 Spark Summit East Keynote: Ali Ghodsi and Databricks Community Edition demo
2016 Spark Summit East Keynote: Ali Ghodsi and Databricks Community Edition demoDatabricks
 
Neo4j Graph Data Science - Webinar
Neo4j Graph Data Science - WebinarNeo4j Graph Data Science - Webinar
Neo4j Graph Data Science - WebinarNeo4j
 
Essential Data Engineering for Data Scientist
Essential Data Engineering for Data Scientist Essential Data Engineering for Data Scientist
Essential Data Engineering for Data Scientist SoftServe
 
Analytics-Enabled Experiences: The New Secret Weapon
Analytics-Enabled Experiences: The New Secret WeaponAnalytics-Enabled Experiences: The New Secret Weapon
Analytics-Enabled Experiences: The New Secret WeaponDatabricks
 
SDSC18 and DSATL Meetup March 2018
SDSC18 and DSATL Meetup March 2018 SDSC18 and DSATL Meetup March 2018
SDSC18 and DSATL Meetup March 2018 CareerBuilder.com
 

What's hot (20)

Machine Learning at Hand with Power BI
Machine Learning at Hand with Power BIMachine Learning at Hand with Power BI
Machine Learning at Hand with Power BI
 
Optier presentation for open analytics event
Optier presentation for open analytics eventOptier presentation for open analytics event
Optier presentation for open analytics event
 
Using Spark-Solr at Scale: Productionizing Spark for Search with Apache Solr...
 Using Spark-Solr at Scale: Productionizing Spark for Search with Apache Solr... Using Spark-Solr at Scale: Productionizing Spark for Search with Apache Solr...
Using Spark-Solr at Scale: Productionizing Spark for Search with Apache Solr...
 
Extreme Analytics @ eBay
Extreme Analytics @ eBayExtreme Analytics @ eBay
Extreme Analytics @ eBay
 
Nodes2020 | Graph of enterprise_metadata | NEO4J Conference
Nodes2020 | Graph of enterprise_metadata | NEO4J ConferenceNodes2020 | Graph of enterprise_metadata | NEO4J Conference
Nodes2020 | Graph of enterprise_metadata | NEO4J Conference
 
Big data analytic platform
Big data analytic platformBig data analytic platform
Big data analytic platform
 
Dive into H2O: NYC
Dive into H2O: NYCDive into H2O: NYC
Dive into H2O: NYC
 
Building Real-Time Data Pipeline for Diabetes Medication Recommender System U...
Building Real-Time Data Pipeline for Diabetes Medication Recommender System U...Building Real-Time Data Pipeline for Diabetes Medication Recommender System U...
Building Real-Time Data Pipeline for Diabetes Medication Recommender System U...
 
Wizard Driven AI Anomaly Detection with Databricks in Azure
Wizard Driven AI Anomaly Detection with Databricks in AzureWizard Driven AI Anomaly Detection with Databricks in Azure
Wizard Driven AI Anomaly Detection with Databricks in Azure
 
H2O World - Data Science w/ Big Data in a Corporate Environment - Nachum Shacham
H2O World - Data Science w/ Big Data in a Corporate Environment - Nachum ShachamH2O World - Data Science w/ Big Data in a Corporate Environment - Nachum Shacham
H2O World - Data Science w/ Big Data in a Corporate Environment - Nachum Shacham
 
Software Engineering for Data Scientists
Software Engineering for Data ScientistsSoftware Engineering for Data Scientists
Software Engineering for Data Scientists
 
Software Analytics with Jupyter, Pandas, jQAssistant, and Neo4j [Neo4j Online...
Software Analytics with Jupyter, Pandas, jQAssistant, and Neo4j [Neo4j Online...Software Analytics with Jupyter, Pandas, jQAssistant, and Neo4j [Neo4j Online...
Software Analytics with Jupyter, Pandas, jQAssistant, and Neo4j [Neo4j Online...
 
SoC Keynote:The State of the Art in Integration Technology
SoC Keynote:The State of the Art in Integration TechnologySoC Keynote:The State of the Art in Integration Technology
SoC Keynote:The State of the Art in Integration Technology
 
Sudhir Rawat, Sr Techonology Evangelist at Microsoft SQL Business Intelligenc...
Sudhir Rawat, Sr Techonology Evangelist at Microsoft SQL Business Intelligenc...Sudhir Rawat, Sr Techonology Evangelist at Microsoft SQL Business Intelligenc...
Sudhir Rawat, Sr Techonology Evangelist at Microsoft SQL Business Intelligenc...
 
2016 Spark Summit East Keynote: Ali Ghodsi and Databricks Community Edition demo
2016 Spark Summit East Keynote: Ali Ghodsi and Databricks Community Edition demo2016 Spark Summit East Keynote: Ali Ghodsi and Databricks Community Edition demo
2016 Spark Summit East Keynote: Ali Ghodsi and Databricks Community Edition demo
 
Machine Data Analytics
Machine Data AnalyticsMachine Data Analytics
Machine Data Analytics
 
Neo4j Graph Data Science - Webinar
Neo4j Graph Data Science - WebinarNeo4j Graph Data Science - Webinar
Neo4j Graph Data Science - Webinar
 
Essential Data Engineering for Data Scientist
Essential Data Engineering for Data Scientist Essential Data Engineering for Data Scientist
Essential Data Engineering for Data Scientist
 
Analytics-Enabled Experiences: The New Secret Weapon
Analytics-Enabled Experiences: The New Secret WeaponAnalytics-Enabled Experiences: The New Secret Weapon
Analytics-Enabled Experiences: The New Secret Weapon
 
SDSC18 and DSATL Meetup March 2018
SDSC18 and DSATL Meetup March 2018 SDSC18 and DSATL Meetup March 2018
SDSC18 and DSATL Meetup March 2018
 

Viewers also liked

IOT Oversharing: 
Stop Sending My Stuff to the Cloud
IOT Oversharing: 
Stop Sending My Stuff to the CloudIOT Oversharing: 
Stop Sending My Stuff to the Cloud
IOT Oversharing: 
Stop Sending My Stuff to the CloudRamin Firoozye
 
Data Science for Social Good
Data Science for Social GoodData Science for Social Good
Data Science for Social GoodCarlo Torniai
 
Virdata: lessons learned from the Internet of Things and M2M Cloud Services @...
Virdata: lessons learned from the Internet of Things and M2M Cloud Services @...Virdata: lessons learned from the Internet of Things and M2M Cloud Services @...
Virdata: lessons learned from the Internet of Things and M2M Cloud Services @...Nathan Bijnens
 
AWS re:Invent 2016: IoT Visualizations and Analytics (IOT306)
AWS re:Invent 2016: IoT Visualizations and Analytics (IOT306)AWS re:Invent 2016: IoT Visualizations and Analytics (IOT306)
AWS re:Invent 2016: IoT Visualizations and Analytics (IOT306)Amazon Web Services
 
E3: Edge and Cloud Connectivity (Predix Transform 2016)
E3: Edge and Cloud Connectivity (Predix Transform 2016)E3: Edge and Cloud Connectivity (Predix Transform 2016)
E3: Edge and Cloud Connectivity (Predix Transform 2016)Predix
 
AWS re:Invent 2016: Internet of Things (IoT) Edge and Device Services (IOT202)
AWS re:Invent 2016: Internet of Things (IoT) Edge and Device Services (IOT202)AWS re:Invent 2016: Internet of Things (IoT) Edge and Device Services (IOT202)
AWS re:Invent 2016: Internet of Things (IoT) Edge and Device Services (IOT202)Amazon Web Services
 
AWS re:Invent 2016: NEW LAUNCH! Introducing AWS Greengrass (IOT201)
AWS re:Invent 2016: NEW LAUNCH! Introducing AWS Greengrass (IOT201)AWS re:Invent 2016: NEW LAUNCH! Introducing AWS Greengrass (IOT201)
AWS re:Invent 2016: NEW LAUNCH! Introducing AWS Greengrass (IOT201)Amazon Web Services
 

Viewers also liked (10)

IOT Oversharing: 
Stop Sending My Stuff to the Cloud
IOT Oversharing: 
Stop Sending My Stuff to the CloudIOT Oversharing: 
Stop Sending My Stuff to the Cloud
IOT Oversharing: 
Stop Sending My Stuff to the Cloud
 
Opp ppt1
Opp ppt1Opp ppt1
Opp ppt1
 
Data Science for Social Good
Data Science for Social GoodData Science for Social Good
Data Science for Social Good
 
Virdata: lessons learned from the Internet of Things and M2M Cloud Services @...
Virdata: lessons learned from the Internet of Things and M2M Cloud Services @...Virdata: lessons learned from the Internet of Things and M2M Cloud Services @...
Virdata: lessons learned from the Internet of Things and M2M Cloud Services @...
 
AWS re:Invent 2016: IoT Visualizations and Analytics (IOT306)
AWS re:Invent 2016: IoT Visualizations and Analytics (IOT306)AWS re:Invent 2016: IoT Visualizations and Analytics (IOT306)
AWS re:Invent 2016: IoT Visualizations and Analytics (IOT306)
 
E3: Edge and Cloud Connectivity (Predix Transform 2016)
E3: Edge and Cloud Connectivity (Predix Transform 2016)E3: Edge and Cloud Connectivity (Predix Transform 2016)
E3: Edge and Cloud Connectivity (Predix Transform 2016)
 
AWS re:Invent 2016: Internet of Things (IoT) Edge and Device Services (IOT202)
AWS re:Invent 2016: Internet of Things (IoT) Edge and Device Services (IOT202)AWS re:Invent 2016: Internet of Things (IoT) Edge and Device Services (IOT202)
AWS re:Invent 2016: Internet of Things (IoT) Edge and Device Services (IOT202)
 
GE Predix - The IIoT Platform
GE Predix - The IIoT PlatformGE Predix - The IIoT Platform
GE Predix - The IIoT Platform
 
AWS re:Invent 2016: NEW LAUNCH! Introducing AWS Greengrass (IOT201)
AWS re:Invent 2016: NEW LAUNCH! Introducing AWS Greengrass (IOT201)AWS re:Invent 2016: NEW LAUNCH! Introducing AWS Greengrass (IOT201)
AWS re:Invent 2016: NEW LAUNCH! Introducing AWS Greengrass (IOT201)
 
Apache Spark Architecture
Apache Spark ArchitectureApache Spark Architecture
Apache Spark Architecture
 

Similar to Fast Data Intelligence in the IoT - real-time data analytics with Spark

Spark streaming , Spark SQL
Spark streaming , Spark SQLSpark streaming , Spark SQL
Spark streaming , Spark SQLYousun Jeong
 
Media_Entertainment_Veriticals
Media_Entertainment_VeriticalsMedia_Entertainment_Veriticals
Media_Entertainment_VeriticalsPeyman Mohajerian
 
Hands on with Apache Spark
Hands on with Apache SparkHands on with Apache Spark
Hands on with Apache SparkDan Lynn
 
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingTiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingPaco Nathan
 
ETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
ETL to ML: Use Apache Spark as an end to end tool for Advanced AnalyticsETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
ETL to ML: Use Apache Spark as an end to end tool for Advanced AnalyticsMiklos Christine
 
Big Data 2.0 - How Spark technologies are reshaping the world of big data ana...
Big Data 2.0 - How Spark technologies are reshaping the world of big data ana...Big Data 2.0 - How Spark technologies are reshaping the world of big data ana...
Big Data 2.0 - How Spark technologies are reshaping the world of big data ana...Lillian Pierson
 
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Databricks
 
Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)Michael Rys
 
Apache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code WorkshopApache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code WorkshopAmanda Casari
 
Introduction to real time big data with Apache Spark
Introduction to real time big data with Apache SparkIntroduction to real time big data with Apache Spark
Introduction to real time big data with Apache SparkTaras Matyashovsky
 
Databricks Meetup @ Los Angeles Apache Spark User Group
Databricks Meetup @ Los Angeles Apache Spark User GroupDatabricks Meetup @ Los Angeles Apache Spark User Group
Databricks Meetup @ Los Angeles Apache Spark User GroupPaco Nathan
 
Spark + H20 = Machine Learning at scale
Spark + H20 = Machine Learning at scaleSpark + H20 = Machine Learning at scale
Spark + H20 = Machine Learning at scaleMateusz Dymczyk
 
Building Deep Learning Workflows with DL4J
Building Deep Learning Workflows with DL4JBuilding Deep Learning Workflows with DL4J
Building Deep Learning Workflows with DL4JJosh Patterson
 
Jump Start on Apache® Spark™ 2.x with Databricks
Jump Start on Apache® Spark™ 2.x with Databricks Jump Start on Apache® Spark™ 2.x with Databricks
Jump Start on Apache® Spark™ 2.x with Databricks Databricks
 
Jumpstart on Apache Spark 2.2 on Databricks
Jumpstart on Apache Spark 2.2 on DatabricksJumpstart on Apache Spark 2.2 on Databricks
Jumpstart on Apache Spark 2.2 on DatabricksDatabricks
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksAnyscale
 
Scaling up with Cisco Big Data: Data + Science = Data Science
Scaling up with Cisco Big Data: Data + Science = Data ScienceScaling up with Cisco Big Data: Data + Science = Data Science
Scaling up with Cisco Big Data: Data + Science = Data ScienceeRic Choo
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Djamel Zouaoui
 

Similar to Fast Data Intelligence in the IoT - real-time data analytics with Spark (20)

Spark streaming , Spark SQL
Spark streaming , Spark SQLSpark streaming , Spark SQL
Spark streaming , Spark SQL
 
20170126 big data processing
20170126 big data processing20170126 big data processing
20170126 big data processing
 
Media_Entertainment_Veriticals
Media_Entertainment_VeriticalsMedia_Entertainment_Veriticals
Media_Entertainment_Veriticals
 
Hands on with Apache Spark
Hands on with Apache SparkHands on with Apache Spark
Hands on with Apache Spark
 
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingTiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
 
ETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
ETL to ML: Use Apache Spark as an end to end tool for Advanced AnalyticsETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
ETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
 
Big Data 2.0 - How Spark technologies are reshaping the world of big data ana...
Big Data 2.0 - How Spark technologies are reshaping the world of big data ana...Big Data 2.0 - How Spark technologies are reshaping the world of big data ana...
Big Data 2.0 - How Spark technologies are reshaping the world of big data ana...
 
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
 
Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)
 
Apache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code WorkshopApache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code Workshop
 
Introduction to real time big data with Apache Spark
Introduction to real time big data with Apache SparkIntroduction to real time big data with Apache Spark
Introduction to real time big data with Apache Spark
 
Databricks Meetup @ Los Angeles Apache Spark User Group
Databricks Meetup @ Los Angeles Apache Spark User GroupDatabricks Meetup @ Los Angeles Apache Spark User Group
Databricks Meetup @ Los Angeles Apache Spark User Group
 
Spark + H20 = Machine Learning at scale
Spark + H20 = Machine Learning at scaleSpark + H20 = Machine Learning at scale
Spark + H20 = Machine Learning at scale
 
Building Deep Learning Workflows with DL4J
Building Deep Learning Workflows with DL4JBuilding Deep Learning Workflows with DL4J
Building Deep Learning Workflows with DL4J
 
Jump Start on Apache® Spark™ 2.x with Databricks
Jump Start on Apache® Spark™ 2.x with Databricks Jump Start on Apache® Spark™ 2.x with Databricks
Jump Start on Apache® Spark™ 2.x with Databricks
 
Jumpstart on Apache Spark 2.2 on Databricks
Jumpstart on Apache Spark 2.2 on DatabricksJumpstart on Apache Spark 2.2 on Databricks
Jumpstart on Apache Spark 2.2 on Databricks
 
Dev Ops Training
Dev Ops TrainingDev Ops Training
Dev Ops Training
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on Databricks
 
Scaling up with Cisco Big Data: Data + Science = Data Science
Scaling up with Cisco Big Data: Data + Science = Data ScienceScaling up with Cisco Big Data: Data + Science = Data Science
Scaling up with Cisco Big Data: Data + Science = Data Science
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming
 

Recently uploaded

%+27788225528 love spells in Vancouver Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Vancouver Psychic Readings, Attraction spells,Br...%+27788225528 love spells in Vancouver Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Vancouver Psychic Readings, Attraction spells,Br...masabamasaba
 
Exploring the Best Video Editing App.pdf
Exploring the Best Video Editing App.pdfExploring the Best Video Editing App.pdf
Exploring the Best Video Editing App.pdfproinshot.com
 
The title is not connected to what is inside
The title is not connected to what is insideThe title is not connected to what is inside
The title is not connected to what is insideshinachiaurasa2
 
Define the academic and professional writing..pdf
Define the academic and professional writing..pdfDefine the academic and professional writing..pdf
Define the academic and professional writing..pdfPearlKirahMaeRagusta1
 
%in Lydenburg+277-882-255-28 abortion pills for sale in Lydenburg
%in Lydenburg+277-882-255-28 abortion pills for sale in Lydenburg%in Lydenburg+277-882-255-28 abortion pills for sale in Lydenburg
%in Lydenburg+277-882-255-28 abortion pills for sale in Lydenburgmasabamasaba
 
%in Durban+277-882-255-28 abortion pills for sale in Durban
%in Durban+277-882-255-28 abortion pills for sale in Durban%in Durban+277-882-255-28 abortion pills for sale in Durban
%in Durban+277-882-255-28 abortion pills for sale in Durbanmasabamasaba
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisamasabamasaba
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisamasabamasaba
 
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdfintroduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdfVishalKumarJha10
 
%in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park %in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park masabamasaba
 
Architecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the pastArchitecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the pastPapp Krisztián
 
8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech studentsHimanshiGarg82
 
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...SelfMade bd
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...Health
 
The Top App Development Trends Shaping the Industry in 2024-25 .pdf
The Top App Development Trends Shaping the Industry in 2024-25 .pdfThe Top App Development Trends Shaping the Industry in 2024-25 .pdf
The Top App Development Trends Shaping the Industry in 2024-25 .pdfayushiqss
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️Delhi Call girls
 
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfonteinmasabamasaba
 
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrainmasabamasaba
 
Chinsurah Escorts ☎️8617697112 Starting From 5K to 15K High Profile Escorts ...
Chinsurah Escorts ☎️8617697112  Starting From 5K to 15K High Profile Escorts ...Chinsurah Escorts ☎️8617697112  Starting From 5K to 15K High Profile Escorts ...
Chinsurah Escorts ☎️8617697112 Starting From 5K to 15K High Profile Escorts ...Nitya salvi
 

Recently uploaded (20)

%+27788225528 love spells in Vancouver Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Vancouver Psychic Readings, Attraction spells,Br...%+27788225528 love spells in Vancouver Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Vancouver Psychic Readings, Attraction spells,Br...
 
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Exploring the Best Video Editing App.pdf
Exploring the Best Video Editing App.pdfExploring the Best Video Editing App.pdf
Exploring the Best Video Editing App.pdf
 
The title is not connected to what is inside
The title is not connected to what is insideThe title is not connected to what is inside
The title is not connected to what is inside
 
Define the academic and professional writing..pdf
Define the academic and professional writing..pdfDefine the academic and professional writing..pdf
Define the academic and professional writing..pdf
 
%in Lydenburg+277-882-255-28 abortion pills for sale in Lydenburg
%in Lydenburg+277-882-255-28 abortion pills for sale in Lydenburg%in Lydenburg+277-882-255-28 abortion pills for sale in Lydenburg
%in Lydenburg+277-882-255-28 abortion pills for sale in Lydenburg
 
%in Durban+277-882-255-28 abortion pills for sale in Durban
%in Durban+277-882-255-28 abortion pills for sale in Durban%in Durban+277-882-255-28 abortion pills for sale in Durban
%in Durban+277-882-255-28 abortion pills for sale in Durban
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
 
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdfintroduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
 
%in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park %in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park
 
Architecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the pastArchitecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the past
 
8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students
 
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
 
The Top App Development Trends Shaping the Industry in 2024-25 .pdf
The Top App Development Trends Shaping the Industry in 2024-25 .pdfThe Top App Development Trends Shaping the Industry in 2024-25 .pdf
The Top App Development Trends Shaping the Industry in 2024-25 .pdf
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
 
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
 
Chinsurah Escorts ☎️8617697112 Starting From 5K to 15K High Profile Escorts ...
Chinsurah Escorts ☎️8617697112  Starting From 5K to 15K High Profile Escorts ...Chinsurah Escorts ☎️8617697112  Starting From 5K to 15K High Profile Escorts ...
Chinsurah Escorts ☎️8617697112 Starting From 5K to 15K High Profile Escorts ...
 

Fast Data Intelligence in the IoT - real-time data analytics with Spark

  • 1. Fast Data Intelligence in the IoT Real-time Data Analytics with Spark Streaming and MLlib Bas Geerdink #iottechday
  • 2. ABOUT ME • Chapter Lead in Analytics area at ING • Academic background in Artificial Intelligence and Informatics • Working in IT since 2004, previously as developer and software architect • Spark Certified Developer • Twitter: @bgeerdink • Github: geerdink
  • 3.
  • 4. WHAT’S NEW IN THE IOT? • More data – Streaming data from multiple sources • New use cases – Combining data streams • New technology – Fast processing and scalability Front End Back End Data
  • 5. PATTERNS & PRACTICES FOR FAST DATA ANALYTICS • Lambda Architecture • Reactive Principles • Pipes & filters • Event Sourcing • REST, HATEOAS • …
  • 6. LAMBDA ARCHITECTURE Source: Nathan Marz & James Warren (2015)
  • 9. FAST DATA ARCHITECTURE Products Users API App Web … Batch (Machine Learning) Social Media Search History GPS Data … Message Broker Events Streaming (Business Logic) VisualizeProcessing Database
  • 10. A SHIFT IN TECHNOLOGY PARADIGMS Disk  In-memory Database  Stream Objects  Functions Centralized  Distributed Shared Memory/CPU/Disk  Shared Nothing
  • 11. TOOLS FOR THE JOB • Apache Kafka • Apache Cassandra • Apache Spark • Apache Zeppelin • Akka • Scala
  • 12. FAST DATA ARCHITECTURE Products Users API App Web … Batch Machine Learning Social Media Search History GPS Data GPS Data Message Broker Streaming Business Logic Events VisualizeProcessing Database
  • 13. KAFKA • Distributed Message broker • Built for speed, scalability, fault-tolerance • Works with topics, producers, consumers • Created at LinkedIn, now open source • Written in Scala
  • 14. CODE: KAFKA • build.sbt: "org.apache.kafka" %% "kafka" % kafkaVersion • Application.conf: kafka { producer … consumer } • KafkaConnection.scala: def producer, def consumer • KafkaProducerActor.scala: producer.send(msg) • KafkaConsumerActor.scala: val kafkaStream = connection.createMessageStreams(Map(topic -> 1))(topic)(0)
  • 15. CASSANDRA • NoSQL database • Built for speed, scalability, fault-tolerance • Works with CQL, consistency levels, replication factors • Created at Facebook, now open source • Written in Java
  • 16. CODE: CASSANDRA CREATE TABLE products (user_name text, product_category text, product_name text, score int, insertion_time timeuuid, PRIMARY KEY (user_name, product_category, product_name)); val cluster = new Cluster.Builder(). addContactPoints(uri.hosts.toArray: _*). withPort(uri.port). withQueryOptions(new QueryOptions().setConsistencyLevel(defaultConsistencyLevel)).build val session = cluster.connect session.execute(s"USE ${uri.keyspace}") def insertScore(productScore: ProductScore): Unit = { val query = s”INSERT INTO products (user_name, product_category, product_name, score, insertion_time) VALUES ('${productScore.userName}', '${productScore.productCategory}', '${productScore.productName}', ${productScore.score}, now())" session.execute(query) }
  • 17. SPARK • Fast, parallel, in-memory, general-purpose data processing engine • Winner of Daytona Gray Sort benchmark 2014 • Runs on Hadoop YARN, Mesos, cloud, or standalone • Created at AMPLab UC Berkeley, now open source • Written in Scala
  • 18. CODE: SPARK BASICS val l = List(1,2,3,4,5) val p = sc.parallelize(l) // create RDD p.count() // action def fun1(x: Int): Int = x * 2 p.map(fun1).collect() // transformation p.map(i => i * 2).filter(_ < 6).collect() // lambda
  • 19. SPARK
  • 21. CODE: SPARK STREAMING val conf = new SparkConf().setAppName("fast-data-search-history").setMaster("local[2]") val ssc = new StreamingContext(conf, Seconds(2)) // batch interval = 2 sec val kafkaParams = Map[String, String]("metadata.broker.list" -> "localhost:9092") val kafkaDirectStream = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, Set("search_history")) kafkaDirectStream .map(rdd => ProductScoreHelper.createProductScore(rdd._2)) .filter(_.productCategory != "Sneakers") .foreachRDD(rdd => rdd.foreach(CassandraHelper.insertScore)) ssc.start() // it's necessary to explicitly tell the StreamingContext to start receiving data ssc.awaitTermination() // wait for the job to finish
  • 22. CODE: SPARK MLLIB // initialize Spark MLlib val conf = new SparkConf().setAppName("fast-data-social-media").setMaster("local[2]") val sc = new SparkContext(conf) // load machine learning model from disk val model = LinearRegressionModel.load(sc, "/home/social_media.model") def processEvent(sme: SocialMediaEvent): Unit = { // feature vector extraction val vector = new DenseVector(Array(sme.userName, sme.message)) // get a new prediction for the top user category val value = model.predict(vector) // store the predicted category value val user = new User(sme.userName, UserHelper.getCategory(value)) CassandraHelper.updateUserCategory(user) }
  • 23. THREE KEY TAKEAWAYS • The IoT comes with new architecture: reactive and scalable are the new normal • Be aware of the paradigm shift: in-memory, streaming, distributed, shared nothing • Open source tooling such as Kafka, Cassandra, and Spark can help to process the fast data flows
  • 24. Thank You! “please rate my talk in the offical IoT Tech Day app” @bgeerdink #iottechday

Editor's Notes

  1. In this session, streaming data from IoT sources (sensors) will be pulled into an analytics engine to make predictions about the future. We use Spark as the technology of choice, since this framework is well suited for combining streaming data with machine learning techniques. Join this session to get an overview of a (nearly) fullblown analytics application, and to get inspired to set up your own predictive API for the IoT!
  2. This is a dream for engineers…
  3. Who is now actually working on a IoT application in production? Compare to a conference of Content Management Systems, ERP, … Big data vs Fast data: 3V, Volume Variety Velocity Storage is not an issue anymore… Hadoop is 10 years old! Speed and responsiveness are the new challenges. Same as with big data: you have to do something with the data. Machine learning = best with lots of data, e.g. historical events
  4. Reusable solutions to common problems Building blocks, guidelines, blueprints of architecture. I’m going to tell a little about the first two.
  5. 1. All data entering the system is dispatched to both the batch layer and the speed layer for processing. 2. The batch layer has two functions: (i) managing the master dataset (an immutable, append-only set of raw data), and (ii) to pre-compute the batch views. 3. The serving layer indexes the batch views so that they can be queried in low-latency, ad-hoc way. 4. The speed layer compensates for the high latency of updates to the serving layer and deals with recent data only. 5. Any incoming query can be answered by merging results from batch views and real-time views.
  6. Elastic = Scalable on demand, up & down. System stays responsive under varying workload. Resilient = system stays responsive in face of failure Responsive = system should respond in a timely manner if at all possible, even if (parts) are failing. Deal with problems quickly. Message-driven = rely on asynchronous message passing to ensure loosely coupling, isolation, non-blocking, back-pressure. Back-pressure = the ability to communicate that a component is under stress. This feedback is used by upstream components to reduce the load, thereby ensuring the system as a whole doesn’t fail.
  7. We have this nice guy. He is a little strange, because he has a social network. Even worse: he is on the internet, buying stuff, searching for items. He is even connected to the IoT: car, house, fridge, phone, etc. Scary! Now, meet an evil guy. He wants to make advantage of all this nice data! He sets up a company that combines all these data flows, and does something very clever: he is giving mr nice guy adds in banners. He wants to give him an offer he can’t refuse! Obviously everyone in the audience will not click on such advertisement spam, but please consider that there are people on this planet who might do that. So, I am a developer in this company, how should I build my system? It has to be scalabe: we start small, but what if this becomes a succes? I’ver heard something about fast data and the lambda architecture, let’s give that a try…
  8. Batch: Based on historical behavior and user profile, predict (recommend) the product category that a user is interested in. Algorithm on daily/hourly basis Speed: Based on current, actual data, score the products of a category. Store events for API: - Select data from tables, define order/priority of products within a category. What do we need to set up such a system nowadays?
  9. Parallelism Fault-tolerance (stateless, immutable)
  10. All open source, reason: not because it’s free, but because we want to contribute to the community. All running on commodity hardware and cloud. I will discuss the top three…
  11. Batch: Based on historical behavior and user profile, predict the product category that a user is interested in. Algorithm on daily/hourly basis Speed: Based on current, actual data, score the products of a category. Store events for API: - Select data from tables, define order/priority of products within a category. What do we need to set up such a system nowadays?
  12. Allows SOA and Microservices architecture, but it’s not an ESB (too little functionality) Elastic: 1 instance can server a large organization. One broker can handle 100s of megabytes per second from 1000s of clients. Runs on Zookeeper: high performance coordination service Publish-subscribe mechanism Too fast? (Precision in real time can lead to misses)
  13. Consistency level: tradeoff between speed and data quality (1 = fast, may not read last written value, quorum = strict majority w.r.t. replication factor, all = slow, guaranteed reads) CAP theorem: it’s impossible to provide all three guarantees of Consistency (= quality; all nodes see the same data at the same time), Availability, Partition Tolerance ACID vs BASE consistency model: relational/’safe’ vs scala/resilient/’eventually consistent’ Commercialized by Datastax
  14. Spark = data processing framework With built-in parallel distribution, in-memory computing. Biggest ‘big data’ project at Apache Daytona Sort: 2009: Hadoop, 100 TB in 173 minutes, 3452 nodes x 4 cores 2013: Hadoop, 100 TB in 4 seconds, 2100 nodes x 8 cores 2014: Spark, 100 TB in 1.4 seconds, 207 nodes x 32 cores Commercialized by Databricks, Cloudera, Hortonworks, Amazon, IBM, … StorageLevel can be chosen: memory and/or disk, eventually serialized Number and size of partitions is configurable.
  15. RDD = resilient distributed dataset Transformations Actions Accumulators, Broadcast variables
  16. History: General batch processing: MapReduce Specialized systems: Dremel, Drill, Impala, Storm, S4, … Unified Platform: Spark Spark SQL = query structured data GraphX = for graph structures, e.g. hyperlinks, communities, …
  17. RDD = Resilient Distributed Dataset For true streaming: Apache Flink
  18. Also show CassandraWriterActor
  19. Show Zeppelin. ML variations: classification, regression, clustering
  20. Fourth one: maybe don’t use social media??
  21. In this session, streaming data from IoT sources (sensors) will be pulled into an analytics engine to make predictions about the future. We use Spark as the technology of choice, since this framework is well suited for combining streaming data with machine learning techniques. Join this session to get an overview of a (nearly) fullblown analytics application, and to get inspired to set up your own predictive API for the IoT!