Fast Data Intelligence in the IoT - real-time data analytics with Spark

Fast Data Intelligence in the IoT
Real-time Data Analytics with Spark Streaming and MLlib
Bas Geerdink
#iottechday

ABOUT ME
• Chapter Lead in Analytics area at ING
• Academic background in Artificial
Intelligence and Informatics
• Working in IT since 2004, previously as
developer and software architect
• Spark Certified Developer
• Twitter: @bgeerdink
• Github: geerdink

WHAT’S NEW IN THE IOT?
• More data
– Streaming data from multiple sources
• New use cases
– Combining data streams
• New technology
– Fast processing and scalability
Front
End
Back
End
Data

PATTERNS & PRACTICES
FOR FAST DATA ANALYTICS
• Lambda Architecture
• Reactive Principles
• Pipes & filters
• Event Sourcing
• REST, HATEOAS
• …

LAMBDA ARCHITECTURE
Source: Nathan Marz & James Warren (2015)

REACTIVE PRINCIPLES
Source: Reactive Manifesto (2014)

FAST DATA ARCHITECTURE
Products
Users
API
App
Web
…
Batch
(Machine Learning)
Social
Media
Search
History
GPS
Data
…
Message
Broker
Events
Streaming
(Business Logic)
VisualizeProcessing Database

A SHIFT IN TECHNOLOGY PARADIGMS
Disk  In-memory
Database  Stream
Objects  Functions
Centralized  Distributed
Shared Memory/CPU/Disk  Shared Nothing

TOOLS FOR THE JOB
• Apache Kafka
• Apache Cassandra
• Apache Spark
• Apache Zeppelin
• Akka
• Scala

FAST DATA ARCHITECTURE
Products
Users
API
App
Web
…
Batch
Machine Learning
Social
Media
Search
History
GPS
Data
GPS
Data
Message
Broker
Streaming
Business Logic
Events VisualizeProcessing Database

KAFKA
• Distributed Message broker
• Built for speed, scalability, fault-tolerance
• Works with topics, producers, consumers
• Created at LinkedIn, now open source
• Written in Scala

CODE: KAFKA
• build.sbt:
"org.apache.kafka" %% "kafka" % kafkaVersion
• Application.conf:
kafka { producer … consumer }
• KafkaConnection.scala:
def producer, def consumer
• KafkaProducerActor.scala:
producer.send(msg)
• KafkaConsumerActor.scala:
val kafkaStream =
connection.createMessageStreams(Map(topic -> 1))(topic)(0)

CASSANDRA
• NoSQL database
• Built for speed, scalability, fault-tolerance
• Works with CQL, consistency levels, replication factors
• Created at Facebook, now open source
• Written in Java

CODE: CASSANDRA
CREATE TABLE products (user_name text, product_category text, product_name text,
score int, insertion_time timeuuid, PRIMARY KEY (user_name, product_category,
product_name));
val cluster = new Cluster.Builder().
addContactPoints(uri.hosts.toArray: _*).
withPort(uri.port).
withQueryOptions(new
QueryOptions().setConsistencyLevel(defaultConsistencyLevel)).build
val session = cluster.connect
session.execute(s"USE ${uri.keyspace}")
def insertScore(productScore: ProductScore): Unit = {
val query = s”INSERT INTO products (user_name, product_category, product_name,
score, insertion_time) VALUES ('${productScore.userName}',
'${productScore.productCategory}', '${productScore.productName}',
${productScore.score}, now())"
session.execute(query)
}

SPARK
• Fast, parallel, in-memory, general-purpose data
processing engine
• Winner of Daytona Gray Sort benchmark 2014
• Runs on Hadoop YARN, Mesos, cloud, or standalone
• Created at AMPLab UC Berkeley, now open source
• Written in Scala

CODE: SPARK BASICS
val l = List(1,2,3,4,5)
val p = sc.parallelize(l) // create RDD
p.count() // action
def fun1(x: Int): Int = x * 2
p.map(fun1).collect() // transformation
p.map(i => i * 2).filter(_ < 6).collect() // lambda

CODE: SPARK STREAMING
val conf = new SparkConf().setAppName("fast-data-search-history").setMaster("local[2]")
val ssc = new StreamingContext(conf, Seconds(2)) // batch interval = 2 sec
val kafkaParams = Map[String, String]("metadata.broker.list" -> "localhost:9092")
val kafkaDirectStream = KafkaUtils.createDirectStream[String, String, StringDecoder,
StringDecoder](ssc, kafkaParams, Set("search_history"))
kafkaDirectStream
.map(rdd => ProductScoreHelper.createProductScore(rdd._2))
.filter(_.productCategory != "Sneakers")
.foreachRDD(rdd => rdd.foreach(CassandraHelper.insertScore))
ssc.start() // it's necessary to explicitly tell the StreamingContext to start receiving data
ssc.awaitTermination() // wait for the job to finish

CODE: SPARK MLLIB
// initialize Spark MLlib
val conf = new SparkConf().setAppName("fast-data-social-media").setMaster("local[2]")
val sc = new SparkContext(conf)
// load machine learning model from disk
val model = LinearRegressionModel.load(sc, "/home/social_media.model")
def processEvent(sme: SocialMediaEvent): Unit = {
// feature vector extraction
val vector = new DenseVector(Array(sme.userName, sme.message))
// get a new prediction for the top user category
val value = model.predict(vector)
// store the predicted category value
val user = new User(sme.userName, UserHelper.getCategory(value))
CassandraHelper.updateUserCategory(user)
}

THREE KEY TAKEAWAYS
• The IoT comes with new architecture: reactive and
scalable are the new normal
• Be aware of the paradigm shift: in-memory,
streaming, distributed, shared nothing
• Open source tooling such as Kafka, Cassandra, and
Spark can help to process the fast data flows

Thank You!
“please rate my talk in the offical IoT Tech Day app”
@bgeerdink
#iottechday

Fast Data Intelligence in the IoT - real-time data analytics with Spark

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (10)

Similar to Fast Data Intelligence in the IoT - real-time data analytics with Spark

Similar to Fast Data Intelligence in the IoT - real-time data analytics with Spark (20)

Recently uploaded

Recently uploaded (20)

Fast Data Intelligence in the IoT - real-time data analytics with Spark

Editor's Notes