Chinsurah Escorts ☎️8617697112 Starting From 5K to 15K High Profile Escorts ...
Fast Data Intelligence in the IoT - real-time data analytics with Spark
1. Fast Data Intelligence in the IoT
Real-time Data Analytics with Spark Streaming and MLlib
Bas Geerdink
#iottechday
2. ABOUT ME
• Chapter Lead in Analytics area at ING
• Academic background in Artificial
Intelligence and Informatics
• Working in IT since 2004, previously as
developer and software architect
• Spark Certified Developer
• Twitter: @bgeerdink
• Github: geerdink
3.
4. WHAT’S NEW IN THE IOT?
• More data
– Streaming data from multiple sources
• New use cases
– Combining data streams
• New technology
– Fast processing and scalability
Front
End
Back
End
Data
5. PATTERNS & PRACTICES
FOR FAST DATA ANALYTICS
• Lambda Architecture
• Reactive Principles
• Pipes & filters
• Event Sourcing
• REST, HATEOAS
• …
13. KAFKA
• Distributed Message broker
• Built for speed, scalability, fault-tolerance
• Works with topics, producers, consumers
• Created at LinkedIn, now open source
• Written in Scala
15. CASSANDRA
• NoSQL database
• Built for speed, scalability, fault-tolerance
• Works with CQL, consistency levels, replication factors
• Created at Facebook, now open source
• Written in Java
17. SPARK
• Fast, parallel, in-memory, general-purpose data
processing engine
• Winner of Daytona Gray Sort benchmark 2014
• Runs on Hadoop YARN, Mesos, cloud, or standalone
• Created at AMPLab UC Berkeley, now open source
• Written in Scala
18. CODE: SPARK BASICS
val l = List(1,2,3,4,5)
val p = sc.parallelize(l) // create RDD
p.count() // action
def fun1(x: Int): Int = x * 2
p.map(fun1).collect() // transformation
p.map(i => i * 2).filter(_ < 6).collect() // lambda
21. CODE: SPARK STREAMING
val conf = new SparkConf().setAppName("fast-data-search-history").setMaster("local[2]")
val ssc = new StreamingContext(conf, Seconds(2)) // batch interval = 2 sec
val kafkaParams = Map[String, String]("metadata.broker.list" -> "localhost:9092")
val kafkaDirectStream = KafkaUtils.createDirectStream[String, String, StringDecoder,
StringDecoder](ssc, kafkaParams, Set("search_history"))
kafkaDirectStream
.map(rdd => ProductScoreHelper.createProductScore(rdd._2))
.filter(_.productCategory != "Sneakers")
.foreachRDD(rdd => rdd.foreach(CassandraHelper.insertScore))
ssc.start() // it's necessary to explicitly tell the StreamingContext to start receiving data
ssc.awaitTermination() // wait for the job to finish
22. CODE: SPARK MLLIB
// initialize Spark MLlib
val conf = new SparkConf().setAppName("fast-data-social-media").setMaster("local[2]")
val sc = new SparkContext(conf)
// load machine learning model from disk
val model = LinearRegressionModel.load(sc, "/home/social_media.model")
def processEvent(sme: SocialMediaEvent): Unit = {
// feature vector extraction
val vector = new DenseVector(Array(sme.userName, sme.message))
// get a new prediction for the top user category
val value = model.predict(vector)
// store the predicted category value
val user = new User(sme.userName, UserHelper.getCategory(value))
CassandraHelper.updateUserCategory(user)
}
23. THREE KEY TAKEAWAYS
• The IoT comes with new architecture: reactive and
scalable are the new normal
• Be aware of the paradigm shift: in-memory,
streaming, distributed, shared nothing
• Open source tooling such as Kafka, Cassandra, and
Spark can help to process the fast data flows
In this session, streaming data from IoT sources (sensors) will be pulled into an analytics engine to make predictions about the future. We use Spark as the technology of choice, since this framework is well suited for combining streaming data with machine learning techniques. Join this session to get an overview of a (nearly) fullblown analytics application, and to get inspired to set up your own predictive API for the IoT!
This is a dream for engineers…
Who is now actually working on a IoT application in production?
Compare to a conference of Content Management Systems, ERP, …
Big data vs Fast data:
3V, Volume Variety Velocity
Storage is not an issue anymore… Hadoop is 10 years old! Speed and responsiveness are the new challenges.
Same as with big data: you have to do something with the data. Machine learning = best with lots of data, e.g. historical events
Reusable solutions to common problems
Building blocks, guidelines, blueprints of architecture.
I’m going to tell a little about the first two.
1. All data entering the system is dispatched to both the batch layer and the speed layer for processing.
2. The batch layer has two functions: (i) managing the master dataset (an immutable, append-only set of raw data), and (ii) to pre-compute the batch views.
3. The serving layer indexes the batch views so that they can be queried in low-latency, ad-hoc way.
4. The speed layer compensates for the high latency of updates to the serving layer and deals with recent data only.
5. Any incoming query can be answered by merging results from batch views and real-time views.
Elastic = Scalable on demand, up & down. System stays responsive under varying workload.
Resilient = system stays responsive in face of failure
Responsive = system should respond in a timely manner if at all possible, even if (parts) are failing. Deal with problems quickly.
Message-driven = rely on asynchronous message passing to ensure loosely coupling, isolation, non-blocking, back-pressure.
Back-pressure = the ability to communicate that a component is under stress. This feedback is used by upstream components to reduce the load, thereby ensuring the system as a whole doesn’t fail.
We have this nice guy. He is a little strange, because he has a social network. Even worse: he is on the internet, buying stuff, searching for items. He is even connected to the IoT: car, house, fridge, phone, etc. Scary!
Now, meet an evil guy. He wants to make advantage of all this nice data! He sets up a company that combines all these data flows, and does something very clever: he is giving mr nice guy adds in banners. He wants to give him an offer he can’t refuse! Obviously everyone in the audience will not click on such advertisement spam, but please consider that there are people on this planet who might do that.
So, I am a developer in this company, how should I build my system? It has to be scalabe: we start small, but what if this becomes a succes? I’ver heard something about fast data and the lambda architecture, let’s give that a try…
Batch:
Based on historical behavior and user profile, predict (recommend) the product category that a user is interested in.
Algorithm on daily/hourly basis
Speed:
Based on current, actual data, score the products of a category.
Store events for
API:
- Select data from tables, define order/priority of products within a category.
What do we need to set up such a system nowadays?
All open source, reason: not because it’s free, but because we want to contribute to the community.
All running on commodity hardware and cloud.
I will discuss the top three…
Batch:
Based on historical behavior and user profile, predict the product category that a user is interested in.
Algorithm on daily/hourly basis
Speed:
Based on current, actual data, score the products of a category.
Store events for
API:
- Select data from tables, define order/priority of products within a category.
What do we need to set up such a system nowadays?
Allows SOA and Microservices architecture, but it’s not an ESB (too little functionality)
Elastic: 1 instance can server a large organization. One broker can handle 100s of megabytes per second from 1000s of clients.
Runs on Zookeeper: high performance coordination service
Publish-subscribe mechanism
Too fast? (Precision in real time can lead to misses)
Consistency level: tradeoff between speed and data quality
(1 = fast, may not read last written value, quorum = strict majority w.r.t. replication factor, all = slow, guaranteed reads)
CAP theorem: it’s impossible to provide all three guarantees of Consistency (= quality; all nodes see the same data at the same time), Availability, Partition Tolerance
ACID vs BASE consistency model: relational/’safe’ vs scala/resilient/’eventually consistent’
Commercialized by Datastax
Spark = data processing framework
With built-in parallel distribution, in-memory computing.
Biggest ‘big data’ project at Apache
Daytona Sort:
2009: Hadoop, 100 TB in 173 minutes, 3452 nodes x 4 cores
2013: Hadoop, 100 TB in 4 seconds, 2100 nodes x 8 cores
2014: Spark, 100 TB in 1.4 seconds, 207 nodes x 32 cores
Commercialized by Databricks, Cloudera, Hortonworks, Amazon, IBM, …
StorageLevel can be chosen: memory and/or disk, eventually serialized
Number and size of partitions is configurable.
History:
General batch processing: MapReduce
Specialized systems: Dremel, Drill, Impala, Storm, S4, …
Unified Platform: Spark
Spark SQL = query structured data
GraphX = for graph structures, e.g. hyperlinks, communities, …
RDD = Resilient Distributed Dataset
For true streaming: Apache Flink
Also show CassandraWriterActor
Show Zeppelin.
ML variations: classification, regression, clustering
Fourth one: maybe don’t use social media??
In this session, streaming data from IoT sources (sensors) will be pulled into an analytics engine to make predictions about the future. We use Spark as the technology of choice, since this framework is well suited for combining streaming data with machine learning techniques. Join this session to get an overview of a (nearly) fullblown analytics application, and to get inspired to set up your own predictive API for the IoT!