Building Self Healing, Intelligent Platforms, systems that learn, multi-datacenter, removing human intervention with ML. Reactive Summit 2016 @helenaedelson
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Building Reactive Distributed Systems For Streaming Big Data, Analytics & Machine Learning
1. 1
Building Reactive Distributed Systems
For Streaming Big Data, Analytics & ML
Helena Edelson @helenaedelson Reactive Summit 2016
2. @helenaedelson #reactivesummit
Distributed Systems & Big Data Platform
Engineer, Event-Driven systems, Streaming
Analytics, Machine Learning, Scala
Committer: FiloDB, Spark Cassandra Connector,
Kafka Connect Cassandra
Contributor: Akka (Akka Cluster), prev: Spring
Integration
Speaker: Reactive Summit, Kafka Summit, Spark
Summit, Strata, QCon, Scala Days, Philly ETE
2
I play with big data
twitter.com/helenaedelson
slideshare.net/helenaedelson
linkedin.com/in/helenaedelson
github.com/helena
and think about FT
4. @helenaedelson #reactivesummit
It’s Not Easy Being…
Globally Distributed
Eventually Consistent
Highly Available
While handling TBs of data
Self Healing
Fast
4
5. @helenaedelson #reactivesummit
Massive event spikes & bursty traffic
Fast producers / slow consumers
Network partitioning & out of sync systems
DC down
Failure before commit
DDOS'ing yourself - no backpressure
Lack of graceful lifecycle handling, i.e. data loss
when auto-scaling down
5
9. @helenaedelson #reactivesummit
“Everything fails, all the time”
Start with this premise
Build failure, chaos routing
and intelligence into your
platform infrastructure
Be fault proactive vs just
reactive
9
- Werner Vogels
10. @helenaedelson #reactivesummit 10
The matrix is everywhere…
It is a world where humans
vs machines need to be
directly involved in health of
systems, respond to alerts…
13. @helenaedelson #reactivesummit 13
Self Healing, Intelligent Platforms
Imagine a world
where engineers designed software
while the machines did triage, failure analysis and
took action to correct failures
and learned from it…
15. @helenaedelson #reactivesummit
Building Systems That Learn
You have systems receiving, processing & collecting data. They can provide insight into
trends, patterns, and when they start failing, your monitoring triggers alerts to humans.
Alternative
1. Automatically route data only to nodes that can process it.
2. Automatically route failure contexts to event streams with ML algorithms to learn:
What is normal?
What is an anomaly?
When these anomalies occur, what are they indicative of?
3. Automate proactive actions when those anomalies bubble up to a fault-aware layer.
15
16. @helenaedelson #reactivesummit
Define Failures vs Error Fault
Tolerance Pathways & Routing
Failures - more network related, connectivity, …
Errors - application errors, user input from REST
requests, config load-time errors…
Escape Loops and Feedback system
Built-in failure detection
17. @helenaedelson #reactivesummit 17
I want to keep all my apps
healthy, handle errors with a
similar strategies, and failures with
a similar strategy…
…and share that data across
datacenters with my multi-dc
aware kernel to build smarter,
automated systems.
21. @helenaedelson #reactivesummit 21
I want to
route failure context
(when a node is alive
enough to talk)
to my.failures.fu Kafka topic for
my X-DC kernel to process through
my ML streams
to learn about usage,
traffic, memory patterns and
anomalies, to anticipate failures,
intercept and correct
Without Human Intervention
22. @helenaedelson #reactivesummit 22
Akka Cluster it. The cluster knows
the node is down. Many options/
mechanisms to send/publish failure
metadata to learning streams.
If it’s a network failure and my node can’t
report to the kernel?
Bring it
23. @helenaedelson #reactivesummit 23
I think I’ll start with a
base cluster-aware Akka Extension for
my platform.
Any non-JVM apps can share data
with these via Kafka, ZeroMQ,
Cassandra…
Sew easy.
Might sneak it in without a
PR.
24. @helenaedelson #reactivesummit 24
Then I can bake my fault tolerance
strategy & route awareness in every
app & node.
And make it know when to use
Kafka and when to use Akka for
communication.
25. @helenaedelson #reactivesummit 25
And all logic and flows for graceful handling of
node lifecycle are baked in to the framework too.
Dude, that’s tight.
Graceful Lifecycle
26. @helenaedelson #reactivesummit
Akka Cluster
Node-aware cluster membership
Gossip protocol
Status of all member nodes in the ring
Tunable, automatic failure detection
No single point of failure or bottleneck
A level of composability above Actors & Actor
Hierarchy
26
27. @helenaedelson #reactivesummit 27
Akka Cluster has tunable Gossip.I
can partition segments of my node ring
by role, and route data by role or
broadcast.
Tunable Gossip
28. @helenaedelson #reactivesummit 28
Akka Cluster has
AdaptiveLoadBalancing logic.
I can route events away from nodes in trouble and
directed to healthier nodes. Many options, highly
configurable, good granularity.
AdaptiveLoadBalancing*
Router
Cluster Metrics API
30. @helenaedelson #reactivesummit 30
Dynamically Route by Node Health
Cluster Metrics API & AdaptiveLBR
Oh look,
slow consumers or something in GC
thrash…
Leader
Uses EWMA algo to weight decay of health data
31. @helenaedelson #reactivesummit 31
Dynamically Route by Node Health
Cluster Metrics API & AdaptiveLBR
Leader
The
Cluster leader ‘knew’ the red
node was nearing capacity & routed my
traffic to healthy nodes.
41. @helenaedelson #reactivesummit
Replaying Data from Time (T)
General time - We updated some compute
algorithms which run on raw and aggregated
data. I need to replay some datasets against the
updated algos
Specific time - A node crashed or there was a
network partitioning event and we lost data
from T1 to T2. I need to replay from T1 to T2
without having to de-dupe data
41
42. @helenaedelson #reactivesummit
Cassandra Bakes Clustered
Ordering Into Data Model
CREATE TABLE timeseries_t (
id text,
year int,
month int,
day int,
hour int,
PRIMARY KEY ((wsid), year, month, day, hour)
) WITH CLUSTERING ORDER BY (year DESC, month DESC, day DESC,
hour DESC);
42
Now the read path queries automatically return sorted by most
recent. No sorts to do or ordering in your code.
Can also easily bucket in cold storage for faster reads.
44. @helenaedelson #reactivesummit
Kafka Streams Key Features
Secure stream processing using Kafka security
Elastic and highly scalable
Fault-tolerant
Stateful and stateless computations
44
45. @helenaedelson #reactivesummit
Kafka Streams Key Features
Interactive queries
Time Model & Windowing
Supports late-arriving and out-of-order data
Millisecond processing latency
At-least-once processing guarantees
exactly-once is in progress :)
45
46. @helenaedelson #reactivesummit
val builder = new KStreamBuilder()
// Consume from topic x in the stream, from any or one DC
val kstream: KStream[K,V] = builder.stream(des, des, “fu.input.topic")
// Do some analytics computations
// Publish to subscribers in all DCs in the stream
kstream.to(“serviceA.questionB.aggregC.topic”, …)
// Start the stream
val streams = new KafkaStreams(builder, props)
streams.start()
// Do some more work then close the stream
streams.close()
Kafka Streams: KStream
46
47. @helenaedelson #reactivesummit 47
Kafka Streams: KTable
val locations: KTable[UserId, Location] = builder.table(“user-locations-topic”)
val userPrefs: KTable[UserId, Prefs] = builder.table(“user-preferences-topic”)
// Join detailed info from 2 streams as events stream in
val userProfiles: KTable[UserId, UserProfile] =
locations.join(userPrefs, (loc, prefs) -> new UserProfile(loc, prefs))
// Compute statistics
val usersPerRegion: KTable[UserId, Long] = userProfiles
.filter((userId, profile) -> profile.age < 30)
.groupBy((userId, profile) -> profile.location)
.count()
48. @helenaedelson #reactivesummit
val streams = new KafkaStreams(builder, props)
streams.start()
// Called when a stream thread abruptly terminates
// due to an uncaught exception.
streams.setUncaughtExceptionHandler(myErrorHandler)
Kafka Streams DSL Basics
48
58. @helenaedelson #reactivesummit
What do I want to predict & what can I learn from the data?
Which attributes in datasets are valuable predictors?
Which algorithms will best uncover valuable patterns?
58
Optimization &
Predictive Models
60. @helenaedelson #reactivesummit
Read from stored training data
into MLLib
60
import org.apache.spark.mllib.stat.Statistics
import filodb.spark._
val df = sqlContext.filoDataset(“table”, database = Some(“ks”))
val seriesX = df.select(“NumX”).map(_.getInt(0).toDouble)
val seriesY = df.select(“NumY”).map(_.getInt(0).toDouble)
val correlation: Double = Statistics.corr(seriesX, seriesY, “pearson”)
Calculate correlation between multiple series of data
61. @helenaedelson #reactivesummit
Read from the stream to MLLib
& store for later work in FiloDB
61
val stream = KafkaUtils.createDirectStream[..](..)
.map(transformFunc)
.map(LabeledPoint.parse)
val trainingData = sqlContext.filoDataset(“trainings_fu")
val model = new StreamingLinearRegressionWithSGD()
.setInitialWeights(Vectors.dense(weights))
.setStepSize(0.2).setNumIterations(25)
.trainOn(trainingData)
model
.predictOnValues(stream.map(lp => (lp.label, lp.features)))
.insertIntoFilo(“predictions_fu")
63. @helenaedelson #reactivesummit
Don’t take when someone says X
doesn’t work at face value.
It may not have been the right choice
for their use case
They may not have configured/
deployed/wrote code for it properly for
their load
All technologies are optimized for
a set of things vs every thing
Invest in R&D
63
Cultivate Skepticism
66. @helenaedelson #reactivesummit 66
Versus duplicating, with disparate
strategies, all over your teams & orgs
Build A Reactive / Proactive Failure & Chaos
Intelligence System for platform infrastructure