SlideShare a Scribd company logo
1 of 42
Stream Computing
(The engineer’s perspective)
Ilya Ganelin
Batch vs. Stream
• Batch
• Process chunks of data instead of one at a time
• Throughput over latency (seconds, minutes, hours)
• E.g. MapReduce, Spark, Tez
• Stream
• Data processed one at a time
• Latency over throughput (microseconds, milliseconds)
• E.g. Storm, Flink, Apex, KafkaStreams, GearPump
Scalability, Performance, Durability, Availability
• How do we handle more data?
• Quickly?
• Without ever losing data or compute?
• And ensure the system keeps working, even if there are failures?
What are the tradeoffs?
• If we focus on scalability, it’s harder to guarantee
• Durability – more moving pieces, more coordination, more failures
• Availability – more failures, harder to stay operational
• Performance – bottlenecks and synchronization
• If we focus on availability, it’s harder to guarantee
• Performance – monitoring and synchronization overhead
• Scalability and performance
• Durability – must recover without losing data
• If we focus on durability, it’s harder to guarantee
• Performance
• Scalability
Batch compute has it easy.
• Get scale-out and performance by adding hardware and taking longer
• Get durability with a durable data store and recompute
• Get availability by taking longer to recover (this makes life easier!)
• In stream processing, you don’t have time!
It’s not about performance and scale.
• Most platforms handle large volume of data relatively quickly
• It’s about:
• Ease of use – how quickly can I build a complex application? Not word count.
• Failure-handling – what happens when things break?
• Durability – how do I avoid losing data without sacrificing performance?
• Availability – how can I keep my system operational with a minimum of labor
and without sacrificing performance?
Next: Case Studies in Open-Source Streaming
• Storm
• Flink
• Apex
Apache Storm
• Tried and true, was deployed on 10,000 node clusters at Twitter
• Scalable
• Performant
• Easy to use
• Weaknesses:
• Failure handling
• Operationalization at scale
• Flexibility
• Obsolete?
How does it work?
How does it work?
How does it work?
Failure Detection
Failure Detection
No durability of data in flight or guarantee of exactly once processing!
Where do the weakness come from?
• Nimbus was a single point of failure (fixed as of 1.0.0 release)
• Upstream bolt/spout failure triggers re-compute on entire tree
• Can only create parallel independent stream by having separate redundant
topologies
• Bolts/spouts share JVM  Hard to debug
• Failed tuples cannot be replayed quicker than 1s (lower limit on Ack)
• No dynamic topologies
• Cannot add or remove applications without service interruption
• Poor resource sharing in large clusters
Enter the Competition – Apache Flink
• Declarative functional API (like Spark)
• But, true streaming platform (sort of) with support for CEP
• Optimized query execution
• Weaknesses:
• Depends on network micro-batching under the hood!
• Not battle -tested
• Failures still affect the entire topology
How does it work?
Failure Handling
So what’s different from Storm?
• Flink handles planning and optimization for you
• Abstracts lower level internals
• Clear semantics around windowing (which Storm has lacked)
• Failure handling is lightweight and fast!
• Exactly once processing (given appropriate connectors at start/end)
• Can run Storm
What can’t it do?
• Dynamically update topology
• Dynamically scale
• Recover from errors without stopping the entire DAG
• Allow fine-grained control of how data moves through the system –
locality, data partitioning, routing
• You can do these individually, but not all at once
• The high level API is a curse!
• Run in production (Maybe?)
So what else is there?
Onyx
Which are unique?
• Apache Beam (Google’s baby - unifies all the platforms)
• Apache Apex (Robust architecture, scalable, fast, durable)
• IBM InfoSphere Streams (proprietary, expensive, the best)
Let’s look at Apex
• Unique provenance
• Built for the business at Yahoo – not a research project
• Built for reliability and strict processing semantics, not performance
• Apex just works
• Strengths
• Dynamism
• Scalability
• Failure-handling
• Weaknesses
• No high-level API
• More complex architecture
How does it work?
Failure Handling
So it’s the best? Sort of!
• Most robust failure-handling
• Allows fine-tuning of data flows and DAG setup
• Excellent exploratory UI
• But
• Learning curve
• No high-level API
• No machine learning support
• Built for business, not for simplicity
Streaming is great – what about state?
• What if I need to persist data?
• Across operators?
• Retrieve it quickly?
• Do complex analytics?
• And build models?
Why state?
• Historical features (e.g. spend amount over 30 days)
• Statistical aggregates
• Machine learning model training
• Why Cross operator? Because of how data is partitioned, allows
aggregation over multiple fields.
Distributed In-Memory Databases
• Can support low-latency streaming use cases
• Durability becomes complicated because memory is volatile
• Memory is expensive and limited
• Examples: Memcached, Redis, MemSQL, Ignite, Hazelcast, Distributed
Hash Tables
Lab!
• Build and deploy a simple architecture on a streaming platform
• Ingest data
• Engineer features
• Build a model
• Score against the model
• Storm + H2O
• Model build and model score are two different steps
• H2O allows you to export your model as a POJO that can be added as Java
code in a Storm Bolt
Goals
• Demonstrate parallel feature computation
• Demonstrate model creation and export using H2O
• Given a labeled data-set (e.g. Titanic) generate a set of scores from
running the model within the Storm topology
• Validate the generated results against a validation dataset (Storm or
offline)
Plan of attack
• Step 0:
• Storm topology, executing a model (could be linear regression you coded
yourself), locally on a single node.
• Step 1:
• Storm topology, executing an H2O model locally on a single node
• Step 2:
• Storm topology, executing an H2O model, on multiple nodes (real or virtual)
• Step 3 (Extra credit):
• Install Redis as a state store and use a Redis client to access Redis from Storm
Final Deliverable
• A report detailing your experience working with this technology
• What worked?
• What did not work?
• What was setup and usability like?
• What issues did you run into?
• How did you resolve these issues?
• Were you able to get the system operational?
• Were you able to get the results you wanted?
Setup
• Download and install Apache Storm
• http://storm.apache.org/releases/1.0.0/index.html
• http://storm.apache.org/downloads.html
• http://storm.apache.org/releases/1.0.0/Setting-up-a-Storm-cluster.html
• Download and install H20
• http://www.h2o.ai/download/
• https://h2o-release.s3.amazonaws.com/h2o/rel-turchin/3/docs-website/h2o-
docs/index.html
• https://h2o-release.s3.amazonaws.com/h2o/rel-turchin/3/docs-website/h2o-
py/docs/index.html

More Related Content

What's hot

Designing your SaaS Database for Scale with Postgres
Designing your SaaS Database for Scale with PostgresDesigning your SaaS Database for Scale with Postgres
Designing your SaaS Database for Scale with Postgres
Ozgun Erdogan
 
Extending Spark Streaming to Support Complex Event Processing
Extending Spark Streaming to Support Complex Event ProcessingExtending Spark Streaming to Support Complex Event Processing
Extending Spark Streaming to Support Complex Event Processing
Oh Chan Kwon
 
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...
Databricks
 

What's hot (20)

Deploying Apache Spark Jobs on Kubernetes with Helm and Spark Operator
Deploying Apache Spark Jobs on Kubernetes with Helm and Spark OperatorDeploying Apache Spark Jobs on Kubernetes with Helm and Spark Operator
Deploying Apache Spark Jobs on Kubernetes with Helm and Spark Operator
 
Deconstructiong Recommendations on Spark-(Ilya Ganelin, Capital One)
Deconstructiong Recommendations on Spark-(Ilya Ganelin, Capital One)Deconstructiong Recommendations on Spark-(Ilya Ganelin, Capital One)
Deconstructiong Recommendations on Spark-(Ilya Ganelin, Capital One)
 
Designing your SaaS Database for Scale with Postgres
Designing your SaaS Database for Scale with PostgresDesigning your SaaS Database for Scale with Postgres
Designing your SaaS Database for Scale with Postgres
 
Deploying Accelerators At Datacenter Scale Using Spark
Deploying Accelerators At Datacenter Scale Using SparkDeploying Accelerators At Datacenter Scale Using Spark
Deploying Accelerators At Datacenter Scale Using Spark
 
Migration from Redshift to Spark
Migration from Redshift to SparkMigration from Redshift to Spark
Migration from Redshift to Spark
 
Spark on Mesos
Spark on MesosSpark on Mesos
Spark on Mesos
 
Spark Summit EU 2016: The Next AMPLab: Real-time Intelligent Secure Execution
Spark Summit EU 2016: The Next AMPLab:  Real-time Intelligent Secure ExecutionSpark Summit EU 2016: The Next AMPLab:  Real-time Intelligent Secure Execution
Spark Summit EU 2016: The Next AMPLab: Real-time Intelligent Secure Execution
 
Use of Spark MLib for Predicting the Offlining of Digital Media-(Christopher ...
Use of Spark MLib for Predicting the Offlining of Digital Media-(Christopher ...Use of Spark MLib for Predicting the Offlining of Digital Media-(Christopher ...
Use of Spark MLib for Predicting the Offlining of Digital Media-(Christopher ...
 
GPU Computing With Apache Spark And Python
GPU Computing With Apache Spark And PythonGPU Computing With Apache Spark And Python
GPU Computing With Apache Spark And Python
 
Using Spark with Tachyon by Gene Pang
Using Spark with Tachyon by Gene PangUsing Spark with Tachyon by Gene Pang
Using Spark with Tachyon by Gene Pang
 
Towards True Elasticity of Spark-(Michael Le and Min Li, IBM)
Towards True Elasticity of Spark-(Michael Le and Min Li, IBM)Towards True Elasticity of Spark-(Michael Le and Min Li, IBM)
Towards True Elasticity of Spark-(Michael Le and Min Li, IBM)
 
Low Latency Execution For Apache Spark
Low Latency Execution For Apache SparkLow Latency Execution For Apache Spark
Low Latency Execution For Apache Spark
 
Extending Spark Streaming to Support Complex Event Processing
Extending Spark Streaming to Support Complex Event ProcessingExtending Spark Streaming to Support Complex Event Processing
Extending Spark Streaming to Support Complex Event Processing
 
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...
 
Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...
Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...
Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...
 
Leverage Mesos for running Spark Streaming production jobs by Iulian Dragos a...
Leverage Mesos for running Spark Streaming production jobs by Iulian Dragos a...Leverage Mesos for running Spark Streaming production jobs by Iulian Dragos a...
Leverage Mesos for running Spark Streaming production jobs by Iulian Dragos a...
 
Spark Summit 2016: Connecting Python to the Spark Ecosystem
Spark Summit 2016: Connecting Python to the Spark EcosystemSpark Summit 2016: Connecting Python to the Spark Ecosystem
Spark Summit 2016: Connecting Python to the Spark Ecosystem
 
Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...
Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...
Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...
 
Deep Learning to Production with MLflow & RedisAI
Deep Learning to Production with MLflow & RedisAIDeep Learning to Production with MLflow & RedisAI
Deep Learning to Production with MLflow & RedisAI
 
Karmasphere Studio for Hadoop
Karmasphere Studio for HadoopKarmasphere Studio for Hadoop
Karmasphere Studio for Hadoop
 

Similar to Stream Computing (The Engineer's Perspective)

Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
xlight
 
Migrating enterprise workloads to AWS
Migrating enterprise workloads to AWSMigrating enterprise workloads to AWS
Migrating enterprise workloads to AWS
Tom Laszewski
 
Tiger oracle
Tiger oracleTiger oracle
Tiger oracle
d0nn9n
 
Azug - successfully breeding rabits
Azug - successfully breeding rabitsAzug - successfully breeding rabits
Azug - successfully breeding rabits
Yves Goeleven
 

Similar to Stream Computing (The Engineer's Perspective) (20)

Five Years of EC2 Distilled
Five Years of EC2 DistilledFive Years of EC2 Distilled
Five Years of EC2 Distilled
 
Building large scale, job processing systems with Scala Akka Actor framework
Building large scale, job processing systems with Scala Akka Actor frameworkBuilding large scale, job processing systems with Scala Akka Actor framework
Building large scale, job processing systems with Scala Akka Actor framework
 
John adams talk cloudy
John adams   talk cloudyJohn adams   talk cloudy
John adams talk cloudy
 
Chirp 2010: Scaling Twitter
Chirp 2010: Scaling TwitterChirp 2010: Scaling Twitter
Chirp 2010: Scaling Twitter
 
Fixing twitter
Fixing twitterFixing twitter
Fixing twitter
 
Fixing_Twitter
Fixing_TwitterFixing_Twitter
Fixing_Twitter
 
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
 
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
 
Getting Deep on Orchestration - Nickoloff - DockerCon16
Getting Deep on Orchestration - Nickoloff - DockerCon16Getting Deep on Orchestration - Nickoloff - DockerCon16
Getting Deep on Orchestration - Nickoloff - DockerCon16
 
Building FoundationDB
Building FoundationDBBuilding FoundationDB
Building FoundationDB
 
Migrating enterprise workloads to AWS
Migrating enterprise workloads to AWSMigrating enterprise workloads to AWS
Migrating enterprise workloads to AWS
 
Open west 2015 talk ben coverston
Open west 2015 talk ben coverstonOpen west 2015 talk ben coverston
Open west 2015 talk ben coverston
 
EUC2015 - Load testing XMPP servers with Plain Old Erlang
EUC2015 - Load testing XMPP servers with Plain Old ErlangEUC2015 - Load testing XMPP servers with Plain Old Erlang
EUC2015 - Load testing XMPP servers with Plain Old Erlang
 
Tale of two streaming frameworks- Apace Storm & Apache Flink
Tale of two streaming frameworks- Apace Storm & Apache FlinkTale of two streaming frameworks- Apace Storm & Apache Flink
Tale of two streaming frameworks- Apace Storm & Apache Flink
 
Tale of two streaming frameworks (Karthik D - Walmart)
Tale of two streaming frameworks (Karthik D - Walmart)Tale of two streaming frameworks (Karthik D - Walmart)
Tale of two streaming frameworks (Karthik D - Walmart)
 
Scaling tappsi
Scaling tappsiScaling tappsi
Scaling tappsi
 
Performance Benchmarking: Tips, Tricks, and Lessons Learned
Performance Benchmarking: Tips, Tricks, and Lessons LearnedPerformance Benchmarking: Tips, Tricks, and Lessons Learned
Performance Benchmarking: Tips, Tricks, and Lessons Learned
 
Tiger oracle
Tiger oracleTiger oracle
Tiger oracle
 
Azug - successfully breeding rabits
Azug - successfully breeding rabitsAzug - successfully breeding rabits
Azug - successfully breeding rabits
 
Performance architecture for cloud connect
Performance architecture for cloud connectPerformance architecture for cloud connect
Performance architecture for cloud connect
 

Recently uploaded

Complex plane, Modulus, Argument, Graphical representation of a complex numbe...
Complex plane, Modulus, Argument, Graphical representation of a complex numbe...Complex plane, Modulus, Argument, Graphical representation of a complex numbe...
Complex plane, Modulus, Argument, Graphical representation of a complex numbe...
MohammadAliNayeem
 

Recently uploaded (20)

RESORT MANAGEMENT AND RESERVATION SYSTEM PROJECT REPORT.pdf
RESORT MANAGEMENT AND RESERVATION SYSTEM PROJECT REPORT.pdfRESORT MANAGEMENT AND RESERVATION SYSTEM PROJECT REPORT.pdf
RESORT MANAGEMENT AND RESERVATION SYSTEM PROJECT REPORT.pdf
 
How to Design and spec harmonic filter.pdf
How to Design and spec harmonic filter.pdfHow to Design and spec harmonic filter.pdf
How to Design and spec harmonic filter.pdf
 
The battle for RAG, explore the pros and cons of using KnowledgeGraphs and Ve...
The battle for RAG, explore the pros and cons of using KnowledgeGraphs and Ve...The battle for RAG, explore the pros and cons of using KnowledgeGraphs and Ve...
The battle for RAG, explore the pros and cons of using KnowledgeGraphs and Ve...
 
Object Oriented Programming OOP Lab Manual.docx
Object Oriented Programming OOP Lab Manual.docxObject Oriented Programming OOP Lab Manual.docx
Object Oriented Programming OOP Lab Manual.docx
 
KIT-601 Lecture Notes-UNIT-5.pdf Frame Works and Visualization
KIT-601 Lecture Notes-UNIT-5.pdf Frame Works and VisualizationKIT-601 Lecture Notes-UNIT-5.pdf Frame Works and Visualization
KIT-601 Lecture Notes-UNIT-5.pdf Frame Works and Visualization
 
Research Methodolgy & Intellectual Property Rights Series 2
Research Methodolgy & Intellectual Property Rights Series 2Research Methodolgy & Intellectual Property Rights Series 2
Research Methodolgy & Intellectual Property Rights Series 2
 
Furniture showroom management system project.pdf
Furniture showroom management system project.pdfFurniture showroom management system project.pdf
Furniture showroom management system project.pdf
 
Complex plane, Modulus, Argument, Graphical representation of a complex numbe...
Complex plane, Modulus, Argument, Graphical representation of a complex numbe...Complex plane, Modulus, Argument, Graphical representation of a complex numbe...
Complex plane, Modulus, Argument, Graphical representation of a complex numbe...
 
Introduction to Machine Learning Unit-4 Notes for II-II Mechanical Engineering
Introduction to Machine Learning Unit-4 Notes for II-II Mechanical EngineeringIntroduction to Machine Learning Unit-4 Notes for II-II Mechanical Engineering
Introduction to Machine Learning Unit-4 Notes for II-II Mechanical Engineering
 
E-Commerce Shopping using MERN Stack where different modules are present
E-Commerce Shopping using MERN Stack where different modules are presentE-Commerce Shopping using MERN Stack where different modules are present
E-Commerce Shopping using MERN Stack where different modules are present
 
solid state electronics ktu module 5 slides
solid state electronics ktu module 5 slidessolid state electronics ktu module 5 slides
solid state electronics ktu module 5 slides
 
ANSI(ST)-III_Manufacturing-I_05052020.pdf
ANSI(ST)-III_Manufacturing-I_05052020.pdfANSI(ST)-III_Manufacturing-I_05052020.pdf
ANSI(ST)-III_Manufacturing-I_05052020.pdf
 
"United Nations Park" Site Visit Report.
"United Nations Park" Site  Visit Report."United Nations Park" Site  Visit Report.
"United Nations Park" Site Visit Report.
 
Seismic Hazard Assessment Software in Python by Prof. Dr. Costas Sachpazis
Seismic Hazard Assessment Software in Python by Prof. Dr. Costas SachpazisSeismic Hazard Assessment Software in Python by Prof. Dr. Costas Sachpazis
Seismic Hazard Assessment Software in Python by Prof. Dr. Costas Sachpazis
 
Artificial Intelligence Bayesian Reasoning
Artificial Intelligence Bayesian ReasoningArtificial Intelligence Bayesian Reasoning
Artificial Intelligence Bayesian Reasoning
 
Arduino based vehicle speed tracker project
Arduino based vehicle speed tracker projectArduino based vehicle speed tracker project
Arduino based vehicle speed tracker project
 
Construction method of steel structure space frame .pptx
Construction method of steel structure space frame .pptxConstruction method of steel structure space frame .pptx
Construction method of steel structure space frame .pptx
 
Low rpm Generator for efficient energy harnessing from a two stage wind turbine
Low rpm Generator for efficient energy harnessing from a two stage wind turbineLow rpm Generator for efficient energy harnessing from a two stage wind turbine
Low rpm Generator for efficient energy harnessing from a two stage wind turbine
 
Diploma Engineering Drawing Qp-2024 Ece .pdf
Diploma Engineering Drawing Qp-2024 Ece .pdfDiploma Engineering Drawing Qp-2024 Ece .pdf
Diploma Engineering Drawing Qp-2024 Ece .pdf
 
Multivibrator and its types defination and usges.pptx
Multivibrator and its types defination and usges.pptxMultivibrator and its types defination and usges.pptx
Multivibrator and its types defination and usges.pptx
 

Stream Computing (The Engineer's Perspective)

  • 1. Stream Computing (The engineer’s perspective) Ilya Ganelin
  • 2. Batch vs. Stream • Batch • Process chunks of data instead of one at a time • Throughput over latency (seconds, minutes, hours) • E.g. MapReduce, Spark, Tez • Stream • Data processed one at a time • Latency over throughput (microseconds, milliseconds) • E.g. Storm, Flink, Apex, KafkaStreams, GearPump
  • 3. Scalability, Performance, Durability, Availability • How do we handle more data? • Quickly? • Without ever losing data or compute? • And ensure the system keeps working, even if there are failures?
  • 4.
  • 5. What are the tradeoffs? • If we focus on scalability, it’s harder to guarantee • Durability – more moving pieces, more coordination, more failures • Availability – more failures, harder to stay operational • Performance – bottlenecks and synchronization • If we focus on availability, it’s harder to guarantee • Performance – monitoring and synchronization overhead • Scalability and performance • Durability – must recover without losing data • If we focus on durability, it’s harder to guarantee • Performance • Scalability
  • 6. Batch compute has it easy. • Get scale-out and performance by adding hardware and taking longer • Get durability with a durable data store and recompute • Get availability by taking longer to recover (this makes life easier!) • In stream processing, you don’t have time!
  • 7. It’s not about performance and scale. • Most platforms handle large volume of data relatively quickly • It’s about: • Ease of use – how quickly can I build a complex application? Not word count. • Failure-handling – what happens when things break? • Durability – how do I avoid losing data without sacrificing performance? • Availability – how can I keep my system operational with a minimum of labor and without sacrificing performance?
  • 8.
  • 9. Next: Case Studies in Open-Source Streaming • Storm • Flink • Apex
  • 10. Apache Storm • Tried and true, was deployed on 10,000 node clusters at Twitter • Scalable • Performant • Easy to use • Weaknesses: • Failure handling • Operationalization at scale • Flexibility • Obsolete?
  • 11. How does it work?
  • 12. How does it work?
  • 13. How does it work?
  • 15. Failure Detection No durability of data in flight or guarantee of exactly once processing!
  • 16. Where do the weakness come from? • Nimbus was a single point of failure (fixed as of 1.0.0 release) • Upstream bolt/spout failure triggers re-compute on entire tree • Can only create parallel independent stream by having separate redundant topologies • Bolts/spouts share JVM  Hard to debug • Failed tuples cannot be replayed quicker than 1s (lower limit on Ack) • No dynamic topologies • Cannot add or remove applications without service interruption • Poor resource sharing in large clusters
  • 17.
  • 18. Enter the Competition – Apache Flink • Declarative functional API (like Spark) • But, true streaming platform (sort of) with support for CEP • Optimized query execution • Weaknesses: • Depends on network micro-batching under the hood! • Not battle -tested • Failures still affect the entire topology
  • 19. How does it work?
  • 20.
  • 22. So what’s different from Storm? • Flink handles planning and optimization for you • Abstracts lower level internals • Clear semantics around windowing (which Storm has lacked) • Failure handling is lightweight and fast! • Exactly once processing (given appropriate connectors at start/end) • Can run Storm
  • 23. What can’t it do? • Dynamically update topology • Dynamically scale • Recover from errors without stopping the entire DAG • Allow fine-grained control of how data moves through the system – locality, data partitioning, routing • You can do these individually, but not all at once • The high level API is a curse! • Run in production (Maybe?)
  • 24.
  • 25. So what else is there? Onyx
  • 26. Which are unique? • Apache Beam (Google’s baby - unifies all the platforms) • Apache Apex (Robust architecture, scalable, fast, durable) • IBM InfoSphere Streams (proprietary, expensive, the best)
  • 27. Let’s look at Apex • Unique provenance • Built for the business at Yahoo – not a research project • Built for reliability and strict processing semantics, not performance • Apex just works • Strengths • Dynamism • Scalability • Failure-handling • Weaknesses • No high-level API • More complex architecture
  • 28. How does it work?
  • 29.
  • 31.
  • 32.
  • 33. So it’s the best? Sort of! • Most robust failure-handling • Allows fine-tuning of data flows and DAG setup • Excellent exploratory UI • But • Learning curve • No high-level API • No machine learning support • Built for business, not for simplicity
  • 34. Streaming is great – what about state? • What if I need to persist data? • Across operators? • Retrieve it quickly? • Do complex analytics? • And build models?
  • 35. Why state? • Historical features (e.g. spend amount over 30 days) • Statistical aggregates • Machine learning model training • Why Cross operator? Because of how data is partitioned, allows aggregation over multiple fields.
  • 36. Distributed In-Memory Databases • Can support low-latency streaming use cases • Durability becomes complicated because memory is volatile • Memory is expensive and limited • Examples: Memcached, Redis, MemSQL, Ignite, Hazelcast, Distributed Hash Tables
  • 37.
  • 38. Lab! • Build and deploy a simple architecture on a streaming platform • Ingest data • Engineer features • Build a model • Score against the model • Storm + H2O • Model build and model score are two different steps • H2O allows you to export your model as a POJO that can be added as Java code in a Storm Bolt
  • 39. Goals • Demonstrate parallel feature computation • Demonstrate model creation and export using H2O • Given a labeled data-set (e.g. Titanic) generate a set of scores from running the model within the Storm topology • Validate the generated results against a validation dataset (Storm or offline)
  • 40. Plan of attack • Step 0: • Storm topology, executing a model (could be linear regression you coded yourself), locally on a single node. • Step 1: • Storm topology, executing an H2O model locally on a single node • Step 2: • Storm topology, executing an H2O model, on multiple nodes (real or virtual) • Step 3 (Extra credit): • Install Redis as a state store and use a Redis client to access Redis from Storm
  • 41. Final Deliverable • A report detailing your experience working with this technology • What worked? • What did not work? • What was setup and usability like? • What issues did you run into? • How did you resolve these issues? • Were you able to get the system operational? • Were you able to get the results you wanted?
  • 42. Setup • Download and install Apache Storm • http://storm.apache.org/releases/1.0.0/index.html • http://storm.apache.org/downloads.html • http://storm.apache.org/releases/1.0.0/Setting-up-a-Storm-cluster.html • Download and install H20 • http://www.h2o.ai/download/ • https://h2o-release.s3.amazonaws.com/h2o/rel-turchin/3/docs-website/h2o- docs/index.html • https://h2o-release.s3.amazonaws.com/h2o/rel-turchin/3/docs-website/h2o- py/docs/index.html

Editor's Notes

  1. Independence of partitions Auto-scaling (throughput and latency)
  2. Batch, micro-batch, and true streaming