POLYGLOT PROCESSING –
AN INTRODUCTION
Dr. Mohan K. Bavirisetty
Chief Scientist
Modern Renaissance
Agenda
1. Big Data Landscape
2. Lambda vs. Kappa Architecture
3. Spark vs. Storm vs. Flink
4. Demo 1 – Apache Spark
5. Demo 2 – Storm, Kafka and Redis
6. Demo 3 – Flink with Data Stream API?
7. Summary
8. Questions
The purpose of computing is insight not data – Richard Hamming
BIG DATA LANDSCAPE
What is Big Data?
Big data is high-volume, high-velocity and high-
variety information assets that demand cost-
effective, innovative forms of information processing
for enhanced insight and decision making.
Source: Gartner Research
What is a Real-time Analytics Platform?
• Batch Operations1
• Micro batch Operations2
• Real-time Streaming3
3 Common Kinds of Workloads
“Evidence-based decision-making (aka Big Data) is not just the latest fad, it's
the future of how we are going to guide and grow business.”
– Kristen Hammond, CTO, Narrative Sciences
8 Requirements of Real-time Computing
Keep Data Moving
Allow SQL Queries
Handle Stream Imperfections
Generate Predictable Outcomes
Integrate Streaming Data and Stored Data
Guarantee Data Safety and Availability
Partition and Scale Applications Automatically
Process and Respond Instantaneously
How do major data engines compare?
Real-time Streaming Architecture
Berkeley Data Analytics Stack
Polyglot …..
•One who is versed in many languages …Polyglot
•Different languages, frameworks and services
•Example Java with Scala, Clojure inside Trident
Polyglot
Programming
•Capacity to store data in multiple formats
•Structured, document, Log, GPS
Polyglot
Persistence
•Refers to capability to process any kind of data,
any kind of workload, any kind of workflow
Polyglot
Processing
LAMBDA VS. KAPPA
ARCHITECTURES
Lambda Architecture
What is Apache Storm?
Apache Storm is a free and open source
distributed real-time computation system it
makes it easy to reliably process unbounded
streams of data.
Why Apache Storm?
Storm is fast, horizontally scalable,
fault-tolerant, easy to setup and
operate and programming language
agnostic
Apache Storm
Apache Storm can be used to realize an APM Use Case
Apache Spark
Apache Spark is a fast and general
engine for large-scale data processing.
• Spark is fast
• Spark is easy
• Spark is extensible
Lambda Implementation with Spark
Kappa Architecture
Apache Flink
Apache Flink has unified runtime engine
DEMONSTRATION
SUMMARY
Summary
• Big Data Challenges are being met with new and
innovative approaches and architectures.
• Lambda Architecture is a pragmatic near-term
solution. Fidelity is already implementing it.
• Kappa Architecture could turn out to be long-term
elegant solution to Polyglot Processing.
• Apache Spark, Strom and Flink have their strengths
and niche areas of applicability.
• Apache Samoa, Apache Zappelin and Tacheon add
value further by providing additional capabilities
Maturity
Time
Descriptive
Preventive/
Prescriptive
Working Toward Analytics Mastery
Predictive
Next Stage of Data Explosion
QUESTIONS?
We do not learn by inference and deduction and the application of mathematics to
philosophy, but by direct intercourse …
- Henry David Thoreau
THANK YOU
Appendix- References and Resources
• 8 Requirements of Real-time Stream Processing
http://cs.brown.edu/~ugur/8rulesSigRec.pdf
• Design Patterns for Real-Time Streaming Analytics
http://strataconf.com/big-data-conference-ca-2015/public/schedule/detail/38774
• Big Data: Principles and Best Practices of Scalable Real-time Data Systems.
http://bit.ly/1LscB7z
• Real-time Stream Processing Next-Step for Apache Flink
http://www.confluent.io/blog/2015/05/06/real-time-stream-processing-the-next-step-for-apache-flink/
• SAMOA – Scalable Advanced Massive Online Analysis
http://jmlr.csail.mit.edu/papers/volume16/morales15a/morales15a.pdf
• Lambda Architecture http://lambda-architecture.net/
• Kappa Architecture http://www.kappa-architecture.com/
• Apache Spark http://spark.apache.org/
• Apache Storm https://storm.apache.org/
• Apache Flink https://flink.apache.org/
• Apache SAMOA https://samoa.incubator.apache.org/
• Apache Zappelin https://zeppelin.incubator.apache.org/
• Tacheon http://tachyon-project.org/

Polyglot Processing - An Introduction 1.0

  • 1.
    POLYGLOT PROCESSING – ANINTRODUCTION Dr. Mohan K. Bavirisetty Chief Scientist Modern Renaissance
  • 2.
    Agenda 1. Big DataLandscape 2. Lambda vs. Kappa Architecture 3. Spark vs. Storm vs. Flink 4. Demo 1 – Apache Spark 5. Demo 2 – Storm, Kafka and Redis 6. Demo 3 – Flink with Data Stream API? 7. Summary 8. Questions The purpose of computing is insight not data – Richard Hamming
  • 3.
  • 4.
    What is BigData? Big data is high-volume, high-velocity and high- variety information assets that demand cost- effective, innovative forms of information processing for enhanced insight and decision making. Source: Gartner Research
  • 6.
    What is aReal-time Analytics Platform?
  • 7.
    • Batch Operations1 •Micro batch Operations2 • Real-time Streaming3 3 Common Kinds of Workloads “Evidence-based decision-making (aka Big Data) is not just the latest fad, it's the future of how we are going to guide and grow business.” – Kristen Hammond, CTO, Narrative Sciences
  • 8.
    8 Requirements ofReal-time Computing Keep Data Moving Allow SQL Queries Handle Stream Imperfections Generate Predictable Outcomes Integrate Streaming Data and Stored Data Guarantee Data Safety and Availability Partition and Scale Applications Automatically Process and Respond Instantaneously
  • 9.
    How do majordata engines compare?
  • 10.
  • 11.
  • 12.
    Polyglot ….. •One whois versed in many languages …Polyglot •Different languages, frameworks and services •Example Java with Scala, Clojure inside Trident Polyglot Programming •Capacity to store data in multiple formats •Structured, document, Log, GPS Polyglot Persistence •Refers to capability to process any kind of data, any kind of workload, any kind of workflow Polyglot Processing
  • 14.
  • 15.
  • 16.
    What is ApacheStorm? Apache Storm is a free and open source distributed real-time computation system it makes it easy to reliably process unbounded streams of data.
  • 17.
    Why Apache Storm? Stormis fast, horizontally scalable, fault-tolerant, easy to setup and operate and programming language agnostic
  • 18.
  • 19.
    Apache Storm canbe used to realize an APM Use Case
  • 20.
    Apache Spark Apache Sparkis a fast and general engine for large-scale data processing. • Spark is fast • Spark is easy • Spark is extensible
  • 21.
  • 22.
  • 23.
  • 24.
    Apache Flink hasunified runtime engine
  • 26.
  • 27.
  • 28.
    Summary • Big DataChallenges are being met with new and innovative approaches and architectures. • Lambda Architecture is a pragmatic near-term solution. Fidelity is already implementing it. • Kappa Architecture could turn out to be long-term elegant solution to Polyglot Processing. • Apache Spark, Strom and Flink have their strengths and niche areas of applicability. • Apache Samoa, Apache Zappelin and Tacheon add value further by providing additional capabilities
  • 29.
  • 30.
    Next Stage ofData Explosion
  • 31.
    QUESTIONS? We do notlearn by inference and deduction and the application of mathematics to philosophy, but by direct intercourse … - Henry David Thoreau
  • 32.
  • 33.
    Appendix- References andResources • 8 Requirements of Real-time Stream Processing http://cs.brown.edu/~ugur/8rulesSigRec.pdf • Design Patterns for Real-Time Streaming Analytics http://strataconf.com/big-data-conference-ca-2015/public/schedule/detail/38774 • Big Data: Principles and Best Practices of Scalable Real-time Data Systems. http://bit.ly/1LscB7z • Real-time Stream Processing Next-Step for Apache Flink http://www.confluent.io/blog/2015/05/06/real-time-stream-processing-the-next-step-for-apache-flink/ • SAMOA – Scalable Advanced Massive Online Analysis http://jmlr.csail.mit.edu/papers/volume16/morales15a/morales15a.pdf • Lambda Architecture http://lambda-architecture.net/ • Kappa Architecture http://www.kappa-architecture.com/ • Apache Spark http://spark.apache.org/ • Apache Storm https://storm.apache.org/ • Apache Flink https://flink.apache.org/ • Apache SAMOA https://samoa.incubator.apache.org/ • Apache Zappelin https://zeppelin.incubator.apache.org/ • Tacheon http://tachyon-project.org/