SlideShare a Scribd company logo
Storm distributed processing

             BarCamp Saigon 2012
                       Duc Quoc
Hello! I’m Duc
• Senior Software Engineer
  – KMS Technology


• Open source advocate
  – www.ducquoc.vn
  – ducquoc.vn@gmail.com
  – @ducquoc_vn
Agenda
• Why Storm created

• Basic concepts

• Some use cases

• Q&A
Agenda
• Why Storm created

• Basic concepts

• Some use cases

• Q&A
Storm?
• Twitter’s stream processing framework
Storm
• Originally from BackType for analyzing tweets
  – (More than 2000 watchers on GitHub)


• “the realtime Hadoop”
  – continuous computation system (open source)


• distributed, reliable, fault-tolerant
  – suitable for big data processing
Big Data challenges
• Scalability
   – vertical, horizontal


• (high) Avalaibility

• Stability (fault-tolerance)

caching, replication, partitioning/sharding, load-balancing, …
Google!
• published papers on MapReduce, Google
  FileSystem (GFS), BigTable
Apache Hadoop
• MapReduce, HDFS, HBase
  – later on: Hive, Pig, Mahout, ZooKeeper, …
                                                  Task
                                                Tracker
                                   ZooKeeper
                                                  Task
                        Job                     Tracker
                                   ZooKeeper
                      Tracker                     Task
                                   ZooKeeper    Tracker

                                                  Task
                                                Tracker

                                                  Task
                                                Tracker
Hadoop limits
• Batch processing with jobs -> not realtime
• Stateful nodes, SPOF – JobTracker/NameNode
• Cumbersome API
                                            now
                    Unprocessed
                       Data
         t


             Fully processed Latest full   Hadoop job
                                period     takes this long
                                           for this data
Agenda
• Why Storm created

• Basic concepts

• Some use cases

• Q&A
Cluster
• Nimbus: daemon master node
• Supervisor: daemon worker nodes
• Coordination via ZooKeeper
                                        Supervisor
                            ZooKeeper
                                        Supervisor
             Nimbus         ZooKeeper
                                        Supervisor
                            ZooKeeper
                                        Supervisor
               UI
                                        Supervisor
Tuple
• Ordered list of elements
  – (“user-1234”, “email:ducquoc.vn@gmail.com”)
Stream
• Unbounded sequence of tuples
Spout
• Source of stream – emitting tuples
• Talks with queue, logs, API calls, event data
Bolt
• Process tuples, may emit new stream

• Apply functions, transforms, access DB & API
  – filter, aggregate, join, …
Topology
• A directed graph of Spout and Bolt
Task
• Thread which executes a Spout or Bolt

• Deploy a topology:
  $ storm jar myCode.jar com.example.MyTopology arg1 arg2


• Kill a topology:
  $ storm kill topologyName
Sample code


                                                            Create stream called “word”
                                                                   Run 10 tasks
                                                            Create stream called “first-…”
                                                                   Run 3 tasks
                                                            Subscribes to stream “word”,
                                                            using shuffle grouping




Source code of this sample: https://ducquoc.googlecode.com/svn/trunk/storm/
Sample code (2/3)
• RandomWordSpout




  emits a random string from the array words, each 100
  milliseconds
Sample code (3/3)
• InterrogativeBolt




  appends a question mark to the first field of Tuple then emit
Stream grouping
• Decides which task in the bolt, the tuple is
  sent to

• ShuffleGrouping: randomly
• FieldsGrouping: groups tuples by named fields
• Global grouping, All grouping, None grouping,
  Direct grouping
Local/distributed mode
More abstractions
• Distributed RPC server

• Transactional/Batch

• Trident

• https://github.com/nathanmarz/storm/wiki
  – http://groups.google.com/group/storm-user
Agenda
• Why Storm created

• Basic concepts

• Some use cases

• Q&A
Popular use cases
• Continuous/realtime query with low latency
  – analyzing, monitoring, statistics, classifying, …


• Back-end processing for streaming data
  – automated scoring, log processing/auditing, …


• Distributed, high-volume data processing
  – ETL, realtime integration/synchronization, …
Storm integration
• Data to Storm
  – storm-jms, storm-kafka, storm-redis-pubsub, storm-
    scribe, storm-contrib-sqs, …

• Storm to databases
  – storm-cassandra, storm-hbase, storm-contrib-mongo,
    storm-state, storm-rdbms, …


• Polyglotism (language agnostic)
  – Clojure, Java, python, ruby, PHP, Perl, JRuby, …
Storm dependencies
• Java 5+, Clojure

• ZeroMQ 2.1.7-, JZMQ, Python 2.6+

• Thrift, ZooKeeper, Kryo, Jetty, …
  – slf4j, joda, snakeyaml, guava, …
Storm UI
In production
• https://github.com/nathanmarz/storm/wiki/Powered-By
Agenda
• Why Storm created

• Basic concepts

• Some use cases

• Q&A
Q&A

Thank you!
Bonus
• I wanna know how many queries I get
  – Per second, minute, day, week
• Results should be available
  – within <2 seconds 99.8+% of the time
  – within 50 seconds almost always
• History should last >2 years
• Should work for 0.01 q/s up to 50,000 q/s
• Failure tolerant, yadda, yadda
Real-time and Long-time together
                  Blended       now
                    View
                    view

      t

           Hadoop works     Storm
          great back here   works
                             here

More Related Content

What's hot

Learning Stream Processing with Apache Storm
Learning Stream Processing with Apache StormLearning Stream Processing with Apache Storm
Learning Stream Processing with Apache Storm
Eugene Dvorkin
 
[212]big models without big data using domain specific deep networks in data-...
[212]big models without big data using domain specific deep networks in data-...[212]big models without big data using domain specific deep networks in data-...
[212]big models without big data using domain specific deep networks in data-...
NAVER D2
 
Practical Pig and PigUnit (Michael Noll, Verisign)
Practical Pig and PigUnit (Michael Noll, Verisign)Practical Pig and PigUnit (Michael Noll, Verisign)
Practical Pig and PigUnit (Michael Noll, Verisign)
Swiss Big Data User Group
 
Storm-on-YARN: Convergence of Low-Latency and Big-Data
Storm-on-YARN: Convergence of Low-Latency and Big-DataStorm-on-YARN: Convergence of Low-Latency and Big-Data
Storm-on-YARN: Convergence of Low-Latency and Big-Data
DataWorks Summit
 
Quest overview
Quest overviewQuest overview
Quest overview
Adam Pah
 
Real-Time Big Data at In-Memory Speed, Using Storm
Real-Time Big Data at In-Memory Speed, Using StormReal-Time Big Data at In-Memory Speed, Using Storm
Real-Time Big Data at In-Memory Speed, Using Storm
Nati Shalom
 
Storm: Distributed and fault tolerant realtime computation
Storm: Distributed and fault tolerant realtime computationStorm: Distributed and fault tolerant realtime computation
Storm: Distributed and fault tolerant realtime computationFerran Galí Reniu
 
Storm and Cassandra
Storm and Cassandra Storm and Cassandra
Storm and Cassandra
T Jake Luciani
 
Boston hug
Boston hugBoston hug
Boston hug
Ted Dunning
 
GPU Computing With Apache Spark And Python
GPU Computing With Apache Spark And PythonGPU Computing With Apache Spark And Python
GPU Computing With Apache Spark And Python
Jen Aman
 
Constructing Web APIs with Rack, Sinatra and MongoDB
Constructing Web APIs with Rack, Sinatra and MongoDBConstructing Web APIs with Rack, Sinatra and MongoDB
Constructing Web APIs with Rack, Sinatra and MongoDB
Oisin Hurley
 
Debugging Hive with Hadoop-in-the-Cloud by David Chaiken of Altiscale
Debugging Hive with Hadoop-in-the-Cloud by David Chaiken of AltiscaleDebugging Hive with Hadoop-in-the-Cloud by David Chaiken of Altiscale
Debugging Hive with Hadoop-in-the-Cloud by David Chaiken of Altiscale
Data Con LA
 
JVM Garbage Collection Tuning
JVM Garbage Collection TuningJVM Garbage Collection Tuning
JVM Garbage Collection Tuning
ihji
 
HPTS talk on micro-sharding with Katta
HPTS talk on micro-sharding with KattaHPTS talk on micro-sharding with Katta
HPTS talk on micro-sharding with Katta
Ted Dunning
 
Introduction to Apache ZooKeeper
Introduction to Apache ZooKeeperIntroduction to Apache ZooKeeper
Introduction to Apache ZooKeeper
knowbigdata
 
The inherent complexity of stream processing
The inherent complexity of stream processingThe inherent complexity of stream processing
The inherent complexity of stream processing
nathanmarz
 
Apache storm vs. Spark Streaming
Apache storm vs. Spark StreamingApache storm vs. Spark Streaming
Apache storm vs. Spark Streaming
P. Taylor Goetz
 
Spark on Kubernetes - Advanced Spark and Tensorflow Meetup - Jan 19 2017 - An...
Spark on Kubernetes - Advanced Spark and Tensorflow Meetup - Jan 19 2017 - An...Spark on Kubernetes - Advanced Spark and Tensorflow Meetup - Jan 19 2017 - An...
Spark on Kubernetes - Advanced Spark and Tensorflow Meetup - Jan 19 2017 - An...
Chris Fregly
 
Spacebrew: The Overview
Spacebrew: The OverviewSpacebrew: The Overview
Spacebrew: The Overview
Brett Renfer
 

What's hot (20)

Learning Stream Processing with Apache Storm
Learning Stream Processing with Apache StormLearning Stream Processing with Apache Storm
Learning Stream Processing with Apache Storm
 
[212]big models without big data using domain specific deep networks in data-...
[212]big models without big data using domain specific deep networks in data-...[212]big models without big data using domain specific deep networks in data-...
[212]big models without big data using domain specific deep networks in data-...
 
Practical Pig and PigUnit (Michael Noll, Verisign)
Practical Pig and PigUnit (Michael Noll, Verisign)Practical Pig and PigUnit (Michael Noll, Verisign)
Practical Pig and PigUnit (Michael Noll, Verisign)
 
Storm-on-YARN: Convergence of Low-Latency and Big-Data
Storm-on-YARN: Convergence of Low-Latency and Big-DataStorm-on-YARN: Convergence of Low-Latency and Big-Data
Storm-on-YARN: Convergence of Low-Latency and Big-Data
 
Introduction to Storm
Introduction to StormIntroduction to Storm
Introduction to Storm
 
Quest overview
Quest overviewQuest overview
Quest overview
 
Real-Time Big Data at In-Memory Speed, Using Storm
Real-Time Big Data at In-Memory Speed, Using StormReal-Time Big Data at In-Memory Speed, Using Storm
Real-Time Big Data at In-Memory Speed, Using Storm
 
Storm: Distributed and fault tolerant realtime computation
Storm: Distributed and fault tolerant realtime computationStorm: Distributed and fault tolerant realtime computation
Storm: Distributed and fault tolerant realtime computation
 
Storm and Cassandra
Storm and Cassandra Storm and Cassandra
Storm and Cassandra
 
Boston hug
Boston hugBoston hug
Boston hug
 
GPU Computing With Apache Spark And Python
GPU Computing With Apache Spark And PythonGPU Computing With Apache Spark And Python
GPU Computing With Apache Spark And Python
 
Constructing Web APIs with Rack, Sinatra and MongoDB
Constructing Web APIs with Rack, Sinatra and MongoDBConstructing Web APIs with Rack, Sinatra and MongoDB
Constructing Web APIs with Rack, Sinatra and MongoDB
 
Debugging Hive with Hadoop-in-the-Cloud by David Chaiken of Altiscale
Debugging Hive with Hadoop-in-the-Cloud by David Chaiken of AltiscaleDebugging Hive with Hadoop-in-the-Cloud by David Chaiken of Altiscale
Debugging Hive with Hadoop-in-the-Cloud by David Chaiken of Altiscale
 
JVM Garbage Collection Tuning
JVM Garbage Collection TuningJVM Garbage Collection Tuning
JVM Garbage Collection Tuning
 
HPTS talk on micro-sharding with Katta
HPTS talk on micro-sharding with KattaHPTS talk on micro-sharding with Katta
HPTS talk on micro-sharding with Katta
 
Introduction to Apache ZooKeeper
Introduction to Apache ZooKeeperIntroduction to Apache ZooKeeper
Introduction to Apache ZooKeeper
 
The inherent complexity of stream processing
The inherent complexity of stream processingThe inherent complexity of stream processing
The inherent complexity of stream processing
 
Apache storm vs. Spark Streaming
Apache storm vs. Spark StreamingApache storm vs. Spark Streaming
Apache storm vs. Spark Streaming
 
Spark on Kubernetes - Advanced Spark and Tensorflow Meetup - Jan 19 2017 - An...
Spark on Kubernetes - Advanced Spark and Tensorflow Meetup - Jan 19 2017 - An...Spark on Kubernetes - Advanced Spark and Tensorflow Meetup - Jan 19 2017 - An...
Spark on Kubernetes - Advanced Spark and Tensorflow Meetup - Jan 19 2017 - An...
 
Spacebrew: The Overview
Spacebrew: The OverviewSpacebrew: The Overview
Spacebrew: The Overview
 

Similar to Storm distributed processing

Accumulo Nutch/GORA, Storm, and Pig
Accumulo Nutch/GORA, Storm, and PigAccumulo Nutch/GORA, Storm, and Pig
Accumulo Nutch/GORA, Storm, and PigJason Trost
 
Internship final report@Treasure Data Inc.
Internship final report@Treasure Data Inc.Internship final report@Treasure Data Inc.
Internship final report@Treasure Data Inc.
Ryuichi ITO
 
OC Big Data Monthly Meetup #5 - Session 1 - Altiscale
OC Big Data Monthly Meetup #5 - Session 1 - AltiscaleOC Big Data Monthly Meetup #5 - Session 1 - Altiscale
OC Big Data Monthly Meetup #5 - Session 1 - Altiscale
Big Data Joe™ Rossi
 
Searching conversations with hadoop
Searching conversations with hadoopSearching conversations with hadoop
Searching conversations with hadoopDataWorks Summit
 
Trend Micro Big Data Platform and Apache Bigtop
Trend Micro Big Data Platform and Apache BigtopTrend Micro Big Data Platform and Apache Bigtop
Trend Micro Big Data Platform and Apache Bigtop
Evans Ye
 
Apache Storm 0.9 basic training - Verisign
Apache Storm 0.9 basic training - VerisignApache Storm 0.9 basic training - Verisign
Apache Storm 0.9 basic training - Verisign
Michael Noll
 
Storm presentation
Storm presentationStorm presentation
Storm presentation
Shyam Raj
 
Azug - successfully breeding rabits
Azug - successfully breeding rabitsAzug - successfully breeding rabits
Azug - successfully breeding rabitsYves Goeleven
 
John adams talk cloudy
John adams   talk cloudyJohn adams   talk cloudy
John adams talk cloudy
John Adams
 
Chirp 2010: Scaling Twitter
Chirp 2010: Scaling TwitterChirp 2010: Scaling Twitter
Chirp 2010: Scaling Twitter
John Adams
 
Distributed Data processing in a Cloud
Distributed Data processing in a CloudDistributed Data processing in a Cloud
Distributed Data processing in a Cloudelliando dias
 
J1 2015 "Debugging Java Apps in Containers: No Heavy Welding Gear Required"
J1 2015 "Debugging Java Apps in Containers: No Heavy Welding Gear Required"J1 2015 "Debugging Java Apps in Containers: No Heavy Welding Gear Required"
J1 2015 "Debugging Java Apps in Containers: No Heavy Welding Gear Required"
Daniel Bryant
 
Storm
StormStorm
Storm
nathanmarz
 
Machine Learning With H2O vs SparkML
Machine Learning With H2O vs SparkMLMachine Learning With H2O vs SparkML
Machine Learning With H2O vs SparkML
Arnab Biswas
 
Hadoop Summit 2014 - recap
Hadoop Summit 2014 - recapHadoop Summit 2014 - recap
Hadoop Summit 2014 - recapUserReport
 
Bigdata roundtable-storm
Bigdata roundtable-stormBigdata roundtable-storm
Bigdata roundtable-storm
Tobias Schlottke
 
Petabyte scale on commodity infrastructure
Petabyte scale on commodity infrastructurePetabyte scale on commodity infrastructure
Petabyte scale on commodity infrastructureelliando dias
 
Distributed and concurrent programming with RabbitMQ and EventMachine Rails U...
Distributed and concurrent programming with RabbitMQ and EventMachine Rails U...Distributed and concurrent programming with RabbitMQ and EventMachine Rails U...
Distributed and concurrent programming with RabbitMQ and EventMachine Rails U...
Paolo Negri
 
Hadoop World 2011: Apache Hadoop 0.23 - Arun Murthy, Horton Works
Hadoop World 2011: Apache Hadoop 0.23 - Arun Murthy, Horton WorksHadoop World 2011: Apache Hadoop 0.23 - Arun Murthy, Horton Works
Hadoop World 2011: Apache Hadoop 0.23 - Arun Murthy, Horton Works
Cloudera, Inc.
 

Similar to Storm distributed processing (20)

Accumulo Nutch/GORA, Storm, and Pig
Accumulo Nutch/GORA, Storm, and PigAccumulo Nutch/GORA, Storm, and Pig
Accumulo Nutch/GORA, Storm, and Pig
 
Internship final report@Treasure Data Inc.
Internship final report@Treasure Data Inc.Internship final report@Treasure Data Inc.
Internship final report@Treasure Data Inc.
 
OC Big Data Monthly Meetup #5 - Session 1 - Altiscale
OC Big Data Monthly Meetup #5 - Session 1 - AltiscaleOC Big Data Monthly Meetup #5 - Session 1 - Altiscale
OC Big Data Monthly Meetup #5 - Session 1 - Altiscale
 
Searching conversations with hadoop
Searching conversations with hadoopSearching conversations with hadoop
Searching conversations with hadoop
 
Trend Micro Big Data Platform and Apache Bigtop
Trend Micro Big Data Platform and Apache BigtopTrend Micro Big Data Platform and Apache Bigtop
Trend Micro Big Data Platform and Apache Bigtop
 
Apache Storm 0.9 basic training - Verisign
Apache Storm 0.9 basic training - VerisignApache Storm 0.9 basic training - Verisign
Apache Storm 0.9 basic training - Verisign
 
Storm presentation
Storm presentationStorm presentation
Storm presentation
 
Azug - successfully breeding rabits
Azug - successfully breeding rabitsAzug - successfully breeding rabits
Azug - successfully breeding rabits
 
John adams talk cloudy
John adams   talk cloudyJohn adams   talk cloudy
John adams talk cloudy
 
Chirp 2010: Scaling Twitter
Chirp 2010: Scaling TwitterChirp 2010: Scaling Twitter
Chirp 2010: Scaling Twitter
 
Distributed Data processing in a Cloud
Distributed Data processing in a CloudDistributed Data processing in a Cloud
Distributed Data processing in a Cloud
 
J1 2015 "Debugging Java Apps in Containers: No Heavy Welding Gear Required"
J1 2015 "Debugging Java Apps in Containers: No Heavy Welding Gear Required"J1 2015 "Debugging Java Apps in Containers: No Heavy Welding Gear Required"
J1 2015 "Debugging Java Apps in Containers: No Heavy Welding Gear Required"
 
Storm
StormStorm
Storm
 
Machine Learning With H2O vs SparkML
Machine Learning With H2O vs SparkMLMachine Learning With H2O vs SparkML
Machine Learning With H2O vs SparkML
 
Java days gbg online
Java days gbg onlineJava days gbg online
Java days gbg online
 
Hadoop Summit 2014 - recap
Hadoop Summit 2014 - recapHadoop Summit 2014 - recap
Hadoop Summit 2014 - recap
 
Bigdata roundtable-storm
Bigdata roundtable-stormBigdata roundtable-storm
Bigdata roundtable-storm
 
Petabyte scale on commodity infrastructure
Petabyte scale on commodity infrastructurePetabyte scale on commodity infrastructure
Petabyte scale on commodity infrastructure
 
Distributed and concurrent programming with RabbitMQ and EventMachine Rails U...
Distributed and concurrent programming with RabbitMQ and EventMachine Rails U...Distributed and concurrent programming with RabbitMQ and EventMachine Rails U...
Distributed and concurrent programming with RabbitMQ and EventMachine Rails U...
 
Hadoop World 2011: Apache Hadoop 0.23 - Arun Murthy, Horton Works
Hadoop World 2011: Apache Hadoop 0.23 - Arun Murthy, Horton WorksHadoop World 2011: Apache Hadoop 0.23 - Arun Murthy, Horton Works
Hadoop World 2011: Apache Hadoop 0.23 - Arun Murthy, Horton Works
 

Recently uploaded

Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
Aftab Hussain
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
ControlCase
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
Alpen-Adria-Universität
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
Laura Byrne
 
Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1
DianaGray10
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Albert Hoitingh
 
Free Complete Python - A step towards Data Science
Free Complete Python - A step towards Data ScienceFree Complete Python - A step towards Data Science
Free Complete Python - A step towards Data Science
RinaMondal9
 
GridMate - End to end testing is a critical piece to ensure quality and avoid...
GridMate - End to end testing is a critical piece to ensure quality and avoid...GridMate - End to end testing is a critical piece to ensure quality and avoid...
GridMate - End to end testing is a critical piece to ensure quality and avoid...
ThomasParaiso2
 
UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5
DianaGray10
 
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Nexer Digital
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
James Anderson
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
Matthew Sinclair
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
Kari Kakkonen
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
Safe Software
 
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
Neo4j
 
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdfSAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
Peter Spielvogel
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
mikeeftimakis1
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
Jemma Hussein Allen
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
Kari Kakkonen
 

Recently uploaded (20)

Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
 
Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
 
Free Complete Python - A step towards Data Science
Free Complete Python - A step towards Data ScienceFree Complete Python - A step towards Data Science
Free Complete Python - A step towards Data Science
 
GridMate - End to end testing is a critical piece to ensure quality and avoid...
GridMate - End to end testing is a critical piece to ensure quality and avoid...GridMate - End to end testing is a critical piece to ensure quality and avoid...
GridMate - End to end testing is a critical piece to ensure quality and avoid...
 
UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5
 
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
 
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
 
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdfSAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
 

Storm distributed processing

  • 1. Storm distributed processing BarCamp Saigon 2012 Duc Quoc
  • 2. Hello! I’m Duc • Senior Software Engineer – KMS Technology • Open source advocate – www.ducquoc.vn – ducquoc.vn@gmail.com – @ducquoc_vn
  • 3. Agenda • Why Storm created • Basic concepts • Some use cases • Q&A
  • 4. Agenda • Why Storm created • Basic concepts • Some use cases • Q&A
  • 5. Storm? • Twitter’s stream processing framework
  • 6. Storm • Originally from BackType for analyzing tweets – (More than 2000 watchers on GitHub) • “the realtime Hadoop” – continuous computation system (open source) • distributed, reliable, fault-tolerant – suitable for big data processing
  • 7. Big Data challenges • Scalability – vertical, horizontal • (high) Avalaibility • Stability (fault-tolerance) caching, replication, partitioning/sharding, load-balancing, …
  • 8. Google! • published papers on MapReduce, Google FileSystem (GFS), BigTable
  • 9. Apache Hadoop • MapReduce, HDFS, HBase – later on: Hive, Pig, Mahout, ZooKeeper, … Task Tracker ZooKeeper Task Job Tracker ZooKeeper Tracker Task ZooKeeper Tracker Task Tracker Task Tracker
  • 10. Hadoop limits • Batch processing with jobs -> not realtime • Stateful nodes, SPOF – JobTracker/NameNode • Cumbersome API now Unprocessed Data t Fully processed Latest full Hadoop job period takes this long for this data
  • 11. Agenda • Why Storm created • Basic concepts • Some use cases • Q&A
  • 12. Cluster • Nimbus: daemon master node • Supervisor: daemon worker nodes • Coordination via ZooKeeper Supervisor ZooKeeper Supervisor Nimbus ZooKeeper Supervisor ZooKeeper Supervisor UI Supervisor
  • 13. Tuple • Ordered list of elements – (“user-1234”, “email:ducquoc.vn@gmail.com”)
  • 15. Spout • Source of stream – emitting tuples • Talks with queue, logs, API calls, event data
  • 16. Bolt • Process tuples, may emit new stream • Apply functions, transforms, access DB & API – filter, aggregate, join, …
  • 17. Topology • A directed graph of Spout and Bolt
  • 18. Task • Thread which executes a Spout or Bolt • Deploy a topology: $ storm jar myCode.jar com.example.MyTopology arg1 arg2 • Kill a topology: $ storm kill topologyName
  • 19. Sample code Create stream called “word” Run 10 tasks Create stream called “first-…” Run 3 tasks Subscribes to stream “word”, using shuffle grouping Source code of this sample: https://ducquoc.googlecode.com/svn/trunk/storm/
  • 20. Sample code (2/3) • RandomWordSpout emits a random string from the array words, each 100 milliseconds
  • 21. Sample code (3/3) • InterrogativeBolt appends a question mark to the first field of Tuple then emit
  • 22. Stream grouping • Decides which task in the bolt, the tuple is sent to • ShuffleGrouping: randomly • FieldsGrouping: groups tuples by named fields • Global grouping, All grouping, None grouping, Direct grouping
  • 24. More abstractions • Distributed RPC server • Transactional/Batch • Trident • https://github.com/nathanmarz/storm/wiki – http://groups.google.com/group/storm-user
  • 25. Agenda • Why Storm created • Basic concepts • Some use cases • Q&A
  • 26. Popular use cases • Continuous/realtime query with low latency – analyzing, monitoring, statistics, classifying, … • Back-end processing for streaming data – automated scoring, log processing/auditing, … • Distributed, high-volume data processing – ETL, realtime integration/synchronization, …
  • 27. Storm integration • Data to Storm – storm-jms, storm-kafka, storm-redis-pubsub, storm- scribe, storm-contrib-sqs, … • Storm to databases – storm-cassandra, storm-hbase, storm-contrib-mongo, storm-state, storm-rdbms, … • Polyglotism (language agnostic) – Clojure, Java, python, ruby, PHP, Perl, JRuby, …
  • 28. Storm dependencies • Java 5+, Clojure • ZeroMQ 2.1.7-, JZMQ, Python 2.6+ • Thrift, ZooKeeper, Kryo, Jetty, … – slf4j, joda, snakeyaml, guava, …
  • 31. Agenda • Why Storm created • Basic concepts • Some use cases • Q&A
  • 33. Bonus • I wanna know how many queries I get – Per second, minute, day, week • Results should be available – within <2 seconds 99.8+% of the time – within 50 seconds almost always • History should last >2 years • Should work for 0.01 q/s up to 50,000 q/s • Failure tolerant, yadda, yadda
  • 34. Real-time and Long-time together Blended now View view t Hadoop works Storm great back here works here

Editor's Notes

  1. Map takes one pair of data with a type in one data domain, and returns a list of pairs in a different domain: Map(k1,v1) -&gt; list(k2,v2)The Reduce function is then applied in parallel to each group, which in turn produces a collection of values in the same domain:Reduce(k2, list (v2)) -&gt; list(v3)
  2. A bolt can subscribe to an unlimited number of streams, by chaining groupings.
  3. declareOutputFields is used to declare streams and their schemas. It is possible to declare several streams and specify the stream to use when outputting tuples in the emit function call.
  4. Each spout or bolt are running X instances in parallel (called tasks). All grouping: send to all tasks• Global grouping: pick task with lowest id