SlideShare a Scribd company logo
PrashanthBabuV V
@P7h
https://twitter.com/P7h
https://github.com/P7h
http://P7h.org
http://about.me/Prashanth
Laptop with anyOS
JDK v7.x installed
Mavenv3.0.5+ installed
IDE[either Eclipse with m2eclipse plugin or IntelliJ IDEA]
CreatedTwitter app for retrieving tweets
Cloned ordownloaded Storm Projectsfrom my GitHub Account:
 https://github.com/P7h/StormWordCount
 https://github.com/P7h/StormTweetsWordCount
Prerequisites for Workshop
Big Data
Batchvs. Real-time processing
Intro to Storm
CompaniesusingStorm
Storm Dependencies
Storm Concepts
Anatomyof Storm Cluster
Live codinga use caseusingStormTopology
Storm vs. Hadoop
Ag e n d a
Dataoverload everysecond
Batch vs. Real-time Processing
Batch processing
 Gatheringof dataand processingas a group at one time.
Real-time processing
 Processingof datathat takes place as theinformation is being entered.
Event Processing
Simple Event Processing
 Actingon a single event,filter in theESP
Event Stream Processing
 Looking across multiple events
Complex Event Processing
 Looking across multiple eventsfrom multipleevent streams
Storm
Createdby Nathan Marz @BackType
 Analyzetweets, links,users on Twitter
Open sourced on 19th September, 2011
 EclipsePublic License 1.0
 Stormv0.5.2
 16k Java and 7k Clojure LOC
Latest Updates
 Currentstable release v0.8.2released on 11th January,2013
 Major core improvementsplanned for v0.9.0
 Stormwill be anApacheProject [soon..]
Storm
Open source distributedreal-time computation system
Hadoop of real-time
Fast
Scalable
Fault-tolerant
Guarantees datawill be processed
Programminglanguage agnostic
Easyto set up and operate
Excellent documentation
Polyglotism(language agnostic) – Clojure,Java, Python, Ruby, PHP,
Perl,… and yes, evenJavaScript
https://github.com/nathanmarz/storm-starter/blob/master/multilang/resources/splitsentence.py
https://github.com/nathanmarz/storm-starter/blob/master/multilang/resources/splitsentence.rb
Real-time analytics
Stream processing
Online machine learning
Continuous computation
Distributed RPC
Extract,Transform and Load (ETL)
Use cases
http://tweitgeist.colinsurprenant.com/
https://github.com/nathanmarz/storm/wiki/Powered-By
CompaniesusingStorm
enables the
convergenceof Big Data and
low-latency processing.
Empowers stream / micro-batch
processing of user events,
content feeds and application
logs.
https://github.com/P7h/storm-camel-example
Clojure
 a dialect of the Lisp programminglanguagerunson the JVM, CLR, andJavaScript engines
ApacheThrift
 Cross languagebridge,RPC; Frameworktobuild services
ØMQ
 Asynchronous messagetransportlayer
Jetty
 Embeddedweb server
Stormunderthe hood
Stormunderthe hood
ApacheZooKeeper
 Distributed system, used to store metadata
LMAX Disruptor
 High performancequeuesharedbythreads
Kryo
 Serializationframework
Misc.
 SLF4J,Python, Java 5+,JZMQ, JODA,Guava
Tuples
Main data structure in Storm.
An ordered list of objects.
 (“user”, “Prashanth”, “Babu”, “Engineer”, “Bangalore“)
Key-valuepairs – keys are strings, values canbe of anytype.
Tuple
Streams
Unbounded sequence of tuples.
Edges in the topology.
Defined with a schema.
Tuple
Tuple
Tuple
Tuple
Tuple
Spouts
Source of streams.
Spouts are like sourcesin a graph.
Examples are API Calls, log files,
event data,queues, Kestrel,AMQP,
JMS, Kafka, etc.
BaseRichSpout
Bolts
Processinput streams and [might] produce new streams.
Can do anything i.e. filtering, streaming joins, aggregations,readfrom
/ write to databases, APIs, run arbitraryfunctions, etc.
All sinks in the topology are bolts but not all bolts are sinks.
Tuple Tuple Tuple
Bolts
Topology
Networkof spouts and bolts.
Can be visualized like a graph.
Container for application logic.
Analogous to a MapReduce job. But runs forever.
Sample Topology
[Sentence]
[Word, Count]
[Sentence]
RandomSentenceSpout
SplitSentenceBolt
SplitSentenceBolt
WordCountBolt
DBBolt /
JMSBolt
…………..
More such
bolts
RandomSentenceSpout
https://github.com/P7h/StormWordCount
Stream Groupings
Each Spout or Bolt might be running n instances in parallel[tasks].
Groupings are used to decide which task in the subscribing bolt, the
tuple is sent to.
Grouping Feature
Shuffle Randomgrouping
Fields Groupedby valuesuch that equal valueresults in sametask
All Replicates to all tasks
Global Makesall tuples goto one task
None MakesBolt run in the samethread as the Bolt / Spout it subscribesto
Direct Producer(task that emits) controls which Consumerwill receive
Local or Shuffle If the targetbolt has one or moretasks in the sameworkerprocess,
tuples will be shuffled to just those in-process tasks
Storm Cluster
NIMBUS
ZooKeeper#1
ZooKeeper#2
ZooKeeper#n
Supervisor#1
Supervisor#2
Supervisor#3
Supervisor#n
UI
Workers
Workers
Workers
Workers
Storm Cluster
Nimbus daemon is the masterof this cluster.
 Managestopologies.
 Comparable to HadoopJobTracker.
Supervisor daemon spawns workers.
 Comparable to HadoopTaskTracker.
Workersare spawned bysupervisors.
 One per port defined in storm.yaml configuration.
Storm Cluster [contd..]
Task is run as a thread in workers.
Zookeeperis a distributed system, used to store metadata.
UI is a webapp which gives diagnostics on the cluster and topologies.
Nimbus and Supervisor daemons are fail-fast and stateless.
 State is storedin Zookeeper.
Storm – Modes of operation
Localmode
 Develop,test and debug topologies onyour localmachine.
 Mavenis used to include Storm asa dev dependency for the project.
mvn clean compile package && java -jar target/storm-wordcount-1.0-SNAPSHOT-jar-
with-dependencies.jar
Remote [or Production]mode
 Topologies are submitted for executionon a clusterof machines.
 Cluster information is added in storm.yaml file.
 More details on storm.yaml file canbe found here:
https://github.com/nathanmarz/storm/wiki/Setting-up-a-Storm-cluster#fill-in-mandatory-configurations-
into-stormyaml
storm jar target/storm-wordcount-1.0-SNAPSHOT.jar
org.p7h.storm.offline.wordcount.topology.WordCountTopology WordCount
Storm – Modes of operation [contd..]
StormUI – ClusterSummary
StormUI – TopologySummary
StormUI – ComponentSummary
Code Sample – Topology
Code Sample– Spout
Code Sample– Bolt#1
Code Sample– Bolt#2
Problem#1 – WordCount [if there are internetissues]
Create a Spout which feeds randomsentences [you can define your own
set of randomsentences].
Create a Bolt which receivessentencesfrom the Spout and then splits
them into words and forwardsthem to next bolt.
Create another Bolt to count the words.
https://github.com/P7h/StormWordCount
Create a Spout which gets data from Twitter [please use Twitter4Jand
OAUTH Credentials to get tweets using Streaming API].
 Forsimplicity consideronly tweets which arein English.
 Emit only the stuff which we areinterested,i.e.A tweet’s getRetweetedStatus().
Create another Bolt to count the count the retweetsof a particular
tweet.
 Makean in-memoryMapwith retweetscreen name andthe counterof the retweet asthe
value.
 Logthe counterevery few seconds / minutes [should be configurable].
https://github.com/P7h/StormTopRetweets
Problem#2 – Top5 retweetedtweets[if internetworks fine]
Storm vs. Hadoop
Batch processing
Jobs runto completion
[Pre-YARN]NameNode is SPOF
Statefulnodes
Scalable
Guarantees nodata loss
Open source
Real-time processing
Topologies run forever
No SPOF
Stateless nodes
Scalable
Gurantees nodataloss
Open source
t
now
Hadoop works great back here Stormworks here
Hadoop AND Storm
Blended
view
Blended
view
Blended View
Convergenceofbatchand low-latencyprocessing
PersonalizationbasedonUserInterests
HadoopAND Stormat Yahoo
Advanced Topics [not covered in this session]
Distributed RPC
Transactional topologies
Trident
Unit testing
Patterns
References
This Slide deck[on slideshare] – http://j.mp/5thEleStorm_SS
This Slide deck[on speakerdeck] – http://j.mp/5thEleStorm_SD
MyGitHub Account for code repos – https://github.com/P7h
Bit.ly Bundle for Storm curatedby me – http://j.mp/YrDgcs
PrashanthBabuV V
Follow me on Twitter: @P7h

More Related Content

What's hot

Data analytics in the cloud with Jupyter notebooks.
Data analytics in the cloud with Jupyter notebooks.Data analytics in the cloud with Jupyter notebooks.
Data analytics in the cloud with Jupyter notebooks.
Graham Dumpleton
 
[BGOUG] Java GC - Friend or Foe
[BGOUG] Java GC - Friend or Foe[BGOUG] Java GC - Friend or Foe
[BGOUG] Java GC - Friend or Foe
SAP HANA Cloud Platform
 
Tracing python applications
Tracing python applicationsTracing python applications
Tracing python applications
Nikolay Stoitsev
 
Juggva cloud
Juggva cloudJuggva cloud
Juggva cloud
Jean-Frederic Clere
 
Introduction to IPython & Jupyter Notebooks
Introduction to IPython & Jupyter NotebooksIntroduction to IPython & Jupyter Notebooks
Introduction to IPython & Jupyter Notebooks
Eueung Mulyana
 
Sneaky computation
Sneaky computationSneaky computation
Sneaky computation
Juan J. Merelo
 
Threading Successes 04 Hellgate
Threading Successes 04   HellgateThreading Successes 04   Hellgate
Threading Successes 04 Hellgateguest40fc7cd
 
Interactive Data Analysis for End Users on HN Science Cloud
Interactive Data Analysis for End Users on HN Science CloudInteractive Data Analysis for End Users on HN Science Cloud
Interactive Data Analysis for End Users on HN Science Cloud
Helix Nebula The Science Cloud
 

What's hot (8)

Data analytics in the cloud with Jupyter notebooks.
Data analytics in the cloud with Jupyter notebooks.Data analytics in the cloud with Jupyter notebooks.
Data analytics in the cloud with Jupyter notebooks.
 
[BGOUG] Java GC - Friend or Foe
[BGOUG] Java GC - Friend or Foe[BGOUG] Java GC - Friend or Foe
[BGOUG] Java GC - Friend or Foe
 
Tracing python applications
Tracing python applicationsTracing python applications
Tracing python applications
 
Juggva cloud
Juggva cloudJuggva cloud
Juggva cloud
 
Introduction to IPython & Jupyter Notebooks
Introduction to IPython & Jupyter NotebooksIntroduction to IPython & Jupyter Notebooks
Introduction to IPython & Jupyter Notebooks
 
Sneaky computation
Sneaky computationSneaky computation
Sneaky computation
 
Threading Successes 04 Hellgate
Threading Successes 04   HellgateThreading Successes 04   Hellgate
Threading Successes 04 Hellgate
 
Interactive Data Analysis for End Users on HN Science Cloud
Interactive Data Analysis for End Users on HN Science CloudInteractive Data Analysis for End Users on HN Science Cloud
Interactive Data Analysis for End Users on HN Science Cloud
 

Similar to Storm @ Fifth Elephant 2013

Distributed and Fault Tolerant Realtime Computation with Apache Storm, Apache...
Distributed and Fault Tolerant Realtime Computation with Apache Storm, Apache...Distributed and Fault Tolerant Realtime Computation with Apache Storm, Apache...
Distributed and Fault Tolerant Realtime Computation with Apache Storm, Apache...
Folio3 Software
 
Introduction to Streaming Distributed Processing with Storm
Introduction to Streaming Distributed Processing with StormIntroduction to Streaming Distributed Processing with Storm
Introduction to Streaming Distributed Processing with Storm
Brandon O'Brien
 
Distributed Realtime Computation using Apache Storm
Distributed Realtime Computation using Apache StormDistributed Realtime Computation using Apache Storm
Distributed Realtime Computation using Apache Storm
the100rabh
 
C*ollege Credit: CEP Distribtued Processing on Cassandra with Storm
C*ollege Credit: CEP Distribtued Processing on Cassandra with StormC*ollege Credit: CEP Distribtued Processing on Cassandra with Storm
C*ollege Credit: CEP Distribtued Processing on Cassandra with Storm
DataStax
 
storm-170531123446.pptx
storm-170531123446.pptxstorm-170531123446.pptx
storm-170531123446.pptx
IbrahimBenhadhria
 
Monitoring MySQL with DTrace/SystemTap
Monitoring MySQL with DTrace/SystemTapMonitoring MySQL with DTrace/SystemTap
Monitoring MySQL with DTrace/SystemTap
Padraig O'Sullivan
 
Storm
StormStorm
ContainerDays Boston 2015: "CoreOS: Building the Layers of the Scalable Clust...
ContainerDays Boston 2015: "CoreOS: Building the Layers of the Scalable Clust...ContainerDays Boston 2015: "CoreOS: Building the Layers of the Scalable Clust...
ContainerDays Boston 2015: "CoreOS: Building the Layers of the Scalable Clust...
DynamicInfraDays
 
Real time stream processing presentation at General Assemb.ly
Real time stream processing presentation at General Assemb.lyReal time stream processing presentation at General Assemb.ly
Real time stream processing presentation at General Assemb.ly
Varun Vijayaraghavan
 
PuppetDB: Sneaking Clojure into Operations
PuppetDB: Sneaking Clojure into OperationsPuppetDB: Sneaking Clojure into Operations
PuppetDB: Sneaking Clojure into Operationsgrim_radical
 
Introduction to apache_cassandra_for_developers-lhg
Introduction to apache_cassandra_for_developers-lhgIntroduction to apache_cassandra_for_developers-lhg
Introduction to apache_cassandra_for_developers-lhgzznate
 
Confitura 2018 — Apache Beam — Promyk Nadziei Data Engineera
Confitura 2018 — Apache Beam — Promyk Nadziei Data EngineeraConfitura 2018 — Apache Beam — Promyk Nadziei Data Engineera
Confitura 2018 — Apache Beam — Promyk Nadziei Data Engineera
Piotr Wikiel
 
Node js introduction
Node js introductionNode js introduction
Node js introduction
Joseph de Castelnau
 
Sinfonier: How I turned my grandmother into a data analyst - Fran J. Gomez - ...
Sinfonier: How I turned my grandmother into a data analyst - Fran J. Gomez - ...Sinfonier: How I turned my grandmother into a data analyst - Fran J. Gomez - ...
Sinfonier: How I turned my grandmother into a data analyst - Fran J. Gomez - ...
Codemotion
 
Kubernetes for the PHP developer
Kubernetes for the PHP developerKubernetes for the PHP developer
Kubernetes for the PHP developer
Paul Czarkowski
 
Automated ML Workflow for Distributed Big Data Using Analytics Zoo (CVPR2020 ...
Automated ML Workflow for Distributed Big Data Using Analytics Zoo (CVPR2020 ...Automated ML Workflow for Distributed Big Data Using Analytics Zoo (CVPR2020 ...
Automated ML Workflow for Distributed Big Data Using Analytics Zoo (CVPR2020 ...
Jason Dai
 
Porting a Streaming Pipeline from Scala to Rust
Porting a Streaming Pipeline from Scala to RustPorting a Streaming Pipeline from Scala to Rust
Porting a Streaming Pipeline from Scala to Rust
Evan Chan
 
Breaking The Clustering Limits @ AlphaCSP JavaEdge 2007
Breaking The Clustering Limits @ AlphaCSP JavaEdge 2007Breaking The Clustering Limits @ AlphaCSP JavaEdge 2007
Breaking The Clustering Limits @ AlphaCSP JavaEdge 2007Baruch Sadogursky
 
Apache Beam: A unified model for batch and stream processing data
Apache Beam: A unified model for batch and stream processing dataApache Beam: A unified model for batch and stream processing data
Apache Beam: A unified model for batch and stream processing data
DataWorks Summit/Hadoop Summit
 
Ultra Fast Deep Learning in Hybrid Cloud using Intel Analytics Zoo & Alluxio
Ultra Fast Deep Learning in Hybrid Cloud using Intel Analytics Zoo & AlluxioUltra Fast Deep Learning in Hybrid Cloud using Intel Analytics Zoo & Alluxio
Ultra Fast Deep Learning in Hybrid Cloud using Intel Analytics Zoo & Alluxio
Alluxio, Inc.
 

Similar to Storm @ Fifth Elephant 2013 (20)

Distributed and Fault Tolerant Realtime Computation with Apache Storm, Apache...
Distributed and Fault Tolerant Realtime Computation with Apache Storm, Apache...Distributed and Fault Tolerant Realtime Computation with Apache Storm, Apache...
Distributed and Fault Tolerant Realtime Computation with Apache Storm, Apache...
 
Introduction to Streaming Distributed Processing with Storm
Introduction to Streaming Distributed Processing with StormIntroduction to Streaming Distributed Processing with Storm
Introduction to Streaming Distributed Processing with Storm
 
Distributed Realtime Computation using Apache Storm
Distributed Realtime Computation using Apache StormDistributed Realtime Computation using Apache Storm
Distributed Realtime Computation using Apache Storm
 
C*ollege Credit: CEP Distribtued Processing on Cassandra with Storm
C*ollege Credit: CEP Distribtued Processing on Cassandra with StormC*ollege Credit: CEP Distribtued Processing on Cassandra with Storm
C*ollege Credit: CEP Distribtued Processing on Cassandra with Storm
 
storm-170531123446.pptx
storm-170531123446.pptxstorm-170531123446.pptx
storm-170531123446.pptx
 
Monitoring MySQL with DTrace/SystemTap
Monitoring MySQL with DTrace/SystemTapMonitoring MySQL with DTrace/SystemTap
Monitoring MySQL with DTrace/SystemTap
 
Storm
StormStorm
Storm
 
ContainerDays Boston 2015: "CoreOS: Building the Layers of the Scalable Clust...
ContainerDays Boston 2015: "CoreOS: Building the Layers of the Scalable Clust...ContainerDays Boston 2015: "CoreOS: Building the Layers of the Scalable Clust...
ContainerDays Boston 2015: "CoreOS: Building the Layers of the Scalable Clust...
 
Real time stream processing presentation at General Assemb.ly
Real time stream processing presentation at General Assemb.lyReal time stream processing presentation at General Assemb.ly
Real time stream processing presentation at General Assemb.ly
 
PuppetDB: Sneaking Clojure into Operations
PuppetDB: Sneaking Clojure into OperationsPuppetDB: Sneaking Clojure into Operations
PuppetDB: Sneaking Clojure into Operations
 
Introduction to apache_cassandra_for_developers-lhg
Introduction to apache_cassandra_for_developers-lhgIntroduction to apache_cassandra_for_developers-lhg
Introduction to apache_cassandra_for_developers-lhg
 
Confitura 2018 — Apache Beam — Promyk Nadziei Data Engineera
Confitura 2018 — Apache Beam — Promyk Nadziei Data EngineeraConfitura 2018 — Apache Beam — Promyk Nadziei Data Engineera
Confitura 2018 — Apache Beam — Promyk Nadziei Data Engineera
 
Node js introduction
Node js introductionNode js introduction
Node js introduction
 
Sinfonier: How I turned my grandmother into a data analyst - Fran J. Gomez - ...
Sinfonier: How I turned my grandmother into a data analyst - Fran J. Gomez - ...Sinfonier: How I turned my grandmother into a data analyst - Fran J. Gomez - ...
Sinfonier: How I turned my grandmother into a data analyst - Fran J. Gomez - ...
 
Kubernetes for the PHP developer
Kubernetes for the PHP developerKubernetes for the PHP developer
Kubernetes for the PHP developer
 
Automated ML Workflow for Distributed Big Data Using Analytics Zoo (CVPR2020 ...
Automated ML Workflow for Distributed Big Data Using Analytics Zoo (CVPR2020 ...Automated ML Workflow for Distributed Big Data Using Analytics Zoo (CVPR2020 ...
Automated ML Workflow for Distributed Big Data Using Analytics Zoo (CVPR2020 ...
 
Porting a Streaming Pipeline from Scala to Rust
Porting a Streaming Pipeline from Scala to RustPorting a Streaming Pipeline from Scala to Rust
Porting a Streaming Pipeline from Scala to Rust
 
Breaking The Clustering Limits @ AlphaCSP JavaEdge 2007
Breaking The Clustering Limits @ AlphaCSP JavaEdge 2007Breaking The Clustering Limits @ AlphaCSP JavaEdge 2007
Breaking The Clustering Limits @ AlphaCSP JavaEdge 2007
 
Apache Beam: A unified model for batch and stream processing data
Apache Beam: A unified model for batch and stream processing dataApache Beam: A unified model for batch and stream processing data
Apache Beam: A unified model for batch and stream processing data
 
Ultra Fast Deep Learning in Hybrid Cloud using Intel Analytics Zoo & Alluxio
Ultra Fast Deep Learning in Hybrid Cloud using Intel Analytics Zoo & AlluxioUltra Fast Deep Learning in Hybrid Cloud using Intel Analytics Zoo & Alluxio
Ultra Fast Deep Learning in Hybrid Cloud using Intel Analytics Zoo & Alluxio
 

Recently uploaded

Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
Safe Software
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
Cheryl Hung
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Ramesh Iyer
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
OnBoard
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
Elena Simperl
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Thierry Lestable
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
Laura Byrne
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
Frank van Harmelen
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
DianaGray10
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Product School
 
Generating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using SmithyGenerating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using Smithy
g2nightmarescribd
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
Kari Kakkonen
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
Guy Korland
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
BookNet Canada
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
ThousandEyes
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Jeffrey Haguewood
 

Recently uploaded (20)

Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
 
Generating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using SmithyGenerating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using Smithy
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
 

Storm @ Fifth Elephant 2013

  • 3.
  • 4. Laptop with anyOS JDK v7.x installed Mavenv3.0.5+ installed IDE[either Eclipse with m2eclipse plugin or IntelliJ IDEA] CreatedTwitter app for retrieving tweets Cloned ordownloaded Storm Projectsfrom my GitHub Account:  https://github.com/P7h/StormWordCount  https://github.com/P7h/StormTweetsWordCount Prerequisites for Workshop
  • 5. Big Data Batchvs. Real-time processing Intro to Storm CompaniesusingStorm Storm Dependencies Storm Concepts Anatomyof Storm Cluster Live codinga use caseusingStormTopology Storm vs. Hadoop Ag e n d a
  • 6.
  • 7.
  • 8.
  • 10. Batch vs. Real-time Processing Batch processing  Gatheringof dataand processingas a group at one time. Real-time processing  Processingof datathat takes place as theinformation is being entered.
  • 11. Event Processing Simple Event Processing  Actingon a single event,filter in theESP Event Stream Processing  Looking across multiple events Complex Event Processing  Looking across multiple eventsfrom multipleevent streams
  • 12.
  • 13. Storm Createdby Nathan Marz @BackType  Analyzetweets, links,users on Twitter Open sourced on 19th September, 2011  EclipsePublic License 1.0  Stormv0.5.2  16k Java and 7k Clojure LOC Latest Updates  Currentstable release v0.8.2released on 11th January,2013  Major core improvementsplanned for v0.9.0  Stormwill be anApacheProject [soon..]
  • 14. Storm Open source distributedreal-time computation system Hadoop of real-time Fast Scalable Fault-tolerant Guarantees datawill be processed Programminglanguage agnostic Easyto set up and operate Excellent documentation
  • 15.
  • 16. Polyglotism(language agnostic) – Clojure,Java, Python, Ruby, PHP, Perl,… and yes, evenJavaScript https://github.com/nathanmarz/storm-starter/blob/master/multilang/resources/splitsentence.py https://github.com/nathanmarz/storm-starter/blob/master/multilang/resources/splitsentence.rb
  • 17. Real-time analytics Stream processing Online machine learning Continuous computation Distributed RPC Extract,Transform and Load (ETL) Use cases
  • 20.
  • 21.
  • 22.
  • 23. enables the convergenceof Big Data and low-latency processing. Empowers stream / micro-batch processing of user events, content feeds and application logs.
  • 24.
  • 26.
  • 27. Clojure  a dialect of the Lisp programminglanguagerunson the JVM, CLR, andJavaScript engines ApacheThrift  Cross languagebridge,RPC; Frameworktobuild services ØMQ  Asynchronous messagetransportlayer Jetty  Embeddedweb server Stormunderthe hood
  • 28. Stormunderthe hood ApacheZooKeeper  Distributed system, used to store metadata LMAX Disruptor  High performancequeuesharedbythreads Kryo  Serializationframework Misc.  SLF4J,Python, Java 5+,JZMQ, JODA,Guava
  • 29.
  • 30. Tuples Main data structure in Storm. An ordered list of objects.  (“user”, “Prashanth”, “Babu”, “Engineer”, “Bangalore“) Key-valuepairs – keys are strings, values canbe of anytype. Tuple
  • 31. Streams Unbounded sequence of tuples. Edges in the topology. Defined with a schema. Tuple Tuple Tuple Tuple Tuple
  • 32. Spouts Source of streams. Spouts are like sourcesin a graph. Examples are API Calls, log files, event data,queues, Kestrel,AMQP, JMS, Kafka, etc.
  • 34. Bolts Processinput streams and [might] produce new streams. Can do anything i.e. filtering, streaming joins, aggregations,readfrom / write to databases, APIs, run arbitraryfunctions, etc. All sinks in the topology are bolts but not all bolts are sinks. Tuple Tuple Tuple
  • 35. Bolts
  • 36. Topology Networkof spouts and bolts. Can be visualized like a graph. Container for application logic. Analogous to a MapReduce job. But runs forever.
  • 37. Sample Topology [Sentence] [Word, Count] [Sentence] RandomSentenceSpout SplitSentenceBolt SplitSentenceBolt WordCountBolt DBBolt / JMSBolt ………….. More such bolts RandomSentenceSpout https://github.com/P7h/StormWordCount
  • 38. Stream Groupings Each Spout or Bolt might be running n instances in parallel[tasks]. Groupings are used to decide which task in the subscribing bolt, the tuple is sent to. Grouping Feature Shuffle Randomgrouping Fields Groupedby valuesuch that equal valueresults in sametask All Replicates to all tasks Global Makesall tuples goto one task None MakesBolt run in the samethread as the Bolt / Spout it subscribesto Direct Producer(task that emits) controls which Consumerwill receive Local or Shuffle If the targetbolt has one or moretasks in the sameworkerprocess, tuples will be shuffled to just those in-process tasks
  • 39.
  • 41. Storm Cluster Nimbus daemon is the masterof this cluster.  Managestopologies.  Comparable to HadoopJobTracker. Supervisor daemon spawns workers.  Comparable to HadoopTaskTracker. Workersare spawned bysupervisors.  One per port defined in storm.yaml configuration.
  • 42. Storm Cluster [contd..] Task is run as a thread in workers. Zookeeperis a distributed system, used to store metadata. UI is a webapp which gives diagnostics on the cluster and topologies. Nimbus and Supervisor daemons are fail-fast and stateless.  State is storedin Zookeeper.
  • 43. Storm – Modes of operation Localmode  Develop,test and debug topologies onyour localmachine.  Mavenis used to include Storm asa dev dependency for the project. mvn clean compile package && java -jar target/storm-wordcount-1.0-SNAPSHOT-jar- with-dependencies.jar
  • 44. Remote [or Production]mode  Topologies are submitted for executionon a clusterof machines.  Cluster information is added in storm.yaml file.  More details on storm.yaml file canbe found here: https://github.com/nathanmarz/storm/wiki/Setting-up-a-Storm-cluster#fill-in-mandatory-configurations- into-stormyaml storm jar target/storm-wordcount-1.0-SNAPSHOT.jar org.p7h.storm.offline.wordcount.topology.WordCountTopology WordCount Storm – Modes of operation [contd..]
  • 45.
  • 49. Code Sample – Topology
  • 53.
  • 54. Problem#1 – WordCount [if there are internetissues] Create a Spout which feeds randomsentences [you can define your own set of randomsentences]. Create a Bolt which receivessentencesfrom the Spout and then splits them into words and forwardsthem to next bolt. Create another Bolt to count the words. https://github.com/P7h/StormWordCount
  • 55. Create a Spout which gets data from Twitter [please use Twitter4Jand OAUTH Credentials to get tweets using Streaming API].  Forsimplicity consideronly tweets which arein English.  Emit only the stuff which we areinterested,i.e.A tweet’s getRetweetedStatus(). Create another Bolt to count the count the retweetsof a particular tweet.  Makean in-memoryMapwith retweetscreen name andthe counterof the retweet asthe value.  Logthe counterevery few seconds / minutes [should be configurable]. https://github.com/P7h/StormTopRetweets Problem#2 – Top5 retweetedtweets[if internetworks fine]
  • 56.
  • 57. Storm vs. Hadoop Batch processing Jobs runto completion [Pre-YARN]NameNode is SPOF Statefulnodes Scalable Guarantees nodata loss Open source Real-time processing Topologies run forever No SPOF Stateless nodes Scalable Gurantees nodataloss Open source
  • 58. t now Hadoop works great back here Stormworks here Hadoop AND Storm Blended view Blended view Blended View
  • 60. Advanced Topics [not covered in this session] Distributed RPC Transactional topologies Trident Unit testing Patterns
  • 61. References This Slide deck[on slideshare] – http://j.mp/5thEleStorm_SS This Slide deck[on speakerdeck] – http://j.mp/5thEleStorm_SD MyGitHub Account for code repos – https://github.com/P7h Bit.ly Bundle for Storm curatedby me – http://j.mp/YrDgcs
  • 62.
  • 63. PrashanthBabuV V Follow me on Twitter: @P7h

Editor's Notes

  1. https://fifthelephant.in/2013/workshops
  2. https://twitter.com/jeremyckahn/status/322390046491688960
  3. http://visual.ly/what-big-data
  4. https://twitter.com/Bill_Gross/statuses/329069954911580160
  5. Advantages of batch processing:>> Once the data process is started, the computer can be left running without supervision.>> Large amount of transactions can be combined into a batch rather than processing them each individually.>> Is cost efficient for an organization. Batch processing is only carried out when needed and uses less computer processing time.Disadvantages of batch processing:>> With batch processing there is a time delay before the work is processed and returned.>> Jobs must be run through to completion, and any changes in the data require reprocessing of the batch job itself.>> In batch processing, the master file is not always kept up to date.>> Batch processing can be slow.>> Complete or no insight into the outcomeAdvantages of real-time processing:>> Any errors in the processing of data are caught at the time the data is entered and can be fixed when real-time processing is used.>> Real-time processing provides better control over inventory and inventory turnover.>> Real-time processing reduces the delay in analyzing.Disadvantages of real-time processing:>> Real-time processing is more complex and expensive than batch systems. >> A real-time processing system is harder to audit.>> The backup of real time processing is necessary to retain the data.
  6. Open sourced on 19th September, 2011 [post Twitter acquisition of BackType]Current stable release [as of 11th July, 2013] v0.8.2 released on 11th January, 2013
  7. https://github.com/nathanmarz/storm-starter/tree/master/multilang/resourceshttp://storm-project.net/about/multi-language.html
  8. Real-time analytics – trending topics in Twitter
  9. http://tweitgeist.colinsurprenant.com/
  10. http://storm-project.net/https://github.com/nathanmarz/storm/wiki/Powered-By
  11. https://github.com/nathanmarz/storm/wiki/Powered-Byhttps://speakerdeck.com/mariuszgil/streams-processing-with-strom
  12. https://github.com/nathanmarz/storm/wiki/Powered-Byhttps://speakerdeck.com/mariuszgil/streams-processing-with-strom
  13. https://github.com/nathanmarz/storm/wiki/Powered-Byhttps://speakerdeck.com/mariuszgil/streams-processing-with-strom
  14. http://developer.yahoo.com/blogs/ydn/storm-yarn-released-open-source-143745133.htmlhttps://github.com/nathanmarz/storm/wiki/Powered-Byhttps://speakerdeck.com/mariuszgil/streams-processing-with-strom
  15. https://github.com/nathanmarz/storm/wiki/Concepts
  16. https://github.com/P7h/StormTopRetweets
  17. http://developer.yahoo.com/blogs/ydn/storm-hadoop-convergence-big-data-low-latency-processing-54503.html
  18. https://github.com/nathanmarz/storm/wiki/Tutorial
  19. https://github.com/nathanmarz/storm/wiki/Tutorial