Your SlideShare is downloading. ×
0
WHEN STORM HITS DATA.
DATA STREAMS PROCESSING IN REAL TIME.
MARCIN STANISLAWSKI
WHO AM I?
Architect/Developer atInteria.pl
Storm and Hadoop user
Github: webik
Twitter: @unilama
BIG DATA
HADOOP
WELCOME IN ZOO
RUN JOB
COFFEE BREAK*
RESULTS
*-there are some solutions
IMPALA
implemented in C++
non Map Reduce solution
KIJI
KijiREST
HDFS/HBase/Cassandra
BATCH PROCESSING VS. STREAMING
STREAMING SOLUTIONS
Yahoo S4
Akka
Spark Streaming
Storm
STORM WHAT IS THAT?
README.MD
Stormis adistributed realtime computation system.
Stormis simple, can be used with any programming
language, and...
CURRENT STATUS
Apache Incubation
Included in HortonWorks DataPlatform
Contributed byYahoo
Easydeployto Amazon EC2
WHO USES
BASIC IDEA
SPOUTS
TAKES EVENTS FROM:
Kafka
Kestrel
RabitMQ
...
AND PASS THEM TO...
BOLTS
TUPLES ARE PROCESSED, IN WAY THAT YOU IMPLEMENT IT
EVENTS ARE TUPLES
( 1, "TEST", "ATMOSPHERE", "2014-05-20 10:00:40", ... )
OBJECTS ARE SERIALIZED USING KYRO
WRITTEN IN JAVA&CLOJURE
TOPOLOGIES ARE DAGS
ARCHITECTURE
Nimbus
Nodes(Supervisors)
UI
DRPC
EVENT PROCESSED ONE OR MORE TIMES.
ACKING FRAMEWORK
Each tuple mustbe acked or failed
TUPLES TRACKING
tuple has random 64 bitid
xor of alltuple ids, thathave been created
and/or acked in the tree
if tuple id ...
COMMUNICATION
Between:
Tasks: Disruptor LMAX
Workers: ⦰MQ -> Netty
TRIDENT
high-levelabstraction
same as Cascading/Scaldingin Hadoop World
SPOUT
Keydifference -producingStream(s)
STREAM
Batches chain with multiplication ability
STREAM OPERATIONS
Functions
Filters
Projections
Joins
Merges
SATE
Operations:
Grouping
Aggregate
Query
STATE TYPES
non-transactional
transactional
opaque transactional
STATE
In memorystate
NoSQL databases
Externalsystems viaAPIs
DRPC
DRPC TOPOLOGY
NAMED DRPC SPOUT
USES MAIN TOPOLOGY STATES
GENERATES ONE TUPLE OUTPUT
DRPC ELEMENTS
THRIFT SERVER(S)
WITH PREDEFINED SPOUT
AND BOLT
ARE YOU PROGRAMMING IN NON-JVM
LANGUAGE?
NO PROBLEM :)
Ruby
Python
Perl
PHP
...
STREAMING API
API defined as Thrift
JSONbased communication
RED STORM
Writingtopologies in Ruby
REAL TIME ALGORITHMS
SIMPLE OPERATIONS
Sum
Count
Multiplication
MAXIMUM AND MINIMUM
don'tlose currentvalue
USUALLY TWO TOPOLOGIES
LEARNING
Classification
Clustering
MODEL
Evaluator
Visualiser
BASIC ELEMENT TABLE
SIMPLE EXAMPLE
ALGORITHM EXAMPLES
k-means clustering
statisticaltest(T, F, Z, Chi2)
Hidden MarkovModels
ADVERT TIME :)
STORMUNIT
http://github.com/webik/StormUnit
MAVEN MOJO - COMMING SOON :)
http://github.com/webik/storm-maven
WHAT NEXT...
SUMMINGBIRD
Write once, run on:
Storm
Hadoop(Scalding)
Amazon Kinesis
MAYBE BACK INTO ZOO
STORM YARN
THANK YOU.
QUESTIONS?
Atmosphere 2014: When Storm hits data. Data streams processing in real time - Marcin Stanislawski
Atmosphere 2014: When Storm hits data. Data streams processing in real time - Marcin Stanislawski
Atmosphere 2014: When Storm hits data. Data streams processing in real time - Marcin Stanislawski
Upcoming SlideShare
Loading in...5
×

Atmosphere 2014: When Storm hits data. Data streams processing in real time - Marcin Stanislawski

300

Published on

Performing analysis and prediction in big scale (aka Big Data) is very popular in IT world now. But how to deal with this amount of data in real time? How to get answers on questions asked by business runners in seconds rather than in hours? How to forget about MapReduce jobs and still deal with BigData?
In this talk I will try to answer these questions and introduce you to real time "recipies" and algoritms. Talk covers also basics of Apache Storm Project which is my preferred tool for doing such type of analysis. Presentation also contains facts and lessons learned from designing and developing real time topologies at INTERIA.PL.

Marcin Stanislawski - Marcin Stanislawski works as software architect/engineer at INTERIA.PL, 4th biggest portal of the Polish internet. Currently he is working on a system that analyzes and predicts user's behaviour on the portal websites. Moreover he is a huge fan of Open Source, Continous Delivery, TDD, BDD and functional programming languages like Scala. On a daily basis husband, father and urbanomics enthusiast.

0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
300
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
6
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

Transcript of "Atmosphere 2014: When Storm hits data. Data streams processing in real time - Marcin Stanislawski"

  1. 1. WHEN STORM HITS DATA. DATA STREAMS PROCESSING IN REAL TIME. MARCIN STANISLAWSKI
  2. 2. WHO AM I? Architect/Developer atInteria.pl Storm and Hadoop user Github: webik Twitter: @unilama
  3. 3. BIG DATA HADOOP
  4. 4. WELCOME IN ZOO
  5. 5. RUN JOB COFFEE BREAK* RESULTS *-there are some solutions
  6. 6. IMPALA implemented in C++ non Map Reduce solution
  7. 7. KIJI KijiREST HDFS/HBase/Cassandra
  8. 8. BATCH PROCESSING VS. STREAMING
  9. 9. STREAMING SOLUTIONS Yahoo S4 Akka Spark Streaming Storm
  10. 10. STORM WHAT IS THAT?
  11. 11. README.MD Stormis adistributed realtime computation system. Stormis simple, can be used with any programming language, and is alotof funto use!
  12. 12. CURRENT STATUS Apache Incubation Included in HortonWorks DataPlatform Contributed byYahoo Easydeployto Amazon EC2
  13. 13. WHO USES
  14. 14. BASIC IDEA
  15. 15. SPOUTS TAKES EVENTS FROM: Kafka Kestrel RabitMQ ... AND PASS THEM TO...
  16. 16. BOLTS TUPLES ARE PROCESSED, IN WAY THAT YOU IMPLEMENT IT
  17. 17. EVENTS ARE TUPLES ( 1, "TEST", "ATMOSPHERE", "2014-05-20 10:00:40", ... ) OBJECTS ARE SERIALIZED USING KYRO
  18. 18. WRITTEN IN JAVA&CLOJURE TOPOLOGIES ARE DAGS
  19. 19. ARCHITECTURE Nimbus Nodes(Supervisors) UI DRPC
  20. 20. EVENT PROCESSED ONE OR MORE TIMES.
  21. 21. ACKING FRAMEWORK Each tuple mustbe acked or failed
  22. 22. TUPLES TRACKING tuple has random 64 bitid xor of alltuple ids, thathave been created and/or acked in the tree if tuple id equals 0, tuple is fullyprocessed
  23. 23. COMMUNICATION Between: Tasks: Disruptor LMAX Workers: ⦰MQ -> Netty
  24. 24. TRIDENT high-levelabstraction same as Cascading/Scaldingin Hadoop World
  25. 25. SPOUT Keydifference -producingStream(s)
  26. 26. STREAM Batches chain with multiplication ability
  27. 27. STREAM OPERATIONS Functions Filters Projections Joins Merges
  28. 28. SATE Operations: Grouping Aggregate Query
  29. 29. STATE TYPES non-transactional transactional opaque transactional
  30. 30. STATE In memorystate NoSQL databases Externalsystems viaAPIs
  31. 31. DRPC
  32. 32. DRPC TOPOLOGY NAMED DRPC SPOUT USES MAIN TOPOLOGY STATES GENERATES ONE TUPLE OUTPUT
  33. 33. DRPC ELEMENTS THRIFT SERVER(S) WITH PREDEFINED SPOUT AND BOLT
  34. 34. ARE YOU PROGRAMMING IN NON-JVM LANGUAGE? NO PROBLEM :) Ruby Python Perl PHP ...
  35. 35. STREAMING API API defined as Thrift JSONbased communication
  36. 36. RED STORM Writingtopologies in Ruby
  37. 37. REAL TIME ALGORITHMS
  38. 38. SIMPLE OPERATIONS Sum Count Multiplication
  39. 39. MAXIMUM AND MINIMUM don'tlose currentvalue
  40. 40. USUALLY TWO TOPOLOGIES
  41. 41. LEARNING Classification Clustering
  42. 42. MODEL Evaluator Visualiser
  43. 43. BASIC ELEMENT TABLE
  44. 44. SIMPLE EXAMPLE
  45. 45. ALGORITHM EXAMPLES k-means clustering statisticaltest(T, F, Z, Chi2) Hidden MarkovModels
  46. 46. ADVERT TIME :)
  47. 47. STORMUNIT http://github.com/webik/StormUnit MAVEN MOJO - COMMING SOON :) http://github.com/webik/storm-maven
  48. 48. WHAT NEXT...
  49. 49. SUMMINGBIRD Write once, run on: Storm Hadoop(Scalding) Amazon Kinesis
  50. 50. MAYBE BACK INTO ZOO STORM YARN
  51. 51. THANK YOU.
  52. 52. QUESTIONS?
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×