Your SlideShare is downloading. ×
  • Like
Atmosphere 2014: When Storm hits data. Data streams processing in real time - Marcin Stanislawski
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Now you can save presentations on your phone or tablet

Available for both IPhone and Android

Text the download link to your phone

Standard text messaging rates apply

Atmosphere 2014: When Storm hits data. Data streams processing in real time - Marcin Stanislawski

  • 239 views
Published

Performing analysis and prediction in big scale (aka Big Data) is very popular in IT world now. But how to deal with this amount of data in real time? How to get answers on questions asked by business …

Performing analysis and prediction in big scale (aka Big Data) is very popular in IT world now. But how to deal with this amount of data in real time? How to get answers on questions asked by business runners in seconds rather than in hours? How to forget about MapReduce jobs and still deal with BigData?
In this talk I will try to answer these questions and introduce you to real time "recipies" and algoritms. Talk covers also basics of Apache Storm Project which is my preferred tool for doing such type of analysis. Presentation also contains facts and lessons learned from designing and developing real time topologies at INTERIA.PL.

Marcin Stanislawski - Marcin Stanislawski works as software architect/engineer at INTERIA.PL, 4th biggest portal of the Polish internet. Currently he is working on a system that analyzes and predicts user's behaviour on the portal websites. Moreover he is a huge fan of Open Source, Continous Delivery, TDD, BDD and functional programming languages like Scala. On a daily basis husband, father and urbanomics enthusiast.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
239
On SlideShare
0
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
6
Comments
0
Likes
2

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. WHEN STORM HITS DATA. DATA STREAMS PROCESSING IN REAL TIME. MARCIN STANISLAWSKI
  • 2. WHO AM I? Architect/Developer atInteria.pl Storm and Hadoop user Github: webik Twitter: @unilama
  • 3. BIG DATA HADOOP
  • 4. WELCOME IN ZOO
  • 5. RUN JOB COFFEE BREAK* RESULTS *-there are some solutions
  • 6. IMPALA implemented in C++ non Map Reduce solution
  • 7. KIJI KijiREST HDFS/HBase/Cassandra
  • 8. BATCH PROCESSING VS. STREAMING
  • 9. STREAMING SOLUTIONS Yahoo S4 Akka Spark Streaming Storm
  • 10. STORM WHAT IS THAT?
  • 11. README.MD Stormis adistributed realtime computation system. Stormis simple, can be used with any programming language, and is alotof funto use!
  • 12. CURRENT STATUS Apache Incubation Included in HortonWorks DataPlatform Contributed byYahoo Easydeployto Amazon EC2
  • 13. WHO USES
  • 14. BASIC IDEA
  • 15. SPOUTS TAKES EVENTS FROM: Kafka Kestrel RabitMQ ... AND PASS THEM TO...
  • 16. BOLTS TUPLES ARE PROCESSED, IN WAY THAT YOU IMPLEMENT IT
  • 17. EVENTS ARE TUPLES ( 1, "TEST", "ATMOSPHERE", "2014-05-20 10:00:40", ... ) OBJECTS ARE SERIALIZED USING KYRO
  • 18. WRITTEN IN JAVA&CLOJURE TOPOLOGIES ARE DAGS
  • 19. ARCHITECTURE Nimbus Nodes(Supervisors) UI DRPC
  • 20. EVENT PROCESSED ONE OR MORE TIMES.
  • 21. ACKING FRAMEWORK Each tuple mustbe acked or failed
  • 22. TUPLES TRACKING tuple has random 64 bitid xor of alltuple ids, thathave been created and/or acked in the tree if tuple id equals 0, tuple is fullyprocessed
  • 23. COMMUNICATION Between: Tasks: Disruptor LMAX Workers: ⦰MQ -> Netty
  • 24. TRIDENT high-levelabstraction same as Cascading/Scaldingin Hadoop World
  • 25. SPOUT Keydifference -producingStream(s)
  • 26. STREAM Batches chain with multiplication ability
  • 27. STREAM OPERATIONS Functions Filters Projections Joins Merges
  • 28. SATE Operations: Grouping Aggregate Query
  • 29. STATE TYPES non-transactional transactional opaque transactional
  • 30. STATE In memorystate NoSQL databases Externalsystems viaAPIs
  • 31. DRPC
  • 32. DRPC TOPOLOGY NAMED DRPC SPOUT USES MAIN TOPOLOGY STATES GENERATES ONE TUPLE OUTPUT
  • 33. DRPC ELEMENTS THRIFT SERVER(S) WITH PREDEFINED SPOUT AND BOLT
  • 34. ARE YOU PROGRAMMING IN NON-JVM LANGUAGE? NO PROBLEM :) Ruby Python Perl PHP ...
  • 35. STREAMING API API defined as Thrift JSONbased communication
  • 36. RED STORM Writingtopologies in Ruby
  • 37. REAL TIME ALGORITHMS
  • 38. SIMPLE OPERATIONS Sum Count Multiplication
  • 39. MAXIMUM AND MINIMUM don'tlose currentvalue
  • 40. USUALLY TWO TOPOLOGIES
  • 41. LEARNING Classification Clustering
  • 42. MODEL Evaluator Visualiser
  • 43. BASIC ELEMENT TABLE
  • 44. SIMPLE EXAMPLE
  • 45. ALGORITHM EXAMPLES k-means clustering statisticaltest(T, F, Z, Chi2) Hidden MarkovModels
  • 46. ADVERT TIME :)
  • 47. STORMUNIT http://github.com/webik/StormUnit MAVEN MOJO - COMMING SOON :) http://github.com/webik/storm-maven
  • 48. WHAT NEXT...
  • 49. SUMMINGBIRD Write once, run on: Storm Hadoop(Scalding) Amazon Kinesis
  • 50. MAYBE BACK INTO ZOO STORM YARN
  • 51. THANK YOU.
  • 52. QUESTIONS?