Budapest Big Data Meetup Real-time stream processing


Published on

Short introduction into open source stream processing solutions, some features of Apache Giraph and Storm. One POF story at the end.

Published in: Technology
1 Like
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Budapest Big Data Meetup Real-time stream processing

  1. 1. StarschemaExperience and Innovation
  2. 2. • Who we are and what we are doing• Big Data era• BSP (Bulk synchronous parallel)• Apache Giraph• Storm• Our use case• ConclusionTopics todayStarschemaExperience and Innovation
  3. 3. Continuous growth25 FTE plus external resources,over $1.5million EBITOpen source projectsShare the knowledge with the public.Open source project in ETL and datawarehousing fields.Founded in 2006Company was founded by privateowners with decade of BI and datawarehouse backgroundR&DCooperation with Obuda University,NKE, EU co-founded technologyresearch and developmentCOMPANY DataFacts about StarschemaStarschemaExperience and Innovation
  4. 4. Big Data eraThe rise of HadoopStarschemaExperience and InnovationGoogle Year of WP Apache YearGFS 2003 HDFS 2007MapReduce 2004 Hadoop MR 2007BigTable 2006 HBase 2007Chubby Lock Service 2006 ZooKeeper 2007Pregel 2009 Giraph 2011Dremel 2010 Drill 2012 ?Which is next? (Curator, Falcon, MRQL, etc.)
  5. 5. • Leslie Valiant - article in nov. 1990• Supersteps• Data stored in local memory• Asynchronous data processing• Barrier sync• Optimal load balacing (more logical processesthan physcal processors, random allocation ofprocesses)• Solution differences (procotols, buffermanagement, routing strategies)• No deadlock or any other race conditions(since no circular dependency)• Use casesBSP (Bulk synchronous parallel)What is it? What is it good for?StarschemaExperience and Innovation
  6. 6. Storm Apache GiraphStarschemaExperience and Innovation
  7. 7. Apache GiraphStarschemaExperience and Innovation• A loose implementation of Pregel• Avery Chink: We cant use it at Yahoo, thats too bad• Developed at Yahoo• Runs on existing MapReduce infrastructure• Netty based comm. instead of Hadoop RPC• In-memory• Fault tolerant• Internal state is saved at user-defined intervals• Master/slave architectureWhat is it?
  8. 8. StormStarschemaExperience and Innovation• Storm is a free and open source distributed real timecomputation system• Developed at BackType, open-sourced by Twitter in 2011• Guaranteed data processing• Horizontal scalability• Fault tolerance• ZeroMQ for message passing• Processing unboundedsequence of tuples• GroupingsWhat is it?
  9. 9. StormStarschemaExperience and InnovationWhat is it for?• Analyze, clean, normalize• Real-time calculation• Real-time ETL• Failure detection from log files• Machine data analysis• IT early-warning systems, security and fraud detection• Traffic information, DOS attack• Stream processing - Continous computation -Distributed RPC
  10. 10. Our use caseStarschemaExperience and Innovation• Real-time calculation• Error detection• Horizontal scalability• Fast implementation• High-availability• Error predictionPOC: Processing machine data from sensorsRequirements
  11. 11. Our use case part 2StarschemaExperience and Innovation• Choosen tool: Storm• One spout for each sensor• Dynamic add and remove of spouts• Error detection based on statistical calculations• ~ 200 lines• HA capability of StormPOC: Processing machine data from sensorsSolution:
  12. 12. Conclusion• Extend existing infrastructure• Answer to new questions• Re-think old problems• New solutions, new features• Happy customers/users• $$$StarschemaExperience and Innovation
  13. 13. StarschemaExperience and InnovationWhat else to use?• Yahoo S4 (Apache Incubator project)• Apache Hama (Top level Apache project)• GoldenOrb• Signal/Collect
  14. 14. QUESTIONS & ANSWERSQ…AStarschemaExperience and