Hadoop and Storm - AJUG talk


Published on

Storm is a distributed realtime computation system. Similar to how Hadoop provides a set of general primitives for doing batch processing, Storm provides a set of general primitives for doing realtime computation.
Storm often coexists in Big Data architectures with Hadoop. We will talk about different approaches to this interoperability between the systems, their benefits & trade-offs, and a new open source library available for high throughput use.

Published in: Technology

Hadoop and Storm - AJUG talk

  1. 1. ©MapR TechnologiesHadoop and StormAJUG 5/21/2013
  2. 2. whoami• Brad Anderson• Solutions Architect at MapR (Atlanta)• ATLHUG co-chair• NoSQL East Conference 2009• “boorad” most places (twitter, github)• banderson@maprtech.com
  3. 3. Hadoop: A Paradigm Shift Distributed computing platform– Large clusters– Commodity hardware Pioneered at Google– Google File System, MapReduce and BigTable Commercially available as Hadoop
  4. 4. Ship the Function to the DataSAN/NASdata data datadata data datadata data datadata data datadata data datafunctionRDBMSTraditional ArchitecturedatafunctiondatafunctiondatafunctiondatafunctiondatafunctiondatafunctiondatafunctiondatafunctiondatafunctiondatafunctiondatafunctiondatafunctionDistributed Computing
  5. 5. MapReduce FlowInput Map CombineShuffleand sortReduceOutputReduce
  6. 6. Variation: No Reduce NecessaryExample: Batch File TransformationInput Map OutputMPG M4V
  7. 7. Variation: Multiple MapReducesExample: Fraud Detection in User TransactionsLDA trainingTransactiondataLDA scoringHBase /MapR M7 EditionG2 scoreCandidateevents foranalyst review95 %-ile LDAanomalyMapReducehttp://en.wikipedia.org/wiki/Latent_Dirichlet_allocation
  8. 8. Pig
  9. 9. MR Equivalent to Pig Script
  10. 10. Hive
  11. 11. MapR Distribution for Apache HadoopComplete HadoopdistributionComprehensivemanagement suiteIndustry-standardinterfacesEnterprise-gradedependabilityEnterprise-grade security(US Intelligence Agency)Patents - IPHigher performance
  12. 12. Hadoop Use CasesETL/EDW OffloadSensor / Telemetry DataRecommendation EngineSearch•ML algorithms•eDiscoveryFleet ManagementFraud Detection / Risk ManagementTraffic Decongestion
  13. 13. One Platform for Big Data…99.999%HADataProtectionDisasterRecoveryScalability&PerformanceEnterpriseIntegrationMulti-tenancyMapReduceFile-BasedApplicationsSQL Database Search StreamProcessingBatchInteractiveRealtimeBatchLog file AnalysisData Warehouse OffloadFraud DetectionClickstream AnalyticsRealtimeSensor Analysis“Twitterscraping”TelematicsProcess OptimizationInteractiveForensic AnalysisAnalytic ModelingBI User Focus
  14. 14. ©MapR TechnologiesStorm“Hadoop for Realtime”
  15. 15. ©MapR TechnologiesBefore StormQueues Workers
  16. 16. ©MapR TechnologiesExample(simplified)
  17. 17. ©MapR TechnologiesStormGuaranteed data processingHorizontal scalabilityFault-toleranceNo intermediate message brokers!Higher level abstraction thanmessage passing“Just works”
  18. 18. ©MapR TechnologiesUnbounded sequence of tuplesTuple Tuple Tuple Tuple Tuple Tuple TupleStreams
  19. 19. ©MapR TechnologiesSource of streamsSpouts
  20. 20. ©MapR Technologiespublic interface ISpout extends Serializable {void open(Map conf,TopologyContext context,SpoutOutputCollector collector);void close();void nextTuple();void ack(Object msgId);void fail(Object msgId);}Spouts
  21. 21. ©MapR TechnologiesProcesses input streams and produces new streamsTuple Tuple Tuple TupleBolts
  22. 22. ©MapR Technologiespublic class DoubleAndTripleBolt extends BaseRichBolt {private OutputCollectorBase _collector;public void prepare(Map conf,TopologyContext context,OutputCollectorBase collector) {_collector = collector;}public void execute(Tuple input) {int val = input.getInteger(0);_collector.emit(input, new Values(val*2, val*3));_collector.ack(input);}public void declareOutputFields(OutputFieldsDeclarer declarer) {declarer.declare(new Fields("double", "triple"));}}Bolts
  23. 23. ©MapR TechnologiesNetwork of spouts and boltsTopologies
  24. 24. ©MapR TechnologiesTridentTopology topology = new TridentTopology();TridentState wordCounts =topology.newStream("spout1", spout).each(new Fields("sentence"),new Split(),new Fields("word")).groupBy(new Fields("word")).persistentAggregate(new MemoryMapState.Factory(),new Count(),new Fields("count")).parallelismHint(6);TridentCascading for Storm
  25. 25. Storm©MapR TechnologiesHadoopbatchprocessesAppsBusinessValueRawDatarealtimeprocessesQueue(Kafka)Parallel Cluster Ingest
  26. 26. ©MapR TechnologiesHadoopbatchprocessesAppsBusinessValueRawDatarealtimeprocessesStormTailSpoutFranzQueue(Kafka)
  27. 27. StormKafkaTwitterTwitter APITweetLoggerKafkaClusterKafkaClusterKafkaClusterKafkaAPIStormWeb Service NASWebDataHadoopFlumeHDFSData
  28. 28. TwitterTwitterAPICatcher StormTopicQueueWeb-serverhttpWebDataMapRTweetLogger
  29. 29. Scaling EstimatesTwitter Firehose Old School – 8+ separateclusters, 20-25 nodes• >3 Kafka nodes• >2 TweetLoggers• 5-10 Hadoop• >2 Catcher nodes• >3 Storm• 3 zookeepers• NAS for web storage• >2 web servers MapR – One Platform• 5-10 nodes total• Any node does any job• Full HA included• Backups included
  30. 30. ©MapR Technologiesgithub• Watch TailSpout & Franz development• https://github.com/{tdunning | boorad | pfcurtis}/mapr-spout• And our example Twitter implementation• https://github.com/{tdunning | boorad | pfcurtis}/mapr-spout-test
  31. 31. Demo