High-throughput data analysis 	A Streaming Reports PlatformAuthorsJ Singh, Early Stage ITDavid Zheng, Early Stage ITContributorSatya Gupta, Virsec SystemsOctober 3, 2011
High-throughput data analysisA few examples of streaming data problemsA concrete problem we solvedHow we solved itTake-away lessons
Streaming DataData arrives continuouslyMust be processed continuouslyEmit analysis results or alerts as needed
Security: Scrapers, Spammers, …
Monitoring and Alerts
Financial Markets
High-throughput data analysisA few examples of streaming data problemsA concrete problem we solvedHow we solved itTake-away lessons
Example Use CaseResolve Virtual Machineprofiles application and gathers data about the applicationThe data analysis blocked until data collection was complete, Took several hours before conclusions could be drawnProject goalsStream-mode analysisBegin within a few seconds of start of profilingContinuous updateData rates up to 5 GB per hourAbility to sustain rate for 24 hrsA product of Virsec SystemsAnalysis and Reporting configured to run in the Amazon EC2 environmentCan be scaled up (bigger machines)
Or scaled out (more machines)ApproachA variant of Eric Ries’ Lean Startup approachIntroduced in his book, The Lean Startup, Crown Publishing Group, 2011It is recipe for learning quicklyRe-adapted by us for this and other “learning projects”Do what it takes to get an end-to-end solutionMeasure, Learn, Build, repeatWhen the cycle has stabilized, expand scope appropriately
RequirementsFast inserts into the databaseThenature and amount of analysis required was hard to judge in the beginningPrevious experience with Map/Reduce in the Google App Engine environment had shown promise but GAE was not appropriate for this applicationSlick, demo-worthy web interface for presenting resultsStream-mode operationStart showing results within a few seconds of starting the Resolve Virtual Machine, and update it periodically as more data is collected and analyzed.
High-throughput data analysisA few examples of streaming data problemsA concrete problem we solvedHow we solved itTake-away lessons
Key decisionsChunk up data into 1-second “slices” as it arrivesUse a collection for signaling the availability of each data sliceProcess each chunk as it becomes availableUse Map/Reduce for analysisExploit Parallelism of the data by using as many processors as needed to maintain the “flow rate”Pipeline the various Map Reduce jobs to maintain sequentiality of data
Pipeline Component: ListenerListenerGoal: push the data into MongoDB as fast as possibleReceives the data from the Resolve Virtual Machine and stores it into MongoDBSelf-describing data12 different types of data fed over 12 different socketsWritten in C++Socket Interface at one endMongoDB C++ driver at other end
Pipeline Component:MongoDBMongoDBGoal: persistence
Did everything it was supposed to do
Allowed us to focus on our problem, not on MongoDB
Will use replica sets for making the data available to analysis serversPipeline Component: Map/ReduceAnalysis Program“Function Call Structure” Data TypeCalculation of T was better done in the listener, moved there.Could scale the solution up, but could not scale it outMap Reduce was much faster and could scale out“Memory Usage” Data TypeNeeded multiple map/reduce stages
“Function Call Structure” AnalysisFnNameTotalTimeSrcFnAddress                 PIDSF_CFAmap (emit: FnName, TotalTime, min_addr, NumOfCalls, PID, …)Shuffle stage:FnName: CreateRaceObjects { {TotalTime: 3, min_addr: 2 , NumOfCalls: 1 , PID: 1, …}                                                  {TotalTime: 7, min_addr: 3, NumOfCalls: 1, PID: 1, …}                                                  {TotalTime: 4, min_addr: 1, NumOfCalls: 1, PID: 1, …}{TotalTime: 6, min_addr: 1, NumOfCalls: 1, PID: 1, …}                                                } reduceOutput:{FnName: CreateRaceObjects{TotalTime: 20, min_addr: 1 , NumOfCalls: 2 , PID: 1, …}
Pipeline Component: PresentationPresentationEach feed type presents a different view of the user’s application.
And requires a page design of its own
Tool of choice: DjangoNonRel

High Throughput Data Analysis

  • 1.
    High-throughput data analysis A Streaming Reports PlatformAuthorsJ Singh, Early Stage ITDavid Zheng, Early Stage ITContributorSatya Gupta, Virsec SystemsOctober 3, 2011
  • 2.
    High-throughput data analysisAfew examples of streaming data problemsA concrete problem we solvedHow we solved itTake-away lessons
  • 3.
    Streaming DataData arrivescontinuouslyMust be processed continuouslyEmit analysis results or alerts as needed
  • 4.
  • 5.
  • 6.
  • 7.
    High-throughput data analysisAfew examples of streaming data problemsA concrete problem we solvedHow we solved itTake-away lessons
  • 8.
    Example Use CaseResolveVirtual Machineprofiles application and gathers data about the applicationThe data analysis blocked until data collection was complete, Took several hours before conclusions could be drawnProject goalsStream-mode analysisBegin within a few seconds of start of profilingContinuous updateData rates up to 5 GB per hourAbility to sustain rate for 24 hrsA product of Virsec SystemsAnalysis and Reporting configured to run in the Amazon EC2 environmentCan be scaled up (bigger machines)
  • 9.
    Or scaled out(more machines)ApproachA variant of Eric Ries’ Lean Startup approachIntroduced in his book, The Lean Startup, Crown Publishing Group, 2011It is recipe for learning quicklyRe-adapted by us for this and other “learning projects”Do what it takes to get an end-to-end solutionMeasure, Learn, Build, repeatWhen the cycle has stabilized, expand scope appropriately
  • 10.
    RequirementsFast inserts intothe databaseThenature and amount of analysis required was hard to judge in the beginningPrevious experience with Map/Reduce in the Google App Engine environment had shown promise but GAE was not appropriate for this applicationSlick, demo-worthy web interface for presenting resultsStream-mode operationStart showing results within a few seconds of starting the Resolve Virtual Machine, and update it periodically as more data is collected and analyzed.
  • 11.
    High-throughput data analysisAfew examples of streaming data problemsA concrete problem we solvedHow we solved itTake-away lessons
  • 12.
    Key decisionsChunk updata into 1-second “slices” as it arrivesUse a collection for signaling the availability of each data sliceProcess each chunk as it becomes availableUse Map/Reduce for analysisExploit Parallelism of the data by using as many processors as needed to maintain the “flow rate”Pipeline the various Map Reduce jobs to maintain sequentiality of data
  • 13.
    Pipeline Component: ListenerListenerGoal:push the data into MongoDB as fast as possibleReceives the data from the Resolve Virtual Machine and stores it into MongoDBSelf-describing data12 different types of data fed over 12 different socketsWritten in C++Socket Interface at one endMongoDB C++ driver at other end
  • 14.
  • 15.
    Did everything itwas supposed to do
  • 16.
    Allowed us tofocus on our problem, not on MongoDB
  • 17.
    Will use replicasets for making the data available to analysis serversPipeline Component: Map/ReduceAnalysis Program“Function Call Structure” Data TypeCalculation of T was better done in the listener, moved there.Could scale the solution up, but could not scale it outMap Reduce was much faster and could scale out“Memory Usage” Data TypeNeeded multiple map/reduce stages
  • 18.
    “Function Call Structure”AnalysisFnNameTotalTimeSrcFnAddress PIDSF_CFAmap (emit: FnName, TotalTime, min_addr, NumOfCalls, PID, …)Shuffle stage:FnName: CreateRaceObjects { {TotalTime: 3, min_addr: 2 , NumOfCalls: 1 , PID: 1, …} {TotalTime: 7, min_addr: 3, NumOfCalls: 1, PID: 1, …} {TotalTime: 4, min_addr: 1, NumOfCalls: 1, PID: 1, …}{TotalTime: 6, min_addr: 1, NumOfCalls: 1, PID: 1, …} } reduceOutput:{FnName: CreateRaceObjects{TotalTime: 20, min_addr: 1 , NumOfCalls: 2 , PID: 1, …}
  • 19.
    Pipeline Component: PresentationPresentationEachfeed type presents a different view of the user’s application.
  • 20.
    And requires apage design of its own
  • 21.
    Tool of choice:DjangoNonRel