High Throughput Data Analysis

High-throughput data analysis A Streaming Reports PlatformAuthorsJ Singh, Early Stage ITDavid Zheng, Early Stage ITContributorSatya Gupta, Virsec SystemsOctober 3, 2011

High-throughput data analysisA few examples of streaming data problemsA concrete problem we solvedHow we solved itTake-away lessons

Streaming DataData arrives continuouslyMust be processed continuouslyEmit analysis results or alerts as needed

Security: Scrapers, Spammers, …

Example Use CaseResolve Virtual Machineprofiles application and gathers data about the applicationThe data analysis blocked until data collection was complete, Took several hours before conclusions could be drawnProject goalsStream-mode analysisBegin within a few seconds of start of profilingContinuous updateData rates up to 5 GB per hourAbility to sustain rate for 24 hrsA product of Virsec SystemsAnalysis and Reporting configured to run in the Amazon EC2 environmentCan be scaled up (bigger machines)

Or scaled out (more machines)ApproachA variant of Eric Ries’ Lean Startup approachIntroduced in his book, The Lean Startup, Crown Publishing Group, 2011It is recipe for learning quicklyRe-adapted by us for this and other “learning projects”Do what it takes to get an end-to-end solutionMeasure, Learn, Build, repeatWhen the cycle has stabilized, expand scope appropriately

RequirementsFast inserts into the databaseThenature and amount of analysis required was hard to judge in the beginningPrevious experience with Map/Reduce in the Google App Engine environment had shown promise but GAE was not appropriate for this applicationSlick, demo-worthy web interface for presenting resultsStream-mode operationStart showing results within a few seconds of starting the Resolve Virtual Machine, and update it periodically as more data is collected and analyzed.

Key decisionsChunk up data into 1-second “slices” as it arrivesUse a collection for signaling the availability of each data sliceProcess each chunk as it becomes availableUse Map/Reduce for analysisExploit Parallelism of the data by using as many processors as needed to maintain the “flow rate”Pipeline the various Map Reduce jobs to maintain sequentiality of data

Pipeline Component: ListenerListenerGoal: push the data into MongoDB as fast as possibleReceives the data from the Resolve Virtual Machine and stores it into MongoDBSelf-describing data12 different types of data fed over 12 different socketsWritten in C++Socket Interface at one endMongoDB C++ driver at other end

Pipeline Component:MongoDBMongoDBGoal: persistence

Did everything it was supposed to do

Allowed us to focus on our problem, not on MongoDB

Will use replica sets for making the data available to analysis serversPipeline Component: Map/ReduceAnalysis Program“Function Call Structure” Data TypeCalculation of T was better done in the listener, moved there.Could scale the solution up, but could not scale it outMap Reduce was much faster and could scale out“Memory Usage” Data TypeNeeded multiple map/reduce stages

“Function Call Structure” AnalysisFnNameTotalTimeSrcFnAddress PIDSF_CFAmap (emit: FnName, TotalTime, min_addr, NumOfCalls, PID, …)Shuffle stage:FnName: CreateRaceObjects { {TotalTime: 3, min_addr: 2 , NumOfCalls: 1 , PID: 1, …} {TotalTime: 7, min_addr: 3, NumOfCalls: 1, PID: 1, …} {TotalTime: 4, min_addr: 1, NumOfCalls: 1, PID: 1, …}{TotalTime: 6, min_addr: 1, NumOfCalls: 1, PID: 1, …} } reduceOutput:{FnName: CreateRaceObjects{TotalTime: 20, min_addr: 1 , NumOfCalls: 2 , PID: 1, …}

Pipeline Component: PresentationPresentationEach feed type presents a different view of the user’s application.

And requires a page design of its own

High Throughput Data Analysis

More Related Content

What's hot

Similar to High Throughput Data Analysis

More from J Singh

Recently uploaded

High Throughput Data Analysis