High-throughput data analysis 	A Streaming Reports Platform<br />Authors<br />J Singh, Early Stage IT<br />David Zheng, Ea...
High-throughput data analysis<br />A few examples of streaming data problems<br />A concrete problem we solved<br />How we...
Streaming Data<br />Data arrives continuously<br />Must be processed continuously<br />Emit analysis results or alerts as ...
Security: Scrapers, Spammers, …<br />
Monitoring and Alerts<br />
Financial Markets<br />
High-throughput data analysis<br />A few examples of streaming data problems<br />A concrete problem we solved<br />How we...
Example Use Case<br />Resolve Virtual Machineprofiles application and gathers data about the application<br />The data an...
Or scaled out (more machines)</li></li></ul><li>Approach<br />A variant of Eric Ries’ Lean Startup approach<br />Introduce...
Requirements<br />Fast inserts into the database<br />Thenature and amount of analysis required was hard to judge in the b...
High-throughput data analysis<br />A few examples of streaming data problems<br />A concrete problem we solved<br />How we...
Key decisions<br />Chunk up data into 1-second “slices” as it arrives<br />Use a collection for signaling the availability...
Pipeline Component: Listener<br />Listener<br />Goal: push the data into MongoDB as fast as possible<br />Receives the dat...
Pipeline Component:MongoDB<br />MongoDB<br /><ul><li>Goal: persistence
Did everything it was supposed to do
Allowed us to focus on our problem, not on MongoDB
Will use replica sets for making the data available to analysis servers</li></li></ul><li>Pipeline Component: Map/Reduce<b...
“Function Call Structure” Analysis<br />FnNameTotalTimeSrcFnAddress                 PID<br />SF_CFA<br />map (emit: FnName...
Pipeline Component: Presentation<br />Presentation<br /><ul><li>Each feed type presents a different view of the user’s app...
And requires a page design of its own
Tool of choice: DjangoNonRel
Upcoming SlideShare
Loading in …5
×

High Throughput Data Analysis

2,056 views
1,907 views

Published on

Receiving data from a source that produces 5-10 GBytes per hour, and presenting analysis results as the data streams in has some interesting challenges.

We used MongoDB running on Amazon EC2 to house the data, map reduce to analyze it and Django-non-rel to present the results in near-real-time.

(Slides from my presentation at MongoDB Boston)

Published in: Technology, Education
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
2,056
On SlideShare
0
From Embeds
0
Number of Embeds
14
Actions
Shares
0
Downloads
0
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

High Throughput Data Analysis

  1. 1. High-throughput data analysis A Streaming Reports Platform<br />Authors<br />J Singh, Early Stage IT<br />David Zheng, Early Stage IT<br />Contributor<br />Satya Gupta, Virsec Systems<br />October 3, 2011 <br />
  2. 2. High-throughput data analysis<br />A few examples of streaming data problems<br />A concrete problem we solved<br />How we solved it<br />Take-away lessons<br />
  3. 3. Streaming Data<br />Data arrives continuously<br />Must be processed continuously<br />Emit analysis results or alerts as needed<br />
  4. 4. Security: Scrapers, Spammers, …<br />
  5. 5. Monitoring and Alerts<br />
  6. 6. Financial Markets<br />
  7. 7. High-throughput data analysis<br />A few examples of streaming data problems<br />A concrete problem we solved<br />How we solved it<br />Take-away lessons<br />
  8. 8. Example Use Case<br />Resolve Virtual Machineprofiles application and gathers data about the application<br />The data analysis blocked until data collection was complete, <br />Took several hours before conclusions could be drawn<br />Project goals<br />Stream-mode analysis<br />Begin within a few seconds of start of profiling<br />Continuous update<br />Data rates up to 5 GB per hour<br />Ability to sustain rate for 24 hrs<br />A product of Virsec Systems<br />Analysis and Reporting configured to run in the Amazon EC2 environment<br /><ul><li>Can be scaled up (bigger machines)
  9. 9. Or scaled out (more machines)</li></li></ul><li>Approach<br />A variant of Eric Ries’ Lean Startup approach<br />Introduced in his book, The Lean Startup, <br />Crown Publishing Group, 2011<br />It is recipe for learning quickly<br />Re-adapted by us for this and other “learning projects”<br />Do what it takes to get an end-to-end solution<br />Measure, Learn, Build, repeat<br />When the cycle has stabilized, expand scope appropriately<br />
  10. 10. Requirements<br />Fast inserts into the database<br />Thenature and amount of analysis required was hard to judge in the beginning<br />Previous experience with Map/Reduce in the Google App Engine environment had shown promise but GAE was not appropriate for this application<br />Slick, demo-worthy web interface for presenting results<br />Stream-mode operation<br />Start showing results within a few seconds of starting the Resolve Virtual Machine, and update it periodically as more data is collected and analyzed.<br />
  11. 11. High-throughput data analysis<br />A few examples of streaming data problems<br />A concrete problem we solved<br />How we solved it<br />Take-away lessons<br />
  12. 12. Key decisions<br />Chunk up data into 1-second “slices” as it arrives<br />Use a collection for signaling the availability of each data slice<br />Process each chunk as it becomes available<br />Use Map/Reduce for analysis<br />Exploit Parallelism of the data by using as many processors as needed to maintain the “flow rate”<br />Pipeline the various Map Reduce jobs to maintain sequentiality of data<br />
  13. 13. Pipeline Component: Listener<br />Listener<br />Goal: push the data into MongoDB as fast as possible<br />Receives the data from the Resolve Virtual Machine and stores it into MongoDB<br />Self-describing data<br />12 different types of data fed over 12 different sockets<br />Written in C++<br />Socket Interface at one end<br />MongoDB C++ driver at other end<br />
  14. 14. Pipeline Component:MongoDB<br />MongoDB<br /><ul><li>Goal: persistence
  15. 15. Did everything it was supposed to do
  16. 16. Allowed us to focus on our problem, not on MongoDB
  17. 17. Will use replica sets for making the data available to analysis servers</li></li></ul><li>Pipeline Component: Map/Reduce<br />Analysis Program<br />“Function Call Structure” Data Type<br />Calculation of T was better done in the listener, moved there.<br />Could scale the solution up, but could not scale it out<br />Map Reduce was much faster and could scale out<br />“Memory Usage” Data Type<br />Needed multiple map/reduce stages<br />
  18. 18. “Function Call Structure” Analysis<br />FnNameTotalTimeSrcFnAddress PID<br />SF_CFA<br />map (emit: FnName, TotalTime, min_addr, NumOfCalls, PID, …)<br />Shuffle stage:<br />FnName: CreateRaceObjects { {TotalTime: 3, min_addr: 2 , NumOfCalls: 1 , PID: 1, …}<br /> {TotalTime: 7, min_addr: 3, NumOfCalls: 1, PID: 1, …}<br /> {TotalTime: 4, min_addr: 1, NumOfCalls: 1, PID: 1, …}<br />{TotalTime: 6, min_addr: 1, NumOfCalls: 1, PID: 1, …}<br /> } <br />reduce<br />Output:<br />{FnName: CreateRaceObjects{TotalTime: 20, min_addr: 1 , NumOfCalls: 2 , PID: 1, …}<br />
  19. 19. Pipeline Component: Presentation<br />Presentation<br /><ul><li>Each feed type presents a different view of the user’s application.
  20. 20. And requires a page design of its own
  21. 21. Tool of choice: DjangoNonRel
  22. 22. But Python driver for MongoDB was sufficient for most work.
  23. 23. DjangoNonRel was not really required.</li></li></ul><li>High-throughput data analysis<br />A few examples of streaming data problems<br />A concrete problem we solved<br />How we solved it<br />Take-away lessons<br />
  24. 24. Progression of Solutions<br />
  25. 25. Endpoint Stack<br />Data Capture (Listener)<br />Custom, preferably written in C++ or Java<br />NoSQL Database<br />MongoDB<br />Well suited for high speedinserts<br />Calculation Platform<br />MongoDB Map/Reduce<br />Could use Hadoop but startup times are a concern<br />Presentation<br />Django Non-Rel<br />
  26. 26. About Us<br />Involved with Map/Reduce and NoSQL technologies on several platforms<br />Many students in J’s Database Systems class at WPI did a project on a NoSQL database.<br />DataThinks.org is a new service of Early Stage IT<br />Building and operating “Big Data” analytics services<br />Thanks<br />

×