Data Analytic Technology Platforms: Options and Tradeoffs

763 views
629 views

Published on

0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
763
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
0
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Data Analytic Technology Platforms: Options and Tradeoffs

  1. 1. Data Analytic Technology Platforms Options and Tradeoffs J Singh January 7, 2014
  2. 2. Do you have a “Big Data” problem? • Or do you have a big “data problem”? © DataThinks 2013-14 2 2
  3. 3. Some Big Data problems (1) • Recommendations © DataThinks 2013-14 3 3
  4. 4. Some Big Data problems (2) • Financial Analysis – Really Big Data if we want Real Time analysis © DataThinks 2013-14 4 4
  5. 5. Some Big Data problems (3) • Internet Infrastructure Security Monitoring © DataThinks 2013-14 5 5
  6. 6. Other Big Data problems • Network graph problems (Social Media data) • Bioinformatics problems (Genomics data) • Physics/engineering problems (Sensor data) •… • Key characteristics 1. Not much common between problems 2. Data too big to download or upload. 3. Data changes fast, requires near-real-time analysis. © DataThinks 2013-14 6 6
  7. 7. Just Big “Data Problems” (JBDP) • Most problems on Kaggle • Popular data sets (e.g., Amazon, Kaggle, …, data sets) – If it can be downloaded, – If it doesn’t change very often, … – It’s a JBDP © DataThinks 2013-14 7 7
  8. 8. About us • Technology and analytics service based on Big Data problems, focused on small & medium companies • Analytics products – App Kinetics – Application analytics for servicing users – Pop Kinetics – Population analytics for targeting prospects © DataThinks 2013-14 8 8
  9. 9. Background for this talk • Experience building the “Kinetics” products – Harvest the kinetic energy of your data for the benefit of your business  • Prior work. – Like-you: an application that trolls through Facebook data to find users who like the same things you do © DataThinks 2013-14 9 9
  10. 10. Governing Principle for Platform Choices • Big Data is difficult to move – If you can move it easily, how big can it really be? • Processing needs to be brought closer to the data – Moving the data to processing is a losing proposition. • Connector solutions for a database won’t scale © DataThinks 2013-14 10 10
  11. 11. Implications of the Governing Principle • Architecture has to be optimized across the entire pipeline – Lesson learned: • • • • The architecture is a giant jig-saw puzzle Best of breed solutions may not fit! Importance of caching in the pipeline Vendor lock-in may be inevitable – Cost, Data Volume and Bandwidth are primary drivers • Different stacks for different applications – App Kinetics: MongoDB-based stack – Pop Kinetics: S3, Elastic Map Reduce-based stack – Similarities: Google App Engine, Google Map Reduce © DataThinks 2013-14 11 11
  12. 12. Governing Principle in Action Function Data Collection App Kinetics Pop Kinetics Custom “probes” Like-You Facebook API Facebook API Data Storage MongoDB Amazon S3 Google Datastore Analysis Mongo M/R (JS) PyMongo (Python) Amazon EMR (Hadoop, Python) Google App Engine M/R (Python) Visualization HTML+D3 (JS) Text HTML+JS Recommend ations Text © DataThinks 2013-14 12 12
  13. 13. The decision-making process • An iterative process (like solving a jig-saw puzzle) – Not linear or formulaic • What is the objective? • About the data – Discovery? – Volume • If there is a market? • If the concept is feasible? – – – – • Rate of Growth – Velocity – Variety Time to market? Hitting a cost target? A scalable solution? Minimizing lock-in? – Format – Location, location, … © DataThinks 2013-14 13
  14. 14. Thank you • J Singh – Principal, DataThinks • j.singh@datathinks.org – Adj. Prof, WPI © DataThinks 2013-14 14 14

×