More Related Content
Similar to Data Analytic Technology Platforms: Options and Tradeoffs
Similar to Data Analytic Technology Platforms: Options and Tradeoffs (20)
Data Analytic Technology Platforms: Options and Tradeoffs
- 2. Do you have a “Big Data” problem?
• Or do you have a big “data problem”?
© DataThinks 2013-14
2
2
- 3. Some Big Data problems (1)
• Recommendations
© DataThinks 2013-14
3
3
- 4. Some Big Data problems (2)
• Financial Analysis
– Really Big Data if we want Real Time analysis
© DataThinks 2013-14
4
4
- 5. Some Big Data problems (3)
• Internet Infrastructure Security Monitoring
© DataThinks 2013-14
5
5
- 6. Other Big Data problems
• Network graph problems (Social Media data)
• Bioinformatics problems (Genomics data)
• Physics/engineering problems (Sensor data)
•…
• Key characteristics
1. Not much common between problems
2. Data too big to download or upload.
3. Data changes fast, requires near-real-time analysis.
© DataThinks 2013-14
6
6
- 7. Just Big “Data Problems” (JBDP)
• Most problems on Kaggle
• Popular data sets (e.g., Amazon, Kaggle, …, data sets)
– If it can be downloaded,
– If it doesn’t change very often, …
– It’s a JBDP
© DataThinks 2013-14
7
7
- 8. About us
• Technology and analytics service based on Big Data
problems, focused on small & medium companies
• Analytics products
– App Kinetics – Application analytics for servicing users
– Pop Kinetics – Population analytics for targeting prospects
© DataThinks 2013-14
8
8
- 9. Background for this talk
• Experience building the “Kinetics”
products
– Harvest the kinetic energy of
your data for the benefit of your
business
• Prior work.
– Like-you: an application that
trolls through Facebook data to
find users who like the same
things you do
© DataThinks 2013-14
9
9
- 10. Governing Principle for Platform Choices
• Big Data is difficult to move
– If you can move it easily, how big can it really be?
• Processing needs to be brought closer to the data
– Moving the data to processing is a losing proposition.
• Connector solutions for a database won’t scale
© DataThinks 2013-14
10
10
- 11. Implications of the Governing Principle
• Architecture has to be optimized across the entire pipeline
– Lesson learned:
•
•
•
•
The architecture is a giant jig-saw puzzle
Best of breed solutions may not fit!
Importance of caching in the pipeline
Vendor lock-in may be inevitable
– Cost, Data Volume and Bandwidth are primary drivers
• Different stacks for different applications
– App Kinetics: MongoDB-based stack
– Pop Kinetics: S3, Elastic Map Reduce-based stack
– Similarities: Google App Engine, Google Map Reduce
© DataThinks 2013-14
11
11
- 12. Governing Principle in Action
Function
Data
Collection
App Kinetics
Pop Kinetics
Custom “probes”
Like-You
Facebook API
Facebook API
Data Storage MongoDB
Amazon S3
Google Datastore
Analysis
Mongo M/R (JS)
PyMongo
(Python)
Amazon EMR
(Hadoop, Python)
Google App
Engine M/R
(Python)
Visualization
HTML+D3 (JS)
Text
HTML+JS
Recommend
ations
Text
© DataThinks 2013-14
12
12
- 13. The decision-making process
• An iterative process (like solving a jig-saw puzzle)
– Not linear or formulaic
• What is the objective?
• About the data
– Discovery?
– Volume
• If there is a market?
• If the concept is
feasible?
–
–
–
–
• Rate of Growth
– Velocity
– Variety
Time to market?
Hitting a cost target?
A scalable solution?
Minimizing lock-in?
– Format
– Location, location, …
© DataThinks 2013-14
13
- 14. Thank you
• J Singh
– Principal, DataThinks
• j.singh@datathinks.org
– Adj. Prof, WPI
© DataThinks 2013-14
14
14