Data Analytic Technology Platforms
Options and Tradeoffs
J Singh
January 7, 2014
Do you have a “Big Data” problem?
• Or do you have a big “data problem”?

© DataThinks 2013-14
2

2
Some Big Data problems (1)
• Recommendations

© DataThinks 2013-14
3

3
Some Big Data problems (2)
• Financial Analysis
– Really Big Data if we want Real Time analysis

© DataThinks 2013-14
4

4
Some Big Data problems (3)
• Internet Infrastructure Security Monitoring

© DataThinks 2013-14
5

5
Other Big Data problems
• Network graph problems (Social Media data)
• Bioinformatics problems (Genomics data)
• Physics/engineering problems (Sensor data)
•…

• Key characteristics
1. Not much common between problems
2. Data too big to download or upload.
3. Data changes fast, requires near-real-time analysis.

© DataThinks 2013-14
6

6
Just Big “Data Problems” (JBDP)
• Most problems on Kaggle
• Popular data sets (e.g., Amazon, Kaggle, …, data sets)
– If it can be downloaded,
– If it doesn’t change very often, …
– It’s a JBDP

© DataThinks 2013-14
7

7
About us
• Technology and analytics service based on Big Data
problems, focused on small & medium companies
• Analytics products
– App Kinetics – Application analytics for servicing users
– Pop Kinetics – Population analytics for targeting prospects

© DataThinks 2013-14
8

8
Background for this talk
• Experience building the “Kinetics”
products
– Harvest the kinetic energy of
your data for the benefit of your
business 

• Prior work.
– Like-you: an application that

trolls through Facebook data to
find users who like the same
things you do

© DataThinks 2013-14
9

9
Governing Principle for Platform Choices
• Big Data is difficult to move
– If you can move it easily, how big can it really be?

• Processing needs to be brought closer to the data
– Moving the data to processing is a losing proposition.

• Connector solutions for a database won’t scale

© DataThinks 2013-14
10

10
Implications of the Governing Principle
• Architecture has to be optimized across the entire pipeline
– Lesson learned:
•
•
•
•

The architecture is a giant jig-saw puzzle
Best of breed solutions may not fit!
Importance of caching in the pipeline
Vendor lock-in may be inevitable

– Cost, Data Volume and Bandwidth are primary drivers

• Different stacks for different applications
– App Kinetics: MongoDB-based stack
– Pop Kinetics: S3, Elastic Map Reduce-based stack
– Similarities: Google App Engine, Google Map Reduce
© DataThinks 2013-14
11

11
Governing Principle in Action
Function
Data
Collection

App Kinetics

Pop Kinetics

Custom “probes”

Like-You

Facebook API

Facebook API

Data Storage MongoDB

Amazon S3

Google Datastore

Analysis

Mongo M/R (JS)
PyMongo
(Python)

Amazon EMR
(Hadoop, Python)

Google App
Engine M/R
(Python)

Visualization

HTML+D3 (JS)

Text

HTML+JS

Recommend
ations

Text

© DataThinks 2013-14
12

12
The decision-making process
• An iterative process (like solving a jig-saw puzzle)
– Not linear or formulaic

• What is the objective?

• About the data

– Discovery?

– Volume

• If there is a market?
• If the concept is
feasible?

–
–
–
–

• Rate of Growth

– Velocity
– Variety

Time to market?
Hitting a cost target?
A scalable solution?
Minimizing lock-in?

– Format
– Location, location, …

© DataThinks 2013-14
13
Thank you
• J Singh
– Principal, DataThinks
• j.singh@datathinks.org

– Adj. Prof, WPI

© DataThinks 2013-14
14

14

Data Analytic Technology Platforms: Options and Tradeoffs

  • 1.
    Data Analytic TechnologyPlatforms Options and Tradeoffs J Singh January 7, 2014
  • 2.
    Do you havea “Big Data” problem? • Or do you have a big “data problem”? © DataThinks 2013-14 2 2
  • 3.
    Some Big Dataproblems (1) • Recommendations © DataThinks 2013-14 3 3
  • 4.
    Some Big Dataproblems (2) • Financial Analysis – Really Big Data if we want Real Time analysis © DataThinks 2013-14 4 4
  • 5.
    Some Big Dataproblems (3) • Internet Infrastructure Security Monitoring © DataThinks 2013-14 5 5
  • 6.
    Other Big Dataproblems • Network graph problems (Social Media data) • Bioinformatics problems (Genomics data) • Physics/engineering problems (Sensor data) •… • Key characteristics 1. Not much common between problems 2. Data too big to download or upload. 3. Data changes fast, requires near-real-time analysis. © DataThinks 2013-14 6 6
  • 7.
    Just Big “DataProblems” (JBDP) • Most problems on Kaggle • Popular data sets (e.g., Amazon, Kaggle, …, data sets) – If it can be downloaded, – If it doesn’t change very often, … – It’s a JBDP © DataThinks 2013-14 7 7
  • 8.
    About us • Technologyand analytics service based on Big Data problems, focused on small & medium companies • Analytics products – App Kinetics – Application analytics for servicing users – Pop Kinetics – Population analytics for targeting prospects © DataThinks 2013-14 8 8
  • 9.
    Background for thistalk • Experience building the “Kinetics” products – Harvest the kinetic energy of your data for the benefit of your business  • Prior work. – Like-you: an application that trolls through Facebook data to find users who like the same things you do © DataThinks 2013-14 9 9
  • 10.
    Governing Principle forPlatform Choices • Big Data is difficult to move – If you can move it easily, how big can it really be? • Processing needs to be brought closer to the data – Moving the data to processing is a losing proposition. • Connector solutions for a database won’t scale © DataThinks 2013-14 10 10
  • 11.
    Implications of theGoverning Principle • Architecture has to be optimized across the entire pipeline – Lesson learned: • • • • The architecture is a giant jig-saw puzzle Best of breed solutions may not fit! Importance of caching in the pipeline Vendor lock-in may be inevitable – Cost, Data Volume and Bandwidth are primary drivers • Different stacks for different applications – App Kinetics: MongoDB-based stack – Pop Kinetics: S3, Elastic Map Reduce-based stack – Similarities: Google App Engine, Google Map Reduce © DataThinks 2013-14 11 11
  • 12.
    Governing Principle inAction Function Data Collection App Kinetics Pop Kinetics Custom “probes” Like-You Facebook API Facebook API Data Storage MongoDB Amazon S3 Google Datastore Analysis Mongo M/R (JS) PyMongo (Python) Amazon EMR (Hadoop, Python) Google App Engine M/R (Python) Visualization HTML+D3 (JS) Text HTML+JS Recommend ations Text © DataThinks 2013-14 12 12
  • 13.
    The decision-making process •An iterative process (like solving a jig-saw puzzle) – Not linear or formulaic • What is the objective? • About the data – Discovery? – Volume • If there is a market? • If the concept is feasible? – – – – • Rate of Growth – Velocity – Variety Time to market? Hitting a cost target? A scalable solution? Minimizing lock-in? – Format – Location, location, … © DataThinks 2013-14 13
  • 14.
    Thank you • JSingh – Principal, DataThinks • j.singh@datathinks.org – Adj. Prof, WPI © DataThinks 2013-14 14 14