Google App Engine
A Big Data Laboratory?
J Singh, Early Stage IT
March 20, 2012
2
© J Singh, 2011 2
App Engine as a Big Data Laboratory?
• Why bother? Why not use Hadoop?
• EvaluatingApp Engine as a Big Data Laboratory
– Loading Data
– Analytics Capabilities
– Visualization Capabilities
• Conclusions
3
© J Singh, 2011 3
Why Bother? Why not Hadoop?
• No install and configuration required
– Focus on the task: Analytics and Visualization
– Use the technology that powers Google Earth and Google Finance
• Works with Google Datastore
– Makes sense if your data is already there
• No import/export of data necessary
• But a purely „low-level‟ programming environment
– Write Map and Reduce functions in Python / Java
– No Pig, Hive, …
• Is this story for real? We wanted to find out.
4
© J Singh, 2011 4
Loading Data into GAE
• What? No native OS environment to work in?
– No OS commands, no file system accessible to the programmer
– Data Prep must be done elsewhere.
• But other options exist
1. Upload a file into Blobstore through an HTTP request
• Max object size 2GB, max get/put in one call: 1MB.
• Process into Datastore entities using BlobstoreInputReader or
BlobstoreZipInputReader classes.
2. Use remote_api to upload CSV files
• It‟s painful
– Only needs to be done one-time, we hope
– Or we need to set up a process for staging and feeding the data
5
© J Singh, 2011 5
Data Analysis: NumPy and SciPy
• NumPy and SciPy libraries using the traditional computing
model (not Map/Reduce) include:
– Array and Matrix manipulation
– Optimization algorithms, e.g., curve fitting, linear regression,
multi-variate regression.
– Multithreading (for embarrassingly parallel problems)
• Replace map(…) with parallel_map(…).
– map is a Python primitive
– parallel_map is a NumPy primitive
– Other scientific algorithms, e.g., Kalman Filtering, Signal
smoothing, Markov Chains.
• NumPy and SciPy depended on Python 2.7
– Enabled in Fall, 2011.
6
© J Singh, 2011 6
Data Analysis: MapReduce
• Input Reader
– Several provided by GAE, can write your own
• Map function: Written by Programmer
• Shuffle function: Provided
– Can write your own overrides for partitioning
(sharding) and comparison (use in sort)
• Reduce function: Written by Programmer
– Can be skipped if not needed
• Output Writer
– Several provided by GAE, can write your own
7
© J Singh, 2011 7
Data Analysis: Pipeline API
• Based on Python Generator functions
• Allows chaining of map reduce jobs
– Primitives for setting up various types of chains
• MapreducePipeline (prev page) was just one type of pipeline
• Available for Python or Java
– Python side better documented
Split and Merge example
class aPipe(pipeline.Pipeline):
def run(self, e_kind, prop_name, *value_list):
all_bs = []
for v in value_list:
stage = yield bPipe(e_kind, prop_name, v)
all_bs.append(stage)
yield common.Append(*all_bs)
8
© J Singh, 2011 8
Data Visualization
• Appengine supports multiple web frameworks for serving data
directly from the Datastore into an HTML5 Browser:
– Django, Jinja2, CherryPy, …
• Options:
– jQuery Visualize
– Google Visualization API
• Including MotionCharts
– Hans Rosling‟s Visualization API
– Check out his TED talk
• Conclusion:
– A rich set of facilities for visualization and taking action
9
© J Singh, 2011 9
Decision Factors
Usage Discussion
Proof of Concept
or
Demo
In GAE
Need a process for Data Loading
But saves on having to do Hadoop setup
Absence of Pig/Hive may be a limiting factor
Advantage in Visualization
Better security and isolation than Hadoop
Production
In GAE
Analyze cost before committing
Lock-in risk?
Production
elsewhere
Good semantic match between Datastore and HBase.
Need to do Hadoop setup and operation
10
© J Singh, 2011 10
Thank you
• J Singh
– President, Early Stage IT
• Technology Services and Strategy for Startups
• DataThinks.org is a new service of Early Stage IT
– “Big Data” analytics solutions

Big Data Laboratory

  • 1.
    Google App Engine ABig Data Laboratory? J Singh, Early Stage IT March 20, 2012
  • 2.
    2 © J Singh,2011 2 App Engine as a Big Data Laboratory? • Why bother? Why not use Hadoop? • EvaluatingApp Engine as a Big Data Laboratory – Loading Data – Analytics Capabilities – Visualization Capabilities • Conclusions
  • 3.
    3 © J Singh,2011 3 Why Bother? Why not Hadoop? • No install and configuration required – Focus on the task: Analytics and Visualization – Use the technology that powers Google Earth and Google Finance • Works with Google Datastore – Makes sense if your data is already there • No import/export of data necessary • But a purely „low-level‟ programming environment – Write Map and Reduce functions in Python / Java – No Pig, Hive, … • Is this story for real? We wanted to find out.
  • 4.
    4 © J Singh,2011 4 Loading Data into GAE • What? No native OS environment to work in? – No OS commands, no file system accessible to the programmer – Data Prep must be done elsewhere. • But other options exist 1. Upload a file into Blobstore through an HTTP request • Max object size 2GB, max get/put in one call: 1MB. • Process into Datastore entities using BlobstoreInputReader or BlobstoreZipInputReader classes. 2. Use remote_api to upload CSV files • It‟s painful – Only needs to be done one-time, we hope – Or we need to set up a process for staging and feeding the data
  • 5.
    5 © J Singh,2011 5 Data Analysis: NumPy and SciPy • NumPy and SciPy libraries using the traditional computing model (not Map/Reduce) include: – Array and Matrix manipulation – Optimization algorithms, e.g., curve fitting, linear regression, multi-variate regression. – Multithreading (for embarrassingly parallel problems) • Replace map(…) with parallel_map(…). – map is a Python primitive – parallel_map is a NumPy primitive – Other scientific algorithms, e.g., Kalman Filtering, Signal smoothing, Markov Chains. • NumPy and SciPy depended on Python 2.7 – Enabled in Fall, 2011.
  • 6.
    6 © J Singh,2011 6 Data Analysis: MapReduce • Input Reader – Several provided by GAE, can write your own • Map function: Written by Programmer • Shuffle function: Provided – Can write your own overrides for partitioning (sharding) and comparison (use in sort) • Reduce function: Written by Programmer – Can be skipped if not needed • Output Writer – Several provided by GAE, can write your own
  • 7.
    7 © J Singh,2011 7 Data Analysis: Pipeline API • Based on Python Generator functions • Allows chaining of map reduce jobs – Primitives for setting up various types of chains • MapreducePipeline (prev page) was just one type of pipeline • Available for Python or Java – Python side better documented Split and Merge example class aPipe(pipeline.Pipeline): def run(self, e_kind, prop_name, *value_list): all_bs = [] for v in value_list: stage = yield bPipe(e_kind, prop_name, v) all_bs.append(stage) yield common.Append(*all_bs)
  • 8.
    8 © J Singh,2011 8 Data Visualization • Appengine supports multiple web frameworks for serving data directly from the Datastore into an HTML5 Browser: – Django, Jinja2, CherryPy, … • Options: – jQuery Visualize – Google Visualization API • Including MotionCharts – Hans Rosling‟s Visualization API – Check out his TED talk • Conclusion: – A rich set of facilities for visualization and taking action
  • 9.
    9 © J Singh,2011 9 Decision Factors Usage Discussion Proof of Concept or Demo In GAE Need a process for Data Loading But saves on having to do Hadoop setup Absence of Pig/Hive may be a limiting factor Advantage in Visualization Better security and isolation than Hadoop Production In GAE Analyze cost before committing Lock-in risk? Production elsewhere Good semantic match between Datastore and HBase. Need to do Hadoop setup and operation
  • 10.
    10 © J Singh,2011 10 Thank you • J Singh – President, Early Stage IT • Technology Services and Strategy for Startups • DataThinks.org is a new service of Early Stage IT – “Big Data” analytics solutions