Big Data Laboratory

Google App Engine
A Big Data Laboratory?
J Singh, Early Stage IT
March 20, 2012

2
© J Singh, 2011 2
App Engine as a Big Data Laboratory?
• Why bother? Why not use Hadoop?
• EvaluatingApp Engine as a Big Data Laboratory
– Loading Data
– Analytics Capabilities
– Visualization Capabilities
• Conclusions

3
© J Singh, 2011 3
Why Bother? Why not Hadoop?
• No install and configuration required
– Focus on the task: Analytics and Visualization
– Use the technology that powers Google Earth and Google Finance
• Works with Google Datastore
– Makes sense if your data is already there
• No import/export of data necessary
• But a purely „low-level‟ programming environment
– Write Map and Reduce functions in Python / Java
– No Pig, Hive, …
• Is this story for real? We wanted to find out.

4
© J Singh, 2011 4
Loading Data into GAE
• What? No native OS environment to work in?
– No OS commands, no file system accessible to the programmer
– Data Prep must be done elsewhere.
• But other options exist
1. Upload a file into Blobstore through an HTTP request
• Max object size 2GB, max get/put in one call: 1MB.
• Process into Datastore entities using BlobstoreInputReader or
BlobstoreZipInputReader classes.
2. Use remote_api to upload CSV files
• It‟s painful
– Only needs to be done one-time, we hope
– Or we need to set up a process for staging and feeding the data

5
© J Singh, 2011 5
Data Analysis: NumPy and SciPy
• NumPy and SciPy libraries using the traditional computing
model (not Map/Reduce) include:
– Array and Matrix manipulation
– Optimization algorithms, e.g., curve fitting, linear regression,
multi-variate regression.
– Multithreading (for embarrassingly parallel problems)
• Replace map(…) with parallel_map(…).
– map is a Python primitive
– parallel_map is a NumPy primitive
– Other scientific algorithms, e.g., Kalman Filtering, Signal
smoothing, Markov Chains.
• NumPy and SciPy depended on Python 2.7
– Enabled in Fall, 2011.

6
© J Singh, 2011 6
Data Analysis: MapReduce
• Input Reader
– Several provided by GAE, can write your own
• Map function: Written by Programmer
• Shuffle function: Provided
– Can write your own overrides for partitioning
(sharding) and comparison (use in sort)
• Reduce function: Written by Programmer
– Can be skipped if not needed
• Output Writer
– Several provided by GAE, can write your own

7
© J Singh, 2011 7
Data Analysis: Pipeline API
• Based on Python Generator functions
• Allows chaining of map reduce jobs
– Primitives for setting up various types of chains
• MapreducePipeline (prev page) was just one type of pipeline
• Available for Python or Java
– Python side better documented
Split and Merge example
class aPipe(pipeline.Pipeline):
def run(self, e_kind, prop_name, *value_list):
all_bs = []
for v in value_list:
stage = yield bPipe(e_kind, prop_name, v)
all_bs.append(stage)
yield common.Append(*all_bs)

8
© J Singh, 2011 8
Data Visualization
• Appengine supports multiple web frameworks for serving data
directly from the Datastore into an HTML5 Browser:
– Django, Jinja2, CherryPy, …
• Options:
– jQuery Visualize
– Google Visualization API
• Including MotionCharts
– Hans Rosling‟s Visualization API
– Check out his TED talk
• Conclusion:
– A rich set of facilities for visualization and taking action

9
© J Singh, 2011 9
Decision Factors
Usage Discussion
Proof of Concept
or
Demo
In GAE
Need a process for Data Loading
But saves on having to do Hadoop setup
Absence of Pig/Hive may be a limiting factor
Advantage in Visualization
Better security and isolation than Hadoop
Production
In GAE
Analyze cost before committing
Lock-in risk?
Production
elsewhere
Good semantic match between Datastore and HBase.
Need to do Hadoop setup and operation

10
© J Singh, 2011 10
Thank you
• J Singh
– President, Early Stage IT
• Technology Services and Strategy for Startups
• DataThinks.org is a new service of Early Stage IT
– “Big Data” analytics solutions

Big Data Laboratory

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Big Data Laboratory

Similar to Big Data Laboratory (20)

More from J Singh

More from J Singh (18)

Recently uploaded

Recently uploaded (20)

Big Data Laboratory