Big Data Laboratory


Published on

Does it make sense to use Google App Engine as a quick prototyping environment for Big Data use cases? It would avoid all the hassles of setting up Hadoop and its bestiary.

The answer is a definite "maybe".

Published in: Technology
1 Comment
  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Big Data Laboratory

  1. 1. Google App Engine A Big Data Laboratory? J Singh, Early Stage IT March 20, 2012
  2. 2. 2 © J Singh, 2011 2 App Engine as a Big Data Laboratory? • Why bother? Why not use Hadoop? • EvaluatingApp Engine as a Big Data Laboratory – Loading Data – Analytics Capabilities – Visualization Capabilities • Conclusions
  3. 3. 3 © J Singh, 2011 3 Why Bother? Why not Hadoop? • No install and configuration required – Focus on the task: Analytics and Visualization – Use the technology that powers Google Earth and Google Finance • Works with Google Datastore – Makes sense if your data is already there • No import/export of data necessary • But a purely „low-level‟ programming environment – Write Map and Reduce functions in Python / Java – No Pig, Hive, … • Is this story for real? We wanted to find out.
  4. 4. 4 © J Singh, 2011 4 Loading Data into GAE • What? No native OS environment to work in? – No OS commands, no file system accessible to the programmer – Data Prep must be done elsewhere. • But other options exist 1. Upload a file into Blobstore through an HTTP request • Max object size 2GB, max get/put in one call: 1MB. • Process into Datastore entities using BlobstoreInputReader or BlobstoreZipInputReader classes. 2. Use remote_api to upload CSV files • It‟s painful – Only needs to be done one-time, we hope – Or we need to set up a process for staging and feeding the data
  5. 5. 5 © J Singh, 2011 5 Data Analysis: NumPy and SciPy • NumPy and SciPy libraries using the traditional computing model (not Map/Reduce) include: – Array and Matrix manipulation – Optimization algorithms, e.g., curve fitting, linear regression, multi-variate regression. – Multithreading (for embarrassingly parallel problems) • Replace map(…) with parallel_map(…). – map is a Python primitive – parallel_map is a NumPy primitive – Other scientific algorithms, e.g., Kalman Filtering, Signal smoothing, Markov Chains. • NumPy and SciPy depended on Python 2.7 – Enabled in Fall, 2011.
  6. 6. 6 © J Singh, 2011 6 Data Analysis: MapReduce • Input Reader – Several provided by GAE, can write your own • Map function: Written by Programmer • Shuffle function: Provided – Can write your own overrides for partitioning (sharding) and comparison (use in sort) • Reduce function: Written by Programmer – Can be skipped if not needed • Output Writer – Several provided by GAE, can write your own
  7. 7. 7 © J Singh, 2011 7 Data Analysis: Pipeline API • Based on Python Generator functions • Allows chaining of map reduce jobs – Primitives for setting up various types of chains • MapreducePipeline (prev page) was just one type of pipeline • Available for Python or Java – Python side better documented Split and Merge example class aPipe(pipeline.Pipeline): def run(self, e_kind, prop_name, *value_list): all_bs = [] for v in value_list: stage = yield bPipe(e_kind, prop_name, v) all_bs.append(stage) yield common.Append(*all_bs)
  8. 8. 8 © J Singh, 2011 8 Data Visualization • Appengine supports multiple web frameworks for serving data directly from the Datastore into an HTML5 Browser: – Django, Jinja2, CherryPy, … • Options: – jQuery Visualize – Google Visualization API • Including MotionCharts – Hans Rosling‟s Visualization API – Check out his TED talk • Conclusion: – A rich set of facilities for visualization and taking action
  9. 9. 9 © J Singh, 2011 9 Decision Factors Usage Discussion Proof of Concept or Demo In GAE Need a process for Data Loading But saves on having to do Hadoop setup Absence of Pig/Hive may be a limiting factor Advantage in Visualization Better security and isolation than Hadoop Production In GAE Analyze cost before committing Lock-in risk? Production elsewhere Good semantic match between Datastore and HBase. Need to do Hadoop setup and operation
  10. 10. 10 © J Singh, 2011 10 Thank you • J Singh – President, Early Stage IT • Technology Services and Strategy for Startups • is a new service of Early Stage IT – “Big Data” analytics solutions