Social Media Mining using GAE Map Reduce

1,459 views
1,341 views

Published on

Map Reduce under Google App Engine offers capabilities to analyze and draw conclusions about data residing in social networks.

0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,459
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
6
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

Social Media Mining using GAE Map Reduce

  1. 1. Social Network Mining Solutions using Google App Engine Map Reduce J Singh, DataThinks.org October 19, 2011
  2. 2. MapReduce: A Genealogical Perspective• Roots – Lisp, Scheme – APL• Google OS papers, 2004 – Exploit extreme parallelism of data• Apache Top Level Project (Hadoop)• MapReduceGAE borrows from these© J Singh, 2011 2 2
  3. 3. Social Network Mining• Finding people based on data in social networks – Love and Romance – Common interests – Similar buying habits – Similar voting propensities – Location• It‟s not a new problem – We have additional solutions for the old problem • Examples based on proprietary data: eHarmony, etc. • Early examples based on social network data: ShoutFlow, WhoIsJustLikeMe.© J Singh, 2011 3 3
  4. 4. Based on clustering algorithms• On-line demo of clustering • Resource intensive. – Best done in batch mode • Exploit data parallelism of the algorithm – App Engine Map Reduce, employing one map job for each cluster – App Engine Pipeline API, employing one stage of the pipeline for each „step‟ • But first, a detour into Map Reduce…© J Singh, 2011 4 4
  5. 5. MapReduce Conceptual Underpinnings• Based on Functional Programming model – From Lisp / Scheme • (map square (1 2 3 4)) (1 4 9 16) • (reduce plus (1 4 9 16)) 30 – From APL • +/ N N  1 2 3 4• Easy to distribute (based on each element of the vector)• New for Map/Reduce: Nice failure/retry semantics – Hundreds and thousands of low-end servers are running at the same time© J Singh, 2011 5 5
  6. 6. MapReduce Flow© J Singh, 2011 6 6
  7. 7. MapReduce Components in GAE 2011 • Input Reader – Several provided by GAE, can write your own • Map function: Written by Programmer • Shuffle function: – Provided by GAE, can write your own • Reduce function: Written by Programmer • Output Writer – Several provided by GAE, can write your own© J Singh, 2011 7 7
  8. 8. Invoking GAE Map Reduceclass MapreducePipeline (…): def run(self, job_name, # A string mapper_spec, # Mapper function reducer_spec, # Reducer function input_reader_spec, # Input reader fn output_writer_spec, # Output writer mapper_params, # A dictionary reducer_params, # A dictionary shards, # An int )© J Singh, 2011 8 8
  9. 9. GAE Pipeline API• Based on Python Generator functions• The old Unix idea on steroids: – Perform complex operations by piping data between primitives – But the primitives are not so primitive – Data lives in permanent storage between pipeline stages• MapreducePipeline (prev page) was just one type of pipeline© J Singh, 2011 9 9
  10. 10. Pipeline API Example CodeSplit and Merge example class aPipe(pipeline.Pipeline): def run(self, e_kind, prop_name, *value_list): all_bs = [] for v in value_list: stage = yield bPipe(e_kind, prop_name, v) all_bs.append(stage) yield common.Append(*all_bs)© J Singh, 2011 10 10
  11. 11. Pause and Assess• Assertion: – GAE Map/Reduce is a complete solution for analysis of social network mining – We know it will scale, the question is how far.• Working on one Proof of Concept for Social Network Mining – Recruiting a second test case• Will report back in 3-4 months with data on – Performance – Cost – Limits of scalability© J Singh, 2011 11 11
  12. 12. Adapting the algorithm to M/R• Clustering Algorithm 1. Create k randomly placed centroids Map each data point 2. Find the centroid (1-k) closest to each data point 3. Move each centroid to the average of its members Reduce Each Centroid 4. Repeat 2 and 3 until there is no more change Connect to next stage using Pipelining API© J Singh, 2011 12 12
  13. 13. About Us• Involved with Map/Reduce and NoSQL technologies on several platforms – Google App Engine, MongoDB• DataThinks.org is a new service of Early Stage IT – Building and operating “Big Data” analytics services Thanks© J Singh, 2011 13 13

×