• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Facebook Analytics with Elastic Map/Reduce
 

Facebook Analytics with Elastic Map/Reduce

on

  • 1,514 views

A workshop on analyzing data about Facebook likes of a set of people

A workshop on analyzing data about Facebook likes of a set of people

Statistics

Views

Total Views
1,514
Views on SlideShare
1,355
Embed Views
159

Actions

Likes
6
Downloads
0
Comments
0

10 Embeds 159

http://www.datathinks.org 133
http://www.datathinks.com 8
http://cs542.wpi.edu.datathinks.org 6
http://master.datathinks.appspot.com 3
http://cs542.datathinks.org 3
http://www.cs3431.datathinks.org 2
http://www.linkedin.com 1
http://qa.datathinks.org 1
http://wpi.datathinks.org 1
http://cs3441.ali.datathinks.org 1
More...

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • Get started with Hadoop
  • Get started with Hadoop

Facebook Analytics with Elastic Map/Reduce Facebook Analytics with Elastic Map/Reduce Presentation Transcript

  • Data + Algorithms = KnowledgeFacebook Analytics With Elastic Map/Reduce – a Hands-on Workshop November 12, 2012 J Singh, DataThinks.org 1
  • Take-away Messages• Map Reduce is simple, Hadoop is one implementation of MR… – …made even simpler by services like Elastic Map Reduce• But Map Reduce requires a different style of programming… – …and a different set of techniques for debugging• Facebook data can get big very quickly… – …and storage and bandwidth costs can dominate your solution• Analytics is an iterative (agile) process… – …each iteration requires evaluating results, and tuning the algorithms, possibly the acquisition of more data © J Singh, 2012 2 2
  • Signing Up for AWSThe steps required to obtain an AWS account Create an AWS account (http://aws.amazon.com). – http://www.slideshare.net/AmazonWebServices/video-how-to-sign-up-for- amazon-web-services-8700872 – Requires a valid credit card and a phone based identification. Sign in to the AWS Management Console – http://aws.amazon.com/console © J Singh, 2012 3 3
  • Elastic Map Reduce Resources• Summary of the offering• Elastic MapReduce Training• Getting Started Guide• Developers Guide © J Singh, 2012 4 4
  • MapReduce Conceptual Underpinnings• Based on Functional Programming model – From Lisp • (map square (1 2 3 4)) (1 4 9 16) • (reduce plus (1 4 9 16)) 30 – From APL • +/ N N  1 2 3 4• Easy to distribute (based on each element of the vector)• New for Map/Reduce: Nice failure/retry semantics – Hundreds and thousands of low-end servers are running at the same time © J Singh, 2012 5 5
  • MapReduce Flow © J Singh, 2012 6 6
  • Elastic Map Reduce – Summary• Hadoop installed and maintained by Amazon – We can focus on programming – Offers a few options on map and reduce programs• Streaming – Map and Reduce programs connect through stdin and stdout – Allows Map and Reduce to be written in any language• Hive, Pig – Translates to Map/Reduce JARs – Can cascade M/R pipelines• Custom JAR – for special cases © J Singh, 2012 7 7
  • Elastic Map Reduce – Architecture• Starting with data in S3• EMR Service initiates the job• Hadoop Master coordinates operation• Slave nodes are initiated and data loaded into them• Extra nodes can be invoked if needed• Results are copied back into S3 – Nodes are destroyed © J Singh, 2012 8 8
  • Elastic Map Reduce – Word Count• Use the AWS Management Console >> Elastic MapReduce – Define Job Flow • Hadoop Version 1.0.3 • Run your own application – Steaming – Specify Parameters • For input files, elasticmapreduce/samples/wordcount/input • For output files, you need to define your own S3 bucket – In a separate browser tab, AWS Management Console >> S3 – Bucket names can include lowercase letters, numbers, period, dash • Mapper code can be seen at http://goo.gl/EbCme – Copy this code to one of your buckets – Specify path <your-bucket>/wordSplitter.py © J Singh, 2012 9 9
  • Elastic Map Reduce – Word Count (p2)• Configure EC2 Instances• Advanced Options – Optional: Amazon EC2 Key Pair • To log into the master and make changes to a running job – E.g,, add extra nodes to speed up processing – Amazon S3 Log Path • <your-bucket>/log-2012-11-12--19-30• Accept all other defaults and go! © J Singh, 2012 10 10
  • Monitoring Operation• AWS Management Console provides a view into the operation – These screen-shots were taken at minute 27 of a 30-minute run – Configuration default in this case was for 2 map slots – First slot became available at 12:00, second around 12:10 © J Singh, 2012 11 11
  • Elastic Map Reduce – Debugging• AWS console and the log files provide clues on what went wrong and how to fix it• Make a change that will break the operation and examine the AWS console to find the error you introduced – Introduce a parsing error in the mapper program – Uncomment these lines to have it raise an exception import random x = 1 / random.randint(0,1000) – Save the file to an S3 bucket and run – Can you find where EMR reveals what happened? © J Singh, 2012 12 12
  • Facebook Analytics – Summary• Extend the architecture – Import Facebook data into S3 – Change Map Reduce programs as required © J Singh, 2012 13 13
  • Facebook Analytics – Observations• Fetching and staging data is the real challenge in putting together an analytics solution – For unstructured data, it requires • An understanding of the data model at the source • Custom code to read it – For structured data, consider Pig/Hive (higher-level Hadoop components) • Pig/Hive can read/write tables formatted as CSV/TSV files in S3 – Either we need to bring files into S3 – Or point Pig/Hive at a JDBC connection • An opportunity to rethink the ETL pipeline? © J Singh, 2012 14 14
  • Facebook Analytics – Data Collection• The exercise is based on everyone‟s Facebook data• Log into http://apps.facebook.com/map-reduce-workshop – Requires permission to get • Information about you, • Your friends, • Your likes, your friends‟ likes. – Randomly selects 10 of those friends – Randomly selects 25 of their likes – Anonymizes your friends‟ Facebook IDs before storing into S3• All data, even though opaque, will be deleted at the end of the workshop © J Singh, 2012 15 15
  • Facebook Analytics – Data CollectedOriginal = 75 Friends = 750 Likes = up to about 20,000• Each user record shows anonymized user ID and their likes – 4110002004281 [21506845769, 345722385482735, 93433060687] © J Singh, 2012 16 16
  • Facebook Analytics – Likes Count• Use the AWS Management Console >> Elastic MapReduce – Define Job Flow • Hadoop Version 1.0.3 • Run Your Own Application – Streaming – Specify Parameters • For input files, use bucket datathinks-users • For output files, you need to define your own S3 bucket – In a separate browser tab, AWS Management Console >> S3 • Mapper: copy goo.gl/PcLK4 into a bucket you own – Advanced options: • Choose a fresh log file location – Accept all other defaults and go! © J Singh, 2012 17 17
  • Viewing the Results• The results of Data Analysis are available in S3. – Partial example: 139784736075551 1 140413412750046 6 184331976202 3 220854914702193 1 29092950651 1• How to interpret the results. – Sort by frequency, then examine most frequent likes • 140413412750046 is cryptic • But http://www.facebook.com/pages/w/140413412750046 reveals what it is (DataThinks)• Requires further action: what to do with the results? © J Singh, 2012 18 18
  • Algorithm Discussion• The algorithm based on exact matches for likes may be too restrictive – „Ella Fitzgerald‟ != „Duke Ellington‟ – But people who like Ella Fitzgerald may be reachable the same way as people who like Duke Ellington – An idea to explore further: • Is there a way to find ID‟s that we might consider equivalent? © J Singh, 2012 19 19
  • Data Collected and EmbellishedOriginal = 75 Friends = 750 Likes = 15,000 Similar Likes = 150,000 © J Singh, 2012 20 20
  • Extended Facebook Analytics – Summary• Extend the architecture – Get mappers to fetch “similar likes” from the internet © J Singh, 2012 21 21
  • Facebook Analytics – Showing Results• The other challenge in putting together an analytics solution is displaying results – Demo of our results page © J Singh, 2012 22 22
  • Take-away Messages• Map Reduce is simple, Hadoop is one implementation of MR… – …made even simpler by services like Elastic Map Reduce• But Map Reduce requires a different style of programming… – …and a different set of techniques for debugging• Facebook data can get big very quickly… – …and storage and bandwidth costs can dominate your solution• Analytics is an iterative (agile) process… – …each iteration requires evaluating results, and tuning the algorithms, possibly the acquisition of more data © J Singh, 2012 23 23
  • Thank you• J Singh – President, Early Stage IT • Technology Services and Strategy for Startups• DataThinks.org is a service of Early Stage IT – “Big Data” analytics solutions © J Singh, 2012 24 24