Home
Explore
Submit Search
Upload
Login
Signup
Advertisement
Facebook Analytics with Elastic Map/Reduce
Report
J Singh
Follow
Organizer at Boston Cloud Services Meetup
Nov. 11, 2012
•
0 likes
8 likes
×
Be the first to like this
Show More
•
2,368 views
views
×
Total views
0
On Slideshare
0
From embeds
0
Number of embeds
0
Check these out next
Qubole @ AWS Meetup Bangalore - July 2015
Joydeep Sen Sarma
Hadoop Ecosystem
Lior Sidi
Facebook Retrospective - Big data-world-europe-2012
Joydeep Sen Sarma
Apache Tez – Present and Future
DataWorks Summit
Facebooks Petabyte Scale Data Warehouse using Hive and Hadoop
royans
Summary machine learning and model deployment
Novita Sari
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduce
Mahantesh Angadi
Pig, Making Hadoop Easy
Nick Dimiduk
1
of
24
Top clipped slide
Facebook Analytics with Elastic Map/Reduce
Nov. 11, 2012
•
0 likes
8 likes
×
Be the first to like this
Show More
•
2,368 views
views
×
Total views
0
On Slideshare
0
From embeds
0
Number of embeds
0
Report
Technology
A workshop on analyzing data about Facebook likes of a set of people
J Singh
Follow
Organizer at Boston Cloud Services Meetup
Advertisement
Advertisement
Advertisement
Recommended
Big Data Laboratory
J Singh
1.8K views
•
10 slides
The Hadoop Ecosystem
J Singh
15K views
•
27 slides
OpenLSH - a framework for locality sensitive hashing
J Singh
2.3K views
•
23 slides
Hadoop ecosystem
Ran Silberman
1K views
•
48 slides
Future of Data Intensive Applicaitons
Milind Bhandarkar
2.4K views
•
58 slides
The Zoo Expands: Labrador *Loves* Elephant, Thanks to Hamster
Milind Bhandarkar
2K views
•
29 slides
More Related Content
Slideshows for you
(20)
Qubole @ AWS Meetup Bangalore - July 2015
Joydeep Sen Sarma
•
1.7K views
Hadoop Ecosystem
Lior Sidi
•
2.6K views
Facebook Retrospective - Big data-world-europe-2012
Joydeep Sen Sarma
•
912 views
Apache Tez – Present and Future
DataWorks Summit
•
3.8K views
Facebooks Petabyte Scale Data Warehouse using Hive and Hadoop
royans
•
82.7K views
Summary machine learning and model deployment
Novita Sari
•
21 views
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduce
Mahantesh Angadi
•
4.3K views
Pig, Making Hadoop Easy
Nick Dimiduk
•
84.7K views
Introduction to the Hadoop EcoSystem
Shivaji Dutta
•
1.8K views
Hadoop Primer
Steve Staso
•
535 views
Nextag talk
Joydeep Sen Sarma
•
1.4K views
Functional Programming and Big Data
DataWorks Summit
•
3.8K views
Hive Training -- Motivations and Real World Use Cases
nzhang
•
20.2K views
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
Uwe Printz
•
10.7K views
Hadoop Hive Talk At IIT-Delhi
Joydeep Sen Sarma
•
3.8K views
Drilling into Data with Apache Drill
DataWorks Summit
•
3.8K views
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Rohit Kulkarni
•
4.2K views
Using Hadoop and Hive to Optimize Travel Search, WindyCityDB 2010
Jonathan Seidman
•
10.5K views
Jan 2013 HUG: Cloud-Friendly Hadoop and Hive
Yahoo Developer Network
•
2.8K views
Getting started with Hadoop, Hive, and Elastic MapReduce
obdit
•
2.4K views
Similar to Facebook Analytics with Elastic Map/Reduce
(20)
[AWS DC Meetup] Not Your Father’s WebApp: The Cloud-Native Architecture of im...
Chris Shenton
•
401 views
Not Your Father’s Web App: The Cloud-Native Architecture of images.nasa.gov
Chris Shenton
•
804 views
SQL to NoSQL: Top 6 Questions
Mike Broberg
•
1.2K views
MongoDB for Spatio-Behavioral Data Analysis and Visualization
MongoDB
•
2.1K views
2013 CPM Conference, Nov 6th, NoSQL Capacity Planning
asya999
•
423 views
Powering a Startup with Apache Spark with Kevin Kim
Spark Summit
•
785 views
eHarmony in the Cloud
Craig Dickson
•
942 views
Shop talk - Project Server 2013
Chris Givens
•
3.3K views
Accelerating Data Science with Better Data Engineering on Databricks
Databricks
•
1.1K views
SharePoint Saturday - Chicago - 2014 - Decoding the Business Intelligence Alp...
Scott_Brickey
•
797 views
Using Power BI and Azure as analytics engine for business applications
Digital Illustrated
•
2.5K views
Dax & sql in power bi
Berkovich Consulting
•
698 views
L19 Application Architecture
Ólafur Andri Ragnarsson
•
570 views
Tableau & MongoDB: Visual Analytics at the Speed of Thought
MongoDB
•
3.8K views
Weathering the Data Storm – How SnapLogic and AWS Deliver Analytics in the Cl...
SnapLogic
•
3.3K views
Tableau Seattle BI Event How Tableau Changed My Life
Russell Spangler
•
1K views
Adf and ala design c sharp corner toronto chapter feb 2019 meetup nik shahriar
Nilesh Shah
•
101 views
EMR and DynamoDB
Sohail M. Khan
•
1.9K views
Building a Front End for a Sensor Data Cloud
PlanetData Network of Excellence
•
531 views
SQL Saturday Columbus 2014 PowerBI with SQL Excel and SharePoint
Scott_Brickey
•
1K views
Advertisement
More from J Singh
(19)
Designing analytics for big data
J Singh
•
641 views
Open LSH - september 2014 update
J Singh
•
1.4K views
PaaS - google app engine
J Singh
•
18.2K views
Mining of massive datasets using locality sensitive hashing (LSH)
J Singh
•
8.5K views
Data Analytic Technology Platforms: Options and Tradeoffs
J Singh
•
940 views
Social Media Mining using GAE Map Reduce
J Singh
•
950 views
High Throughput Data Analysis
J Singh
•
5K views
NoSQL and MapReduce
J Singh
•
12.4K views
CS 542 -- Concurrency Control, Distributed Commit
J Singh
•
2.5K views
CS 542 -- Failure Recovery, Concurrency Control
J Singh
•
2.4K views
CS 542 -- Query Optimization
J Singh
•
2.4K views
CS 542 -- Query Execution
J Singh
•
364 views
CS 542 Putting it all together -- Storage Management
J Singh
•
405 views
CS 542 Parallel DBs, NoSQL, MapReduce
J Singh
•
1.8K views
CS 542 Database Index Structures
J Singh
•
2.5K views
CS 542 Controlling Database Integrity and Performance
J Singh
•
530 views
CS 542 Overview of query processing
J Singh
•
415 views
CS 542 Introduction
J Singh
•
548 views
Cloud Computing from an Entrpreneur's Viewpoint
J Singh
•
336 views
Recently uploaded
(20)
End to End Process Transformation with Signavio.pdf
IgnacioPeredoCL
•
0 views
Chapter_11-Heragu.pptx
Madan Karki
•
0 views
lect1.pdf
AtkaAli
•
0 views
Responsive Web Design Crafting Websites for the Multi-Device World (2).pdf
iSQUARE Business Solution
•
0 views
Exploratory Data Analysis - A Comprehensive Guide to EDA.pdf
StephenAmell4
•
0 views
Varanasi_Meetup_Universal API Managment.pdf
Santosh Ojha
•
0 views
AI HELPS PARALYSED MAN TO WALK NATURALLY.pdf
sudhakargeruganti
•
0 views
Hybrid Mobile App Development Frameworks.pdf
TarunTiwari94
•
0 views
在哪里可以办美国大学文凭《夏威夷太平洋大学毕业证成绩单仿制》
efagvah
•
0 views
2023-05-31_ESWC.pptx
Anisa Rula
•
0 views
Don’t Reinvent the Wheel: Pre-built Spatial and Data Enrichment APIs for Your...
Precisely
•
0 views
ChatGPT_Prompts.pptx
Chakrit Phain
•
0 views
Site Directed Mutagenesis (SDM).pptx
TechnoIndiaUniversit
•
0 views
Internship_Report_Projects_have_done_Dur.pdf
HikMan2
•
0 views
Exploratory Data Analysis - A Comprehensive Guide to EDA.pdf
JamieDornan2
•
0 views
Theben DALI-2 Room Solution
Ivory Egg
•
0 views
Europe Dedicated Server
ShivamShakya32
•
0 views
CDP_Presentation.pptx
Abbas335883
•
0 views
Ethereum's Transaction Momentum: Closing the Gap with Visa
Mobiloitte Technologies
•
0 views
evpn_in_service_provider_network-web.pdf
ThanhTrungBui5
•
0 views
Advertisement
Facebook Analytics with Elastic Map/Reduce
Data + Algorithms
= Knowledge Facebook Analytics With Elastic Map/Reduce – a Hands-on Workshop November 12, 2012 J Singh, DataThinks.org 1
Take-away Messages • Map
Reduce is simple, Hadoop is one implementation of MR… – …made even simpler by services like Elastic Map Reduce • But Map Reduce requires a different style of programming… – …and a different set of techniques for debugging • Facebook data can get big very quickly… – …and storage and bandwidth costs can dominate your solution • Analytics is an iterative (agile) process… – …each iteration requires evaluating results, and tuning the algorithms, possibly the acquisition of more data © J Singh, 2012 2 2
Signing Up for
AWS The steps required to obtain an AWS account Create an AWS account (http://aws.amazon.com). – http://www.slideshare.net/AmazonWebServices/video-how-to-sign-up-for- amazon-web-services-8700872 – Requires a valid credit card and a phone based identification. Sign in to the AWS Management Console – http://aws.amazon.com/console © J Singh, 2012 3 3
Elastic Map Reduce
Resources • Summary of the offering • Elastic MapReduce Training • Getting Started Guide • Developers Guide © J Singh, 2012 4 4
MapReduce Conceptual Underpinnings •
Based on Functional Programming model – From Lisp • (map square '(1 2 3 4)) (1 4 9 16) • (reduce plus '(1 4 9 16)) 30 – From APL • +/ N N 1 2 3 4 • Easy to distribute (based on each element of the vector) • New for Map/Reduce: Nice failure/retry semantics – Hundreds and thousands of low-end servers are running at the same time © J Singh, 2012 5 5
MapReduce Flow
© J Singh, 2012 6 6
Elastic Map Reduce
– Summary • Hadoop installed and maintained by Amazon – We can focus on programming – Offers a few options on map and reduce programs • Streaming – Map and Reduce programs connect through stdin and stdout – Allows Map and Reduce to be written in any language • Hive, Pig – Translates to Map/Reduce JARs – Can cascade M/R pipelines • Custom JAR – for special cases © J Singh, 2012 7 7
Elastic Map Reduce
– Architecture • Starting with data in S3 • EMR Service initiates the job • Hadoop Master coordinates operation • Slave nodes are initiated and data loaded into them • Extra nodes can be invoked if needed • Results are copied back into S3 – Nodes are destroyed © J Singh, 2012 8 8
Elastic Map Reduce
– Word Count • Use the AWS Management Console >> Elastic MapReduce – Define Job Flow • Hadoop Version 1.0.3 • Run your own application – Steaming – Specify Parameters • For input files, elasticmapreduce/samples/wordcount/input • For output files, you need to define your own S3 bucket – In a separate browser tab, AWS Management Console >> S3 – Bucket names can include lowercase letters, numbers, period, dash • Mapper code can be seen at http://goo.gl/EbCme – Copy this code to one of your buckets – Specify path <your-bucket>/wordSplitter.py © J Singh, 2012 9 9
Elastic Map Reduce
– Word Count (p2) • Configure EC2 Instances • Advanced Options – Optional: Amazon EC2 Key Pair • To log into the master and make changes to a running job – E.g,, add extra nodes to speed up processing – Amazon S3 Log Path • <your-bucket>/log-2012-11-12--19-30 • Accept all other defaults and go! © J Singh, 2012 10 10
Monitoring Operation • AWS
Management Console provides a view into the operation – These screen-shots were taken at minute 27 of a 30-minute run – Configuration default in this case was for 2 map slots – First slot became available at 12:00, second around 12:10 © J Singh, 2012 11 11
Elastic Map Reduce
– Debugging • AWS console and the log files provide clues on what went wrong and how to fix it • Make a change that will break the operation and examine the AWS console to find the error you introduced – Introduce a parsing error in the mapper program – Uncomment these lines to have it raise an exception import random x = 1 / random.randint(0,1000) – Save the file to an S3 bucket and run – Can you find where EMR reveals what happened? © J Singh, 2012 12 12
Facebook Analytics –
Summary • Extend the architecture – Import Facebook data into S3 – Change Map Reduce programs as required © J Singh, 2012 13 13
Facebook Analytics –
Observations • Fetching and staging data is the real challenge in putting together an analytics solution – For unstructured data, it requires • An understanding of the data model at the source • Custom code to read it – For structured data, consider Pig/Hive (higher-level Hadoop components) • Pig/Hive can read/write tables formatted as CSV/TSV files in S3 – Either we need to bring files into S3 – Or point Pig/Hive at a JDBC connection • An opportunity to rethink the ETL pipeline? © J Singh, 2012 14 14
Facebook Analytics –
Data Collection • The exercise is based on everyone‟s Facebook data • Log into http://apps.facebook.com/map-reduce-workshop – Requires permission to get • Information about you, • Your friends, • Your likes, your friends‟ likes. – Randomly selects 10 of those friends – Randomly selects 25 of their likes – Anonymizes your friends‟ Facebook IDs before storing into S3 • All data, even though opaque, will be deleted at the end of the workshop © J Singh, 2012 15 15
Facebook Analytics –
Data Collected Original = 75 Friends = 750 Likes = up to about 20,000 • Each user record shows anonymized user ID and their likes – 4110002004281 ['21506845769', '345722385482735', '93433060687'] © J Singh, 2012 16 16
Facebook Analytics –
Likes Count • Use the AWS Management Console >> Elastic MapReduce – Define Job Flow • Hadoop Version 1.0.3 • Run Your Own Application – Streaming – Specify Parameters • For input files, use bucket datathinks-users • For output files, you need to define your own S3 bucket – In a separate browser tab, AWS Management Console >> S3 • Mapper: copy goo.gl/PcLK4 into a bucket you own – Advanced options: • Choose a fresh log file location – Accept all other defaults and go! © J Singh, 2012 17 17
Viewing the Results •
The results of Data Analysis are available in S3. – Partial example: 139784736075551 1 140413412750046 6 184331976202 3 220854914702193 1 29092950651 1 • How to interpret the results. – Sort by frequency, then examine most frequent likes • 140413412750046 is cryptic • But http://www.facebook.com/pages/w/140413412750046 reveals what it is (DataThinks) • Requires further action: what to do with the results? © J Singh, 2012 18 18
Algorithm Discussion • The
algorithm based on exact matches for likes may be too restrictive – „Ella Fitzgerald‟ != „Duke Ellington‟ – But people who like Ella Fitzgerald may be reachable the same way as people who like Duke Ellington – An idea to explore further: • Is there a way to find ID‟s that we might consider equivalent? © J Singh, 2012 19 19
Data Collected and
Embellished Original = 75 Friends = 750 Likes = 15,000 Similar Likes = 150,000 © J Singh, 2012 20 20
Extended Facebook Analytics
– Summary • Extend the architecture – Get mappers to fetch “similar likes” from the internet © J Singh, 2012 21 21
Facebook Analytics –
Showing Results • The other challenge in putting together an analytics solution is displaying results – Demo of our results page © J Singh, 2012 22 22
Take-away Messages • Map
Reduce is simple, Hadoop is one implementation of MR… – …made even simpler by services like Elastic Map Reduce • But Map Reduce requires a different style of programming… – …and a different set of techniques for debugging • Facebook data can get big very quickly… – …and storage and bandwidth costs can dominate your solution • Analytics is an iterative (agile) process… – …each iteration requires evaluating results, and tuning the algorithms, possibly the acquisition of more data © J Singh, 2012 23 23
Thank you • J
Singh – President, Early Stage IT • Technology Services and Strategy for Startups • DataThinks.org is a service of Early Stage IT – “Big Data” analytics solutions © J Singh, 2012 24 24
Editor's Notes
Get started with Hadoop
Get started with Hadoop
Advertisement