SlideShare a Scribd company logo
1 of 24
Data + Algorithms = Knowledge




Facebook Analytics


                  With Elastic Map/Reduce
                      – a Hands-on Workshop

                                            November 12, 2012
                                        J Singh, DataThinks.org




                             1
Take-away Messages

• Map Reduce is simple, Hadoop is one implementation of MR…
   – …made even simpler by services like Elastic Map Reduce


• But Map Reduce requires a different style of programming…
   – …and a different set of techniques for debugging


• Facebook data can get big very quickly…
   – …and storage and bandwidth costs can dominate your solution


• Analytics is an iterative (agile) process…
   – …each iteration requires evaluating results, and tuning the algorithms,
     possibly the acquisition of more data

                       © J Singh, 2012                                  2
                                2
Signing Up for AWS

The steps required to obtain an AWS account
   Create an AWS account (http://aws.amazon.com).
    –   http://www.slideshare.net/AmazonWebServices/video-how-to-sign-up-for-
        amazon-web-services-8700872
    –   Requires a valid credit card and a phone based identification.
   Sign in to the AWS Management Console
    – http://aws.amazon.com/console




                          © J Singh, 2012                                   3
                                   3
Elastic Map Reduce Resources

• Summary of the offering

• Elastic MapReduce Training

• Getting Started Guide

• Developers Guide




                     © J Singh, 2012   4
                              4
MapReduce Conceptual Underpinnings

• Based on Functional Programming model
   – From Lisp
       • (map square '(1 2 3 4))   (1 4 9 16)
       • (reduce plus '(1 4 9 16))   30
   – From APL
       • +/ N    N  1 2 3 4


• Easy to distribute (based on each element of the vector)

• New for Map/Reduce: Nice failure/retry semantics
   – Hundreds and thousands of low-end servers are running at
     the same time
                     © J Singh, 2012                            5
                              5
MapReduce Flow




            © J Singh, 2012   6
                     6
Elastic Map Reduce – Summary

• Hadoop installed and maintained by Amazon
   – We can focus on programming
   – Offers a few options on map and reduce programs

• Streaming
   – Map and Reduce programs
     connect through stdin and
     stdout
   – Allows Map and Reduce to be
     written in any language
• Hive, Pig
   – Translates to Map/Reduce JARs
   – Can cascade M/R pipelines
• Custom JAR – for special cases

                      © J Singh, 2012                  7
                               7
Elastic Map Reduce – Architecture

• Starting with data in S3

• EMR Service initiates the job
• Hadoop Master coordinates
  operation
• Slave nodes are initiated and
  data loaded into them
• Extra nodes can be invoked if
  needed

• Results are copied back into S3
   – Nodes are destroyed

                      © J Singh, 2012   8
                               8
Elastic Map Reduce – Word Count

• Use the AWS Management Console >> Elastic MapReduce
  – Define Job Flow
      • Hadoop Version 1.0.3
      • Run your own application
          – Steaming
  – Specify Parameters
      • For input files,
        elasticmapreduce/samples/wordcount/input
      • For output files, you need to define your own S3 bucket
          – In a separate browser tab, AWS Management Console >> S3
          – Bucket names can include lowercase letters, numbers, period, dash
      • Mapper code can be seen at http://goo.gl/EbCme
          – Copy this code to one of your buckets
          – Specify path <your-bucket>/wordSplitter.py
                           © J Singh, 2012                                  9
                                    9
Elastic Map Reduce – Word Count (p2)

• Configure EC2 Instances
• Advanced Options
   – Optional: Amazon EC2 Key Pair
       • To log into the master and make changes to a running job
          – E.g,, add extra nodes to speed up processing
   – Amazon S3 Log Path
       • <your-bucket>/log-2012-11-12--19-30
• Accept all other defaults and go!




                       © J Singh, 2012                              10
                                10
Monitoring Operation

• AWS Management Console provides a view into the
  operation




  – These screen-shots were taken at minute 27 of a 30-minute
    run
  – Configuration default in this case was for 2 map slots
  – First slot became available at 12:00, second around 12:10

                   © J Singh, 2012                              11
                           11
Elastic Map Reduce – Debugging

• AWS console and the log files provide clues on what went
  wrong and how to fix it

• Make a change that will break the operation and examine
  the AWS console to find the error you introduced
   – Introduce a parsing error in the mapper program
   – Uncomment these lines to have it raise an exception
                 import random
                 x = 1 / random.randint(0,1000)
   – Save the file to an S3 bucket and run
   – Can you find where EMR reveals what happened?


                     © J Singh, 2012                         12
                             12
Facebook Analytics – Summary

• Extend the architecture
   – Import Facebook data into S3
   – Change Map Reduce programs as required




                      © J Singh, 2012         13
                              13
Facebook Analytics – Observations

• Fetching and staging data is the real challenge in putting
  together an analytics solution
   – For unstructured data, it requires
       • An understanding of the data model at the source
       • Custom code to read it


   – For structured data, consider Pig/Hive (higher-level Hadoop
     components)
       • Pig/Hive can read/write tables formatted as CSV/TSV files in S3
          – Either we need to bring files into S3
          – Or point Pig/Hive at a JDBC connection
       • An opportunity to rethink the ETL pipeline?


                       © J Singh, 2012                                 14
                                 14
Facebook Analytics – Data Collection

• The exercise is based on everyone‟s Facebook data
• Log into http://apps.facebook.com/map-reduce-workshop
   – Requires permission to get
       • Information about you,
       • Your friends,
       • Your likes, your friends‟ likes.
   – Randomly selects 10 of those friends
   – Randomly selects 25 of their likes
   – Anonymizes your friends‟ Facebook IDs before storing into
     S3
• All data, even though opaque, will be deleted at the end of
 the workshop

                        © J Singh, 2012                          15
                                  15
Facebook Analytics – Data Collected




Original = 75   Friends = 750        Likes = up to about 20,000

• Each user record shows anonymized user ID and their likes
   –   4110002004281   ['21506845769', '345722385482735', '93433060687']




                        © J Singh, 2012                              16
                                16
Facebook Analytics – Likes Count

• Use the AWS Management Console >> Elastic MapReduce
  – Define Job Flow
      • Hadoop Version 1.0.3
      • Run Your Own Application
         – Streaming
  – Specify Parameters
      • For input files, use bucket datathinks-users
      • For output files, you need to define your own S3 bucket
         – In a separate browser tab, AWS Management Console >> S3
      • Mapper: copy goo.gl/PcLK4 into a bucket you own
  – Advanced options:
      • Choose a fresh log file location
  – Accept all other defaults and go!
                       © J Singh, 2012                               17
                               17
Viewing the Results

• The results of Data Analysis are available in S3.
   – Partial example:     139784736075551      1
                          140413412750046      6
                          184331976202         3
                          220854914702193      1
                          29092950651          1


• How to interpret the results.
   – Sort by frequency, then examine most frequent likes
       • 140413412750046 is cryptic
       • But http://www.facebook.com/pages/w/140413412750046
         reveals what it is (DataThinks)
• Requires further action: what to do with the results?
                        © J Singh, 2012                        18
                                18
Algorithm Discussion

• The algorithm based on exact matches for likes may be
  too restrictive
  – „Ella Fitzgerald‟ != „Duke Ellington‟
  – But people who like Ella Fitzgerald may be reachable the
    same way as people who like Duke Ellington

  – An idea to explore further:
      • Is there a way to find ID‟s that we might consider equivalent?




                      © J Singh, 2012                                    19
                              19
Data Collected and Embellished




Original = 75   Friends = 750   Likes = 15,000   Similar Likes = 150,000




                         © J Singh, 2012                                   20
                                  20
Extended Facebook Analytics – Summary

• Extend the architecture
   – Get mappers to fetch “similar likes” from the internet




                        © J Singh, 2012                       21
                                21
Facebook Analytics – Showing Results

• The other challenge in putting together an analytics
  solution is displaying results
   – Demo of our results page




                    © J Singh, 2012                      22
                            22
Take-away Messages

• Map Reduce is simple, Hadoop is one implementation of MR…
   – …made even simpler by services like Elastic Map Reduce


• But Map Reduce requires a different style of programming…
   – …and a different set of techniques for debugging


• Facebook data can get big very quickly…
   – …and storage and bandwidth costs can dominate your solution


• Analytics is an iterative (agile) process…
   – …each iteration requires evaluating results, and tuning the algorithms,
     possibly the acquisition of more data

                       © J Singh, 2012                                  23
                                23
Thank you

• J Singh
   – President, Early Stage IT
       • Technology Services and Strategy for Startups


• DataThinks.org is a service of Early Stage IT
   – “Big Data” analytics solutions




                      © J Singh, 2012                    24
                              24

More Related Content

What's hot

Qubole @ AWS Meetup Bangalore - July 2015
Qubole @ AWS Meetup Bangalore - July 2015Qubole @ AWS Meetup Bangalore - July 2015
Qubole @ AWS Meetup Bangalore - July 2015Joydeep Sen Sarma
 
Hadoop Ecosystem
Hadoop EcosystemHadoop Ecosystem
Hadoop EcosystemLior Sidi
 
Facebook Retrospective - Big data-world-europe-2012
Facebook Retrospective - Big data-world-europe-2012Facebook Retrospective - Big data-world-europe-2012
Facebook Retrospective - Big data-world-europe-2012Joydeep Sen Sarma
 
Apache Tez – Present and Future
Apache Tez – Present and FutureApache Tez – Present and Future
Apache Tez – Present and FutureDataWorks Summit
 
Facebooks Petabyte Scale Data Warehouse using Hive and Hadoop
Facebooks Petabyte Scale Data Warehouse using Hive and HadoopFacebooks Petabyte Scale Data Warehouse using Hive and Hadoop
Facebooks Petabyte Scale Data Warehouse using Hive and Hadooproyans
 
Summary machine learning and model deployment
Summary machine learning and model deploymentSummary machine learning and model deployment
Summary machine learning and model deploymentNovita Sari
 
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduce
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduceBIGDATA- Survey on Scheduling Methods in Hadoop MapReduce
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduceMahantesh Angadi
 
Pig, Making Hadoop Easy
Pig, Making Hadoop EasyPig, Making Hadoop Easy
Pig, Making Hadoop EasyNick Dimiduk
 
Introduction to the Hadoop EcoSystem
Introduction to the Hadoop EcoSystemIntroduction to the Hadoop EcoSystem
Introduction to the Hadoop EcoSystemShivaji Dutta
 
Functional Programming and Big Data
Functional Programming and Big DataFunctional Programming and Big Data
Functional Programming and Big DataDataWorks Summit
 
Hive Training -- Motivations and Real World Use Cases
Hive Training -- Motivations and Real World Use CasesHive Training -- Motivations and Real World Use Cases
Hive Training -- Motivations and Real World Use Casesnzhang
 
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)Uwe Printz
 
Hadoop Hive Talk At IIT-Delhi
Hadoop Hive Talk At IIT-DelhiHadoop Hive Talk At IIT-Delhi
Hadoop Hive Talk At IIT-DelhiJoydeep Sen Sarma
 
Drilling into Data with Apache Drill
Drilling into Data with Apache DrillDrilling into Data with Apache Drill
Drilling into Data with Apache DrillDataWorks Summit
 
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, GuindyScaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, GuindyRohit Kulkarni
 
Using Hadoop and Hive to Optimize Travel Search , WindyCityDB 2010
Using Hadoop and Hive to Optimize Travel Search, WindyCityDB 2010Using Hadoop and Hive to Optimize Travel Search, WindyCityDB 2010
Using Hadoop and Hive to Optimize Travel Search , WindyCityDB 2010Jonathan Seidman
 
Jan 2013 HUG: Cloud-Friendly Hadoop and Hive
Jan 2013 HUG: Cloud-Friendly Hadoop and HiveJan 2013 HUG: Cloud-Friendly Hadoop and Hive
Jan 2013 HUG: Cloud-Friendly Hadoop and HiveYahoo Developer Network
 
Getting started with Hadoop, Hive, and Elastic MapReduce
Getting started with Hadoop, Hive, and Elastic MapReduceGetting started with Hadoop, Hive, and Elastic MapReduce
Getting started with Hadoop, Hive, and Elastic MapReduceobdit
 

What's hot (20)

Qubole @ AWS Meetup Bangalore - July 2015
Qubole @ AWS Meetup Bangalore - July 2015Qubole @ AWS Meetup Bangalore - July 2015
Qubole @ AWS Meetup Bangalore - July 2015
 
Hadoop Ecosystem
Hadoop EcosystemHadoop Ecosystem
Hadoop Ecosystem
 
Facebook Retrospective - Big data-world-europe-2012
Facebook Retrospective - Big data-world-europe-2012Facebook Retrospective - Big data-world-europe-2012
Facebook Retrospective - Big data-world-europe-2012
 
Apache Tez – Present and Future
Apache Tez – Present and FutureApache Tez – Present and Future
Apache Tez – Present and Future
 
Facebooks Petabyte Scale Data Warehouse using Hive and Hadoop
Facebooks Petabyte Scale Data Warehouse using Hive and HadoopFacebooks Petabyte Scale Data Warehouse using Hive and Hadoop
Facebooks Petabyte Scale Data Warehouse using Hive and Hadoop
 
Summary machine learning and model deployment
Summary machine learning and model deploymentSummary machine learning and model deployment
Summary machine learning and model deployment
 
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduce
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduceBIGDATA- Survey on Scheduling Methods in Hadoop MapReduce
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduce
 
Pig, Making Hadoop Easy
Pig, Making Hadoop EasyPig, Making Hadoop Easy
Pig, Making Hadoop Easy
 
Introduction to the Hadoop EcoSystem
Introduction to the Hadoop EcoSystemIntroduction to the Hadoop EcoSystem
Introduction to the Hadoop EcoSystem
 
Hadoop Primer
Hadoop PrimerHadoop Primer
Hadoop Primer
 
Nextag talk
Nextag talkNextag talk
Nextag talk
 
Functional Programming and Big Data
Functional Programming and Big DataFunctional Programming and Big Data
Functional Programming and Big Data
 
Hive Training -- Motivations and Real World Use Cases
Hive Training -- Motivations and Real World Use CasesHive Training -- Motivations and Real World Use Cases
Hive Training -- Motivations and Real World Use Cases
 
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
 
Hadoop Hive Talk At IIT-Delhi
Hadoop Hive Talk At IIT-DelhiHadoop Hive Talk At IIT-Delhi
Hadoop Hive Talk At IIT-Delhi
 
Drilling into Data with Apache Drill
Drilling into Data with Apache DrillDrilling into Data with Apache Drill
Drilling into Data with Apache Drill
 
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, GuindyScaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
 
Using Hadoop and Hive to Optimize Travel Search , WindyCityDB 2010
Using Hadoop and Hive to Optimize Travel Search, WindyCityDB 2010Using Hadoop and Hive to Optimize Travel Search, WindyCityDB 2010
Using Hadoop and Hive to Optimize Travel Search , WindyCityDB 2010
 
Jan 2013 HUG: Cloud-Friendly Hadoop and Hive
Jan 2013 HUG: Cloud-Friendly Hadoop and HiveJan 2013 HUG: Cloud-Friendly Hadoop and Hive
Jan 2013 HUG: Cloud-Friendly Hadoop and Hive
 
Getting started with Hadoop, Hive, and Elastic MapReduce
Getting started with Hadoop, Hive, and Elastic MapReduceGetting started with Hadoop, Hive, and Elastic MapReduce
Getting started with Hadoop, Hive, and Elastic MapReduce
 

Viewers also liked

Apache Hadoop India Summit 2011 talk "Making Hadoop Enterprise Ready with Am...
Apache Hadoop India Summit 2011 talk  "Making Hadoop Enterprise Ready with Am...Apache Hadoop India Summit 2011 talk  "Making Hadoop Enterprise Ready with Am...
Apache Hadoop India Summit 2011 talk "Making Hadoop Enterprise Ready with Am...Yahoo Developer Network
 
Introduction to Elastic MapReduce
Introduction to Elastic MapReduceIntroduction to Elastic MapReduce
Introduction to Elastic MapReduceAmazon Web Services
 
MapReduce Paradigm
MapReduce ParadigmMapReduce Paradigm
MapReduce ParadigmDilip Reddy
 
Scaling your analytics with Amazon EMR
Scaling your analytics with Amazon EMRScaling your analytics with Amazon EMR
Scaling your analytics with Amazon EMRIsrael AWS User Group
 
October 2016 HUG: The Pillars of Effective Data Archiving and Tiering in Hadoop
October 2016 HUG: The Pillars of Effective Data Archiving and Tiering in HadoopOctober 2016 HUG: The Pillars of Effective Data Archiving and Tiering in Hadoop
October 2016 HUG: The Pillars of Effective Data Archiving and Tiering in HadoopYahoo Developer Network
 
October 2016 HUG: Pulsar,  a highly scalable, low latency pub-sub messaging s...
October 2016 HUG: Pulsar,  a highly scalable, low latency pub-sub messaging s...October 2016 HUG: Pulsar,  a highly scalable, low latency pub-sub messaging s...
October 2016 HUG: Pulsar,  a highly scalable, low latency pub-sub messaging s...Yahoo Developer Network
 
Hadoop Summit 2012 | Optimizing MapReduce Job Performance
Hadoop Summit 2012 | Optimizing MapReduce Job PerformanceHadoop Summit 2012 | Optimizing MapReduce Job Performance
Hadoop Summit 2012 | Optimizing MapReduce Job PerformanceCloudera, Inc.
 
October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and ...
October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and ...October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and ...
October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and ...Yahoo Developer Network
 
Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Inven...
Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Inven...Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Inven...
Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Inven...Amazon Web Services
 
Mapreduce Algorithms
Mapreduce AlgorithmsMapreduce Algorithms
Mapreduce AlgorithmsAmund Tveit
 
Deep Dive - Amazon Elastic MapReduce (EMR)
Deep Dive - Amazon Elastic MapReduce (EMR)Deep Dive - Amazon Elastic MapReduce (EMR)
Deep Dive - Amazon Elastic MapReduce (EMR)Amazon Web Services
 
BigData_Chp5: Putting it all together
BigData_Chp5: Putting it all togetherBigData_Chp5: Putting it all together
BigData_Chp5: Putting it all togetherLilia Sfaxi
 
BigData_TP3 : Spark
BigData_TP3 : SparkBigData_TP3 : Spark
BigData_TP3 : SparkLilia Sfaxi
 
Building a Sustainable Data Platform on AWS
Building a Sustainable Data Platform on AWSBuilding a Sustainable Data Platform on AWS
Building a Sustainable Data Platform on AWSSmartNews, Inc.
 
Hadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsHadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsLynn Langit
 
Big data, map reduce and beyond
Big data, map reduce and beyondBig data, map reduce and beyond
Big data, map reduce and beyonddatasalt
 
Dynamo and BigTable - Review and Comparison
Dynamo and BigTable - Review and ComparisonDynamo and BigTable - Review and Comparison
Dynamo and BigTable - Review and ComparisonGrisha Weintraub
 
Slideshare Powerpoint presentation
Slideshare Powerpoint presentationSlideshare Powerpoint presentation
Slideshare Powerpoint presentationelliehood
 

Viewers also liked (20)

Apache Hadoop India Summit 2011 talk "Making Hadoop Enterprise Ready with Am...
Apache Hadoop India Summit 2011 talk  "Making Hadoop Enterprise Ready with Am...Apache Hadoop India Summit 2011 talk  "Making Hadoop Enterprise Ready with Am...
Apache Hadoop India Summit 2011 talk "Making Hadoop Enterprise Ready with Am...
 
Introduction to Elastic MapReduce
Introduction to Elastic MapReduceIntroduction to Elastic MapReduce
Introduction to Elastic MapReduce
 
MapReduce Paradigm
MapReduce ParadigmMapReduce Paradigm
MapReduce Paradigm
 
Scaling your analytics with Amazon EMR
Scaling your analytics with Amazon EMRScaling your analytics with Amazon EMR
Scaling your analytics with Amazon EMR
 
October 2016 HUG: The Pillars of Effective Data Archiving and Tiering in Hadoop
October 2016 HUG: The Pillars of Effective Data Archiving and Tiering in HadoopOctober 2016 HUG: The Pillars of Effective Data Archiving and Tiering in Hadoop
October 2016 HUG: The Pillars of Effective Data Archiving and Tiering in Hadoop
 
October 2016 HUG: Pulsar,  a highly scalable, low latency pub-sub messaging s...
October 2016 HUG: Pulsar,  a highly scalable, low latency pub-sub messaging s...October 2016 HUG: Pulsar,  a highly scalable, low latency pub-sub messaging s...
October 2016 HUG: Pulsar,  a highly scalable, low latency pub-sub messaging s...
 
Hadoop Summit 2012 | Optimizing MapReduce Job Performance
Hadoop Summit 2012 | Optimizing MapReduce Job PerformanceHadoop Summit 2012 | Optimizing MapReduce Job Performance
Hadoop Summit 2012 | Optimizing MapReduce Job Performance
 
October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and ...
October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and ...October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and ...
October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and ...
 
Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Inven...
Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Inven...Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Inven...
Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Inven...
 
Mapreduce Algorithms
Mapreduce AlgorithmsMapreduce Algorithms
Mapreduce Algorithms
 
Deep Dive - Amazon Elastic MapReduce (EMR)
Deep Dive - Amazon Elastic MapReduce (EMR)Deep Dive - Amazon Elastic MapReduce (EMR)
Deep Dive - Amazon Elastic MapReduce (EMR)
 
BigData_Chp5: Putting it all together
BigData_Chp5: Putting it all togetherBigData_Chp5: Putting it all together
BigData_Chp5: Putting it all together
 
BigData_TP3 : Spark
BigData_TP3 : SparkBigData_TP3 : Spark
BigData_TP3 : Spark
 
Building a Sustainable Data Platform on AWS
Building a Sustainable Data Platform on AWSBuilding a Sustainable Data Platform on AWS
Building a Sustainable Data Platform on AWS
 
Hadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsHadoop MapReduce Fundamentals
Hadoop MapReduce Fundamentals
 
MapReduce in Simple Terms
MapReduce in Simple TermsMapReduce in Simple Terms
MapReduce in Simple Terms
 
Big data, map reduce and beyond
Big data, map reduce and beyondBig data, map reduce and beyond
Big data, map reduce and beyond
 
Bigtable and Dynamo
Bigtable and DynamoBigtable and Dynamo
Bigtable and Dynamo
 
Dynamo and BigTable - Review and Comparison
Dynamo and BigTable - Review and ComparisonDynamo and BigTable - Review and Comparison
Dynamo and BigTable - Review and Comparison
 
Slideshare Powerpoint presentation
Slideshare Powerpoint presentationSlideshare Powerpoint presentation
Slideshare Powerpoint presentation
 

Similar to Facebook Analytics with Elastic Map/Reduce

[AWS DC Meetup] Not Your Father’s WebApp: The Cloud-Native Architecture of im...
[AWS DC Meetup] Not Your Father’s WebApp: The Cloud-Native Architecture of im...[AWS DC Meetup] Not Your Father’s WebApp: The Cloud-Native Architecture of im...
[AWS DC Meetup] Not Your Father’s WebApp: The Cloud-Native Architecture of im...Chris Shenton
 
Not Your Father’s Web App: The Cloud-Native Architecture of images.nasa.gov
Not Your Father’s Web App: The Cloud-Native Architecture of images.nasa.govNot Your Father’s Web App: The Cloud-Native Architecture of images.nasa.gov
Not Your Father’s Web App: The Cloud-Native Architecture of images.nasa.govChris Shenton
 
SQL to NoSQL: Top 6 Questions
SQL to NoSQL: Top 6 QuestionsSQL to NoSQL: Top 6 Questions
SQL to NoSQL: Top 6 QuestionsMike Broberg
 
MongoDB for Spatio-Behavioral Data Analysis and Visualization
MongoDB for Spatio-Behavioral Data Analysis and VisualizationMongoDB for Spatio-Behavioral Data Analysis and Visualization
MongoDB for Spatio-Behavioral Data Analysis and VisualizationMongoDB
 
2013 CPM Conference, Nov 6th, NoSQL Capacity Planning
2013 CPM Conference, Nov 6th, NoSQL Capacity Planning2013 CPM Conference, Nov 6th, NoSQL Capacity Planning
2013 CPM Conference, Nov 6th, NoSQL Capacity Planningasya999
 
Powering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin KimPowering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin KimSpark Summit
 
eHarmony in the Cloud
eHarmony in the CloudeHarmony in the Cloud
eHarmony in the CloudCraig Dickson
 
Shop talk - Project Server 2013
Shop talk - Project Server 2013Shop talk - Project Server 2013
Shop talk - Project Server 2013Chris Givens
 
Accelerating Data Science with Better Data Engineering on Databricks
Accelerating Data Science with Better Data Engineering on DatabricksAccelerating Data Science with Better Data Engineering on Databricks
Accelerating Data Science with Better Data Engineering on DatabricksDatabricks
 
SharePoint Saturday - Chicago - 2014 - Decoding the Business Intelligence Alp...
SharePoint Saturday - Chicago - 2014 - Decoding the Business Intelligence Alp...SharePoint Saturday - Chicago - 2014 - Decoding the Business Intelligence Alp...
SharePoint Saturday - Chicago - 2014 - Decoding the Business Intelligence Alp...Scott_Brickey
 
Using Power BI and Azure as analytics engine for business applications
Using Power BI and Azure as analytics engine for business applicationsUsing Power BI and Azure as analytics engine for business applications
Using Power BI and Azure as analytics engine for business applicationsDigital Illustrated
 
Tableau & MongoDB: Visual Analytics at the Speed of Thought
Tableau & MongoDB: Visual Analytics at the Speed of ThoughtTableau & MongoDB: Visual Analytics at the Speed of Thought
Tableau & MongoDB: Visual Analytics at the Speed of ThoughtMongoDB
 
Weathering the Data Storm – How SnapLogic and AWS Deliver Analytics in the Cl...
Weathering the Data Storm – How SnapLogic and AWS Deliver Analytics in the Cl...Weathering the Data Storm – How SnapLogic and AWS Deliver Analytics in the Cl...
Weathering the Data Storm – How SnapLogic and AWS Deliver Analytics in the Cl...SnapLogic
 
Tableau Seattle BI Event How Tableau Changed My Life
Tableau Seattle BI Event How Tableau Changed My LifeTableau Seattle BI Event How Tableau Changed My Life
Tableau Seattle BI Event How Tableau Changed My LifeRussell Spangler
 
Adf and ala design c sharp corner toronto chapter feb 2019 meetup nik shahriar
Adf and ala design c sharp corner toronto chapter feb 2019 meetup nik shahriarAdf and ala design c sharp corner toronto chapter feb 2019 meetup nik shahriar
Adf and ala design c sharp corner toronto chapter feb 2019 meetup nik shahriarNilesh Shah
 
SQL Saturday Columbus 2014 PowerBI with SQL Excel and SharePoint
SQL Saturday Columbus 2014 PowerBI with SQL Excel and SharePointSQL Saturday Columbus 2014 PowerBI with SQL Excel and SharePoint
SQL Saturday Columbus 2014 PowerBI with SQL Excel and SharePointScott_Brickey
 

Similar to Facebook Analytics with Elastic Map/Reduce (20)

[AWS DC Meetup] Not Your Father’s WebApp: The Cloud-Native Architecture of im...
[AWS DC Meetup] Not Your Father’s WebApp: The Cloud-Native Architecture of im...[AWS DC Meetup] Not Your Father’s WebApp: The Cloud-Native Architecture of im...
[AWS DC Meetup] Not Your Father’s WebApp: The Cloud-Native Architecture of im...
 
Not Your Father’s Web App: The Cloud-Native Architecture of images.nasa.gov
Not Your Father’s Web App: The Cloud-Native Architecture of images.nasa.govNot Your Father’s Web App: The Cloud-Native Architecture of images.nasa.gov
Not Your Father’s Web App: The Cloud-Native Architecture of images.nasa.gov
 
SQL to NoSQL: Top 6 Questions
SQL to NoSQL: Top 6 QuestionsSQL to NoSQL: Top 6 Questions
SQL to NoSQL: Top 6 Questions
 
MongoDB for Spatio-Behavioral Data Analysis and Visualization
MongoDB for Spatio-Behavioral Data Analysis and VisualizationMongoDB for Spatio-Behavioral Data Analysis and Visualization
MongoDB for Spatio-Behavioral Data Analysis and Visualization
 
2013 CPM Conference, Nov 6th, NoSQL Capacity Planning
2013 CPM Conference, Nov 6th, NoSQL Capacity Planning2013 CPM Conference, Nov 6th, NoSQL Capacity Planning
2013 CPM Conference, Nov 6th, NoSQL Capacity Planning
 
Powering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin KimPowering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin Kim
 
eHarmony in the Cloud
eHarmony in the CloudeHarmony in the Cloud
eHarmony in the Cloud
 
Shop talk - Project Server 2013
Shop talk - Project Server 2013Shop talk - Project Server 2013
Shop talk - Project Server 2013
 
Accelerating Data Science with Better Data Engineering on Databricks
Accelerating Data Science with Better Data Engineering on DatabricksAccelerating Data Science with Better Data Engineering on Databricks
Accelerating Data Science with Better Data Engineering on Databricks
 
SharePoint Saturday - Chicago - 2014 - Decoding the Business Intelligence Alp...
SharePoint Saturday - Chicago - 2014 - Decoding the Business Intelligence Alp...SharePoint Saturday - Chicago - 2014 - Decoding the Business Intelligence Alp...
SharePoint Saturday - Chicago - 2014 - Decoding the Business Intelligence Alp...
 
Using Power BI and Azure as analytics engine for business applications
Using Power BI and Azure as analytics engine for business applicationsUsing Power BI and Azure as analytics engine for business applications
Using Power BI and Azure as analytics engine for business applications
 
Dax & sql in power bi
Dax & sql in power biDax & sql in power bi
Dax & sql in power bi
 
L19 Application Architecture
L19 Application ArchitectureL19 Application Architecture
L19 Application Architecture
 
Tableau & MongoDB: Visual Analytics at the Speed of Thought
Tableau & MongoDB: Visual Analytics at the Speed of ThoughtTableau & MongoDB: Visual Analytics at the Speed of Thought
Tableau & MongoDB: Visual Analytics at the Speed of Thought
 
Weathering the Data Storm – How SnapLogic and AWS Deliver Analytics in the Cl...
Weathering the Data Storm – How SnapLogic and AWS Deliver Analytics in the Cl...Weathering the Data Storm – How SnapLogic and AWS Deliver Analytics in the Cl...
Weathering the Data Storm – How SnapLogic and AWS Deliver Analytics in the Cl...
 
Tableau Seattle BI Event How Tableau Changed My Life
Tableau Seattle BI Event How Tableau Changed My LifeTableau Seattle BI Event How Tableau Changed My Life
Tableau Seattle BI Event How Tableau Changed My Life
 
Adf and ala design c sharp corner toronto chapter feb 2019 meetup nik shahriar
Adf and ala design c sharp corner toronto chapter feb 2019 meetup nik shahriarAdf and ala design c sharp corner toronto chapter feb 2019 meetup nik shahriar
Adf and ala design c sharp corner toronto chapter feb 2019 meetup nik shahriar
 
EMR and DynamoDB
EMR and DynamoDBEMR and DynamoDB
EMR and DynamoDB
 
Building a Front End for a Sensor Data Cloud
Building a Front End for a Sensor Data CloudBuilding a Front End for a Sensor Data Cloud
Building a Front End for a Sensor Data Cloud
 
SQL Saturday Columbus 2014 PowerBI with SQL Excel and SharePoint
SQL Saturday Columbus 2014 PowerBI with SQL Excel and SharePointSQL Saturday Columbus 2014 PowerBI with SQL Excel and SharePoint
SQL Saturday Columbus 2014 PowerBI with SQL Excel and SharePoint
 

More from J Singh

Designing analytics for big data
Designing analytics for big dataDesigning analytics for big data
Designing analytics for big dataJ Singh
 
Open LSH - september 2014 update
Open LSH  - september 2014 updateOpen LSH  - september 2014 update
Open LSH - september 2014 updateJ Singh
 
PaaS - google app engine
PaaS  - google app enginePaaS  - google app engine
PaaS - google app engineJ Singh
 
Mining of massive datasets using locality sensitive hashing (LSH)
Mining of massive datasets using locality sensitive hashing (LSH)Mining of massive datasets using locality sensitive hashing (LSH)
Mining of massive datasets using locality sensitive hashing (LSH)J Singh
 
Data Analytic Technology Platforms: Options and Tradeoffs
Data Analytic Technology Platforms: Options and TradeoffsData Analytic Technology Platforms: Options and Tradeoffs
Data Analytic Technology Platforms: Options and TradeoffsJ Singh
 
Social Media Mining using GAE Map Reduce
Social Media Mining using GAE Map ReduceSocial Media Mining using GAE Map Reduce
Social Media Mining using GAE Map ReduceJ Singh
 
High Throughput Data Analysis
High Throughput Data AnalysisHigh Throughput Data Analysis
High Throughput Data AnalysisJ Singh
 
NoSQL and MapReduce
NoSQL and MapReduceNoSQL and MapReduce
NoSQL and MapReduceJ Singh
 
CS 542 -- Concurrency Control, Distributed Commit
CS 542 -- Concurrency Control, Distributed CommitCS 542 -- Concurrency Control, Distributed Commit
CS 542 -- Concurrency Control, Distributed CommitJ Singh
 
CS 542 -- Failure Recovery, Concurrency Control
CS 542 -- Failure Recovery, Concurrency ControlCS 542 -- Failure Recovery, Concurrency Control
CS 542 -- Failure Recovery, Concurrency ControlJ Singh
 
CS 542 -- Query Optimization
CS 542 -- Query OptimizationCS 542 -- Query Optimization
CS 542 -- Query OptimizationJ Singh
 
CS 542 -- Query Execution
CS 542 -- Query ExecutionCS 542 -- Query Execution
CS 542 -- Query ExecutionJ Singh
 
CS 542 Putting it all together -- Storage Management
CS 542 Putting it all together -- Storage ManagementCS 542 Putting it all together -- Storage Management
CS 542 Putting it all together -- Storage ManagementJ Singh
 
CS 542 Parallel DBs, NoSQL, MapReduce
CS 542 Parallel DBs, NoSQL, MapReduceCS 542 Parallel DBs, NoSQL, MapReduce
CS 542 Parallel DBs, NoSQL, MapReduceJ Singh
 
CS 542 Database Index Structures
CS 542 Database Index StructuresCS 542 Database Index Structures
CS 542 Database Index StructuresJ Singh
 
CS 542 Controlling Database Integrity and Performance
CS 542 Controlling Database Integrity and PerformanceCS 542 Controlling Database Integrity and Performance
CS 542 Controlling Database Integrity and PerformanceJ Singh
 
CS 542 Overview of query processing
CS 542 Overview of query processingCS 542 Overview of query processing
CS 542 Overview of query processingJ Singh
 
CS 542 Introduction
CS 542 IntroductionCS 542 Introduction
CS 542 IntroductionJ Singh
 
Cloud Computing from an Entrpreneur's Viewpoint
Cloud Computing from an Entrpreneur's ViewpointCloud Computing from an Entrpreneur's Viewpoint
Cloud Computing from an Entrpreneur's ViewpointJ Singh
 

More from J Singh (19)

Designing analytics for big data
Designing analytics for big dataDesigning analytics for big data
Designing analytics for big data
 
Open LSH - september 2014 update
Open LSH  - september 2014 updateOpen LSH  - september 2014 update
Open LSH - september 2014 update
 
PaaS - google app engine
PaaS  - google app enginePaaS  - google app engine
PaaS - google app engine
 
Mining of massive datasets using locality sensitive hashing (LSH)
Mining of massive datasets using locality sensitive hashing (LSH)Mining of massive datasets using locality sensitive hashing (LSH)
Mining of massive datasets using locality sensitive hashing (LSH)
 
Data Analytic Technology Platforms: Options and Tradeoffs
Data Analytic Technology Platforms: Options and TradeoffsData Analytic Technology Platforms: Options and Tradeoffs
Data Analytic Technology Platforms: Options and Tradeoffs
 
Social Media Mining using GAE Map Reduce
Social Media Mining using GAE Map ReduceSocial Media Mining using GAE Map Reduce
Social Media Mining using GAE Map Reduce
 
High Throughput Data Analysis
High Throughput Data AnalysisHigh Throughput Data Analysis
High Throughput Data Analysis
 
NoSQL and MapReduce
NoSQL and MapReduceNoSQL and MapReduce
NoSQL and MapReduce
 
CS 542 -- Concurrency Control, Distributed Commit
CS 542 -- Concurrency Control, Distributed CommitCS 542 -- Concurrency Control, Distributed Commit
CS 542 -- Concurrency Control, Distributed Commit
 
CS 542 -- Failure Recovery, Concurrency Control
CS 542 -- Failure Recovery, Concurrency ControlCS 542 -- Failure Recovery, Concurrency Control
CS 542 -- Failure Recovery, Concurrency Control
 
CS 542 -- Query Optimization
CS 542 -- Query OptimizationCS 542 -- Query Optimization
CS 542 -- Query Optimization
 
CS 542 -- Query Execution
CS 542 -- Query ExecutionCS 542 -- Query Execution
CS 542 -- Query Execution
 
CS 542 Putting it all together -- Storage Management
CS 542 Putting it all together -- Storage ManagementCS 542 Putting it all together -- Storage Management
CS 542 Putting it all together -- Storage Management
 
CS 542 Parallel DBs, NoSQL, MapReduce
CS 542 Parallel DBs, NoSQL, MapReduceCS 542 Parallel DBs, NoSQL, MapReduce
CS 542 Parallel DBs, NoSQL, MapReduce
 
CS 542 Database Index Structures
CS 542 Database Index StructuresCS 542 Database Index Structures
CS 542 Database Index Structures
 
CS 542 Controlling Database Integrity and Performance
CS 542 Controlling Database Integrity and PerformanceCS 542 Controlling Database Integrity and Performance
CS 542 Controlling Database Integrity and Performance
 
CS 542 Overview of query processing
CS 542 Overview of query processingCS 542 Overview of query processing
CS 542 Overview of query processing
 
CS 542 Introduction
CS 542 IntroductionCS 542 Introduction
CS 542 Introduction
 
Cloud Computing from an Entrpreneur's Viewpoint
Cloud Computing from an Entrpreneur's ViewpointCloud Computing from an Entrpreneur's Viewpoint
Cloud Computing from an Entrpreneur's Viewpoint
 

Recently uploaded

Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraDeakin University
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxnull - The Open Security Community
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDGMarianaLemus7
 

Recently uploaded (20)

Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort ServiceHot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning era
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDG
 

Facebook Analytics with Elastic Map/Reduce

  • 1. Data + Algorithms = Knowledge Facebook Analytics With Elastic Map/Reduce – a Hands-on Workshop November 12, 2012 J Singh, DataThinks.org 1
  • 2. Take-away Messages • Map Reduce is simple, Hadoop is one implementation of MR… – …made even simpler by services like Elastic Map Reduce • But Map Reduce requires a different style of programming… – …and a different set of techniques for debugging • Facebook data can get big very quickly… – …and storage and bandwidth costs can dominate your solution • Analytics is an iterative (agile) process… – …each iteration requires evaluating results, and tuning the algorithms, possibly the acquisition of more data © J Singh, 2012 2 2
  • 3. Signing Up for AWS The steps required to obtain an AWS account  Create an AWS account (http://aws.amazon.com). – http://www.slideshare.net/AmazonWebServices/video-how-to-sign-up-for- amazon-web-services-8700872 – Requires a valid credit card and a phone based identification.  Sign in to the AWS Management Console – http://aws.amazon.com/console © J Singh, 2012 3 3
  • 4. Elastic Map Reduce Resources • Summary of the offering • Elastic MapReduce Training • Getting Started Guide • Developers Guide © J Singh, 2012 4 4
  • 5. MapReduce Conceptual Underpinnings • Based on Functional Programming model – From Lisp • (map square '(1 2 3 4)) (1 4 9 16) • (reduce plus '(1 4 9 16)) 30 – From APL • +/ N N  1 2 3 4 • Easy to distribute (based on each element of the vector) • New for Map/Reduce: Nice failure/retry semantics – Hundreds and thousands of low-end servers are running at the same time © J Singh, 2012 5 5
  • 6. MapReduce Flow © J Singh, 2012 6 6
  • 7. Elastic Map Reduce – Summary • Hadoop installed and maintained by Amazon – We can focus on programming – Offers a few options on map and reduce programs • Streaming – Map and Reduce programs connect through stdin and stdout – Allows Map and Reduce to be written in any language • Hive, Pig – Translates to Map/Reduce JARs – Can cascade M/R pipelines • Custom JAR – for special cases © J Singh, 2012 7 7
  • 8. Elastic Map Reduce – Architecture • Starting with data in S3 • EMR Service initiates the job • Hadoop Master coordinates operation • Slave nodes are initiated and data loaded into them • Extra nodes can be invoked if needed • Results are copied back into S3 – Nodes are destroyed © J Singh, 2012 8 8
  • 9. Elastic Map Reduce – Word Count • Use the AWS Management Console >> Elastic MapReduce – Define Job Flow • Hadoop Version 1.0.3 • Run your own application – Steaming – Specify Parameters • For input files, elasticmapreduce/samples/wordcount/input • For output files, you need to define your own S3 bucket – In a separate browser tab, AWS Management Console >> S3 – Bucket names can include lowercase letters, numbers, period, dash • Mapper code can be seen at http://goo.gl/EbCme – Copy this code to one of your buckets – Specify path <your-bucket>/wordSplitter.py © J Singh, 2012 9 9
  • 10. Elastic Map Reduce – Word Count (p2) • Configure EC2 Instances • Advanced Options – Optional: Amazon EC2 Key Pair • To log into the master and make changes to a running job – E.g,, add extra nodes to speed up processing – Amazon S3 Log Path • <your-bucket>/log-2012-11-12--19-30 • Accept all other defaults and go! © J Singh, 2012 10 10
  • 11. Monitoring Operation • AWS Management Console provides a view into the operation – These screen-shots were taken at minute 27 of a 30-minute run – Configuration default in this case was for 2 map slots – First slot became available at 12:00, second around 12:10 © J Singh, 2012 11 11
  • 12. Elastic Map Reduce – Debugging • AWS console and the log files provide clues on what went wrong and how to fix it • Make a change that will break the operation and examine the AWS console to find the error you introduced – Introduce a parsing error in the mapper program – Uncomment these lines to have it raise an exception import random x = 1 / random.randint(0,1000) – Save the file to an S3 bucket and run – Can you find where EMR reveals what happened? © J Singh, 2012 12 12
  • 13. Facebook Analytics – Summary • Extend the architecture – Import Facebook data into S3 – Change Map Reduce programs as required © J Singh, 2012 13 13
  • 14. Facebook Analytics – Observations • Fetching and staging data is the real challenge in putting together an analytics solution – For unstructured data, it requires • An understanding of the data model at the source • Custom code to read it – For structured data, consider Pig/Hive (higher-level Hadoop components) • Pig/Hive can read/write tables formatted as CSV/TSV files in S3 – Either we need to bring files into S3 – Or point Pig/Hive at a JDBC connection • An opportunity to rethink the ETL pipeline? © J Singh, 2012 14 14
  • 15. Facebook Analytics – Data Collection • The exercise is based on everyone‟s Facebook data • Log into http://apps.facebook.com/map-reduce-workshop – Requires permission to get • Information about you, • Your friends, • Your likes, your friends‟ likes. – Randomly selects 10 of those friends – Randomly selects 25 of their likes – Anonymizes your friends‟ Facebook IDs before storing into S3 • All data, even though opaque, will be deleted at the end of the workshop © J Singh, 2012 15 15
  • 16. Facebook Analytics – Data Collected Original = 75 Friends = 750 Likes = up to about 20,000 • Each user record shows anonymized user ID and their likes – 4110002004281 ['21506845769', '345722385482735', '93433060687'] © J Singh, 2012 16 16
  • 17. Facebook Analytics – Likes Count • Use the AWS Management Console >> Elastic MapReduce – Define Job Flow • Hadoop Version 1.0.3 • Run Your Own Application – Streaming – Specify Parameters • For input files, use bucket datathinks-users • For output files, you need to define your own S3 bucket – In a separate browser tab, AWS Management Console >> S3 • Mapper: copy goo.gl/PcLK4 into a bucket you own – Advanced options: • Choose a fresh log file location – Accept all other defaults and go! © J Singh, 2012 17 17
  • 18. Viewing the Results • The results of Data Analysis are available in S3. – Partial example: 139784736075551 1 140413412750046 6 184331976202 3 220854914702193 1 29092950651 1 • How to interpret the results. – Sort by frequency, then examine most frequent likes • 140413412750046 is cryptic • But http://www.facebook.com/pages/w/140413412750046 reveals what it is (DataThinks) • Requires further action: what to do with the results? © J Singh, 2012 18 18
  • 19. Algorithm Discussion • The algorithm based on exact matches for likes may be too restrictive – „Ella Fitzgerald‟ != „Duke Ellington‟ – But people who like Ella Fitzgerald may be reachable the same way as people who like Duke Ellington – An idea to explore further: • Is there a way to find ID‟s that we might consider equivalent? © J Singh, 2012 19 19
  • 20. Data Collected and Embellished Original = 75 Friends = 750 Likes = 15,000 Similar Likes = 150,000 © J Singh, 2012 20 20
  • 21. Extended Facebook Analytics – Summary • Extend the architecture – Get mappers to fetch “similar likes” from the internet © J Singh, 2012 21 21
  • 22. Facebook Analytics – Showing Results • The other challenge in putting together an analytics solution is displaying results – Demo of our results page © J Singh, 2012 22 22
  • 23. Take-away Messages • Map Reduce is simple, Hadoop is one implementation of MR… – …made even simpler by services like Elastic Map Reduce • But Map Reduce requires a different style of programming… – …and a different set of techniques for debugging • Facebook data can get big very quickly… – …and storage and bandwidth costs can dominate your solution • Analytics is an iterative (agile) process… – …each iteration requires evaluating results, and tuning the algorithms, possibly the acquisition of more data © J Singh, 2012 23 23
  • 24. Thank you • J Singh – President, Early Stage IT • Technology Services and Strategy for Startups • DataThinks.org is a service of Early Stage IT – “Big Data” analytics solutions © J Singh, 2012 24 24

Editor's Notes

  1. Get started with Hadoop
  2. Get started with Hadoop