SlideShare a Scribd company logo
Rapid Data Exploration
    With Hadoop
      Peter Skomoroch
     Senior Data Scientist




      @peteskomoroch
Outline
• Overview: LinkedIn Biz, Tech, & Analytics
• Rapid Data Exploration 101
        - Spatial Analytics Pig Code
        - Trend detection with Pig & Python
        - R Streaming Example
•   Deep Dive: Our Data Analysis Approach
•   Building Data Products
•   LinkedIn Data Insights
Connect the world’s professionals to make
  them more productive and successful
Professional Identity
LinkedIn at a glance
• Founded in 2003
• #17 site in the US (Alexa)
• 60+ million members
• First million members = 477 days
• Latest million = 9 days
• 500K+ company profiles
• 12+ million small business professionals
• In 2009 - 1billion people searches
• Average age: 41
• Household income $107,000
• 42% are “decision makers”
How International?
• More than 50% international
  (members in over 200 countries & territories)
• 13+ million in Europe
• 4+ million in India
• 3+ million in UK
• #13 site in UK (Alexa)
How do we keep the lights on?
• Profitable since 2007
• Valued at over $1B at the last funding round
• Subscriptions
• Ads
• Job Postings
• Enterprise Client
Hadoop on LinkedIn
1,400+ members list “Hadoop” on their profile
What other skills do they have?
•HBase, Lucene, Solr, MapReduce, Nutch...
Where are they?        Who do they work for?
 • 36% in Bay Area      • 11% Yahoo!
 • 8% in India          • 2% Apache Software Foundation
 • 6% in NYC            • 1% LinkedIn
 • 4% in Seattle        • 1% Google
 • 4% in Los Angeles    • 1% Facebook
Hadoop at LinkedIn
Voldemort Data Storage
Compact, compressed, binary data (something like Avro)
 Type can be any combination of int, double, float, String,
Map, List, etc. => Sequence Files
 Example member definition:
  {

 ‘member_id’: ‘int32’,
     ‘first_name': 'string',
     ’last_name': ’string’,
     ‘age’    : ‘int32’
      …
    }
Getting Data In
•From Databases (user data, news, jobs etc.)
  • Need a way to get data reliably periodically
  • Need tests to verify data
  • Support for incremental replication
  • Solution: Transmogrify Driver Program
    • InputReader: JDBCReader, CSV Reader
    • Output Writer: JDBCWriter, HDFS writers
• From web logs (page views, search, clicks etc)
  • Weblogs files are rsynced and loaded up in HDFS
  • Hadoop jobs for date cleaning and transformation.
Getting Data Out
Giving Back: Open Source
http://sna-projects.com/sna/
Analytics Technologies
We Build Things With Data

           Give smart people great tools,
           enable them to solve problems
Prototyping Culture
How does Hadoop
 enable rapid data
   exploration?
Pig for Spatial Analytics
US County HeatMap
Pig for Trend Detection
Python Streaming Script
Sort Output & Display
R Streaming Also Easy




*from http://www.stat.uiowa.edu/~luke/classes/295-hpc/
Let’s Talk Data
Business is recognizing the importance of analytics
What data do we start with?
We can also leverage...
• Connection Graph          • Company Pages
• Recommendations           • Talent Match
• Address Book Uploads      • Web Referrals
• Search Logs               • 1M+ Twitter Accounts
• Profile Views & Activity   • Wikipedia Data
• Job Postings              • Mechanical Turk
• LinkedIn Groups           • Census, BLS, & Data.gov
• LinkedIn Questions        • Much more...
How do we think of Analytics?




      Data Jujitsu
Lots of Medium can be
more powerful than Big


             >
Reconstruct Reality
        from Data Exhaust
Data Scientist Lessons
• Follow the data, avoid assumptions
• Sanity check the extremes (0, infinity)
• Don’t get mired in rare edge cases
• Data Jujitsu: solve easier auxiliary problems
• Build smaller consistent samples to test code
• Establish a baseline model quickly, iterate often
• Use the right tool for the job at hand
• Iterate quickly with high level languages
Where did the bankers go?
We’re Hiring!
http://sna-projects.com/sna/
pskomoro@linkedin.com
@peteskomoroch

More Related Content

What's hot

Getting started in data science (4:3)
Getting started in data science (4:3)Getting started in data science (4:3)
Getting started in data science (4:3)
Thinkful
 
Getting started in data science (4:3)
Getting started in data science (4:3)Getting started in data science (4:3)
Getting started in data science (4:3)
Thinkful
 
Getting started in Data Science (April 2017, Los Angeles)
Getting started in Data Science (April 2017, Los Angeles)Getting started in Data Science (April 2017, Los Angeles)
Getting started in Data Science (April 2017, Los Angeles)
Thinkful
 
Data Science Overview
Data Science OverviewData Science Overview
Data Science Overview
Davide Mauri
 
Demystifying Data Science with an introduction to Machine Learning
Demystifying Data Science with an introduction to Machine LearningDemystifying Data Science with an introduction to Machine Learning
Demystifying Data Science with an introduction to Machine Learning
Julian Bright
 
Data Culture Series - Keynote & Panel - Birmingham - 8th April 2015
Data Culture Series  - Keynote & Panel - Birmingham - 8th April 2015Data Culture Series  - Keynote & Panel - Birmingham - 8th April 2015
Data Culture Series - Keynote & Panel - Birmingham - 8th April 2015
Jonathan Woodward
 
Building Satori: Web Data Extraction On Hadoop
Building Satori: Web Data Extraction On HadoopBuilding Satori: Web Data Extraction On Hadoop
Building Satori: Web Data Extraction On Hadoop
Nikolai Avteniev
 
What Is GDS and Neo4j’s GDS Library
What Is GDS and Neo4j’s GDS LibraryWhat Is GDS and Neo4j’s GDS Library
What Is GDS and Neo4j’s GDS Library
Neo4j
 
David golynskiy resume it5
David golynskiy resume it5 David golynskiy resume it5
David golynskiy resume it5 David Golynskiy
 
Introduction to Graph databases and Neo4j (by Stefan Armbruster)
Introduction to Graph databases and Neo4j (by Stefan Armbruster)Introduction to Graph databases and Neo4j (by Stefan Armbruster)
Introduction to Graph databases and Neo4j (by Stefan Armbruster)
barcelonajug
 
Semantically Enabled Personal Information Management with Cluug.com
Semantically Enabled Personal Information Management with Cluug.comSemantically Enabled Personal Information Management with Cluug.com
Semantically Enabled Personal Information Management with Cluug.comBernhard Schandl
 
Graph Databases - Where Do We Do the Modeling Part?
Graph Databases - Where Do We Do the Modeling Part?Graph Databases - Where Do We Do the Modeling Part?
Graph Databases - Where Do We Do the Modeling Part?
DATAVERSITY
 
Data Science: Harnessing Open Data for High Impact Solutions
Data Science: Harnessing Open Data for High Impact SolutionsData Science: Harnessing Open Data for High Impact Solutions
Data Science: Harnessing Open Data for High Impact Solutions
Mohd Izhar Firdaus Ismail
 
AI in the Intelligent Workplace
AI in the Intelligent WorkplaceAI in the Intelligent Workplace
AI in the Intelligent Workplace
Sharon O'Dea
 
Kurukshetra - Big Data
Kurukshetra - Big DataKurukshetra - Big Data
Kurukshetra - Big Data
shankar_radhakrishnan
 
Personalized News and Video Recomendation System at LinkSure
Personalized News and Video Recomendation System at LinkSurePersonalized News and Video Recomendation System at LinkSure
Personalized News and Video Recomendation System at LinkSure
Leanne Hwee
 
Paving The Way To Data Driven
Paving The Way To Data DrivenPaving The Way To Data Driven
Paving The Way To Data Driven
Mohd Izhar Firdaus Ismail
 
Smartlogic, Semaphore and Semantically Enhanced Search – For “Discovery”
Smartlogic, Semaphore and Semantically Enhanced Search –  For “Discovery”Smartlogic, Semaphore and Semantically Enhanced Search –  For “Discovery”
Smartlogic, Semaphore and Semantically Enhanced Search – For “Discovery”
VOGIN-academie
 
Big Data Content Organization, Discovery, and Management
Big Data Content Organization, Discovery, and ManagementBig Data Content Organization, Discovery, and Management
Big Data Content Organization, Discovery, and Management
Access Innovations, Inc.
 
Graph databases and the #panamapapers
Graph databases and the #panamapapersGraph databases and the #panamapapers
Graph databases and the #panamapapers
darthvader42
 

What's hot (20)

Getting started in data science (4:3)
Getting started in data science (4:3)Getting started in data science (4:3)
Getting started in data science (4:3)
 
Getting started in data science (4:3)
Getting started in data science (4:3)Getting started in data science (4:3)
Getting started in data science (4:3)
 
Getting started in Data Science (April 2017, Los Angeles)
Getting started in Data Science (April 2017, Los Angeles)Getting started in Data Science (April 2017, Los Angeles)
Getting started in Data Science (April 2017, Los Angeles)
 
Data Science Overview
Data Science OverviewData Science Overview
Data Science Overview
 
Demystifying Data Science with an introduction to Machine Learning
Demystifying Data Science with an introduction to Machine LearningDemystifying Data Science with an introduction to Machine Learning
Demystifying Data Science with an introduction to Machine Learning
 
Data Culture Series - Keynote & Panel - Birmingham - 8th April 2015
Data Culture Series  - Keynote & Panel - Birmingham - 8th April 2015Data Culture Series  - Keynote & Panel - Birmingham - 8th April 2015
Data Culture Series - Keynote & Panel - Birmingham - 8th April 2015
 
Building Satori: Web Data Extraction On Hadoop
Building Satori: Web Data Extraction On HadoopBuilding Satori: Web Data Extraction On Hadoop
Building Satori: Web Data Extraction On Hadoop
 
What Is GDS and Neo4j’s GDS Library
What Is GDS and Neo4j’s GDS LibraryWhat Is GDS and Neo4j’s GDS Library
What Is GDS and Neo4j’s GDS Library
 
David golynskiy resume it5
David golynskiy resume it5 David golynskiy resume it5
David golynskiy resume it5
 
Introduction to Graph databases and Neo4j (by Stefan Armbruster)
Introduction to Graph databases and Neo4j (by Stefan Armbruster)Introduction to Graph databases and Neo4j (by Stefan Armbruster)
Introduction to Graph databases and Neo4j (by Stefan Armbruster)
 
Semantically Enabled Personal Information Management with Cluug.com
Semantically Enabled Personal Information Management with Cluug.comSemantically Enabled Personal Information Management with Cluug.com
Semantically Enabled Personal Information Management with Cluug.com
 
Graph Databases - Where Do We Do the Modeling Part?
Graph Databases - Where Do We Do the Modeling Part?Graph Databases - Where Do We Do the Modeling Part?
Graph Databases - Where Do We Do the Modeling Part?
 
Data Science: Harnessing Open Data for High Impact Solutions
Data Science: Harnessing Open Data for High Impact SolutionsData Science: Harnessing Open Data for High Impact Solutions
Data Science: Harnessing Open Data for High Impact Solutions
 
AI in the Intelligent Workplace
AI in the Intelligent WorkplaceAI in the Intelligent Workplace
AI in the Intelligent Workplace
 
Kurukshetra - Big Data
Kurukshetra - Big DataKurukshetra - Big Data
Kurukshetra - Big Data
 
Personalized News and Video Recomendation System at LinkSure
Personalized News and Video Recomendation System at LinkSurePersonalized News and Video Recomendation System at LinkSure
Personalized News and Video Recomendation System at LinkSure
 
Paving The Way To Data Driven
Paving The Way To Data DrivenPaving The Way To Data Driven
Paving The Way To Data Driven
 
Smartlogic, Semaphore and Semantically Enhanced Search – For “Discovery”
Smartlogic, Semaphore and Semantically Enhanced Search –  For “Discovery”Smartlogic, Semaphore and Semantically Enhanced Search –  For “Discovery”
Smartlogic, Semaphore and Semantically Enhanced Search – For “Discovery”
 
Big Data Content Organization, Discovery, and Management
Big Data Content Organization, Discovery, and ManagementBig Data Content Organization, Discovery, and Management
Big Data Content Organization, Discovery, and Management
 
Graph databases and the #panamapapers
Graph databases and the #panamapapersGraph databases and the #panamapapers
Graph databases and the #panamapapers
 

Similar to Rapid Data Exploration With Hadoop

Building Competitive Moats With Data
Building Competitive Moats With DataBuilding Competitive Moats With Data
Building Competitive Moats With Data
Peter Skomoroch
 
Thinkful DC - Intro to Data Science
Thinkful DC - Intro to Data Science Thinkful DC - Intro to Data Science
Thinkful DC - Intro to Data Science
TJ Stalcup
 
Semantics and Machine Learning
Semantics and Machine LearningSemantics and Machine Learning
Semantics and Machine Learning
Vladimir Alexiev, PhD, PMP
 
Data Science-Why?What?How? By Hari Prasad
Data Science-Why?What?How? By Hari PrasadData Science-Why?What?How? By Hari Prasad
Data Science-Why?What?How? By Hari Prasad
Hari Prasad
 
Hadoop and SAP BI
Hadoop and SAP BI   Hadoop and SAP BI
Hadoop and SAP BI
Praveen Kumar (Tyagi)
 
SDSC18 and DSATL Meetup March 2018
SDSC18 and DSATL Meetup March 2018 SDSC18 and DSATL Meetup March 2018
SDSC18 and DSATL Meetup March 2018
CareerBuilder.com
 
Data Foundation for Analytics Excellence by Tanimura, cathy from Okta
Data Foundation for Analytics Excellence by Tanimura, cathy from OktaData Foundation for Analytics Excellence by Tanimura, cathy from Okta
Data Foundation for Analytics Excellence by Tanimura, cathy from Okta
Tin Ho
 
2017 06-14-getting started with data science
2017 06-14-getting started with data science2017 06-14-getting started with data science
2017 06-14-getting started with data science
Thinkful
 
Intro to Data Science
Intro to Data ScienceIntro to Data Science
Intro to Data Science
TJ Stalcup
 
How Oracle Uses CrowdFlower For Sentiment Analysis
How Oracle Uses CrowdFlower For Sentiment AnalysisHow Oracle Uses CrowdFlower For Sentiment Analysis
How Oracle Uses CrowdFlower For Sentiment Analysis
CrowdFlower
 
Architecting for Big Data: Trends, Tips, and Deployment Options
Architecting for Big Data: Trends, Tips, and Deployment OptionsArchitecting for Big Data: Trends, Tips, and Deployment Options
Architecting for Big Data: Trends, Tips, and Deployment Options
Caserta
 
Big databigideasit4bc
Big databigideasit4bcBig databigideasit4bc
Big databigideasit4bc
Vincent Ohprecio
 
Frank Bien Opening Keynote - Join 2016
Frank Bien Opening Keynote - Join 2016Frank Bien Opening Keynote - Join 2016
Frank Bien Opening Keynote - Join 2016
Looker
 
Frank Bien Opening Keynote - Join 2016
Frank Bien Opening Keynote - Join 2016Frank Bien Opening Keynote - Join 2016
Frank Bien Opening Keynote - Join 2016
Looker
 
Big Data Landscape 2018
Big Data Landscape 2018Big Data Landscape 2018
Big Data Landscape 2018
Leanne Hwee
 
Data Science at LinkedIn - Data-Driven Products & Insights
Data Science at LinkedIn - Data-Driven Products & InsightsData Science at LinkedIn - Data-Driven Products & Insights
Data Science at LinkedIn - Data-Driven Products & Insights
Yael Garten
 
Big Data for HR
Big Data for HRBig Data for HR
Big Data for HR
David Bernstein
 
What Managers Need to Know about Data Science
What Managers Need to Know about Data ScienceWhat Managers Need to Know about Data Science
What Managers Need to Know about Data Science
Annie Flippo
 
Neo4j GraphTalk Düsseldorf - How Graphs revolutionise Identity & Access Manag...
Neo4j GraphTalk Düsseldorf - How Graphs revolutionise Identity & Access Manag...Neo4j GraphTalk Düsseldorf - How Graphs revolutionise Identity & Access Manag...
Neo4j GraphTalk Düsseldorf - How Graphs revolutionise Identity & Access Manag...
Neo4j
 

Similar to Rapid Data Exploration With Hadoop (20)

Building Competitive Moats With Data
Building Competitive Moats With DataBuilding Competitive Moats With Data
Building Competitive Moats With Data
 
Thinkful DC - Intro to Data Science
Thinkful DC - Intro to Data Science Thinkful DC - Intro to Data Science
Thinkful DC - Intro to Data Science
 
Semantics and Machine Learning
Semantics and Machine LearningSemantics and Machine Learning
Semantics and Machine Learning
 
Data Science-Why?What?How? By Hari Prasad
Data Science-Why?What?How? By Hari PrasadData Science-Why?What?How? By Hari Prasad
Data Science-Why?What?How? By Hari Prasad
 
Hadoop and SAP BI
Hadoop and SAP BI   Hadoop and SAP BI
Hadoop and SAP BI
 
SDSC18 and DSATL Meetup March 2018
SDSC18 and DSATL Meetup March 2018 SDSC18 and DSATL Meetup March 2018
SDSC18 and DSATL Meetup March 2018
 
Data Foundation for Analytics Excellence by Tanimura, cathy from Okta
Data Foundation for Analytics Excellence by Tanimura, cathy from OktaData Foundation for Analytics Excellence by Tanimura, cathy from Okta
Data Foundation for Analytics Excellence by Tanimura, cathy from Okta
 
2017 06-14-getting started with data science
2017 06-14-getting started with data science2017 06-14-getting started with data science
2017 06-14-getting started with data science
 
Intro to Data Science
Intro to Data ScienceIntro to Data Science
Intro to Data Science
 
How Oracle Uses CrowdFlower For Sentiment Analysis
How Oracle Uses CrowdFlower For Sentiment AnalysisHow Oracle Uses CrowdFlower For Sentiment Analysis
How Oracle Uses CrowdFlower For Sentiment Analysis
 
Architecting for Big Data: Trends, Tips, and Deployment Options
Architecting for Big Data: Trends, Tips, and Deployment OptionsArchitecting for Big Data: Trends, Tips, and Deployment Options
Architecting for Big Data: Trends, Tips, and Deployment Options
 
Ds01 data science
Ds01   data scienceDs01   data science
Ds01 data science
 
Big databigideasit4bc
Big databigideasit4bcBig databigideasit4bc
Big databigideasit4bc
 
Frank Bien Opening Keynote - Join 2016
Frank Bien Opening Keynote - Join 2016Frank Bien Opening Keynote - Join 2016
Frank Bien Opening Keynote - Join 2016
 
Frank Bien Opening Keynote - Join 2016
Frank Bien Opening Keynote - Join 2016Frank Bien Opening Keynote - Join 2016
Frank Bien Opening Keynote - Join 2016
 
Big Data Landscape 2018
Big Data Landscape 2018Big Data Landscape 2018
Big Data Landscape 2018
 
Data Science at LinkedIn - Data-Driven Products & Insights
Data Science at LinkedIn - Data-Driven Products & InsightsData Science at LinkedIn - Data-Driven Products & Insights
Data Science at LinkedIn - Data-Driven Products & Insights
 
Big Data for HR
Big Data for HRBig Data for HR
Big Data for HR
 
What Managers Need to Know about Data Science
What Managers Need to Know about Data ScienceWhat Managers Need to Know about Data Science
What Managers Need to Know about Data Science
 
Neo4j GraphTalk Düsseldorf - How Graphs revolutionise Identity & Access Manag...
Neo4j GraphTalk Düsseldorf - How Graphs revolutionise Identity & Access Manag...Neo4j GraphTalk Düsseldorf - How Graphs revolutionise Identity & Access Manag...
Neo4j GraphTalk Düsseldorf - How Graphs revolutionise Identity & Access Manag...
 

More from Peter Skomoroch

Bridging the AI Gap: Building Stakeholder Support
Bridging the AI Gap: Building Stakeholder SupportBridging the AI Gap: Building Stakeholder Support
Bridging the AI Gap: Building Stakeholder Support
Peter Skomoroch
 
Managing Machines: The New AI Dev Stack
Managing Machines: The New AI Dev StackManaging Machines: The New AI Dev Stack
Managing Machines: The New AI Dev Stack
Peter Skomoroch
 
Product Management for AI
Product Management for AIProduct Management for AI
Product Management for AI
Peter Skomoroch
 
Executive Briefing: Why managing machines is harder than you think
Executive Briefing: Why managing machines is harder than you thinkExecutive Briefing: Why managing machines is harder than you think
Executive Briefing: Why managing machines is harder than you think
Peter Skomoroch
 
SF Data Science: Developing Data Products
SF Data Science: Developing Data ProductsSF Data Science: Developing Data Products
SF Data Science: Developing Data Products
Peter Skomoroch
 
Skills, Reputation, and Search
Skills, Reputation, and SearchSkills, Reputation, and Search
Skills, Reputation, and Search
Peter Skomoroch
 
LinkedIn Endorsements: Reputation, Virality, and Social Tagging
LinkedIn Endorsements: Reputation, Virality, and Social TaggingLinkedIn Endorsements: Reputation, Virality, and Social Tagging
LinkedIn Endorsements: Reputation, Virality, and Social Tagging
Peter Skomoroch
 
Developing Data Products
Developing Data ProductsDeveloping Data Products
Developing Data Products
Peter Skomoroch
 
Practical Problem Solving with Data - Onlab Data Conference, Tokyo
Practical Problem Solving with Data - Onlab Data Conference, TokyoPractical Problem Solving with Data - Onlab Data Conference, Tokyo
Practical Problem Solving with Data - Onlab Data Conference, Tokyo
Peter Skomoroch
 
Street Fighting Data Science
Street Fighting Data ScienceStreet Fighting Data Science
Street Fighting Data Science
Peter Skomoroch
 
Data Mashups -Data Science Summit
Data Mashups -Data Science SummitData Mashups -Data Science Summit
Data Mashups -Data Science Summit
Peter Skomoroch
 
Geo Analytics Tutorial - Where 2.0 2011
Geo Analytics Tutorial - Where 2.0 2011Geo Analytics Tutorial - Where 2.0 2011
Geo Analytics Tutorial - Where 2.0 2011
Peter Skomoroch
 
Prototyping Data Intensive Apps: TrendingTopics.org
Prototyping Data Intensive Apps: TrendingTopics.orgPrototyping Data Intensive Apps: TrendingTopics.org
Prototyping Data Intensive Apps: TrendingTopics.org
Peter Skomoroch
 
Elasticwulf Pycon Talk
Elasticwulf Pycon TalkElasticwulf Pycon Talk
Elasticwulf Pycon Talk
Peter Skomoroch
 

More from Peter Skomoroch (14)

Bridging the AI Gap: Building Stakeholder Support
Bridging the AI Gap: Building Stakeholder SupportBridging the AI Gap: Building Stakeholder Support
Bridging the AI Gap: Building Stakeholder Support
 
Managing Machines: The New AI Dev Stack
Managing Machines: The New AI Dev StackManaging Machines: The New AI Dev Stack
Managing Machines: The New AI Dev Stack
 
Product Management for AI
Product Management for AIProduct Management for AI
Product Management for AI
 
Executive Briefing: Why managing machines is harder than you think
Executive Briefing: Why managing machines is harder than you thinkExecutive Briefing: Why managing machines is harder than you think
Executive Briefing: Why managing machines is harder than you think
 
SF Data Science: Developing Data Products
SF Data Science: Developing Data ProductsSF Data Science: Developing Data Products
SF Data Science: Developing Data Products
 
Skills, Reputation, and Search
Skills, Reputation, and SearchSkills, Reputation, and Search
Skills, Reputation, and Search
 
LinkedIn Endorsements: Reputation, Virality, and Social Tagging
LinkedIn Endorsements: Reputation, Virality, and Social TaggingLinkedIn Endorsements: Reputation, Virality, and Social Tagging
LinkedIn Endorsements: Reputation, Virality, and Social Tagging
 
Developing Data Products
Developing Data ProductsDeveloping Data Products
Developing Data Products
 
Practical Problem Solving with Data - Onlab Data Conference, Tokyo
Practical Problem Solving with Data - Onlab Data Conference, TokyoPractical Problem Solving with Data - Onlab Data Conference, Tokyo
Practical Problem Solving with Data - Onlab Data Conference, Tokyo
 
Street Fighting Data Science
Street Fighting Data ScienceStreet Fighting Data Science
Street Fighting Data Science
 
Data Mashups -Data Science Summit
Data Mashups -Data Science SummitData Mashups -Data Science Summit
Data Mashups -Data Science Summit
 
Geo Analytics Tutorial - Where 2.0 2011
Geo Analytics Tutorial - Where 2.0 2011Geo Analytics Tutorial - Where 2.0 2011
Geo Analytics Tutorial - Where 2.0 2011
 
Prototyping Data Intensive Apps: TrendingTopics.org
Prototyping Data Intensive Apps: TrendingTopics.orgPrototyping Data Intensive Apps: TrendingTopics.org
Prototyping Data Intensive Apps: TrendingTopics.org
 
Elasticwulf Pycon Talk
Elasticwulf Pycon TalkElasticwulf Pycon Talk
Elasticwulf Pycon Talk
 

Recently uploaded

Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Thierry Lestable
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
Safe Software
 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Aggregage
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance
 
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
UiPathCommunity
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
DianaGray10
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
ControlCase
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
Prayukth K V
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
91mobiles
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
Jemma Hussein Allen
 
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Nexer Digital
 
Welocme to ViralQR, your best QR code generator.
Welocme to ViralQR, your best QR code generator.Welocme to ViralQR, your best QR code generator.
Welocme to ViralQR, your best QR code generator.
ViralQR
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
James Anderson
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
Alison B. Lowndes
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Paige Cruz
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
mikeeftimakis1
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
OnBoard
 

Recently uploaded (20)

Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
 
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
 
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?
 
Welocme to ViralQR, your best QR code generator.
Welocme to ViralQR, your best QR code generator.Welocme to ViralQR, your best QR code generator.
Welocme to ViralQR, your best QR code generator.
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
 

Rapid Data Exploration With Hadoop

  • 1. Rapid Data Exploration With Hadoop Peter Skomoroch Senior Data Scientist @peteskomoroch
  • 2. Outline • Overview: LinkedIn Biz, Tech, & Analytics • Rapid Data Exploration 101 - Spatial Analytics Pig Code - Trend detection with Pig & Python - R Streaming Example • Deep Dive: Our Data Analysis Approach • Building Data Products • LinkedIn Data Insights
  • 3. Connect the world’s professionals to make them more productive and successful
  • 5. LinkedIn at a glance • Founded in 2003 • #17 site in the US (Alexa) • 60+ million members • First million members = 477 days • Latest million = 9 days • 500K+ company profiles • 12+ million small business professionals • In 2009 - 1billion people searches • Average age: 41 • Household income $107,000 • 42% are “decision makers”
  • 6. How International? • More than 50% international (members in over 200 countries & territories) • 13+ million in Europe • 4+ million in India • 3+ million in UK • #13 site in UK (Alexa)
  • 7. How do we keep the lights on? • Profitable since 2007 • Valued at over $1B at the last funding round • Subscriptions • Ads • Job Postings • Enterprise Client
  • 8. Hadoop on LinkedIn 1,400+ members list “Hadoop” on their profile What other skills do they have? •HBase, Lucene, Solr, MapReduce, Nutch... Where are they? Who do they work for? • 36% in Bay Area • 11% Yahoo! • 8% in India • 2% Apache Software Foundation • 6% in NYC • 1% LinkedIn • 4% in Seattle • 1% Google • 4% in Los Angeles • 1% Facebook
  • 10. Voldemort Data Storage Compact, compressed, binary data (something like Avro) Type can be any combination of int, double, float, String, Map, List, etc. => Sequence Files Example member definition: { ‘member_id’: ‘int32’, ‘first_name': 'string', ’last_name': ’string’, ‘age’ : ‘int32’ … }
  • 11. Getting Data In •From Databases (user data, news, jobs etc.) • Need a way to get data reliably periodically • Need tests to verify data • Support for incremental replication • Solution: Transmogrify Driver Program • InputReader: JDBCReader, CSV Reader • Output Writer: JDBCWriter, HDFS writers • From web logs (page views, search, clicks etc) • Weblogs files are rsynced and loaded up in HDFS • Hadoop jobs for date cleaning and transformation.
  • 13. Giving Back: Open Source http://sna-projects.com/sna/
  • 15. We Build Things With Data Give smart people great tools, enable them to solve problems
  • 17. How does Hadoop enable rapid data exploration?
  • 18. Pig for Spatial Analytics
  • 20. Pig for Trend Detection
  • 22. Sort Output & Display
  • 23. R Streaming Also Easy *from http://www.stat.uiowa.edu/~luke/classes/295-hpc/
  • 25. Business is recognizing the importance of analytics
  • 26. What data do we start with?
  • 27. We can also leverage... • Connection Graph • Company Pages • Recommendations • Talent Match • Address Book Uploads • Web Referrals • Search Logs • 1M+ Twitter Accounts • Profile Views & Activity • Wikipedia Data • Job Postings • Mechanical Turk • LinkedIn Groups • Census, BLS, & Data.gov • LinkedIn Questions • Much more...
  • 28. How do we think of Analytics? Data Jujitsu
  • 29. Lots of Medium can be more powerful than Big >
  • 30. Reconstruct Reality from Data Exhaust
  • 31. Data Scientist Lessons • Follow the data, avoid assumptions • Sanity check the extremes (0, infinity) • Don’t get mired in rare edge cases • Data Jujitsu: solve easier auxiliary problems • Build smaller consistent samples to test code • Establish a baseline model quickly, iterate often • Use the right tool for the job at hand • Iterate quickly with high level languages
  • 32. Where did the bankers go?