SlideShare a Scribd company logo
1 of 45
Ana Martinez
Kin Lane

February 2012   M.C. Escher
The problem
Big Bottleneck!
Single POF!
Places Processing
Places Processing
              Source 2
              • Name
              • Address
              • Phone
              • reviews
  Source 1                 Source 3
  • Name                   • Name
  • Address                • Address
  • Phone                  • Phone
  • Images                 • menu



                CityGrid
                 Place
Why is it hard?
Book is to ISBN what Product is to UPC and what Place is to ______


No centrally regulated unique id (tax id is, but not public). Now what?

Spago
176 Canon Dr
Beverly Hills, CA 90210
310-944-3924



R. French Ac & Heating Inc               Ray French Air Conditioning & Heating
                                         Service
2211 martin luther king blvd             2211 MLK boulevard #104
los angeles, CA, 90069                   west Hollywood, CA, 90069
310-358-5903                             866-465-5303
Problem Definition
• Medium size data set
  – 21mill rows, 120 cols

• Time to process: Daily

• Hybrid environment

• Not all data is from same source
Solution




       Normalizer   Matcher   Merger
Normalizer


  Soundex     Metaphone      NYSIIS


        Matching
         Rating     Coverphone
        Approach
Know Your Data
Stop Words
 • The Viper Room           Viper Room

Stemming
 • av               aven           avenu
 • avenue           avn            avnue
Compression
 • county line      county rd      county road

Trunction
 • apt                      unit                 #
Normalizer
         123 Martin Luther King.n

           123 MartinLutherKing.

           123 martinlutherking.

    Martin Luther King | martinlutherking
                  canon column



          the | n | ave | (tokens)
Matching Strategy




   Do what you can on automated fashion and
       complement with manual steps.
Matching Strategy




Exact matching
            Set similarity joins
                                   Custom fuzzy matching
Matching Strategy
• C - Support Vector Machine

• Threashold: 0.996
  – Precision: 98.1%
  – Recall: 97.5%




        84% + manual -> % Match Rate
Merger

Rules:
   Provider truthworthiness
   Voting rules
   New data vs Old data
   Super providers
                              History:
                                         Accepted
                                         Rejected
Example
123 M L K Road Ste 45 123 Martin Luther King Rd       123 Martin L King Drive #45
123 m l k road ste 45      123 martinluther king rd   123 martin l king drive #45
(123) (m) (l) (k) (road)   (123) (martin) (luther)    (123) (martin) (l) (king)
(ste) (45)                 (king) (rd)                (drive) (#) (45)
123 mlk road ste 45        123 martinlutherkingrd     123 martinlking drive # 45
123 mlkrdste 45            123 mlkrd                  123 mlkdr #45
123 mlkrd                  123 mlkrd                  123 mlkdr
123 mlk                    123 mlk                    123 mlk


          MATCH!                     MATCH!                       MATCH!
Findings & Tips
• Domain Knowledge




                     • Automation
                     • Mechanical Turk
                     • Machine Learning

  Run every 2hrs -> Match Rate of %
Solution for Search APIs
Solution for Places API
Performance Results
Updates


          • Hours


          • Real Time
Places Detail – Demo Time!
• Details by ID

  – http://api.citygridmedia.com/content/places/v2/detail?listing_i
    d=11280452&client_ip=123.4.56.78&publisher=test

  – http://api.citygridmedia.com/content/places/v2/detail?public_i
    d=pinks-hot-dogs-los-angeles-
    2&client_ip=123.4.56.78&publisher=test
Improvements
• Shard Listing and Content Data

• Integrate Mongo across all APIs
APIs
        Now we have rich Places API

How do we make developers aware they exist?

How do we get them to successfully integrate?
APIs – Supporting Developer Area
 Common Building Blocks

   • Getting Started
   •Terms of Use
     Publisher Overview
   • Documentation
   • FAQ
   • Terms of Use
APIs – Supporting Developer Area
 Developers Tools
   • Code Samples
   •Terms of Use
     Libraries
   • Mobile SDKs
   • Starter Kits
   • Hackathon Toolkits
   • Partner APIs
APIs – Evangelism - Online
 •   Blogging
 •   Twitter
 •   LinkedIn
 •   Facebook of Use
       Terms
 •   Github
 •   Stack Overflow
 •   Quora
 •   Hacker News
 •   StumbleUpon
 •   Reddit
APIs – Evangelism - Offline


 •   Conferences
 •   Hackathons
      Terms of Use
 •   Meetups
 •   Workshops
APIs – Easy Start + Engage Immediately

•   Testable APIs
•   Self-Service
       Terms of Use
•   Email After Registration
•   Follow on Twitter
•   Follow on LinkedIn
APIs – Feedback Loop + Voice

•   Email Support
•   Forum(s) of Use
        Terms
•   Twitter
•   LinkedIn
APIs – Monetization = Sustainability

•   Local Web Advertising
•   Local Mobile Advertising
       Terms of Use
•   Local Custom Ads
•   Places that Pay
APIs – Evangelize Internally

•   Developer Feedback
•   Roadmap Suggestions
      Terms of Use
•   Landscape Analysis
•   Technology Awareness
•   Trends
•   Internal Hackathons
APIs – Measure & Repeat


  Terms of Use
CityGrid Architecture + API Overview from O'Reilly Strata Conference
CityGrid Architecture + API Overview from O'Reilly Strata Conference

More Related Content

Similar to CityGrid Architecture + API Overview from O'Reilly Strata Conference

Similar to CityGrid Architecture + API Overview from O'Reilly Strata Conference (20)

Ralf Herbrich - Introduction to Graphical models in Industry
Ralf Herbrich - Introduction to Graphical models in IndustryRalf Herbrich - Introduction to Graphical models in Industry
Ralf Herbrich - Introduction to Graphical models in Industry
 
Buzz Words Dunning Multi Modal Recommendations
Buzz Words Dunning Multi Modal RecommendationsBuzz Words Dunning Multi Modal Recommendations
Buzz Words Dunning Multi Modal Recommendations
 
Buzz words-dunning-multi-modal-recommendation
Buzz words-dunning-multi-modal-recommendationBuzz words-dunning-multi-modal-recommendation
Buzz words-dunning-multi-modal-recommendation
 
Elastic Stack roadmap deep dive
Elastic Stack roadmap deep diveElastic Stack roadmap deep dive
Elastic Stack roadmap deep dive
 
Processing Large Graphs
Processing Large GraphsProcessing Large Graphs
Processing Large Graphs
 
Polyvalent recommendations
Polyvalent recommendationsPolyvalent recommendations
Polyvalent recommendations
 
Data Science At Zillow
Data Science At ZillowData Science At Zillow
Data Science At Zillow
 
Análisis del roadmap del Elastic Stack
Análisis del roadmap del Elastic StackAnálisis del roadmap del Elastic Stack
Análisis del roadmap del Elastic Stack
 
TDC2016SP - Otimização Prematura: a Raíz de Todo o Mal
TDC2016SP - Otimização Prematura: a Raíz de Todo o MalTDC2016SP - Otimização Prematura: a Raíz de Todo o Mal
TDC2016SP - Otimização Prematura: a Raíz de Todo o Mal
 
Near Real Time Indexing: Presented by Umesh Prasad & Thejus V M, Flipkart
Near Real Time Indexing: Presented by Umesh Prasad & Thejus V M, FlipkartNear Real Time Indexing: Presented by Umesh Prasad & Thejus V M, Flipkart
Near Real Time Indexing: Presented by Umesh Prasad & Thejus V M, Flipkart
 
Data Warehousing and Data Mining
Data Warehousing and Data MiningData Warehousing and Data Mining
Data Warehousing and Data Mining
 
Skillwise Big data
Skillwise Big dataSkillwise Big data
Skillwise Big data
 
InsideSales.com Introduction
InsideSales.com IntroductionInsideSales.com Introduction
InsideSales.com Introduction
 
Revenue Growth through Machine Learning
Revenue Growth through Machine LearningRevenue Growth through Machine Learning
Revenue Growth through Machine Learning
 
Summit EU Machine Learning
Summit EU  Machine LearningSummit EU  Machine Learning
Summit EU Machine Learning
 
IBANK - Big data www.ibank.uk.com 07474222079
IBANK - Big data www.ibank.uk.com 07474222079IBANK - Big data www.ibank.uk.com 07474222079
IBANK - Big data www.ibank.uk.com 07474222079
 
Go Reactive: Building Responsive, Resilient, Elastic & Message-Driven Systems
Go Reactive: Building Responsive, Resilient, Elastic & Message-Driven SystemsGo Reactive: Building Responsive, Resilient, Elastic & Message-Driven Systems
Go Reactive: Building Responsive, Resilient, Elastic & Message-Driven Systems
 
Big data
Big dataBig data
Big data
 
Big data
Big dataBig data
Big data
 
Hadoop PDF
Hadoop PDFHadoop PDF
Hadoop PDF
 

Recently uploaded

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Recently uploaded (20)

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 

CityGrid Architecture + API Overview from O'Reilly Strata Conference

  • 2.
  • 3.
  • 7.
  • 9. Places Processing Source 2 • Name • Address • Phone • reviews Source 1 Source 3 • Name • Name • Address • Address • Phone • Phone • Images • menu CityGrid Place
  • 10. Why is it hard? Book is to ISBN what Product is to UPC and what Place is to ______ No centrally regulated unique id (tax id is, but not public). Now what? Spago 176 Canon Dr Beverly Hills, CA 90210 310-944-3924 R. French Ac & Heating Inc Ray French Air Conditioning & Heating Service 2211 martin luther king blvd 2211 MLK boulevard #104 los angeles, CA, 90069 west Hollywood, CA, 90069 310-358-5903 866-465-5303
  • 11. Problem Definition • Medium size data set – 21mill rows, 120 cols • Time to process: Daily • Hybrid environment • Not all data is from same source
  • 12. Solution Normalizer Matcher Merger
  • 13. Normalizer Soundex Metaphone NYSIIS Matching Rating Coverphone Approach
  • 14. Know Your Data Stop Words • The Viper Room Viper Room Stemming • av aven avenu • avenue avn avnue Compression • county line county rd county road Trunction • apt unit #
  • 15. Normalizer 123 Martin Luther King.n 123 MartinLutherKing. 123 martinlutherking. Martin Luther King | martinlutherking canon column the | n | ave | (tokens)
  • 16. Matching Strategy Do what you can on automated fashion and complement with manual steps.
  • 17. Matching Strategy Exact matching Set similarity joins Custom fuzzy matching
  • 18. Matching Strategy • C - Support Vector Machine • Threashold: 0.996 – Precision: 98.1% – Recall: 97.5% 84% + manual -> % Match Rate
  • 19. Merger Rules: Provider truthworthiness Voting rules New data vs Old data Super providers History: Accepted Rejected
  • 20. Example 123 M L K Road Ste 45 123 Martin Luther King Rd 123 Martin L King Drive #45 123 m l k road ste 45 123 martinluther king rd 123 martin l king drive #45 (123) (m) (l) (k) (road) (123) (martin) (luther) (123) (martin) (l) (king) (ste) (45) (king) (rd) (drive) (#) (45) 123 mlk road ste 45 123 martinlutherkingrd 123 martinlking drive # 45 123 mlkrdste 45 123 mlkrd 123 mlkdr #45 123 mlkrd 123 mlkrd 123 mlkdr 123 mlk 123 mlk 123 mlk MATCH! MATCH! MATCH!
  • 21. Findings & Tips • Domain Knowledge • Automation • Mechanical Turk • Machine Learning Run every 2hrs -> Match Rate of %
  • 22.
  • 23.
  • 25.
  • 27.
  • 28.
  • 30. Updates • Hours • Real Time
  • 31.
  • 32. Places Detail – Demo Time! • Details by ID – http://api.citygridmedia.com/content/places/v2/detail?listing_i d=11280452&client_ip=123.4.56.78&publisher=test – http://api.citygridmedia.com/content/places/v2/detail?public_i d=pinks-hot-dogs-los-angeles- 2&client_ip=123.4.56.78&publisher=test
  • 33. Improvements • Shard Listing and Content Data • Integrate Mongo across all APIs
  • 34. APIs Now we have rich Places API How do we make developers aware they exist? How do we get them to successfully integrate?
  • 35. APIs – Supporting Developer Area Common Building Blocks • Getting Started •Terms of Use Publisher Overview • Documentation • FAQ • Terms of Use
  • 36. APIs – Supporting Developer Area Developers Tools • Code Samples •Terms of Use Libraries • Mobile SDKs • Starter Kits • Hackathon Toolkits • Partner APIs
  • 37. APIs – Evangelism - Online • Blogging • Twitter • LinkedIn • Facebook of Use Terms • Github • Stack Overflow • Quora • Hacker News • StumbleUpon • Reddit
  • 38. APIs – Evangelism - Offline • Conferences • Hackathons Terms of Use • Meetups • Workshops
  • 39. APIs – Easy Start + Engage Immediately • Testable APIs • Self-Service Terms of Use • Email After Registration • Follow on Twitter • Follow on LinkedIn
  • 40. APIs – Feedback Loop + Voice • Email Support • Forum(s) of Use Terms • Twitter • LinkedIn
  • 41. APIs – Monetization = Sustainability • Local Web Advertising • Local Mobile Advertising Terms of Use • Local Custom Ads • Places that Pay
  • 42. APIs – Evangelize Internally • Developer Feedback • Roadmap Suggestions Terms of Use • Landscape Analysis • Technology Awareness • Trends • Internal Hackathons
  • 43. APIs – Measure & Repeat Terms of Use

Editor's Notes

  1. Demo