A living hell - lessons learned in eight years of parsing real estate data

lokku
A living hell: lessons learned in eight years of processing real estate listings
Ed Freyfogle
CSVConf Berlin
15 July 2014
Residential property search engine in nine markets
3-4 million unique users per month
Processing close to 20M listings daily
Extensive experience / painful lessons in ETL, geocoding, deduping, ...
http://www.nestoria.com
What we do
Real estate is complex, high value transaction. Our goal is :
Simple
Comprehensive
Fast (user time and time to market)
A living hell - lessons learned in eight years of parsing real estate data
Where does the data come from?
Seller
Agent 1
Agent 2
Agent 3
Where does the data come from?
Seller
Agent 1
Agent 2
Agent 3
Portal 1
Portal 2
Portal 3
Where does the data come from?
Seller
Agent 1
Agent 2
Agent 3
Portal 1
Portal 2
Portal 3
Where does the data come from?
Seller
Agent 1
Agent 2
Agent 3
Portal 1
Portal 2
Portal 3
Where does the data come from?
Seller
Agent 1
Agent 2
Agent 3
Portal 1
Portal 2
Portal 3
Plenty of chances for data to go bad
Where we do it
India
Very, very good at:
Cricket
Amazing cuisine
World’s largest democracy
Too many other things to list here
India
Very, very good at:
Cricket
Amazing cuisine
World’s largest democracy
Too many other things to list here
Utterly fucking terrible at:
Real Estate data quality
Addresses / Geodata
Must garbage in be garbage out?
Can we turn multiple bits of shit into something useful?
What we really do
Must garbage in be garbage out?
Can we turn multiple bits of shit into something useful?
What we really do
something
useful
Chaos
Caveat: I love our clients
All the examples you are about to see are all theoretical *wink, wink*
Examples / Horror stories
Us: “Please set up an automated data transfer. Thx!”
Them: “It’s impossible to export the data from the database”
Them: “Just crawl our website”
Them: “Let’s do incremental updates to save bandwidth”
Them: “I’ll just send you an email when there is new stuff … starting when I get back from
holiday”
Getting the data
zip or tar full of subdirs, names of which change with each upload
filename “feed.xml?key=SsKpyM62QN0RbqCwnaAc”
One file per agent, when file not supplied no way to know if missing due to error or
intentionally
Format A on Monday, B on Tuesday, ...
Fun with files
<Description>Residential Plot available in Suncity&amp;lt;br
/&amp;gt;&#13;&amp;lt;br&amp;gt;&amp;lt;br
/&amp;gt;&#13;&amp;lt;br&amp;gt;SUNCITY PROJECT&amp;lt;br
/&amp;gt;&#13;&amp;lt;br&amp;gt;&amp;lt;br
/&amp;gt;&#13;&amp;lt;br&amp;gt;A complete township...
"&amp;gt;" - for when you really, really want to be sure you've escaped
your XML
&#13; anyone?
XML, LOL
One 500 MB file of XML
On a single line … to save space
Go grep yourself
Newlines, newlines,
newlines
Choose your delimiter wisely - ^B
So simple even a child could get it wrong
Microsoft quotes vs. ASCII quotes
Excel vs. CSV
CSV, LOL
Them “we will send the data in X (where X is large industry player) format”
Us “not even X uses that format”
Them “We use X format, but changed it slightly so we could ….”
Us *sigh*
Wrong tool for right job
Are they really unique?
Are the unique across time?
Partner re-uses numeric unique ids … in case there is ever a shortage of numbers
Unique identifiers
I’m ranting
Topics we haven’t yet even touched upon:
Character encodings
Geocoding / Parsing addresses
Image processing/classification at scale
Parsing free text descriptions
Deduplication
Too many other things to list here
Never trust, check everything, every single time
Tests, tests, tests, tests
Embrace UNIX philosophy of many small tools in a chain
Reuse rather than reinvent (but not always)
Technology helps manage the problem, it is not “the solution”.
Problems are almost always cultural not technical
What have we learned?
Misaligned incentives
Technology laggards
Apathy
Ignorance
Why do they hate us?
Tricked you - there is of course no single perfect solution
Closest thing is dialog, ideally face to face.
People generally want to do right thing, need help to know why and how to do it.
One five minute conversation often more useful than five months of email
The solution
Unless you hate life, do NOT try to scrape real estate data
Re-read the line above.
Our API: http://nestoria.com/api
One more thing
http://nestoria.com and http://nestoria.com/api
http://devblog.nestoria.com - our dev blog
http://www.lokku.com - our parent company
http://opencagedata.com - all your geocoding are belong to us
Twitter: @nestoria, @lokku, @opencagedata, @freyfogle
Slides will be on http://slideshare.net/lokku later today
Learn more
1 of 28

Recommended

Lessons learned in doing lots with few people by
Lessons learned in  doing lots with few peopleLessons learned in  doing lots with few people
Lessons learned in doing lots with few peoplelokku
3.1K views13 slides
Wagner whats buggingyou-voyager by
Wagner whats buggingyou-voyagerWagner whats buggingyou-voyager
Wagner whats buggingyou-voyagerENUG
249 views2 slides
Best by
BestBest
BestTomoya Shimaguchi
517 views10 slides
Small Team, Big Success by
Small Team, Big SuccessSmall Team, Big Success
Small Team, Big SuccessAlex Nguyen
1.1K views19 slides
Tom Limoncelli's Top 5 Time Management Tips for SysAdmins/DevOps/Devs. by
Tom Limoncelli's Top 5 Time Management Tips for SysAdmins/DevOps/Devs.Tom Limoncelli's Top 5 Time Management Tips for SysAdmins/DevOps/Devs.
Tom Limoncelli's Top 5 Time Management Tips for SysAdmins/DevOps/Devs.Tom Limoncelli
7.6K views33 slides
Low Code Development: Workflow by
Low Code Development: WorkflowLow Code Development: Workflow
Low Code Development: WorkflowInnoTech
560 views91 slides

More Related Content

Similar to A living hell - lessons learned in eight years of parsing real estate data

Roelof Temmingh FIRST07 slides by
Roelof Temmingh FIRST07 slidesRoelof Temmingh FIRST07 slides
Roelof Temmingh FIRST07 slidesLeon Kuunders
5.3K views81 slides
Drew Conway: A Social Scientist's Perspective on Data Science by
Drew Conway: A Social Scientist's Perspective on Data ScienceDrew Conway: A Social Scientist's Perspective on Data Science
Drew Conway: A Social Scientist's Perspective on Data Sciencemortardata
4.8K views28 slides
Apps as Machines — at FH Potsdam by
Apps as Machines — at FH PotsdamApps as Machines — at FH Potsdam
Apps as Machines — at FH PotsdamMartin Jordan
4.9K views127 slides
Better the devil you know by
Better the devil you knowBetter the devil you know
Better the devil you knowAlexandra Deschamps-Sonsino
281 views31 slides
Algorithm Marketplace and the new "Algorithm Economy" by
Algorithm Marketplace and the new "Algorithm Economy"Algorithm Marketplace and the new "Algorithm Economy"
Algorithm Marketplace and the new "Algorithm Economy"Diego Oppenheimer
2.3K views47 slides
From DevOps to NoOps how not to get Equifaxed Apidays by
From DevOps to NoOps how not to get Equifaxed ApidaysFrom DevOps to NoOps how not to get Equifaxed Apidays
From DevOps to NoOps how not to get Equifaxed ApidaysOri Pekelman
507 views54 slides

Similar to A living hell - lessons learned in eight years of parsing real estate data (20)

Roelof Temmingh FIRST07 slides by Leon Kuunders
Roelof Temmingh FIRST07 slidesRoelof Temmingh FIRST07 slides
Roelof Temmingh FIRST07 slides
Leon Kuunders5.3K views
Drew Conway: A Social Scientist's Perspective on Data Science by mortardata
Drew Conway: A Social Scientist's Perspective on Data ScienceDrew Conway: A Social Scientist's Perspective on Data Science
Drew Conway: A Social Scientist's Perspective on Data Science
mortardata4.8K views
Apps as Machines — at FH Potsdam by Martin Jordan
Apps as Machines — at FH PotsdamApps as Machines — at FH Potsdam
Apps as Machines — at FH Potsdam
Martin Jordan4.9K views
Algorithm Marketplace and the new "Algorithm Economy" by Diego Oppenheimer
Algorithm Marketplace and the new "Algorithm Economy"Algorithm Marketplace and the new "Algorithm Economy"
Algorithm Marketplace and the new "Algorithm Economy"
Diego Oppenheimer2.3K views
From DevOps to NoOps how not to get Equifaxed Apidays by Ori Pekelman
From DevOps to NoOps how not to get Equifaxed ApidaysFrom DevOps to NoOps how not to get Equifaxed Apidays
From DevOps to NoOps how not to get Equifaxed Apidays
Ori Pekelman507 views
OpenFest 2012 : Leveraging the public internet by tkisason
OpenFest 2012 : Leveraging the public internetOpenFest 2012 : Leveraging the public internet
OpenFest 2012 : Leveraging the public internet
tkisason492 views
What does "monitoring" mean? (FOSDEM 2017) by Brian Brazil
What does "monitoring" mean? (FOSDEM 2017)What does "monitoring" mean? (FOSDEM 2017)
What does "monitoring" mean? (FOSDEM 2017)
Brian Brazil2.4K views
From 🤦 to 🐿️ by Ori Pekelman
From 🤦 to 🐿️From 🤦 to 🐿️
From 🤦 to 🐿️
Ori Pekelman264 views
SpringOne Tour: The Influential Software Engineer by VMware Tanzu
SpringOne Tour: The Influential Software EngineerSpringOne Tour: The Influential Software Engineer
SpringOne Tour: The Influential Software Engineer
VMware Tanzu40 views
What your employees need to learn to work with data in the 21 st century by Human Capital Media
What your employees need to learn to work with data in the 21 st century What your employees need to learn to work with data in the 21 st century
What your employees need to learn to work with data in the 21 st century
Cybercrime and the Developer Java2Days 2016 Sofia by Steve Poole
Cybercrime and the Developer Java2Days 2016 SofiaCybercrime and the Developer Java2Days 2016 Sofia
Cybercrime and the Developer Java2Days 2016 Sofia
Steve Poole401 views
Log Mining: Beyond Log Analysis by Anton Chuvakin
Log Mining: Beyond Log AnalysisLog Mining: Beyond Log Analysis
Log Mining: Beyond Log Analysis
Anton Chuvakin20.7K views
Pc magazine january 2015 usa by Nhóc Nhóc
Pc magazine   january 2015  usaPc magazine   january 2015  usa
Pc magazine january 2015 usa
Nhóc Nhóc3.5K views
State of the art in Natural Language Processing (March 2019) by Liad Magen
State of the art in Natural Language Processing (March 2019)State of the art in Natural Language Processing (March 2019)
State of the art in Natural Language Processing (March 2019)
Liad Magen1.7K views
AI-SDV 2020: Special Hypertext Information Treatment in is Special Hypertext ... by Dr. Haxel Consult
AI-SDV 2020: Special Hypertext Information Treatment in is Special Hypertext ...AI-SDV 2020: Special Hypertext Information Treatment in is Special Hypertext ...
AI-SDV 2020: Special Hypertext Information Treatment in is Special Hypertext ...
Dr. Haxel Consult650 views
How Did We End up Here? by C4Media
 How Did We End up Here? How Did We End up Here?
How Did We End up Here?
C4Media786 views
Choose Boring Technology by Dan McKinley
Choose Boring TechnologyChoose Boring Technology
Choose Boring Technology
Dan McKinley36K views

More from lokku

Geocoding Overview by
Geocoding OverviewGeocoding Overview
Geocoding Overviewlokku
2.5K views36 slides
OpenCage Data and sustainable business models for open data by
OpenCage Data and sustainable business models for open data OpenCage Data and sustainable business models for open data
OpenCage Data and sustainable business models for open data lokku
4K views89 slides
Presenting the OpenCage Geocoder at #londonapi 17 Sept 2014 by
Presenting the OpenCage Geocoder at #londonapi 17 Sept 2014Presenting the OpenCage Geocoder at #londonapi 17 Sept 2014
Presenting the OpenCage Geocoder at #londonapi 17 Sept 2014lokku
2K views31 slides
Geo-search-location-based-results-for-site-search by
Geo-search-location-based-results-for-site-searchGeo-search-location-based-results-for-site-search
Geo-search-location-based-results-for-site-searchlokku
1.6K views24 slides
Geocoding India - talk delivered on 31 Jan 2014 at the Bangalore goeBLR event by
Geocoding India - talk delivered on 31 Jan 2014 at the Bangalore goeBLR eventGeocoding India - talk delivered on 31 Jan 2014 at the Bangalore goeBLR event
Geocoding India - talk delivered on 31 Jan 2014 at the Bangalore goeBLR eventlokku
7.6K views37 slides
Nestoria new design by
Nestoria new designNestoria new design
Nestoria new designlokku
1.7K views86 slides

More from lokku(20)

Geocoding Overview by lokku
Geocoding OverviewGeocoding Overview
Geocoding Overview
lokku2.5K views
OpenCage Data and sustainable business models for open data by lokku
OpenCage Data and sustainable business models for open data OpenCage Data and sustainable business models for open data
OpenCage Data and sustainable business models for open data
lokku4K views
Presenting the OpenCage Geocoder at #londonapi 17 Sept 2014 by lokku
Presenting the OpenCage Geocoder at #londonapi 17 Sept 2014Presenting the OpenCage Geocoder at #londonapi 17 Sept 2014
Presenting the OpenCage Geocoder at #londonapi 17 Sept 2014
lokku2K views
Geo-search-location-based-results-for-site-search by lokku
Geo-search-location-based-results-for-site-searchGeo-search-location-based-results-for-site-search
Geo-search-location-based-results-for-site-search
lokku1.6K views
Geocoding India - talk delivered on 31 Jan 2014 at the Bangalore goeBLR event by lokku
Geocoding India - talk delivered on 31 Jan 2014 at the Bangalore goeBLR eventGeocoding India - talk delivered on 31 Jan 2014 at the Bangalore goeBLR event
Geocoding India - talk delivered on 31 Jan 2014 at the Bangalore goeBLR event
lokku7.6K views
Nestoria new design by lokku
Nestoria new designNestoria new design
Nestoria new design
lokku1.7K views
CSS::SpriteMaker in action! by lokku
CSS::SpriteMaker in action!CSS::SpriteMaker in action!
CSS::SpriteMaker in action!
lokku1.1K views
Reducing the technical hurdle - why we started OpenCage Data by lokku
Reducing the technical hurdle - why we started OpenCage DataReducing the technical hurdle - why we started OpenCage Data
Reducing the technical hurdle - why we started OpenCage Data
lokku1K views
Css sprite_maker-1 by lokku
Css  sprite_maker-1Css  sprite_maker-1
Css sprite_maker-1
lokku1.9K views
Nestoria case study - The effective use of geo-data for search marketing by lokku
Nestoria case study - The effective use of geo-data for search marketingNestoria case study - The effective use of geo-data for search marketing
Nestoria case study - The effective use of geo-data for search marketing
lokku1.5K views
The Nestoria GeoChallenge by lokku
The Nestoria GeoChallengeThe Nestoria GeoChallenge
The Nestoria GeoChallenge
lokku4.5K views
Geo-Data for Search Marketing SEM & SEO by lokku
Geo-Data for Search Marketing SEM & SEOGeo-Data for Search Marketing SEM & SEO
Geo-Data for Search Marketing SEM & SEO
lokku847 views
Making using OSM data simpler - OpenCage Data by lokku
Making using OSM data simpler - OpenCage Data Making using OSM data simpler - OpenCage Data
Making using OSM data simpler - OpenCage Data
lokku1.4K views
What’s next in mapping for portals? ppw2012 by lokku
What’s next in mapping for portals? ppw2012What’s next in mapping for portals? ppw2012
What’s next in mapping for portals? ppw2012
lokku1.3K views
How Nestoria switched to OpenStreetMap maps by lokku
How Nestoria switched to OpenStreetMap mapsHow Nestoria switched to OpenStreetMap maps
How Nestoria switched to OpenStreetMap maps
lokku941 views
Remote Geocoding by lokku
Remote GeocodingRemote Geocoding
Remote Geocoding
lokku1K views
Mapstraction by lokku
MapstractionMapstraction
Mapstraction
lokku665 views
Bar Camp London 7 by lokku
Bar Camp London 7Bar Camp London 7
Bar Camp London 7
lokku442 views
The path ahead for property portals by lokku
The path ahead for property portalsThe path ahead for property portals
The path ahead for property portals
lokku1.2K views
How People Search For Locations by lokku
How People Search For LocationsHow People Search For Locations
How People Search For Locations
lokku547 views

Recently uploaded

The Dark Web : Hidden Services by
The Dark Web : Hidden ServicesThe Dark Web : Hidden Services
The Dark Web : Hidden ServicesAnshu Singh
5 views24 slides
Marketing and Community Building in Web3 by
Marketing and Community Building in Web3Marketing and Community Building in Web3
Marketing and Community Building in Web3Federico Ast
14 views64 slides
Affiliate Marketing by
Affiliate MarketingAffiliate Marketing
Affiliate MarketingNavin Dhanuka
17 views30 slides
IETF 118: Starlink Protocol Performance by
IETF 118: Starlink Protocol PerformanceIETF 118: Starlink Protocol Performance
IETF 118: Starlink Protocol PerformanceAPNIC
394 views22 slides
How to think like a threat actor for Kubernetes.pptx by
How to think like a threat actor for Kubernetes.pptxHow to think like a threat actor for Kubernetes.pptx
How to think like a threat actor for Kubernetes.pptxLibbySchulze1
5 views33 slides
Building trust in our information ecosystem: who do we trust in an emergency by
Building trust in our information ecosystem: who do we trust in an emergencyBuilding trust in our information ecosystem: who do we trust in an emergency
Building trust in our information ecosystem: who do we trust in an emergencyTina Purnat
109 views18 slides

Recently uploaded(9)

The Dark Web : Hidden Services by Anshu Singh
The Dark Web : Hidden ServicesThe Dark Web : Hidden Services
The Dark Web : Hidden Services
Anshu Singh5 views
Marketing and Community Building in Web3 by Federico Ast
Marketing and Community Building in Web3Marketing and Community Building in Web3
Marketing and Community Building in Web3
Federico Ast14 views
IETF 118: Starlink Protocol Performance by APNIC
IETF 118: Starlink Protocol PerformanceIETF 118: Starlink Protocol Performance
IETF 118: Starlink Protocol Performance
APNIC394 views
How to think like a threat actor for Kubernetes.pptx by LibbySchulze1
How to think like a threat actor for Kubernetes.pptxHow to think like a threat actor for Kubernetes.pptx
How to think like a threat actor for Kubernetes.pptx
LibbySchulze15 views
Building trust in our information ecosystem: who do we trust in an emergency by Tina Purnat
Building trust in our information ecosystem: who do we trust in an emergencyBuilding trust in our information ecosystem: who do we trust in an emergency
Building trust in our information ecosystem: who do we trust in an emergency
Tina Purnat109 views
ATPMOUSE_융합2조.pptx by kts120898
ATPMOUSE_융합2조.pptxATPMOUSE_융합2조.pptx
ATPMOUSE_융합2조.pptx
kts12089824 views

A living hell - lessons learned in eight years of parsing real estate data

  • 1. A living hell: lessons learned in eight years of processing real estate listings Ed Freyfogle CSVConf Berlin 15 July 2014
  • 2. Residential property search engine in nine markets 3-4 million unique users per month Processing close to 20M listings daily Extensive experience / painful lessons in ETL, geocoding, deduping, ... http://www.nestoria.com
  • 3. What we do Real estate is complex, high value transaction. Our goal is : Simple Comprehensive Fast (user time and time to market)
  • 5. Where does the data come from? Seller Agent 1 Agent 2 Agent 3
  • 6. Where does the data come from? Seller Agent 1 Agent 2 Agent 3 Portal 1 Portal 2 Portal 3
  • 7. Where does the data come from? Seller Agent 1 Agent 2 Agent 3 Portal 1 Portal 2 Portal 3
  • 8. Where does the data come from? Seller Agent 1 Agent 2 Agent 3 Portal 1 Portal 2 Portal 3
  • 9. Where does the data come from? Seller Agent 1 Agent 2 Agent 3 Portal 1 Portal 2 Portal 3 Plenty of chances for data to go bad
  • 11. India Very, very good at: Cricket Amazing cuisine World’s largest democracy Too many other things to list here
  • 12. India Very, very good at: Cricket Amazing cuisine World’s largest democracy Too many other things to list here Utterly fucking terrible at: Real Estate data quality Addresses / Geodata
  • 13. Must garbage in be garbage out? Can we turn multiple bits of shit into something useful? What we really do
  • 14. Must garbage in be garbage out? Can we turn multiple bits of shit into something useful? What we really do something useful Chaos
  • 15. Caveat: I love our clients All the examples you are about to see are all theoretical *wink, wink* Examples / Horror stories
  • 16. Us: “Please set up an automated data transfer. Thx!” Them: “It’s impossible to export the data from the database” Them: “Just crawl our website” Them: “Let’s do incremental updates to save bandwidth” Them: “I’ll just send you an email when there is new stuff … starting when I get back from holiday” Getting the data
  • 17. zip or tar full of subdirs, names of which change with each upload filename “feed.xml?key=SsKpyM62QN0RbqCwnaAc” One file per agent, when file not supplied no way to know if missing due to error or intentionally Format A on Monday, B on Tuesday, ... Fun with files
  • 18. <Description>Residential Plot available in Suncity&amp;lt;br /&amp;gt;&#13;&amp;lt;br&amp;gt;&amp;lt;br /&amp;gt;&#13;&amp;lt;br&amp;gt;SUNCITY PROJECT&amp;lt;br /&amp;gt;&#13;&amp;lt;br&amp;gt;&amp;lt;br /&amp;gt;&#13;&amp;lt;br&amp;gt;A complete township... "&amp;gt;" - for when you really, really want to be sure you've escaped your XML &#13; anyone? XML, LOL
  • 19. One 500 MB file of XML On a single line … to save space Go grep yourself
  • 20. Newlines, newlines, newlines Choose your delimiter wisely - ^B So simple even a child could get it wrong Microsoft quotes vs. ASCII quotes Excel vs. CSV CSV, LOL
  • 21. Them “we will send the data in X (where X is large industry player) format” Us “not even X uses that format” Them “We use X format, but changed it slightly so we could ….” Us *sigh* Wrong tool for right job
  • 22. Are they really unique? Are the unique across time? Partner re-uses numeric unique ids … in case there is ever a shortage of numbers Unique identifiers
  • 23. I’m ranting Topics we haven’t yet even touched upon: Character encodings Geocoding / Parsing addresses Image processing/classification at scale Parsing free text descriptions Deduplication Too many other things to list here
  • 24. Never trust, check everything, every single time Tests, tests, tests, tests Embrace UNIX philosophy of many small tools in a chain Reuse rather than reinvent (but not always) Technology helps manage the problem, it is not “the solution”. Problems are almost always cultural not technical What have we learned?
  • 26. Tricked you - there is of course no single perfect solution Closest thing is dialog, ideally face to face. People generally want to do right thing, need help to know why and how to do it. One five minute conversation often more useful than five months of email The solution
  • 27. Unless you hate life, do NOT try to scrape real estate data Re-read the line above. Our API: http://nestoria.com/api One more thing
  • 28. http://nestoria.com and http://nestoria.com/api http://devblog.nestoria.com - our dev blog http://www.lokku.com - our parent company http://opencagedata.com - all your geocoding are belong to us Twitter: @nestoria, @lokku, @opencagedata, @freyfogle Slides will be on http://slideshare.net/lokku later today Learn more