A living hell - lessons learned in eight years of parsing real estate data

4,843 views

Published on

Slides of talk delivered by Ed Freyfogle (@freyfogle) at #csvconf in Berlin on 15 June 2014.

Published in: Internet
0 Comments
4 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
4,843
On SlideShare
0
From Embeds
0
Number of Embeds
501
Actions
Shares
0
Downloads
8
Comments
0
Likes
4
Embeds 0
No embeds

No notes for slide

A living hell - lessons learned in eight years of parsing real estate data

  1. 1. A living hell: lessons learned in eight years of processing real estate listings Ed Freyfogle CSVConf Berlin 15 July 2014
  2. 2. Residential property search engine in nine markets 3-4 million unique users per month Processing close to 20M listings daily Extensive experience / painful lessons in ETL, geocoding, deduping, ... http://www.nestoria.com
  3. 3. What we do Real estate is complex, high value transaction. Our goal is : Simple Comprehensive Fast (user time and time to market)
  4. 4. Where does the data come from? Seller Agent 1 Agent 2 Agent 3
  5. 5. Where does the data come from? Seller Agent 1 Agent 2 Agent 3 Portal 1 Portal 2 Portal 3
  6. 6. Where does the data come from? Seller Agent 1 Agent 2 Agent 3 Portal 1 Portal 2 Portal 3
  7. 7. Where does the data come from? Seller Agent 1 Agent 2 Agent 3 Portal 1 Portal 2 Portal 3
  8. 8. Where does the data come from? Seller Agent 1 Agent 2 Agent 3 Portal 1 Portal 2 Portal 3 Plenty of chances for data to go bad
  9. 9. Where we do it
  10. 10. India Very, very good at: Cricket Amazing cuisine World’s largest democracy Too many other things to list here
  11. 11. India Very, very good at: Cricket Amazing cuisine World’s largest democracy Too many other things to list here Utterly fucking terrible at: Real Estate data quality Addresses / Geodata
  12. 12. Must garbage in be garbage out? Can we turn multiple bits of shit into something useful? What we really do
  13. 13. Must garbage in be garbage out? Can we turn multiple bits of shit into something useful? What we really do something useful Chaos
  14. 14. Caveat: I love our clients All the examples you are about to see are all theoretical *wink, wink* Examples / Horror stories
  15. 15. Us: “Please set up an automated data transfer. Thx!” Them: “It’s impossible to export the data from the database” Them: “Just crawl our website” Them: “Let’s do incremental updates to save bandwidth” Them: “I’ll just send you an email when there is new stuff … starting when I get back from holiday” Getting the data
  16. 16. zip or tar full of subdirs, names of which change with each upload filename “feed.xml?key=SsKpyM62QN0RbqCwnaAc” One file per agent, when file not supplied no way to know if missing due to error or intentionally Format A on Monday, B on Tuesday, ... Fun with files
  17. 17. <Description>Residential Plot available in Suncity&amp;lt;br /&amp;gt; &amp;lt;br&amp;gt;&amp;lt;br /&amp;gt; &amp;lt;br&amp;gt;SUNCITY PROJECT&amp;lt;br /&amp;gt; &amp;lt;br&amp;gt;&amp;lt;br /&amp;gt; &amp;lt;br&amp;gt;A complete township... "&amp;gt;" - for when you really, really want to be sure you've escaped your XML anyone? XML, LOL
  18. 18. One 500 MB file of XML On a single line … to save space Go grep yourself
  19. 19. Newlines, newlines, newlines Choose your delimiter wisely - ^B So simple even a child could get it wrong Microsoft quotes vs. ASCII quotes Excel vs. CSV CSV, LOL
  20. 20. Them “we will send the data in X (where X is large industry player) format” Us “not even X uses that format” Them “We use X format, but changed it slightly so we could ….” Us *sigh* Wrong tool for right job
  21. 21. Are they really unique? Are the unique across time? Partner re-uses numeric unique ids … in case there is ever a shortage of numbers Unique identifiers
  22. 22. I’m ranting Topics we haven’t yet even touched upon: Character encodings Geocoding / Parsing addresses Image processing/classification at scale Parsing free text descriptions Deduplication Too many other things to list here
  23. 23. Never trust, check everything, every single time Tests, tests, tests, tests Embrace UNIX philosophy of many small tools in a chain Reuse rather than reinvent (but not always) Technology helps manage the problem, it is not “the solution”. Problems are almost always cultural not technical What have we learned?
  24. 24. Misaligned incentives Technology laggards Apathy Ignorance Why do they hate us?
  25. 25. Tricked you - there is of course no single perfect solution Closest thing is dialog, ideally face to face. People generally want to do right thing, need help to know why and how to do it. One five minute conversation often more useful than five months of email The solution
  26. 26. Unless you hate life, do NOT try to scrape real estate data Re-read the line above. Our API: http://nestoria.com/api One more thing
  27. 27. http://nestoria.com and http://nestoria.com/api http://devblog.nestoria.com - our dev blog http://www.lokku.com - our parent company http://opencagedata.com - all your geocoding are belong to us Twitter: @nestoria, @lokku, @opencagedata, @freyfogle Slides will be on http://slideshare.net/lokku later today Learn more

×