Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
A living hell: lessons learned in eight years of processing real estate listings
Ed Freyfogle
CSVConf Berlin
15 July 2014
Residential property search engine in nine markets
3-4 million unique users per month
Processing close to 20M listings dai...
What we do
Real estate is complex, high value transaction. Our goal is :
Simple
Comprehensive
Fast (user time and time to ...
Where does the data come from?
Seller
Agent 1
Agent 2
Agent 3
Where does the data come from?
Seller
Agent 1
Agent 2
Agent 3
Portal 1
Portal 2
Portal 3
Where does the data come from?
Seller
Agent 1
Agent 2
Agent 3
Portal 1
Portal 2
Portal 3
Where does the data come from?
Seller
Agent 1
Agent 2
Agent 3
Portal 1
Portal 2
Portal 3
Where does the data come from?
Seller
Agent 1
Agent 2
Agent 3
Portal 1
Portal 2
Portal 3
Plenty of chances for data to go ...
Where we do it
India
Very, very good at:
Cricket
Amazing cuisine
World’s largest democracy
Too many other things to list here
India
Very, very good at:
Cricket
Amazing cuisine
World’s largest democracy
Too many other things to list here
Utterly fuc...
Must garbage in be garbage out?
Can we turn multiple bits of shit into something useful?
What we really do
Must garbage in be garbage out?
Can we turn multiple bits of shit into something useful?
What we really do
something
usefu...
Caveat: I love our clients
All the examples you are about to see are all theoretical *wink, wink*
Examples / Horror stories
Us: “Please set up an automated data transfer. Thx!”
Them: “It’s impossible to export the data from the database”
Them: “J...
zip or tar full of subdirs, names of which change with each upload
filename “feed.xml?key=SsKpyM62QN0RbqCwnaAc”
One file p...
<Description>Residential Plot available in Suncity&amp;lt;br
/&amp;gt;
&amp;lt;br&amp;gt;&amp;lt;br
/&amp;gt;
&amp;lt;br&a...
One 500 MB file of XML
On a single line … to save space
Go grep yourself
Newlines, newlines,
newlines
Choose your delimiter wisely - ^B
So simple even a child could get it wrong
Microsoft quotes ...
Them “we will send the data in X (where X is large industry player) format”
Us “not even X uses that format”
Them “We use ...
Are they really unique?
Are the unique across time?
Partner re-uses numeric unique ids … in case there is ever a shortage ...
I’m ranting
Topics we haven’t yet even touched upon:
Character encodings
Geocoding / Parsing addresses
Image processing/cl...
Never trust, check everything, every single time
Tests, tests, tests, tests
Embrace UNIX philosophy of many small tools in...
Misaligned incentives
Technology laggards
Apathy
Ignorance
Why do they hate us?
Tricked you - there is of course no single perfect solution
Closest thing is dialog, ideally face to face.
People generall...
Unless you hate life, do NOT try to scrape real estate data
Re-read the line above.
Our API: http://nestoria.com/api
One m...
http://nestoria.com and http://nestoria.com/api
http://devblog.nestoria.com - our dev blog
http://www.lokku.com - our pare...
A living hell - lessons learned in eight years of parsing real estate data
Upcoming SlideShare
Loading in …5
×

A living hell - lessons learned in eight years of parsing real estate data

5,157 views

Published on

Slides of talk delivered by Ed Freyfogle (@freyfogle) at #csvconf in Berlin on 15 June 2014.

Published in: Internet
  • Be the first to comment

A living hell - lessons learned in eight years of parsing real estate data

  1. 1. A living hell: lessons learned in eight years of processing real estate listings Ed Freyfogle CSVConf Berlin 15 July 2014
  2. 2. Residential property search engine in nine markets 3-4 million unique users per month Processing close to 20M listings daily Extensive experience / painful lessons in ETL, geocoding, deduping, ... http://www.nestoria.com
  3. 3. What we do Real estate is complex, high value transaction. Our goal is : Simple Comprehensive Fast (user time and time to market)
  4. 4. Where does the data come from? Seller Agent 1 Agent 2 Agent 3
  5. 5. Where does the data come from? Seller Agent 1 Agent 2 Agent 3 Portal 1 Portal 2 Portal 3
  6. 6. Where does the data come from? Seller Agent 1 Agent 2 Agent 3 Portal 1 Portal 2 Portal 3
  7. 7. Where does the data come from? Seller Agent 1 Agent 2 Agent 3 Portal 1 Portal 2 Portal 3
  8. 8. Where does the data come from? Seller Agent 1 Agent 2 Agent 3 Portal 1 Portal 2 Portal 3 Plenty of chances for data to go bad
  9. 9. Where we do it
  10. 10. India Very, very good at: Cricket Amazing cuisine World’s largest democracy Too many other things to list here
  11. 11. India Very, very good at: Cricket Amazing cuisine World’s largest democracy Too many other things to list here Utterly fucking terrible at: Real Estate data quality Addresses / Geodata
  12. 12. Must garbage in be garbage out? Can we turn multiple bits of shit into something useful? What we really do
  13. 13. Must garbage in be garbage out? Can we turn multiple bits of shit into something useful? What we really do something useful Chaos
  14. 14. Caveat: I love our clients All the examples you are about to see are all theoretical *wink, wink* Examples / Horror stories
  15. 15. Us: “Please set up an automated data transfer. Thx!” Them: “It’s impossible to export the data from the database” Them: “Just crawl our website” Them: “Let’s do incremental updates to save bandwidth” Them: “I’ll just send you an email when there is new stuff … starting when I get back from holiday” Getting the data
  16. 16. zip or tar full of subdirs, names of which change with each upload filename “feed.xml?key=SsKpyM62QN0RbqCwnaAc” One file per agent, when file not supplied no way to know if missing due to error or intentionally Format A on Monday, B on Tuesday, ... Fun with files
  17. 17. <Description>Residential Plot available in Suncity&amp;lt;br /&amp;gt; &amp;lt;br&amp;gt;&amp;lt;br /&amp;gt; &amp;lt;br&amp;gt;SUNCITY PROJECT&amp;lt;br /&amp;gt; &amp;lt;br&amp;gt;&amp;lt;br /&amp;gt; &amp;lt;br&amp;gt;A complete township... "&amp;gt;" - for when you really, really want to be sure you've escaped your XML anyone? XML, LOL
  18. 18. One 500 MB file of XML On a single line … to save space Go grep yourself
  19. 19. Newlines, newlines, newlines Choose your delimiter wisely - ^B So simple even a child could get it wrong Microsoft quotes vs. ASCII quotes Excel vs. CSV CSV, LOL
  20. 20. Them “we will send the data in X (where X is large industry player) format” Us “not even X uses that format” Them “We use X format, but changed it slightly so we could ….” Us *sigh* Wrong tool for right job
  21. 21. Are they really unique? Are the unique across time? Partner re-uses numeric unique ids … in case there is ever a shortage of numbers Unique identifiers
  22. 22. I’m ranting Topics we haven’t yet even touched upon: Character encodings Geocoding / Parsing addresses Image processing/classification at scale Parsing free text descriptions Deduplication Too many other things to list here
  23. 23. Never trust, check everything, every single time Tests, tests, tests, tests Embrace UNIX philosophy of many small tools in a chain Reuse rather than reinvent (but not always) Technology helps manage the problem, it is not “the solution”. Problems are almost always cultural not technical What have we learned?
  24. 24. Misaligned incentives Technology laggards Apathy Ignorance Why do they hate us?
  25. 25. Tricked you - there is of course no single perfect solution Closest thing is dialog, ideally face to face. People generally want to do right thing, need help to know why and how to do it. One five minute conversation often more useful than five months of email The solution
  26. 26. Unless you hate life, do NOT try to scrape real estate data Re-read the line above. Our API: http://nestoria.com/api One more thing
  27. 27. http://nestoria.com and http://nestoria.com/api http://devblog.nestoria.com - our dev blog http://www.lokku.com - our parent company http://opencagedata.com - all your geocoding are belong to us Twitter: @nestoria, @lokku, @opencagedata, @freyfogle Slides will be on http://slideshare.net/lokku later today Learn more

×