Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
11
DATA SCIENCE AT ZILLOW
The Zestimate® and Beyond
22
Machine Learning vs. Statistics :
Glossary (Rob Tibshirani)
Machine learning Statistics
network, graphs model
weights p...
33
Decision Tree : Machine Learning vs. Statistics
Ross Quinlan (1993):
Programs for Machine
Learning (C4.5)
Breiman et al...
44
Zillow Traffic & Usage
• More than 73 million unique users visited Zillow’s mobile apps and websites.
– Source: Interna...
55
How Big are Our Data?
Zestimate Scoring Data Size
Homes on Zillow 110 million
Home Attributes 103
Double precision 8 by...
66
How Frequent Do our Data Change ?
77
Agenda
Name Topics
Yeng Bun, Sr. Data Scientist Zestimates and Zillow Home Value Index
Mike Babb, GIS Analyst Automated...
88
How Do We Do It ?
Prototype
(Interactive mode)
Query
Analysis
Modeling
Visualization
Database
QueryQueryQuery
Train
and...
99
Prototype vs. Production
Prototype
Turn idea into software quickly
Flexible
• Interactive mode
• Creative
• Experimenta...
1010
Production Deployment
Prototype
Sample
dataset
Development
Full dataset
Staging
Test site
Production
Live site
1111
Software Hierarchy
App
Framework
Infrastructure
• Define a standard
structure for apps.
• Provide a generic
app.
• Bu...
1212
MapReduce vs. ZPL
MapReduce
Input
Head Node
Output
Worker
Node 2
Map
Reduce
ZPL
Input
Head Node
Output
Worker
Node 2
...
1313
Data Partitioning and Parallel Computing
AK
AL
AR
AZ
CA
CO
CT
…
Head
Node
TaskQueue
Parallel R
Worker Nodes
AL
AR
AZ
...
1414
Zillow RServe
• Binary TCP/IP Servers
 Expose R functions for client apps to call
 Auto load models generated by ba...
1515
Performance
Machine 2.5 GHz, 16 cores, 128 GB RAM
Real-Time Zestimates
Throughput/Connection
12/sec
Real-Time RentZes...
1616
A PEEK LOOK
Rent Zestimate®
1717
Rent Zestimate
SCORING ENGINE
Yes
No
Dynamic Filter
Train County
Model
Model(B)
Train State
Model
Model(C)
Reconcile ...
1818
Measuring Accuracy
Hold-out 30% of data
If rz are the Rent Zestimates for homes in the hold-out dataset, then the
per...
1919
Rent Zestimate Accuracy: National
2020
Rent Zestimate Accuracy: National
2121
HOUSING MARKET
Zillow Rent Index (ZRI)
2222
Zillow Rent Index (ZRI): Methodology
• Calculate Raw Median Rent Zestimates (ZRI raw)
• Apply Smoothing Filter
• Appl...
2323
National
2424
HOUSING MARKET
Zillow Home Value Index (ZHVI)
2525
Zillow Home Value Index (ZHVI): Methodology
• Calculate Raw Median Zestimates (ZHVI raw)
• Apply Systematic Error Cor...
2626
Zestimate Accuracy : National
2727
National
2828
The Good, The Bad And The Ugly
Ad-hoc
Prototype
Interactive
mode
Batch
mode
Real-time
service
ZHVI, Forecast,
ZRI, Pr...
29
Zillow, Python, R, and GIS
Mike Babb, GIS Analyst
3030
Overview
• GIS@Zillow
– Who we are, how we function within the larger
organization
• Technology Stack
– How we do wha...
3131
GIS@Zillow
• Three person team somewhat like an in-house GIS consulting
shop
– Michalis Avraam, Ph.D.: Lead GIS Analy...
3232
How we do it
• Most development done in Windows.
• Highly available SQL Server DBs store current and historical
prope...
3333
Tools and libraries
• PYTHON LIBRARIES
• Data Management, Analysis,
and Storage
– multiprocessing
– Pandas
– numpy
– ...
3434
AUTOMATED WATERFRONT
DETERMINATION
3535
Automated Waterfront Determination
• Motivation
– Homes on the waterfront are valued differently than homes not on th...
3636
Parcel situation
3737
Parcels within 250 meters
3838
Identify intervening parcels
3939
Identify intervening streets
4040
Final waterfront determination
4141
HOME STREET FEATURE
DISCOVERY
4242
Home Street Feature Discovery
• Motivation
– Homes are on a street network.
– What information about a home can we ga...
4343
The Laurelhurst Neighborhood
4444
House orientation
4545
Sequence of houses along a street
4646
DATA SCIENCE AT ZILLOW
PART 3:
PYTHON AND GRAPHLAB
Nick McClure, Senior Data Scientist
Zillow
4747
Why Data Modeling?
• Find outliers
• Find bad data
• Database cleaning
• Imputation of missing data
4848
The Role of Python in Data Science
• Increase in computation!
• New algorithms and
complex old ones can be
implemente...
4949
Heavily Used Python Tools in Zillow Data Science
• NumPy – Speeds up computations by pre-allocating sizes of objects....
5050
• Dato maintains ‘Graphlab Create’, an open source python package that
allows very easy and scalable applications to ...
5151
Dato Example: Finding Quantiles of MOM Change in
ALL Zestimates
• Example:
5252
Dato in action!
5353
Dato: MoM Quantile Results!
5454
Dato and Zestimate MoM Analysis Take Aways
• Dato’s Graphlab Create tool is immensely powerful and easy to use.
• Int...
5555
Using Scikit Learn for Fraudulent Listing Detection
Make Me Move Fraud Commercial Listing
5656
Finding Fraud: Methodology
• Every property has lots of information: attributes (bds, bths, …),
address, pricing data...
5757
Finding Fraud: Results
• Fraudulent Listing Model
• The current iteration of Fraud detection is 96.9%*
accurate.
• Th...
5858
Record Linkage: Property Matching
Zestimate
Data
Cleaning
County
Records
User
Records
MLS-1
MLS-2
Matching
Vs.
123 Ma...
5959
Property 1 Property 2
Property Matching: The Problem
6060
Property Matching Methodology
New Data
Zip code 1
Zip code 2
Zip code 3
.
.
.
Superset of
current
properties in
zip c...
6161
Property Matching: Results
• Speed:
– Unmatched LA County Records: ~150k
– LA Database Records: ~2 Million
– Naively ...
6262
Current Openings at Zillow!
(www.Zillow.com/jobs)
• Software Development Engineer, Machine Learning
• Senior Business...
6363
Fun Benefits of Working at Zillow
•Great benefits (gym access, matched 401k, medical, dental, vision,…)
•Free fitbit!...
6464
Thank You! Questions?
Upcoming SlideShare
Loading in …5
×

Data Science At Zillow

39,389 views

Published on

Data Science at Zillow

Published in: Data & Analytics

Data Science At Zillow

  1. 1. 11 DATA SCIENCE AT ZILLOW The Zestimate® and Beyond
  2. 2. 22 Machine Learning vs. Statistics : Glossary (Rob Tibshirani) Machine learning Statistics network, graphs model weights parameters learning fitting generalization test set performance supervised learning regression/classification unsupervised learning density estimation, clustering large grant = $1,000,000 large grant = $50,000 nice place to have a meeting: Snowbird, Utah, French Alps nice place to have a meeting: Las Vegas in August
  3. 3. 33 Decision Tree : Machine Learning vs. Statistics Ross Quinlan (1993): Programs for Machine Learning (C4.5) Breiman et al. (1984): Classification and Regression Trees (CART)
  4. 4. 44 Zillow Traffic & Usage • More than 73 million unique users visited Zillow’s mobile apps and websites. – Source: Internal tracking via Google Analytics, December 2014 • The Yahoo!-Zillow Real Estate Network is the largest real estate network on the Web. – Source: comScore Media Metrix Real Estate Category Ranking by Unique Visitors, November 2014, US Data • Zillow.com is the largest rental site on the Web. – Source: comScore Media Metrix Real Estate category ranking by Unique Visitors, November 2014, US Data • 4 out of 5 U.S. homes have been viewed on Zillow. – Source: Zillow Internal, December 2014 • Zillow has data on more than 110 million U.S. homes. – Source: Zillow Internal, December 2014 • Zestimates and Rent Zestimates on more than 100 million U.S. homes. – Source: Zillow Internal, December 2014
  5. 5. 55 How Big are Our Data? Zestimate Scoring Data Size Homes on Zillow 110 million Home Attributes 103 Double precision 8 bytes Time series 220 months Total 20 TBs ~ 110M*103*220*8
  6. 6. 66 How Frequent Do our Data Change ?
  7. 7. 77 Agenda Name Topics Yeng Bun, Sr. Data Scientist Zestimates and Zillow Home Value Index Mike Babb, GIS Analyst Automated Waterfront determination and Home Street feature discovery Nick McClure, Sr. Data Scientist Data Cleaning, Fraud Detection, and Address Matching
  8. 8. 88 How Do We Do It ? Prototype (Interactive mode) Query Analysis Modeling Visualization Database QueryQueryQuery Train and Score Combine Data Production (Batch mode) Models Production (Real-time service) Scoring Engine Early Day : Java, R & C Present Day : R & C/C++ Early Day : R Present Day : R Future : R & Python Future : TBD
  9. 9. 99 Prototype vs. Production Prototype Turn idea into software quickly Flexible • Interactive mode • Creative • Experimental Rigid • Batch and Real-time modes • Repeatability • Maintainability Complete software versions • Error free • Run on full dataset Incomplete software versions • Proof of concept • Run on small sample dataset Production Run the software automatically
  10. 10. 1010 Production Deployment Prototype Sample dataset Development Full dataset Staging Test site Production Live site
  11. 11. 1111 Software Hierarchy App Framework Infrastructure • Define a standard structure for apps. • Provide a generic app. • Build on top of a framework. • Deal with specific details and complexities of the application. • Basic services, communication, storage management, version control, etc.. • Rterm, system(), .Call(), .Fortran(), library(), load(), save(), … • Rserve, SQL,GIT • EconBot • ZPL • One Pagers, CaseShiller Forecast, … • Zhvi, Zhvi Forecast, Zri, Price/Rent ratio, Export MarketReport Data,… • Zestimate, ZestimateForecast, RentZestimate, Diagnostics, … • ZillowRserve
  12. 12. 1212 MapReduce vs. ZPL MapReduce Input Head Node Output Worker Node 2 Map Reduce ZPL Input Head Node Output Worker Node 2 zplOnCompute() zplOnUpdate() Update Node Hadoop RDBMS
  13. 13. 1313 Data Partitioning and Parallel Computing AK AL AR AZ CA CO CT … Head Node TaskQueue Parallel R Worker Nodes AL AR AZ Input Database Update Node Combine Data Output Database
  14. 14. 1414 Zillow RServe • Binary TCP/IP Servers  Expose R functions for client apps to call  Auto load models generated by batch jobs  Auto cleanup  Redundancy : multiple ports and multiple boxes • Clients – C/C++, Python, R, Java, C#, etc… – SQL Server – Web server – Mobil devices
  15. 15. 1515 Performance Machine 2.5 GHz, 16 cores, 128 GB RAM Real-Time Zestimates Throughput/Connection 12/sec Real-Time RentZestimates Throughput/Connection 20/sec Zestimates 3 times/week run in batch mode 13 hours RentZestimates Weekly run in batch mode 3 hours Historical Zestimates 220 monthly data points 5 days ~ 220*13/24/24 (24 boxes)
  16. 16. 1616 A PEEK LOOK Rent Zestimate®
  17. 17. 1717 Rent Zestimate SCORING ENGINE Yes No Dynamic Filter Train County Model Model(B) Train State Model Model(C) Reconcile Edited Facts Score County Model Model (B) Model (C) Good? Score State Model Models PropertyDimensionEditedFacts PropertyDimension PropertyUserDimension PropertyTaxAssessment RegionDimension <ZestimateDate>_ForRentPosting <ZestimateDate>_RZest QueryQuery pre-process SQL Server Query QueryImpute PropertyDimensionImputedFacts Scoring DataTraining Data <ZestimateDate>_ZestSmooth Batch job
  18. 18. 1818 Measuring Accuracy Hold-out 30% of data If rz are the Rent Zestimates for homes in the hold-out dataset, then the percent estimated errors are e =100*(rz – r)/r where r are the actual rental listing prices. Two key metrics: • median (abs (e)) • percent of estimates within 10% of rent price:100*count (abs(e)<10)/count (e) group by counties, metros, states and national.
  19. 19. 1919 Rent Zestimate Accuracy: National
  20. 20. 2020 Rent Zestimate Accuracy: National
  21. 21. 2121 HOUSING MARKET Zillow Rent Index (ZRI)
  22. 22. 2222 Zillow Rent Index (ZRI): Methodology • Calculate Raw Median Rent Zestimates (ZRI raw) • Apply Smoothing Filter • Apply Seasonal Adjustment • Quality Control
  23. 23. 2323 National
  24. 24. 2424 HOUSING MARKET Zillow Home Value Index (ZHVI)
  25. 25. 2525 Zillow Home Value Index (ZHVI): Methodology • Calculate Raw Median Zestimates (ZHVI raw) • Apply Systematic Error Correction • Apply Smoothing Filter • Apply Seasonal Adjustment • Quality Control
  26. 26. 2626 Zestimate Accuracy : National
  27. 27. 2727 National
  28. 28. 2828 The Good, The Bad And The Ugly Ad-hoc Prototype Interactive mode Batch mode Real-time service ZHVI, Forecast, ZRI, Price/Rent Diagnostics,…
  29. 29. 29 Zillow, Python, R, and GIS Mike Babb, GIS Analyst
  30. 30. 3030 Overview • GIS@Zillow – Who we are, how we function within the larger organization • Technology Stack – How we do what do • Several examples – Automated Waterfront Determination – Home Street Feature Discovery
  31. 31. 3131 GIS@Zillow • Three person team somewhat like an in-house GIS consulting shop – Michalis Avraam, Ph.D.: Lead GIS Analyst – Mike Babb, Ph.C.: GIS Analyst – Andrew Smyth: GIS Analyst • What we do: – Automating the incorporation of spatially explicit data (spatial ETL). – Adjusting boundary geometry (cities, school districts, Zip Codes, etc.). – Conflating geospatial data from different vendors into a congruent product. – The discovery, creation, and formalization of spatial relationships into machine-comprehensible data for input into the Zestimation algorithm.
  32. 32. 3232 How we do it • Most development done in Windows. • Highly available SQL Server DBs store current and historical property data. • 75% Python, 15% R, 5% SQL Server, 5% bash and shell. • But… • Proprietary Linux-only in-house database used for blazingly fast in-memory and http look up. • Crawl – walk – run.
  33. 33. 3333 Tools and libraries • PYTHON LIBRARIES • Data Management, Analysis, and Storage – multiprocessing – Pandas – numpy – sqlite3 • Spatial Analysis – ArcPy – gdal/ogr/osr – Rtree – shapely • R LIBRARIES • Data Management, Analysis, and Storage – data.Table – doSNOW – rsqlite • Spatial Analysis – gpclib – maptools – rgdal – rgeos – sp
  34. 34. 3434 AUTOMATED WATERFRONT DETERMINATION
  35. 35. 3535 Automated Waterfront Determination • Motivation – Homes on the waterfront are valued differently than homes not on the waterfront. – Incorporating measures of waterfront access into our Zestimation algorithm helps increase the accuracy of our models. • Needs – Distinguish between proximity and access. – Identify properties that are near the waterfront but have intervening properties and intervening streets. • Tools – Most processing done using R and the following geospatial libraries: sp, maptools, rgdal, and rgeos. Native R objects and data.table objects are used for storage and data management. – Multiprocessing techniques were used where possible. • Technique – Identify parcels within 250 meters of the shore. – Use ray tracing to identify intervening features. – Visualize and check results using ArcMap.
  36. 36. 3636 Parcel situation
  37. 37. 3737 Parcels within 250 meters
  38. 38. 3838 Identify intervening parcels
  39. 39. 3939 Identify intervening streets
  40. 40. 4040 Final waterfront determination
  41. 41. 4141 HOME STREET FEATURE DISCOVERY
  42. 42. 4242 Home Street Feature Discovery • Motivation – Homes are on a street network. – What information about a home can we gather from a home’s relationship to it’s street? • Needs – The orientation of the home to the street. – The sequence of homes along a street. – Various other needs. • Tools – Propriety database used to fuzzy-match a home to a street segment. Database accepts both same-machine in-memory lookup and http requests. – Pandas for IO, batching, analysis, and storage. – ArcPy for prototyping. – Rtree, shapely, and gdal/ogr/osr for production.
  43. 43. 4343 The Laurelhurst Neighborhood
  44. 44. 4444 House orientation
  45. 45. 4545 Sequence of houses along a street
  46. 46. 4646 DATA SCIENCE AT ZILLOW PART 3: PYTHON AND GRAPHLAB Nick McClure, Senior Data Scientist Zillow
  47. 47. 4747 Why Data Modeling? • Find outliers • Find bad data • Database cleaning • Imputation of missing data
  48. 48. 4848 The Role of Python in Data Science • Increase in computation! • New algorithms and complex old ones can be implemented now, easier than ever. • The demand for analytic talent is insatiable. • So a versatile, fast, and easy-to-learn language is invaluable. • That language must double as usable by developers and by analysts.
  49. 49. 4949 Heavily Used Python Tools in Zillow Data Science • NumPy – Speeds up computations by pre-allocating sizes of objects. • Pandas – Creates the familiar ‘data frame’ object and relevant tools. • Scikit Learn – Easy to use machine learning tools. • Textmining – Allows analysis of unstructured text fields. • Pymssql/Pyodbc – Connections to SQL Server. • SQLite3 – Creation of local databases. • Graphlab Create – Strikingly fast machine learning, easy and scalable application creation by Dato (previously Graphlab).
  50. 50. 5050 • Dato maintains ‘Graphlab Create’, an open source python package that allows very easy and scalable applications to be built in simple code. (Functions should have intelligent defaults!)
  51. 51. 5151 Dato Example: Finding Quantiles of MOM Change in ALL Zestimates • Example:
  52. 52. 5252 Dato in action!
  53. 53. 5353 Dato: MoM Quantile Results!
  54. 54. 5454 Dato and Zestimate MoM Analysis Take Aways • Dato’s Graphlab Create tool is immensely powerful and easy to use. • Integration into your AWS is as easy as setting up an environment and writing a function. • We now have a tool that can slice and dice the Zestimate and look at all of our data by any number of factors. • Fast in comparison to alternatives. (~1-3 hours total) • Currently setting up this tool to catch problematic Zestimates every month.
  55. 55. 5555 Using Scikit Learn for Fraudulent Listing Detection Make Me Move Fraud Commercial Listing
  56. 56. 5656 Finding Fraud: Methodology • Every property has lots of information: attributes (bds, bths, …), address, pricing data, transactional data, account information, and unstructured text descriptions. • We create features based on this information. • Train a gradient-boosted random forest with features on known fraudulent and non-fraudulent listings. • Output is scored (fraud = P(fraud)>0.5) as actual fraud or not. • Scored data is refed into the fraud model weekly for training.
  57. 57. 5757 Finding Fraud: Results • Fraudulent Listing Model • The current iteration of Fraud detection is 96.9%* accurate. • The null prediction benchmark is 96.1% accurate. * Note that high % accuracy when predicting rare events is expected.
  58. 58. 5858 Record Linkage: Property Matching Zestimate Data Cleaning County Records User Records MLS-1 MLS-2 Matching Vs. 123 Main St. Bellevue, WA 89555 123 Main St. Seattle, WA 89555
  59. 59. 5959 Property 1 Property 2 Property Matching: The Problem
  60. 60. 6060 Property Matching Methodology New Data Zip code 1 Zip code 2 Zip code 3 . . . Superset of current properties in zip code 1 Matching Algorithm -knn algorithm (Dato) -text & numeric features -feature weights -distance metrics Results • Outputs all new data with most probable matches • P(match) > constant
  61. 61. 6161 Property Matching: Results • Speed: – Unmatched LA County Records: ~150k – LA Database Records: ~2 Million – Naively at 100k comparisons a second = ~33 days of computing time. – Graphlab’s shortcuts reduce time to 20-40 minutes!!! (8 core personal desktop) • Previous match rate is around ~65-95%. • Test cases so far have resulted in new match rates of >97%. • What are we missing? New lots, construction, incomplete addresses.
  62. 62. 6262 Current Openings at Zillow! (www.Zillow.com/jobs) • Software Development Engineer, Machine Learning • Senior Business Intelligence Developer • Manager, Business Analytics • Program Manager, Enterprise Data Warehouse • Data Analyst, Listing and Data Quality • Reporting Analyst • Data Quality Control Specialist • Senior Software Development Engineer • Software Development Engineer • Hardware/Datacenter Technician • Cloud Architect • Associate Software Test Engineer • Software Development Manager • Full Stack Software Dev. Engineer • Search Infrastructure Engineer • Senior Database Developer • SOC Engineer • UX Developer Jobs within Zillow Analytics Other SDE/IT Jobs within Zillow
  63. 63. 6363 Fun Benefits of Working at Zillow •Great benefits (gym access, matched 401k, medical, dental, vision,…) •Free fitbit! •Monthly zSpeakers. –Previous speakers: Arianna Huffington, Mexican Pres. Vicente Fox, Seahawks Defensive End Cliff Avril, Joel Spolsky, … •Free snacks, drinks, candy, … •Treadmill rooms, ping pong, shuffleboard, game room,… •Free Orca card. •Bi-annual Hackweek. •Quarterly Group Outings (Clam-bake, kayaking, ice skating,…) •Smart, fun, and helpful colleagues in a relaxed atmosphere!
  64. 64. 6464 Thank You! Questions?

×