SlideShare a Scribd company logo
1
ZESTIMATE + LAMBDA ARCHITECTURE
Steven Hoelscher, Machine Learning Engineer
How we produce low-latency, high-quality home estimates
Goals of the Zestimate
• Independent
• Transparent
• High Accuracy
• Low Bias
• Stable over time
• Respond quickly to data
updates
• High coverage (about 100M
homes)
www.zillow.com/zestimate
In early 2015, we shared the original architecture of the
Zestimate…
…but a lot has changed
Then (2015)
• Languages: R and Python
• Data Storage: on-prem RDBMSs
• Compute: on-prem hosts
• Framework: in-house
parallelization library (ZPL)
• People: Data Analysts and
Scientists
Now (2017)
• Languages: Python and R
• Data Storage: AWS Simple
Storage Service (S3), Redis
• Compute: AWS Elastic
MapReduce (EMR)
• Framework: Apache Spark
• People: Data Analysts, Scientists,
and Engineers
So, what’s changed?
Lambda Architecture
• Introduced by Nathan Marz
(Apache Storm) and highlighted in
his book, Big Data (2015)
• An architecture for scalable, fault-
tolerant, low-latency big data
systems
Low Latency,
Accuracy
High Latency,
Accuracy
Latency-Accuracy Tradeoff
www.databricks.com/blog/2016/05/19/approximate-
algorithms-in-apache-spark-hyperloglog-and-quantiles.html
>>> review_lengths.approxQuantile("lengths", quantiles, relative_error)
High-level Lambda Architecture
• We can process new data with
only a batch layer, but for
computationally expensive
queries, the results will be out-of-
date
• The speed layer compensates for
this lack of timeliness, by
computing, generally,
approximate views
Master Data Architecture
Lock down permissions
to prevent data deletes
and updates!
PropertyId Bedrooms Bathrooms SquareFootage UpdateDate
1 2.0 1.0 1450 2010-03-13
1 2.0 2.0 1500 2015-05-15
1 3.0 2.5 1800 2016-06-24
Data is immutable
Below, we see the evolution of a home over time:
• Constructed in 2010 with 2 bedrooms and 1 bath
• A full-bath added five years later, increasing the square footage
• Finally, another bedroom is added as well as a half-bath
Data is eternally true
PropertyId Bathrooms UpdateTime
1 2.0 2015-05-15
1 2.5 2016-06-24
PropertyId SaleValue SaleTime
1 450000 2015-08-19
This bathroom value would have
been overwritten in our mutable
data view
This transaction in our training data
would erroneously use a bathroom
upgrade from the future
Batch Layer Architecture
ETL
• Ingests master data
• Standardizes data across many sources
• Dedupes, cleanses and performs sanity checks on data
• Stores partitioned training and scoring sets in Parquet format
Train
• Large memory requirements (caching training sets for various models)
Score
• Scoring set partitioned in uniform chunks for parallelization
Batch Layer Highlights
• The number one source of Zestimate error is the facts that
flow into it – about bedrooms, bathrooms, and square
footage.
• To combat data issues, we give homeowners the ability to
update such facts and immediately see a change to their
Zestimate
• Beyond that, we want to recalculate Zestimates when
homes are listed on the market
Responding to data changes quickly
• Kinesis consumer is responsible
for low-latency transformations to
the data.
• Much of the data cleansing in the
batch layer relies on a
longitudinal view of the data, so
we cannot afford these
computations
• It looks up pertinent property
information in Redis and decides
whether to update the Zestimate
by calling the API
Speed Layer Architecture: Kinesis Consumer
Speed Layer Architecture: Zestimate API
• Uses latest, pre-trained models
from batch layer to avoid costly
retraining
• All property information required
for scoring is stored in Redis,
reusing a majority of the exact
calculations from the batch layer
• Relies on sharding of pre-trained
region models due to individual
model memory requirements
• The speed layer is not meant to be perfect; it’s meant to be lightning fast.
Your batch layer will correct mistakes, eventually.
• As a result, we can think of the speed layer view as ephemeral
PropertyId LotSize
0 21
1 16
2 5
Remember: Eventual Accuracy
Toy Example: Square feet or Acres?
Imagine a GIS model for validating lot
size by looking at a given property’s
parcel and its neighboring parcels. But
what happens if that model is slow
to compute?
• We still rely on our on-prem SQL
Server for serving Zestimates on
Zillow.com
• Reconciliation of views requires
knowing when the batch layer
started: if a home fact comes in
after the batch layer began, we
serve the speed layer’s
calculation
Serving Layer Architecture
The Big Picture
(3) Reduces
latency and
improves
timeliness
(2) Performs
heavy-lifting
cleaning and
training
(4)
Reconciles
views to
ensure better
estimation is
chosen
(1) Data is
immutable and
human-fault
tolerant
19
SO DID YOU FIX MY
ZESTIMATE?
Andrew Martin, Zestimate Research Manager
Accuracy Metrics for Real-Estate
Valuation
• Median Absolute Percent Error (MAPE)
• Measures the “average” amount of error in in prediction in terms of
percentage off the correct answer in either direction
• Measuring error in percentages more natural for home prices since
they are heteroscedastic
• Percent Error Within 5%, 10%, 20%
• Measure of how many predictions fell within +/-X% of the true value
𝑀𝐴𝑃𝐸 = 𝑀𝑒𝑑𝑖𝑎𝑛
𝐴𝑏𝑠 𝑆𝑎𝑙𝑒𝑝𝑟𝑖𝑐𝑒 − 𝑍𝑒𝑠𝑡𝑖𝑚𝑎𝑡𝑒
𝑆𝑎𝑙𝑒𝑝𝑟𝑖𝑐𝑒
𝑊𝑖𝑡ℎ𝑖𝑛 𝑋% =
𝑆𝑎𝑙𝑒𝑠
𝐴𝑏𝑠 𝑆𝑎𝑙𝑒𝑝𝑟𝑖𝑐𝑒𝑖 − 𝑍𝑒𝑠𝑡𝑖𝑚𝑎𝑡𝑒𝑖
𝑆𝑎𝑙𝑒𝑝𝑟𝑖𝑐𝑒𝑖
< 𝑋%
Did you know we keep a public
scorecard?
www.zillow.com/zestimate/
Comparing Accuracy at 10,000FT
• Let’s focus on King County, WA since the new architecture has
been live here since January 2017
• We compute accuracy by using the Zestimate at the end of the
month prior to when a home was sold as our prediction
• i.e. if a home sold in Kent for $300,000 on April 10th we’d use the
Zestimate from March 31st
• We went back and recomputed Zestimates at month ends with
the new architecture for all homes and months 2016
• We compare architectures by looking at error on the same set of sales
Architecture MAPE Within 5% Within 10% Within 20%
2015 (Z5.4) 5.1% 49.0% 75.0% 92.5%
2017 (Z6) 4.5% 54.1% 81.0% 94.9%
Breaking Accuracy out by Price
0
1000
2000
3000
4000
5000
6000
0.0%
1.0%
2.0%
3.0%
4.0%
5.0%
6.0%
7.0%
8.0%
MAPE
Sales 2015 (Z5.4) 2017 (Z6)
Breaking Accuracy out by Home Type
Architecture Home
Type
MAPE Within5% Within10% Within20%
2015 (Z5.4) SFR
5.1% 49.2% 74.8% 92.4%
Condo 5.1% 49.5% 76.8% 93.7%
2017 (Z6) SFR 4.5% 54.6% 81.1% 94.6%
Condo 4.6% 53.4% 81.6% 96.0%
Think that you might have an idea for how to
improve the Zestimate? We’re all ears...
+
www.zillow.com/promo/zillow-prize
26
We are hiring!
• Data Scientist
• Machine Learning Engineer
• Data Scientist, Computer Vision and Deep Learning
• Software Developer Engineer, Computer Vision
• Economist
• Data Analyst
www.zillow.com/jobs

More Related Content

What's hot

Neo4j Graph Use Cases, Bruno Ungermann, Neo4j
Neo4j Graph Use Cases, Bruno Ungermann, Neo4jNeo4j Graph Use Cases, Bruno Ungermann, Neo4j
Neo4j Graph Use Cases, Bruno Ungermann, Neo4j
Neo4j
 
Pitch Deck Teardown: Mint House's $35M Series B deck
Pitch Deck Teardown: Mint House's $35M Series B deckPitch Deck Teardown: Mint House's $35M Series B deck
Pitch Deck Teardown: Mint House's $35M Series B deck
HajeJanKamps
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog Presentation
Knoldus Inc.
 
Big Data Architecture
Big Data ArchitectureBig Data Architecture
Big Data Architecture
Guido Schmutz
 
Dimensional modeling primer
Dimensional modeling primerDimensional modeling primer
Dimensional modeling primer
Terry Bunio
 
Continuous Data Ingestion pipeline for the Enterprise
Continuous Data Ingestion pipeline for the EnterpriseContinuous Data Ingestion pipeline for the Enterprise
Continuous Data Ingestion pipeline for the Enterprise
DataWorks Summit
 
predictive analytics
predictive analyticspredictive analytics
predictive analytics
Astha Jagetiya
 
Become a Data Analyst
Become a Data Analyst Become a Data Analyst
Become a Data Analyst
Aaron Lamphere
 
Data Quality & Data Governance
Data Quality & Data GovernanceData Quality & Data Governance
Data Quality & Data Governance
Tuba Yaman Him
 
How to Become a Data Analyst? | Data Analyst Skills | Data Analyst Training |...
How to Become a Data Analyst? | Data Analyst Skills | Data Analyst Training |...How to Become a Data Analyst? | Data Analyst Skills | Data Analyst Training |...
How to Become a Data Analyst? | Data Analyst Skills | Data Analyst Training |...
Edureka!
 
How Kafka Powers the World's Most Popular Vector Database System with Charles...
How Kafka Powers the World's Most Popular Vector Database System with Charles...How Kafka Powers the World's Most Popular Vector Database System with Charles...
How Kafka Powers the World's Most Popular Vector Database System with Charles...
HostedbyConfluent
 
PayPal merchant ecosystem using Apache Spark, Hive, Druid, and HBase
PayPal merchant ecosystem using Apache Spark, Hive, Druid, and HBase PayPal merchant ecosystem using Apache Spark, Hive, Druid, and HBase
PayPal merchant ecosystem using Apache Spark, Hive, Druid, and HBase
DataWorks Summit
 
BIG DATA and USE CASES
BIG DATA and USE CASESBIG DATA and USE CASES
BIG DATA and USE CASES
Bhaskara Reddy Sannapureddy
 
The Value of the Modern Data Architecture with Apache Hadoop and Teradata
The Value of the Modern Data Architecture with Apache Hadoop and Teradata The Value of the Modern Data Architecture with Apache Hadoop and Teradata
The Value of the Modern Data Architecture with Apache Hadoop and Teradata
Hortonworks
 
Overview of Data Science at Zillow
Overview of Data Science at ZillowOverview of Data Science at Zillow
Overview of Data Science at Zillow
njstevens
 
Pitch Deck Teardown: Gable's $12M Series A deck
Pitch Deck Teardown: Gable's $12M Series A deckPitch Deck Teardown: Gable's $12M Series A deck
Pitch Deck Teardown: Gable's $12M Series A deck
HajeJanKamps
 
Fraud Detection with Amazon Machine Learning on AWS
Fraud Detection with Amazon Machine Learning on AWSFraud Detection with Amazon Machine Learning on AWS
Fraud Detection with Amazon Machine Learning on AWS
Amazon Web Services
 
Late Arrival Facts
Late Arrival FactsLate Arrival Facts
Late Arrival Facts
Punya Sloka Muduli
 
Storemates Pitch Deck
Storemates Pitch DeckStoremates Pitch Deck
Storemates Pitch Deck
Shaffique Prabatani
 

What's hot (20)

Neo4j Graph Use Cases, Bruno Ungermann, Neo4j
Neo4j Graph Use Cases, Bruno Ungermann, Neo4jNeo4j Graph Use Cases, Bruno Ungermann, Neo4j
Neo4j Graph Use Cases, Bruno Ungermann, Neo4j
 
Pitch Deck Teardown: Mint House's $35M Series B deck
Pitch Deck Teardown: Mint House's $35M Series B deckPitch Deck Teardown: Mint House's $35M Series B deck
Pitch Deck Teardown: Mint House's $35M Series B deck
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog Presentation
 
Big Data Architecture
Big Data ArchitectureBig Data Architecture
Big Data Architecture
 
Dimensional modeling primer
Dimensional modeling primerDimensional modeling primer
Dimensional modeling primer
 
Continuous Data Ingestion pipeline for the Enterprise
Continuous Data Ingestion pipeline for the EnterpriseContinuous Data Ingestion pipeline for the Enterprise
Continuous Data Ingestion pipeline for the Enterprise
 
predictive analytics
predictive analyticspredictive analytics
predictive analytics
 
Become a Data Analyst
Become a Data Analyst Become a Data Analyst
Become a Data Analyst
 
Data Quality & Data Governance
Data Quality & Data GovernanceData Quality & Data Governance
Data Quality & Data Governance
 
How to Become a Data Analyst? | Data Analyst Skills | Data Analyst Training |...
How to Become a Data Analyst? | Data Analyst Skills | Data Analyst Training |...How to Become a Data Analyst? | Data Analyst Skills | Data Analyst Training |...
How to Become a Data Analyst? | Data Analyst Skills | Data Analyst Training |...
 
Applications of Big Data
Applications of Big DataApplications of Big Data
Applications of Big Data
 
How Kafka Powers the World's Most Popular Vector Database System with Charles...
How Kafka Powers the World's Most Popular Vector Database System with Charles...How Kafka Powers the World's Most Popular Vector Database System with Charles...
How Kafka Powers the World's Most Popular Vector Database System with Charles...
 
PayPal merchant ecosystem using Apache Spark, Hive, Druid, and HBase
PayPal merchant ecosystem using Apache Spark, Hive, Druid, and HBase PayPal merchant ecosystem using Apache Spark, Hive, Druid, and HBase
PayPal merchant ecosystem using Apache Spark, Hive, Druid, and HBase
 
BIG DATA and USE CASES
BIG DATA and USE CASESBIG DATA and USE CASES
BIG DATA and USE CASES
 
The Value of the Modern Data Architecture with Apache Hadoop and Teradata
The Value of the Modern Data Architecture with Apache Hadoop and Teradata The Value of the Modern Data Architecture with Apache Hadoop and Teradata
The Value of the Modern Data Architecture with Apache Hadoop and Teradata
 
Overview of Data Science at Zillow
Overview of Data Science at ZillowOverview of Data Science at Zillow
Overview of Data Science at Zillow
 
Pitch Deck Teardown: Gable's $12M Series A deck
Pitch Deck Teardown: Gable's $12M Series A deckPitch Deck Teardown: Gable's $12M Series A deck
Pitch Deck Teardown: Gable's $12M Series A deck
 
Fraud Detection with Amazon Machine Learning on AWS
Fraud Detection with Amazon Machine Learning on AWSFraud Detection with Amazon Machine Learning on AWS
Fraud Detection with Amazon Machine Learning on AWS
 
Late Arrival Facts
Late Arrival FactsLate Arrival Facts
Late Arrival Facts
 
Storemates Pitch Deck
Storemates Pitch DeckStoremates Pitch Deck
Storemates Pitch Deck
 

Similar to Zestimate Lambda Architecture

Rsqrd AI: Zestimates and Zillow AI Platform
Rsqrd AI: Zestimates and Zillow AI PlatformRsqrd AI: Zestimates and Zillow AI Platform
Rsqrd AI: Zestimates and Zillow AI Platform
Sanjana Chowdhury
 
Enterprise Data World 2018 - Building Cloud Self-Service Analytical Solution
Enterprise Data World 2018 - Building Cloud Self-Service Analytical SolutionEnterprise Data World 2018 - Building Cloud Self-Service Analytical Solution
Enterprise Data World 2018 - Building Cloud Self-Service Analytical Solution
Dmitry Anoshin
 
AWS re:Invent 2016: Billions of Rows Transformed in Record Time Using Matilli...
AWS re:Invent 2016: Billions of Rows Transformed in Record Time Using Matilli...AWS re:Invent 2016: Billions of Rows Transformed in Record Time Using Matilli...
AWS re:Invent 2016: Billions of Rows Transformed in Record Time Using Matilli...
Amazon Web Services
 
ABD217_From Batch to Streaming
ABD217_From Batch to StreamingABD217_From Batch to Streaming
ABD217_From Batch to Streaming
Amazon Web Services
 
FSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the Cloud
FSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the CloudFSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the Cloud
FSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the Cloud
Amazon Web Services
 
Managing Large Amounts of Data with Salesforce
Managing Large Amounts of Data with SalesforceManaging Large Amounts of Data with Salesforce
Managing Large Amounts of Data with Salesforce
Sense Corp
 
Dev Lakhani, Data Scientist at Batch Insights "Real Time Big Data Applicatio...
Dev Lakhani, Data Scientist at Batch Insights  "Real Time Big Data Applicatio...Dev Lakhani, Data Scientist at Batch Insights  "Real Time Big Data Applicatio...
Dev Lakhani, Data Scientist at Batch Insights "Real Time Big Data Applicatio...
Dataconomy Media
 
An overview of modern scalable web development
An overview of modern scalable web developmentAn overview of modern scalable web development
An overview of modern scalable web development
Tung Nguyen
 
Migrate from Netezza to Amazon Redshift: Best Practices with Financial Engine...
Migrate from Netezza to Amazon Redshift: Best Practices with Financial Engine...Migrate from Netezza to Amazon Redshift: Best Practices with Financial Engine...
Migrate from Netezza to Amazon Redshift: Best Practices with Financial Engine...
Amazon Web Services
 
ClickHouse in Real Life. Case Studies and Best Practices, by Alexander Zaitsev
ClickHouse in Real Life. Case Studies and Best Practices, by Alexander ZaitsevClickHouse in Real Life. Case Studies and Best Practices, by Alexander Zaitsev
ClickHouse in Real Life. Case Studies and Best Practices, by Alexander Zaitsev
Altinity Ltd
 
Streamlio and IoT analytics with Apache Pulsar
Streamlio and IoT analytics with Apache PulsarStreamlio and IoT analytics with Apache Pulsar
Streamlio and IoT analytics with Apache Pulsar
Streamlio
 
Amazon Redshift with Full 360 Inc.
Amazon Redshift with Full 360 Inc.Amazon Redshift with Full 360 Inc.
Amazon Redshift with Full 360 Inc.
Amazon Web Services
 
Data warehousing in the era of Big Data: Deep Dive into Amazon Redshift
Data warehousing in the era of Big Data: Deep Dive into Amazon RedshiftData warehousing in the era of Big Data: Deep Dive into Amazon Redshift
Data warehousing in the era of Big Data: Deep Dive into Amazon Redshift
Amazon Web Services
 
AWS Sydney Summit 2013 - Big Data Analytics
AWS Sydney Summit 2013 - Big Data AnalyticsAWS Sydney Summit 2013 - Big Data Analytics
AWS Sydney Summit 2013 - Big Data Analytics
Amazon Web Services
 
Real-time Streaming and Querying with Amazon Kinesis and Amazon Elastic MapRe...
Real-time Streaming and Querying with Amazon Kinesis and Amazon Elastic MapRe...Real-time Streaming and Querying with Amazon Kinesis and Amazon Elastic MapRe...
Real-time Streaming and Querying with Amazon Kinesis and Amazon Elastic MapRe...
Amazon Web Services
 
AWS APAC Webinar Week - Real Time Data Processing with Kinesis
AWS APAC Webinar Week - Real Time Data Processing with KinesisAWS APAC Webinar Week - Real Time Data Processing with Kinesis
AWS APAC Webinar Week - Real Time Data Processing with Kinesis
Amazon Web Services
 
(BDT314) A Big Data & Analytics App on Amazon EMR & Amazon Redshift
(BDT314) A Big Data & Analytics App on Amazon EMR & Amazon Redshift(BDT314) A Big Data & Analytics App on Amazon EMR & Amazon Redshift
(BDT314) A Big Data & Analytics App on Amazon EMR & Amazon Redshift
Amazon Web Services
 
Which Database is Right for My Workload?: Database Week San Francisco
Which Database is Right for My Workload?: Database Week San FranciscoWhich Database is Right for My Workload?: Database Week San Francisco
Which Database is Right for My Workload?: Database Week San Francisco
Amazon Web Services
 
Which Database is Right for My Workload: Database Week SF
Which Database is Right for My Workload: Database Week SFWhich Database is Right for My Workload: Database Week SF
Which Database is Right for My Workload: Database Week SF
Amazon Web Services
 
Which Database is Right for My Workload?
Which Database is Right for My Workload?Which Database is Right for My Workload?
Which Database is Right for My Workload?
Amazon Web Services
 

Similar to Zestimate Lambda Architecture (20)

Rsqrd AI: Zestimates and Zillow AI Platform
Rsqrd AI: Zestimates and Zillow AI PlatformRsqrd AI: Zestimates and Zillow AI Platform
Rsqrd AI: Zestimates and Zillow AI Platform
 
Enterprise Data World 2018 - Building Cloud Self-Service Analytical Solution
Enterprise Data World 2018 - Building Cloud Self-Service Analytical SolutionEnterprise Data World 2018 - Building Cloud Self-Service Analytical Solution
Enterprise Data World 2018 - Building Cloud Self-Service Analytical Solution
 
AWS re:Invent 2016: Billions of Rows Transformed in Record Time Using Matilli...
AWS re:Invent 2016: Billions of Rows Transformed in Record Time Using Matilli...AWS re:Invent 2016: Billions of Rows Transformed in Record Time Using Matilli...
AWS re:Invent 2016: Billions of Rows Transformed in Record Time Using Matilli...
 
ABD217_From Batch to Streaming
ABD217_From Batch to StreamingABD217_From Batch to Streaming
ABD217_From Batch to Streaming
 
FSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the Cloud
FSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the CloudFSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the Cloud
FSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the Cloud
 
Managing Large Amounts of Data with Salesforce
Managing Large Amounts of Data with SalesforceManaging Large Amounts of Data with Salesforce
Managing Large Amounts of Data with Salesforce
 
Dev Lakhani, Data Scientist at Batch Insights "Real Time Big Data Applicatio...
Dev Lakhani, Data Scientist at Batch Insights  "Real Time Big Data Applicatio...Dev Lakhani, Data Scientist at Batch Insights  "Real Time Big Data Applicatio...
Dev Lakhani, Data Scientist at Batch Insights "Real Time Big Data Applicatio...
 
An overview of modern scalable web development
An overview of modern scalable web developmentAn overview of modern scalable web development
An overview of modern scalable web development
 
Migrate from Netezza to Amazon Redshift: Best Practices with Financial Engine...
Migrate from Netezza to Amazon Redshift: Best Practices with Financial Engine...Migrate from Netezza to Amazon Redshift: Best Practices with Financial Engine...
Migrate from Netezza to Amazon Redshift: Best Practices with Financial Engine...
 
ClickHouse in Real Life. Case Studies and Best Practices, by Alexander Zaitsev
ClickHouse in Real Life. Case Studies and Best Practices, by Alexander ZaitsevClickHouse in Real Life. Case Studies and Best Practices, by Alexander Zaitsev
ClickHouse in Real Life. Case Studies and Best Practices, by Alexander Zaitsev
 
Streamlio and IoT analytics with Apache Pulsar
Streamlio and IoT analytics with Apache PulsarStreamlio and IoT analytics with Apache Pulsar
Streamlio and IoT analytics with Apache Pulsar
 
Amazon Redshift with Full 360 Inc.
Amazon Redshift with Full 360 Inc.Amazon Redshift with Full 360 Inc.
Amazon Redshift with Full 360 Inc.
 
Data warehousing in the era of Big Data: Deep Dive into Amazon Redshift
Data warehousing in the era of Big Data: Deep Dive into Amazon RedshiftData warehousing in the era of Big Data: Deep Dive into Amazon Redshift
Data warehousing in the era of Big Data: Deep Dive into Amazon Redshift
 
AWS Sydney Summit 2013 - Big Data Analytics
AWS Sydney Summit 2013 - Big Data AnalyticsAWS Sydney Summit 2013 - Big Data Analytics
AWS Sydney Summit 2013 - Big Data Analytics
 
Real-time Streaming and Querying with Amazon Kinesis and Amazon Elastic MapRe...
Real-time Streaming and Querying with Amazon Kinesis and Amazon Elastic MapRe...Real-time Streaming and Querying with Amazon Kinesis and Amazon Elastic MapRe...
Real-time Streaming and Querying with Amazon Kinesis and Amazon Elastic MapRe...
 
AWS APAC Webinar Week - Real Time Data Processing with Kinesis
AWS APAC Webinar Week - Real Time Data Processing with KinesisAWS APAC Webinar Week - Real Time Data Processing with Kinesis
AWS APAC Webinar Week - Real Time Data Processing with Kinesis
 
(BDT314) A Big Data & Analytics App on Amazon EMR & Amazon Redshift
(BDT314) A Big Data & Analytics App on Amazon EMR & Amazon Redshift(BDT314) A Big Data & Analytics App on Amazon EMR & Amazon Redshift
(BDT314) A Big Data & Analytics App on Amazon EMR & Amazon Redshift
 
Which Database is Right for My Workload?: Database Week San Francisco
Which Database is Right for My Workload?: Database Week San FranciscoWhich Database is Right for My Workload?: Database Week San Francisco
Which Database is Right for My Workload?: Database Week San Francisco
 
Which Database is Right for My Workload: Database Week SF
Which Database is Right for My Workload: Database Week SFWhich Database is Right for My Workload: Database Week SF
Which Database is Right for My Workload: Database Week SF
 
Which Database is Right for My Workload?
Which Database is Right for My Workload?Which Database is Right for My Workload?
Which Database is Right for My Workload?
 

Recently uploaded

一比一原版(LBS毕业证)伦敦商学院毕业证成绩单专业办理
一比一原版(LBS毕业证)伦敦商学院毕业证成绩单专业办理一比一原版(LBS毕业证)伦敦商学院毕业证成绩单专业办理
一比一原版(LBS毕业证)伦敦商学院毕业证成绩单专业办理
eutxy
 
Multi-cluster Kubernetes Networking- Patterns, Projects and Guidelines
Multi-cluster Kubernetes Networking- Patterns, Projects and GuidelinesMulti-cluster Kubernetes Networking- Patterns, Projects and Guidelines
Multi-cluster Kubernetes Networking- Patterns, Projects and Guidelines
Sanjeev Rampal
 
Comptia N+ Standard Networking lesson guide
Comptia N+ Standard Networking lesson guideComptia N+ Standard Networking lesson guide
Comptia N+ Standard Networking lesson guide
GTProductions1
 
Latest trends in computer networking.pptx
Latest trends in computer networking.pptxLatest trends in computer networking.pptx
Latest trends in computer networking.pptx
JungkooksNonexistent
 
BASIC C++ lecture NOTE C++ lecture 3.pptx
BASIC C++ lecture NOTE C++ lecture 3.pptxBASIC C++ lecture NOTE C++ lecture 3.pptx
BASIC C++ lecture NOTE C++ lecture 3.pptx
natyesu
 
JAVIER LASA-EXPERIENCIA digital 1986-2024.pdf
JAVIER LASA-EXPERIENCIA digital 1986-2024.pdfJAVIER LASA-EXPERIENCIA digital 1986-2024.pdf
JAVIER LASA-EXPERIENCIA digital 1986-2024.pdf
Javier Lasa
 
一比一原版(SLU毕业证)圣路易斯大学毕业证成绩单专业办理
一比一原版(SLU毕业证)圣路易斯大学毕业证成绩单专业办理一比一原版(SLU毕业证)圣路易斯大学毕业证成绩单专业办理
一比一原版(SLU毕业证)圣路易斯大学毕业证成绩单专业办理
keoku
 
The+Prospects+of+E-Commerce+in+China.pptx
The+Prospects+of+E-Commerce+in+China.pptxThe+Prospects+of+E-Commerce+in+China.pptx
The+Prospects+of+E-Commerce+in+China.pptx
laozhuseo02
 
How to Use Contact Form 7 Like a Pro.pptx
How to Use Contact Form 7 Like a Pro.pptxHow to Use Contact Form 7 Like a Pro.pptx
How to Use Contact Form 7 Like a Pro.pptx
Gal Baras
 
APNIC Foundation, presented by Ellisha Heppner at the PNG DNS Forum 2024
APNIC Foundation, presented by Ellisha Heppner at the PNG DNS Forum 2024APNIC Foundation, presented by Ellisha Heppner at the PNG DNS Forum 2024
APNIC Foundation, presented by Ellisha Heppner at the PNG DNS Forum 2024
APNIC
 
History+of+E-commerce+Development+in+China-www.cfye-commerce.shop
History+of+E-commerce+Development+in+China-www.cfye-commerce.shopHistory+of+E-commerce+Development+in+China-www.cfye-commerce.shop
History+of+E-commerce+Development+in+China-www.cfye-commerce.shop
laozhuseo02
 
This 7-second Brain Wave Ritual Attracts Money To You.!
This 7-second Brain Wave Ritual Attracts Money To You.!This 7-second Brain Wave Ritual Attracts Money To You.!
This 7-second Brain Wave Ritual Attracts Money To You.!
nirahealhty
 
原版仿制(uob毕业证书)英国伯明翰大学毕业证本科学历证书原版一模一样
原版仿制(uob毕业证书)英国伯明翰大学毕业证本科学历证书原版一模一样原版仿制(uob毕业证书)英国伯明翰大学毕业证本科学历证书原版一模一样
原版仿制(uob毕业证书)英国伯明翰大学毕业证本科学历证书原版一模一样
3ipehhoa
 
1比1复刻(bath毕业证书)英国巴斯大学毕业证学位证原版一模一样
1比1复刻(bath毕业证书)英国巴斯大学毕业证学位证原版一模一样1比1复刻(bath毕业证书)英国巴斯大学毕业证学位证原版一模一样
1比1复刻(bath毕业证书)英国巴斯大学毕业证学位证原版一模一样
3ipehhoa
 
guildmasters guide to ravnica Dungeons & Dragons 5...
guildmasters guide to ravnica Dungeons & Dragons 5...guildmasters guide to ravnica Dungeons & Dragons 5...
guildmasters guide to ravnica Dungeons & Dragons 5...
Rogerio Filho
 
Internet-Security-Safeguarding-Your-Digital-World (1).pptx
Internet-Security-Safeguarding-Your-Digital-World (1).pptxInternet-Security-Safeguarding-Your-Digital-World (1).pptx
Internet-Security-Safeguarding-Your-Digital-World (1).pptx
VivekSinghShekhawat2
 
Bridging the Digital Gap Brad Spiegel Macon, GA Initiative.pptx
Bridging the Digital Gap Brad Spiegel Macon, GA Initiative.pptxBridging the Digital Gap Brad Spiegel Macon, GA Initiative.pptx
Bridging the Digital Gap Brad Spiegel Macon, GA Initiative.pptx
Brad Spiegel Macon GA
 
一比一原版(CSU毕业证)加利福尼亚州立大学毕业证成绩单专业办理
一比一原版(CSU毕业证)加利福尼亚州立大学毕业证成绩单专业办理一比一原版(CSU毕业证)加利福尼亚州立大学毕业证成绩单专业办理
一比一原版(CSU毕业证)加利福尼亚州立大学毕业证成绩单专业办理
ufdana
 
1.Wireless Communication System_Wireless communication is a broad term that i...
1.Wireless Communication System_Wireless communication is a broad term that i...1.Wireless Communication System_Wireless communication is a broad term that i...
1.Wireless Communication System_Wireless communication is a broad term that i...
JeyaPerumal1
 
test test test test testtest test testtest test testtest test testtest test ...
test test  test test testtest test testtest test testtest test testtest test ...test test  test test testtest test testtest test testtest test testtest test ...
test test test test testtest test testtest test testtest test testtest test ...
Arif0071
 

Recently uploaded (20)

一比一原版(LBS毕业证)伦敦商学院毕业证成绩单专业办理
一比一原版(LBS毕业证)伦敦商学院毕业证成绩单专业办理一比一原版(LBS毕业证)伦敦商学院毕业证成绩单专业办理
一比一原版(LBS毕业证)伦敦商学院毕业证成绩单专业办理
 
Multi-cluster Kubernetes Networking- Patterns, Projects and Guidelines
Multi-cluster Kubernetes Networking- Patterns, Projects and GuidelinesMulti-cluster Kubernetes Networking- Patterns, Projects and Guidelines
Multi-cluster Kubernetes Networking- Patterns, Projects and Guidelines
 
Comptia N+ Standard Networking lesson guide
Comptia N+ Standard Networking lesson guideComptia N+ Standard Networking lesson guide
Comptia N+ Standard Networking lesson guide
 
Latest trends in computer networking.pptx
Latest trends in computer networking.pptxLatest trends in computer networking.pptx
Latest trends in computer networking.pptx
 
BASIC C++ lecture NOTE C++ lecture 3.pptx
BASIC C++ lecture NOTE C++ lecture 3.pptxBASIC C++ lecture NOTE C++ lecture 3.pptx
BASIC C++ lecture NOTE C++ lecture 3.pptx
 
JAVIER LASA-EXPERIENCIA digital 1986-2024.pdf
JAVIER LASA-EXPERIENCIA digital 1986-2024.pdfJAVIER LASA-EXPERIENCIA digital 1986-2024.pdf
JAVIER LASA-EXPERIENCIA digital 1986-2024.pdf
 
一比一原版(SLU毕业证)圣路易斯大学毕业证成绩单专业办理
一比一原版(SLU毕业证)圣路易斯大学毕业证成绩单专业办理一比一原版(SLU毕业证)圣路易斯大学毕业证成绩单专业办理
一比一原版(SLU毕业证)圣路易斯大学毕业证成绩单专业办理
 
The+Prospects+of+E-Commerce+in+China.pptx
The+Prospects+of+E-Commerce+in+China.pptxThe+Prospects+of+E-Commerce+in+China.pptx
The+Prospects+of+E-Commerce+in+China.pptx
 
How to Use Contact Form 7 Like a Pro.pptx
How to Use Contact Form 7 Like a Pro.pptxHow to Use Contact Form 7 Like a Pro.pptx
How to Use Contact Form 7 Like a Pro.pptx
 
APNIC Foundation, presented by Ellisha Heppner at the PNG DNS Forum 2024
APNIC Foundation, presented by Ellisha Heppner at the PNG DNS Forum 2024APNIC Foundation, presented by Ellisha Heppner at the PNG DNS Forum 2024
APNIC Foundation, presented by Ellisha Heppner at the PNG DNS Forum 2024
 
History+of+E-commerce+Development+in+China-www.cfye-commerce.shop
History+of+E-commerce+Development+in+China-www.cfye-commerce.shopHistory+of+E-commerce+Development+in+China-www.cfye-commerce.shop
History+of+E-commerce+Development+in+China-www.cfye-commerce.shop
 
This 7-second Brain Wave Ritual Attracts Money To You.!
This 7-second Brain Wave Ritual Attracts Money To You.!This 7-second Brain Wave Ritual Attracts Money To You.!
This 7-second Brain Wave Ritual Attracts Money To You.!
 
原版仿制(uob毕业证书)英国伯明翰大学毕业证本科学历证书原版一模一样
原版仿制(uob毕业证书)英国伯明翰大学毕业证本科学历证书原版一模一样原版仿制(uob毕业证书)英国伯明翰大学毕业证本科学历证书原版一模一样
原版仿制(uob毕业证书)英国伯明翰大学毕业证本科学历证书原版一模一样
 
1比1复刻(bath毕业证书)英国巴斯大学毕业证学位证原版一模一样
1比1复刻(bath毕业证书)英国巴斯大学毕业证学位证原版一模一样1比1复刻(bath毕业证书)英国巴斯大学毕业证学位证原版一模一样
1比1复刻(bath毕业证书)英国巴斯大学毕业证学位证原版一模一样
 
guildmasters guide to ravnica Dungeons & Dragons 5...
guildmasters guide to ravnica Dungeons & Dragons 5...guildmasters guide to ravnica Dungeons & Dragons 5...
guildmasters guide to ravnica Dungeons & Dragons 5...
 
Internet-Security-Safeguarding-Your-Digital-World (1).pptx
Internet-Security-Safeguarding-Your-Digital-World (1).pptxInternet-Security-Safeguarding-Your-Digital-World (1).pptx
Internet-Security-Safeguarding-Your-Digital-World (1).pptx
 
Bridging the Digital Gap Brad Spiegel Macon, GA Initiative.pptx
Bridging the Digital Gap Brad Spiegel Macon, GA Initiative.pptxBridging the Digital Gap Brad Spiegel Macon, GA Initiative.pptx
Bridging the Digital Gap Brad Spiegel Macon, GA Initiative.pptx
 
一比一原版(CSU毕业证)加利福尼亚州立大学毕业证成绩单专业办理
一比一原版(CSU毕业证)加利福尼亚州立大学毕业证成绩单专业办理一比一原版(CSU毕业证)加利福尼亚州立大学毕业证成绩单专业办理
一比一原版(CSU毕业证)加利福尼亚州立大学毕业证成绩单专业办理
 
1.Wireless Communication System_Wireless communication is a broad term that i...
1.Wireless Communication System_Wireless communication is a broad term that i...1.Wireless Communication System_Wireless communication is a broad term that i...
1.Wireless Communication System_Wireless communication is a broad term that i...
 
test test test test testtest test testtest test testtest test testtest test ...
test test  test test testtest test testtest test testtest test testtest test ...test test  test test testtest test testtest test testtest test testtest test ...
test test test test testtest test testtest test testtest test testtest test ...
 

Zestimate Lambda Architecture

  • 1. 1 ZESTIMATE + LAMBDA ARCHITECTURE Steven Hoelscher, Machine Learning Engineer How we produce low-latency, high-quality home estimates
  • 2. Goals of the Zestimate • Independent • Transparent • High Accuracy • Low Bias • Stable over time • Respond quickly to data updates • High coverage (about 100M homes) www.zillow.com/zestimate
  • 3. In early 2015, we shared the original architecture of the Zestimate… …but a lot has changed
  • 4. Then (2015) • Languages: R and Python • Data Storage: on-prem RDBMSs • Compute: on-prem hosts • Framework: in-house parallelization library (ZPL) • People: Data Analysts and Scientists Now (2017) • Languages: Python and R • Data Storage: AWS Simple Storage Service (S3), Redis • Compute: AWS Elastic MapReduce (EMR) • Framework: Apache Spark • People: Data Analysts, Scientists, and Engineers So, what’s changed?
  • 5. Lambda Architecture • Introduced by Nathan Marz (Apache Storm) and highlighted in his book, Big Data (2015) • An architecture for scalable, fault- tolerant, low-latency big data systems Low Latency, Accuracy High Latency, Accuracy
  • 7. High-level Lambda Architecture • We can process new data with only a batch layer, but for computationally expensive queries, the results will be out-of- date • The speed layer compensates for this lack of timeliness, by computing, generally, approximate views
  • 8. Master Data Architecture Lock down permissions to prevent data deletes and updates!
  • 9. PropertyId Bedrooms Bathrooms SquareFootage UpdateDate 1 2.0 1.0 1450 2010-03-13 1 2.0 2.0 1500 2015-05-15 1 3.0 2.5 1800 2016-06-24 Data is immutable Below, we see the evolution of a home over time: • Constructed in 2010 with 2 bedrooms and 1 bath • A full-bath added five years later, increasing the square footage • Finally, another bedroom is added as well as a half-bath
  • 10. Data is eternally true PropertyId Bathrooms UpdateTime 1 2.0 2015-05-15 1 2.5 2016-06-24 PropertyId SaleValue SaleTime 1 450000 2015-08-19 This bathroom value would have been overwritten in our mutable data view This transaction in our training data would erroneously use a bathroom upgrade from the future
  • 12. ETL • Ingests master data • Standardizes data across many sources • Dedupes, cleanses and performs sanity checks on data • Stores partitioned training and scoring sets in Parquet format Train • Large memory requirements (caching training sets for various models) Score • Scoring set partitioned in uniform chunks for parallelization Batch Layer Highlights
  • 13. • The number one source of Zestimate error is the facts that flow into it – about bedrooms, bathrooms, and square footage. • To combat data issues, we give homeowners the ability to update such facts and immediately see a change to their Zestimate • Beyond that, we want to recalculate Zestimates when homes are listed on the market Responding to data changes quickly
  • 14. • Kinesis consumer is responsible for low-latency transformations to the data. • Much of the data cleansing in the batch layer relies on a longitudinal view of the data, so we cannot afford these computations • It looks up pertinent property information in Redis and decides whether to update the Zestimate by calling the API Speed Layer Architecture: Kinesis Consumer
  • 15. Speed Layer Architecture: Zestimate API • Uses latest, pre-trained models from batch layer to avoid costly retraining • All property information required for scoring is stored in Redis, reusing a majority of the exact calculations from the batch layer • Relies on sharding of pre-trained region models due to individual model memory requirements
  • 16. • The speed layer is not meant to be perfect; it’s meant to be lightning fast. Your batch layer will correct mistakes, eventually. • As a result, we can think of the speed layer view as ephemeral PropertyId LotSize 0 21 1 16 2 5 Remember: Eventual Accuracy Toy Example: Square feet or Acres? Imagine a GIS model for validating lot size by looking at a given property’s parcel and its neighboring parcels. But what happens if that model is slow to compute?
  • 17. • We still rely on our on-prem SQL Server for serving Zestimates on Zillow.com • Reconciliation of views requires knowing when the batch layer started: if a home fact comes in after the batch layer began, we serve the speed layer’s calculation Serving Layer Architecture
  • 18. The Big Picture (3) Reduces latency and improves timeliness (2) Performs heavy-lifting cleaning and training (4) Reconciles views to ensure better estimation is chosen (1) Data is immutable and human-fault tolerant
  • 19. 19 SO DID YOU FIX MY ZESTIMATE? Andrew Martin, Zestimate Research Manager
  • 20. Accuracy Metrics for Real-Estate Valuation • Median Absolute Percent Error (MAPE) • Measures the “average” amount of error in in prediction in terms of percentage off the correct answer in either direction • Measuring error in percentages more natural for home prices since they are heteroscedastic • Percent Error Within 5%, 10%, 20% • Measure of how many predictions fell within +/-X% of the true value 𝑀𝐴𝑃𝐸 = 𝑀𝑒𝑑𝑖𝑎𝑛 𝐴𝑏𝑠 𝑆𝑎𝑙𝑒𝑝𝑟𝑖𝑐𝑒 − 𝑍𝑒𝑠𝑡𝑖𝑚𝑎𝑡𝑒 𝑆𝑎𝑙𝑒𝑝𝑟𝑖𝑐𝑒 𝑊𝑖𝑡ℎ𝑖𝑛 𝑋% = 𝑆𝑎𝑙𝑒𝑠 𝐴𝑏𝑠 𝑆𝑎𝑙𝑒𝑝𝑟𝑖𝑐𝑒𝑖 − 𝑍𝑒𝑠𝑡𝑖𝑚𝑎𝑡𝑒𝑖 𝑆𝑎𝑙𝑒𝑝𝑟𝑖𝑐𝑒𝑖 < 𝑋%
  • 21. Did you know we keep a public scorecard? www.zillow.com/zestimate/
  • 22. Comparing Accuracy at 10,000FT • Let’s focus on King County, WA since the new architecture has been live here since January 2017 • We compute accuracy by using the Zestimate at the end of the month prior to when a home was sold as our prediction • i.e. if a home sold in Kent for $300,000 on April 10th we’d use the Zestimate from March 31st • We went back and recomputed Zestimates at month ends with the new architecture for all homes and months 2016 • We compare architectures by looking at error on the same set of sales Architecture MAPE Within 5% Within 10% Within 20% 2015 (Z5.4) 5.1% 49.0% 75.0% 92.5% 2017 (Z6) 4.5% 54.1% 81.0% 94.9%
  • 23. Breaking Accuracy out by Price 0 1000 2000 3000 4000 5000 6000 0.0% 1.0% 2.0% 3.0% 4.0% 5.0% 6.0% 7.0% 8.0% MAPE Sales 2015 (Z5.4) 2017 (Z6)
  • 24. Breaking Accuracy out by Home Type Architecture Home Type MAPE Within5% Within10% Within20% 2015 (Z5.4) SFR 5.1% 49.2% 74.8% 92.4% Condo 5.1% 49.5% 76.8% 93.7% 2017 (Z6) SFR 4.5% 54.6% 81.1% 94.6% Condo 4.6% 53.4% 81.6% 96.0%
  • 25. Think that you might have an idea for how to improve the Zestimate? We’re all ears... + www.zillow.com/promo/zillow-prize
  • 26. 26 We are hiring! • Data Scientist • Machine Learning Engineer • Data Scientist, Computer Vision and Deep Learning • Software Developer Engineer, Computer Vision • Economist • Data Analyst www.zillow.com/jobs

Editor's Notes

  1. Hi everyone, thanks for joining me here at Zillow for today’s meet up. My name is Steven Hoelscher, and I’m a machine learning engineer on the data science and engineering team. I’ve been with Zillow for 2.5 years now and had the opportunity to work on the team responsible for building and rearchitecting a new Zestimate pipeline, largely inspired by Lambda Architecture. It’s my hope that you’ll walk away from this presentation with a better understanding of what lambda architecture means and will have seen a in-production example for actually realizing it.
  2. Without further ado, let’s start with the Zestimate itself and its goals. For those who aren’t familiar, the Zestimate is simply our estimated market value for individual homes nationwide. We strive to put a Zestimate on every rooftop, just as we see in this screenshot. Every day, the Zestimate team thinks about how we can improve our algorithm, and from a data science perspective, improvement is based on whether we achieve these goals. To talk about a few: obviously, we would like our Zestimates to have high accuracy; when a home sells, it’s our goal for the Zestimate to be that near sale price. The Zestimate, as an algorithm, should also be stable over time and not exhibit erratic behavior day-to-day. The Zestimate should also be able to respond quickly to data updates. Users can supply us with more accurate data to improve our estimates, and their Zestimate should immediately reflect fact updates. In a sense, these are the goals that our pipeline must support and we’re going to spend some more time talking about how to balance these goals in a big data system.
  3. In early 2015, right around the time I started at Zillow, a few of my colleagues presented on the Zestimate architecture…as it was then. But a lot has changed since that presentation, only just 2 years ago.
  4. At the core, the Zestimate in 2015 was largely written in R. Our team was comprised of R language experts and we even built an in-house R framework for parallelization a la MapReduce. We were a smaller team back then, mostly data scientists who also had a knack for engineering. We relied on collaboration with others teams, especially our database administrators to interface with on-premises relational databases. Two years later, we’ve made a hiring push across all skill sets and invited engineers to join the fray. Python has become the new language of choice, thanks mostly to its long history of support in Apache Spark. We started leveraging more and more cloud-based services, such as Amazon’s Simple Storage Service for storing our data and Elastic MapReduce for compute. No longer are we bottlenecked by the size of a single machine. With all of these changes, we had the opportunity to start afresh and design a system that would handle large amounts of data in the cloud, that would rely on horizontal scaling, and most importantly would meet the goals of the Zestimate.
  5. Enter Lambda Architecture. The idea of Lambda Architecture was introduced by Nathan Marz, the creator of Apache Storm. I highly recommend the book he published in 2015 with the title *Big Data*. This book for the uninitiated provided the foundations for Lambda Architecture, with great case studies for understanding how to achieve this architecture. Simply put, Lambda Architecture is a generic data processing architecture that is horizontally scalable, fault-tolerant (in the face of both human and hardware failures), and capable of low latency responses. Shortly, we’ll see what a high-level lambda architecture looks like. But before we dive into that, I want to talk about making a tradeoff between latency and accuracy. In some cases, we cannot expect to have low latency responses when dealing with enormous amounts of data. As such, we have to tradeoff some degree of accuracy to reduce our latency. This idea will underpins Lambda Architecture.
  6. Let’s look at example, highlighted by the Databricks team. Apache Spark implements an algorithm for calculating approximate percentiles of numerical data, with a function called approxQuantile. This algorithm requires a user to specify a target error bound and the result is guaranteed to be within this bound. This algorithm can be adjusted to trade accuracy against computation time and memory. In the example here, the Databricks team studies the length of the text in each Amazon review. On the x-axis, we have the targeted residual. As we would guess, the higher the residual, the less computationally expensive our calculation becomes, but the tradeoff is accuracy.
  7. Let’s start thinking about what this means for a big data processing system. We could start simple by building a batch system with low complexity. It reads directly from a master dataset, that contains all of the data so far. This batch layer, as it’s called, will virtually freeze the data at the time the job begins and start running computations. The problem is that once the batch layer finishes computing a query, the data is already out-of-date: new changes have come in and were not accounted for. This is the gap that the lambda architecture is trying to solve. We can rely on a speed layer that will compensate for the batch layer’s lack of timeliness. But the speed layer, generally speaking, cannot rely on the same algorithms that the batch layer did. In the example before, we would want our batch layer to calculate a correct and highly accurate quantile, but the speed layer should rely on approximation to be more nimble. In this way, at any given moment, we could have two different views: one view from the batch layer that is accurate but not so timely and one view from the speed layer that is less accurate but timely. Reconciling these two views, we can answer a query in a relatively accurate and timely fashion.
  8. At this point, we’re going to explore a few of the layers of the Lambda Architecture and see how we implement each layer for the Zestimate itself. To begin, we start with the data. As I mentioned before, most of our data in 2015 was only stored on premises in relational databases. Our first goal, then, was to move this data to the cloud and start having new data-generating processes to write directly to the cloud store. At Zillow, we use AWS S3 for our data lake / master dataset. It is optimized to handle a large, constantly growing set of data. In our case, we have a bucket specific designated for raw data. In this design, we don’t want to actually modify or update the raw data and I’ll talk about why we don’t want to do this in a second here. As such, we set permissions on the bucket itself to prevent data deletes and updates. Any generic data-generating process is responsible for only appending new records to this object store, never deleting. Most data-generating processes are writing JSON data. We do mandate a schema contract between the producers and consumers of the data, to ensure data types are conformed to.
  9. Data is immutable. Let’s understand what this means by working through this example. We have a sample home and how it has evolved over time. In 2010, it was constructed with 2 bedrooms and 1 bathroom. Five years later, the homeowner added a full-bath, therefore increasing the square footage. This was done right before selling the home in a few months later in 2015. A new owner purchased the home, and nearly a year later, decided to add another bedroom and half-bath.With mutable data, this story is lost. One way of storing these attributes in a relational database would be to update records with the new attributes.
  10. Data is eternally true. Now let’s introduce the transaction that I referred to. It occurred before the number of bathrooms changed again. In our mutable data view, this transaction would have been tied with a bathroom upgrade from the future.Once we attach a timestamp to data, we ensure it is eternally true. It is eternally true that in 2015, this home had 2 bathrooms, but in 2016, a half bath was added. This story is extremely important for data scientists. And while this example may be trivial, you can imagine tying a sale value to a larger set of home facts that weren’t actually true at that point of time. Immutability of data allows us to retain this story. We’re no longer updating data, and as a benefit, we are less prone to human mistakes, especially when it comes to what all data scientists hold dear: the raw data itself.
  11. After migrating our data to the AWS S3, we began work on the batch layer for the Zestimate pipeline. From a high-level, the Zestimate batch layer has a few components: first, we need to make available the raw, master dataset. Apache Spark allows us to read directly from S3, but some of our raw data sources suffer from the painful small-files problem in Hadoop. Simply put, big data systems expect to consume fewer large files rather than a lot of small files. Apache Spark suffers from this same problem. We rely heavily on vacuuming applications, such as Hadoop’s distcp, to aggregate data into larger files, by pulling from S3 and storing the aggregates on HDFS. From there, our jobs read directly from HDFS: we begin with an ETL layer, responsible for producing training and scoring sets for our various models. Then, training and scoring takes place for about 100 M homes in the nation. Models, training and scoring sets, and performance metrics are all stored in a different bucket in S3, one for transformed data. This ensures that we’re distinguishing between the raw data (our master dataset) and the data derived from the raw data.
  12. The ETL layer is responsible for interfacing with the master dataset and transforming it in order to arrive at cleaner, standardized datasets that are consumable by our Zestimate models. We have a wide variety of data sources that we deal with and so need to pull appropriate features from each to build a rich feature set. We invest a lot of time into ensuring our data is clean. As we know, garbage in, garbage out, and this holds true for the Zestimate algorithm. One example we always talk about is the case of fat-fingers. You can imagine that typing 500 square feet instead of 5000 square feet could drastically change how we perceive that home’s value. This cleaning process, in addition to the partitioning required, can be very expensive computationally. This is one area where a speed layer would need to be more nimble, as it won’t be able to look at historical data to make inferences about the quality of new data. After the ETL step, we can begin training models. Training, in our cases, requires large amounts of memory to support caching of training sets for various models. We train models on various geographies, making tradeoffs between data skew and volume of data available. Scoring is then done in parallel, using data partitioned in uniform chunks. At this point, we have a view created (the Zestimates for about 100M homes in the nation) as well as pre-trained models for the speed layer. But at this point, some of the facts that went into our model training and scoring could be out of date.
  13. The number one source of Zestimate error is the facts that flow into it, like bedroom count, bathroom counts, and square footage. We provide homeowners with a means for proactively making adjustments to their Zestimate. They can update a bathroom count or square footage and immediately see a change in their Zestimate. Beyond that, we want to recalculate Zestimates when homes are listed on the market, because in these cases an off the market home is updated with all of the latest facts so that it is represented accurately on the market.
  14. In lambda architecture, we want our speed layer to read from the same generic data-generating processes that our batch layer does. Amazon Kinesis (firehose and streams) makes it easy to both write to S3 as well as have consumers read directly from the stream. At this stage, you have the choice of which consumer to use. Spark Streaming can be used directly to enable code sharing (specifically, code relying on the Spark API) between the batch layer and the speed layer, but if Spark-specific code sharing is not a requirement, Amazon’s Kinesis Client Library (which Spark Streaming relies on) is a good solution. In our case, we built our Kinesis Consumer with just the Kinesis Client Library, for three reasons: (1) simplicity, (2) lack of spark processing, and (3) Elastic MapReduce would be more expensive than a small Elastic ComputeCloud (EC2) instance.
  15. Steven