Securing your real estate listing data is harder than ever. Why? Web scraping is cheap and bots simply steal whatever content they’ve been programmed to fetch – real estate listing text, photos, and other data that should only be available to paid subscribers and legitimate consumers.
Learn how to avoid expensive litigation by protecting your content before the theft occurs. Review the latest research on how non-human traffic has evolved over the past few years and best practices to protect both copyrighted and non-copyrightable content.
See the results from research conducted with more than 100 MLS executives and IDX vendors on the current state of anti-scraping efforts and the related efforts and rules from NAR.
3. Introductions and Background
Trends in Scraping Real Estate Websites
Overview of Study and Findings
Immediate Opportunities and Threats from Scraping
Agenda
Toward better Security for Real Estate Data Online
5. Market Leader in Bot Detection and Mitigation
● Only bot detection vendor to be included
in Gartner’s 2015 Online Fraud Detection
Market Guide
● Key Attack Trend: “Fraudsters spreading
their attacks over thousands of IP
addresses”
● Key Inclusion Criteria: “Ability to detect
online fraud as transactions occur in real
time or near real time”
● Interesting to note: No WAF vendors in
this report (as their detection model is
primarily rules-based)
6. What Is Web Scraping?
Web Scraping
Also known as screen scraping, web scraping is the act of
copying large amounts of data from a website – either
manually or with an automated program (Bot)
Legitimate Scraping
Scraping can sometimes be benevolent and totally
acceptable. For example, the search engine bots that index
your website
Malicious Scraping
A systematic theft of intellectual property accessible on a
website, including pricing, content, images, and proprietary
data
7. MLSs:
○ Obligation to protect copyright
○ Higher cost to use reactive methods - beacons, legal, etc
○ Duty to enforce NAR Policy (VOWs, so far)
○ Missed revenue opportunities for licensing content
Brokers / Agents:
○Provided content license on listing for specific purpose
○Responsible for NAR Policy (VOWs, so far)
○Stale (scraped) data undermines trust and reputation in brand
○Higher costs - bots drive up costs for online services
Why Bots / Scraping is a Problem in Real Estate
8. Software Vendors / Publishers:
○ Resource Utilization – more servers and bandwidth costs
○ Poor Website Performance – latency and brownouts, etc.
○ Clean up Marketing Metrics – optimize for humans
○ Ad Fraud – advertisers are not paying for non-human traffic
○ People Resources – keep your team focused on revenue!
Bottom Line
Scrapers scrape because they are making money with your listings!
And the Real Estate industry is left with...
→ Higher costs
→ Lost revenues
Why Bots / Scraping is a Problem in Real Estate
9. Realtor.org offers free tools to track data - Reactive = expensive
○Checklist for Syndication has many references to data scraping – legal guidance
○NoScrape – aborted project - no update since 2010?
Problem is not going away
Industry Help? ...Way behind on Bad Bots
Ads for Scraping Programs
on Realtor.com!
Realtor.com blog to “deter scraping” relies on
obsolete IP address blocking and expensive IP
litigation
“REALTOR.com® logging, tracking and monitoring
patterns that indicate data is being stolen for these
illegitimate purposes. Once an offender is identified, their
IP address is blocked from accessing the site.”
(Oct 10, 2014)
10. Scraping as a service sites proliferate – scraping VERY accessible!
o Search for “web data scraping” on elance.com, odesk.com, freelancer.com, etc
o Google Search terms: “scraping real estate data” and “scrape MLS listings”
o Services: Mozenda.com, 80legs.com, webharvey.com, scraping.pro, etc
Problem is not going away
Web Scraping - Cheap, Easy & DIY
11. Costs of Scraping MLS Data
○ Resource costs - 10% to 40% of server utilization and bandwidth
○ Customer Care - Cost per call from consumer? Calls per month?
○ Website Performance – brownouts results in 3 days of low traffic
○ Ad Fraud - If 30% of ads are seen by bots, are advertisers paying?
○ Lead Gen… $15/mover, $30/storage facility, … $100s per listing
going to third parties, not the broker, not the agent
→ Biggest Losers: MLS and Brokers
Value of solution?
○ Antivirus is $40 to $75 year per member ( = $3 - $6/month)
○ Anti-scraping protection should be same or less cost
Bottom Line on Scraping
12. For now, two surveys:
○MLS Executives - 100 MLS Executives rep. MLSs with over 600,000 subscribers.
○ IDX Vendors – 14 rep. 400,000 IDX & VOW websites. Others would only speak informally.
Because they manage the largest set of scraping targets
Email invitation, web-survey over several weeks.
Study Methodology
Because they play a part in all scraping contexts – MLS, Publishers, and IDX/VOW.
● Technology Selection. Selects and contracts for the MLS systems.
● Data Licensing. Manages the data license agreements with the Advertising Portals
● Industry Policy. Collectively set IDX / VOW rules
13. 99% say compliance with rules protecting misuse of MLS data is important
Implementing anti-scraping should be a priority for MLS vendors:
95% agree that IDX sites should be subject to rules specifically mandating
scraping protections. This needs follow-up w/ NAR committees.
59% of respondents do NOT test VOW sites for anti-scraping compliance
Most testing performed is not rigorous
Some rely on self-reporting
98% of respondents want a set of standardized tests to verify
that VOW and syndication sites are protected
MLS Study – Key Results
14. 43% of IDX/VOW vendors were not aware of issue pervasiveness.
62% rate Compliance with MLS rules is most important factor in having
IDX/VOW vendors implement an anti-scraping solution
Other drivers for adoption of anti-scraping protection
○Customer demand for anti-scraping protections
○Cost of infrastructure use/abuse
○Security concerns
○System performance issues
IDX / VOW Study – Key Results
15. ○ 50% of IDX vendor respondents believe 15-30% bot traffic is acceptable
○ 50% believe less than 1% bot traffic is acceptable (more like MLS)
○ Most IDX/VOW vendors are using reactive detection tactics
Log analysis - reactive and labor-intensive monitoring
IP-based methods - ineffective against sophisticated scrapers
Obsolete Preventions - IP-based rate limiting and CAPTCHAs
→ Likely underestimating (missing bots) with these methods!
○ More than half cannot identify the costs of bots to their business...if you
cannot measure it, you cannot manage it, & certainly not budget it
○ While 100% put NAR compliance as a priority, only 25% have budgeted for
services to provide anti-scraping service to comply with VOW rules
IDX / VOW Study - Misaligned, Lacking Key Data
16. ○Scripts, such as CURL or Ruby, making requests at any rate
○Selenium, fully automated browser making requests at any rate (fully automating browser)
○Headless browser with or without Phantom JS (fully simulating browser, browser pre-rendering)
○IP cycling using any bot technology at rate of less than 5 requests per IP Address, then change IP
○Crawlers - at any speed, even slow crawlers making 10 requests per minute or less
○Anonymized proxy for IP to make requests using any technology or at any rate of requests
○Spoofed bot user-agent, e.g. using fake “googlebot” or “bingbot” as user-agent, IE running on Linux, etc
○Non-Browser user-agent, spoofed user-agents for mobile browsers or mobile applications
○Blocking traffic from data centers and hosting providers (why would consumers be using those IP?)
○Blocking bots from Consumer ISPs while letting legitimate requests through
It’s An Arms Race … More Detail:
Modern Anti-Scraping Tool Requirements
17. ○ 7 of top 10 sources of bots are Consumer ISPs:
(1) Comcast, (2) Time Warner Cable, (3) Verizon FIOS,
(4) Charter, (5) Cox, (6) CenturyLink, and (7) AT&T
Uverse
○ 50% - 75% of bot traffic on RE sites is from Consumer ISPs
○ Most Consumer ISPs had 1,500+ IPs with bot traffic
○ 18-45% Automated browsers - mimicking humans
○ 14-25% in Bot Database - fingerprinted, known bots
○ 16-42% Slow Crawlers - recycling IPs and user agents
Highlights of Bot Sophistication in Real Estate
The Facts on Scraping Real Estate Data
18. Purpose Built Solution, Not a Feature
Bot Detection is a New Category, NOT a Feature
○ NOT a Content Delivery Service (CDN)
○ NOT a Distributed Denial of Service (DDoS) protection solution
○ NOT a simple IP list or set of scripts
○ NOT a Web Application Firewall (WAF)
A purpose built bot detection solution is always updating and evolving
19. Catch 99.9% of Malicious Bots with Distil
A Typical WAF Catches 20%
IP BLOCK
USER AGENT
TESTING
IP ANALYSIS
USER AGENT
TESTING
JAVASCRIPT
TEST
COOKIE
SELENIUM TEST
BROWSER RATE
LIMITING
AUTOMATED
BROWSER
PHANTOM JS
MACHINE
LEARNING
IP CYCLING
Distil Catches up to 99.9%
21. Control Over Your Bot Traffic
Monitor
Monitor to inspect requests and record
the traffic to Distil and/or your own
server logs
Block
Set to Block to serve the client an
unblock verification form
CAPTCHA
Serve a hardened CAPTCHA to test the
client for verification
Drop
Drop them to present them with an
access denied page
22. Flexible Deployment Options
Cloud
Deploys in hours
Blazing fast Anycast DNS-based
GeoIP Routing. Automatic
content compression optimizes
for faster delivery
17 datacenters automatically fail
over when a primary location
goes offline
Automatically increases
infrastructure and bandwidth to
accommodate spikes
USER DISTIL CLOUD CDN LOAD BALANCER WEB SERVER
23. Flexible Deployment Options
Physical or Virtual Appliance(s)
Install on virtualized or Bare Metal
appliance(s)
Deploys in days
High availability configurations
with failover monitoring
Heartbeat up to Distil Cloud
USER INTERNET LOAD BALANCER WEB SERVER
DISTIL APPLIANCE
24. Best of Breed Solution will Include:
○99% Accuracy, cannot rely on IP address to identify bots or use rate limiting on IP
○Dedicated Service - NOT a button/feature/add-on
○Layers of tactics, multiple detection tactics, with ongoing R&D
○Easy to Implement - deploy in days or weeks
○Real-time detection and mitigation - be proactive to save time and money
○Flexible Configurable options for actions to mitigate bots
○Affordable cost per member, per site, or per MLS - flexible business model
Selection Criteria for Anti-Scraping