SlideShare a Scribd company logo
Big Data at Scrapinghub
Shane Evans
About Shane
● 9 years web scraping
● Decades with Big Data
● Scrapy, Portia, Frontera,
Scrapy Cloud, etc.
● Co-founded Scrapinghub
We turn web content into useful data
Founded in 2010, largest 100% remote company based outside of the US
We’re 126 teammates in 41 countries
About Scrapinghub
Scrapinghub specializes in data extraction. Our platform is
used to scrape over 4 billion web pages a month.
We offer:
● Professional Services to handle the web scraping for you
● Off-the-shelf datasets so you can get data hassle free
● A cloud-based platform that makes scraping a breeze
Who Uses Web Scraping
Used by everyone from individuals to
multinational companies:
● Monitor your competitors’ prices by scraping
product information
● Detect fraudulent reviews and sentiment
changes by scraping product reviews
● Track online reputation by scraping social
media profiles
● Create apps that use public data
● Track SEO by scraping search engine results
“Getting information off the
Internet is like taking a drink
from a fire hydrant.”
– Mitchell Kapor
Scrapy
Scrapy is a web scraping framework that
gets the dirty work related to web crawling
out of your way.
Benefits
● No platform lock-in: Open Source
● Very popular (13k+ ★)
● Battle tested
● Highly extensible
● Great documentation
Introducing Portia
Portia is a Visual Scraping tool that lets you
get data without needing to write code.
Benefits
● No platform lock-in: Open Source
● JavaScript dynamic content generation
● Ideal for non-developers
● Extensible
● It’s as easy as annotating a page
How Portia Works
User provides seed URLs:
Follows links
● Users specify which links to follow (regexp, point-and-click)
● Automatically guesses: finds and follows pagination, infinite scroll, prioritizes content
● Knows when to stop
Extracts data
● Given a sample, extracts the same data from all similar pages
● Understands repetitive patterns
● Manages item schemas
Run standalone or on Scrapy Cloud
Portia UI
Large Scale Infrastructure
Meet Scrapy Cloud , our PaaS for web crawlers:
● Scalable: Crawlers run on our cloud infrastructure
● Crawlera add-on
● Control your spiders: Command line, API or web UI
● Machine learning integration: BigML, MonkeyLearn, among others
● No lock-in: scrapyd, Scrapy or Portia to run spiders on your own
infrastructure
Data Growth
● Items, logs and requests are collected in real time
● Millions of web crawling jobs each month
● Now at 4 billion a month and growing
● Thousands of separate active projects
● Browse data as the crawl is running
● Filter and download huge datasets
● Items can have arbitrary schemas
Data Dashboard
MongoDB - v1.0
MongoDB was a good fit to get a demo up and
running, but it’s a bad fit for our use at scale
● Cannot keep hot data in memory
● Lock contention
● Cannot order data without sorting, skip+limit
queries slow
● Poor space efficiency
See https://blog.scrapinghub.com/2013/05/13/mongo-bad-for-scraped-data/
● High write volume. Writes are micro-batched
● Much of the data is written in order and immutable (like logs)
● Items are semi-structured nested data
● Expect exponential growth
● Random access from dashboard users, keep summary stats
● Sequential reading important (downloading & analyzing)
● Store data on disk, many TB per node
Storage Requirements - v2.0
Bigtable looks good...
Google’s Bigtable provides a sparse,
distributed, persistent
multidimensional sorted map
Can express our requirements in what
Bigtable provides
Performance characteristics should
match our workload
Inspired several open source projects
Apache HBase
● Modelled after Google’s Bigtable
● Provides real time random read and write to billions of rows with
millions of columns
● Runs on hadoop and uses HDFS
● Strictly consistent reads and writes
● Extensible via server side filters and coprocessors
● Java-based
HBase Architecture
HBase Key Selection
Key selection is critical
● Atomic operations are at the row level: we use fat columns, update counts on write
operations and delete whole rows at once
● Order is determined by the binary key: our offsets preserve order
HBase Values
● Msgpack is like JSON but fast and small
● Storing entire records as a value has low
overhead (vs. splitting records into multiple
key/values in hbase)
● Doesn’t handle very large values well, requires
us to limit the size of single records
● We need arbitrarily nested data anyway, so we
need some custom binary encoding
● Write custom Filters to support simple queries
We store the entire item record as msgpack encoded data in a single value
HBase Deployment
● All access is via a single service that provides a restricted API
● Ensure no long running queries, deal with timeouts everywhere, ...
● Tune settings to work with a lot of data per node
● Set block size and compression for each Column Family
● Do not use block cache for large scans (Scan.setCacheBlocks) and
‘batch’ every time you touch fat columns
● Scripts to manage regions (balancing, merging, bulk delete)
● We host in Hetzner, on dedicated servers
● Data replicated to backup clusters, where we run analytics
HBase Lessons Learned
● It was a lot of work
○ API is low level (untyped bytes) - check out Apache Phoenix
○ Many parts -> longer learning curve and difficult to debug. Tools
are getting better
● Many of our early problems were addressed in later releases
○ reduced memory allocation & GC times
○ improved MTTR
○ online region merging
○ scanner heartbeat
Broad Crawls
Broad Crawls
Frontera allows us to build large scale web crawlers in Python:
● Scrapy support out of the box
● Distribute and scale custom web crawlers across servers
● Crawl Frontier Framework: large scale URL prioritization logic
● Aduana to prioritize URLs based on link analysis (PageRank, HITS)
Broad Crawls
Many uses of Frontera:
○ News analysis, Topical crawling
○ Plagiarism detection
○ Sentiment analysis (popularity, likeability)
○ Due diligence (profile/business data)
○ Lead generation (extracting contact information)
○ Track criminal activity & find lost persons (DARPA)
Frontera Motivation
Frontera started when we needed to identify frequently changing
hubs
We had to crawl about 1 billion pages per week
Frontera Architecture
Supports both local and distributed mode
● Scrapy for crawl spiders
● Kafka for message bus
● HBase for storage and frontier
maintenance
● Twisted.Internet for async primitives
● Snappy for compression
Frontera: Big and Small hosts
Ordering of URLs across hosts is important:
● Politeness: a single host crawled by one Scrapy process
● Each Scrapy process crawls multiple hosts
Challenges we found at scale:
Queue flooded with URLs from the same host.
○ Underuse of spider resources.
Additional per-host (per-IP) queue and metering algorithm.
URLs from big hosts are cached in memory.
○ Found a few very huge hosts (>20M docs)
All queue partitions were flooded with huge hosts.
Two MapReduce jobs: queue shuffling, limit all hosts to 100
docs MAX.
Breadth-first strategy: huge amount of DNS requests
● Recursive DNS server on every spider node, upstream to
Verizon & OpenDNS
● Scrapy patch for large thread pool for DNS resolving and
timeout customization
Intensive network traffic from workers to services
● Throughput between workers and Kafka/HBase ~ 1Gbit/s
● Thrift compact protocol for HBase
● Message compression in Kafka with Snappy
Batching and caching to achieve performance
Frontera: tuning
Duplicate Content
The web is full of duplicate content.
Duplicate Content negatively impacts:
● Storage
● Re-crawl performance
● Quality of data
Efficient algorithms for Near Duplicate Detection, like SimHash, are
applied to estimate similarity between web pages to avoid scraping
duplicated content.
Near Duplicate Detection Uses
Compare prices of products scraped from different retailers by finding
near duplicates in a dataset:
Merge similar items to avoid duplicate entries:
Title Store Price
ThinkPad X220 Laptop Lenovo (i7 2.8GHz, 12.5 LED, 320 GB) Acme Store 599.89
Lenovo Thinkpad Notebook Model X220 (i7 2.8, 12.5’’, HDD 320) XYZ Electronics 559.95
Name Summary Location
Saint Fin Barre’s Cathedral Begun in 1863, the cathedral was the first major work of the
Victorian architect William Burges…
51.8944, -8.48064
St. Finbarr’s Cathedral Cork Designed by William Burges and consecrated in 1870, ... 51.894401550293, -8.48064041137695
What we’re seeing..
● More data is available than ever
● Scrapinghub can provide web data in a usable format
● We’re combining multiple data sources and analyzing
● The technology to use big data is rapidly improving and
becoming more accessible
● Data Science is everywhere
Thank you!
Shane Evans
shane@scrapinghub.com
scrapinghub.com
Thank you!

More Related Content

What's hot

Web scraping with BeautifulSoup, LXML, RegEx and Scrapy
Web scraping with BeautifulSoup, LXML, RegEx and ScrapyWeb scraping with BeautifulSoup, LXML, RegEx and Scrapy
Web scraping with BeautifulSoup, LXML, RegEx and Scrapy
LITTINRAJAN
 
MongoDB World 2016: Poster Sessions eBook
MongoDB World 2016: Poster Sessions eBookMongoDB World 2016: Poster Sessions eBook
MongoDB World 2016: Poster Sessions eBook
MongoDB
 
IPFS: The Permanent Web
IPFS: The Permanent WebIPFS: The Permanent Web
IPFS: The Permanent Web
Sivachandran Paramsivam
 
ElasticSearch for data mining
ElasticSearch for data mining ElasticSearch for data mining
ElasticSearch for data mining
William Simms
 
Microservice-based software architecture
Microservice-based software architectureMicroservice-based software architecture
Microservice-based software architecture
ArangoDB Database
 
MongoDB
MongoDBMongoDB
Distributed Crawler Service architecture presentation
Distributed Crawler Service architecture presentationDistributed Crawler Service architecture presentation
Distributed Crawler Service architecture presentation
Gennady Baranov
 
Indexing big data in the cloud
Indexing big data in the cloudIndexing big data in the cloud
Indexing big data in the cloud
OpenSource Connections
 
Introduction to MongoDB
Introduction to MongoDBIntroduction to MongoDB
Introduction to MongoDB
MongoDB
 
Sharding
ShardingSharding
Sharding
MongoDB
 
Real time fulltext search with sphinx
Real time fulltext search with sphinxReal time fulltext search with sphinx
Real time fulltext search with sphinx
Adrian Nuta
 
No sq lv1_0
No sq lv1_0No sq lv1_0
No sq lv1_0
Tuan Luong
 
Introduction to MongoDB
Introduction to MongoDBIntroduction to MongoDB
Introduction to MongoDB
Ravi Teja
 
Fluentd and Docker - running fluentd within a docker container
Fluentd and Docker - running fluentd within a docker containerFluentd and Docker - running fluentd within a docker container
Fluentd and Docker - running fluentd within a docker container
Treasure Data, Inc.
 
J-Day Kraków: Listen to the sounds of your application
J-Day Kraków: Listen to the sounds of your applicationJ-Day Kraków: Listen to the sounds of your application
J-Day Kraków: Listen to the sounds of your applicationMaciej Bilas
 
I’ve outgrown my basic stack. Now what?
I’ve outgrown my basic stack. Now what?I’ve outgrown my basic stack. Now what?
I’ve outgrown my basic stack. Now what?
Francis David Cleary
 
Small intro to Big Data - Old version
Small intro to Big Data - Old versionSmall intro to Big Data - Old version
Small intro to Big Data - Old version
SoftwareMill
 
Elasticsearch Distributed search & analytics on BigData made easy
Elasticsearch Distributed search & analytics on BigData made easyElasticsearch Distributed search & analytics on BigData made easy
Elasticsearch Distributed search & analytics on BigData made easy
Itamar
 
Webinar: What's new in the .NET Driver
Webinar: What's new in the .NET DriverWebinar: What's new in the .NET Driver
Webinar: What's new in the .NET Driver
MongoDB
 
Investigating server logs
Investigating server logsInvestigating server logs
Investigating server logs
Animesh Shaw
 

What's hot (20)

Web scraping with BeautifulSoup, LXML, RegEx and Scrapy
Web scraping with BeautifulSoup, LXML, RegEx and ScrapyWeb scraping with BeautifulSoup, LXML, RegEx and Scrapy
Web scraping with BeautifulSoup, LXML, RegEx and Scrapy
 
MongoDB World 2016: Poster Sessions eBook
MongoDB World 2016: Poster Sessions eBookMongoDB World 2016: Poster Sessions eBook
MongoDB World 2016: Poster Sessions eBook
 
IPFS: The Permanent Web
IPFS: The Permanent WebIPFS: The Permanent Web
IPFS: The Permanent Web
 
ElasticSearch for data mining
ElasticSearch for data mining ElasticSearch for data mining
ElasticSearch for data mining
 
Microservice-based software architecture
Microservice-based software architectureMicroservice-based software architecture
Microservice-based software architecture
 
MongoDB
MongoDBMongoDB
MongoDB
 
Distributed Crawler Service architecture presentation
Distributed Crawler Service architecture presentationDistributed Crawler Service architecture presentation
Distributed Crawler Service architecture presentation
 
Indexing big data in the cloud
Indexing big data in the cloudIndexing big data in the cloud
Indexing big data in the cloud
 
Introduction to MongoDB
Introduction to MongoDBIntroduction to MongoDB
Introduction to MongoDB
 
Sharding
ShardingSharding
Sharding
 
Real time fulltext search with sphinx
Real time fulltext search with sphinxReal time fulltext search with sphinx
Real time fulltext search with sphinx
 
No sq lv1_0
No sq lv1_0No sq lv1_0
No sq lv1_0
 
Introduction to MongoDB
Introduction to MongoDBIntroduction to MongoDB
Introduction to MongoDB
 
Fluentd and Docker - running fluentd within a docker container
Fluentd and Docker - running fluentd within a docker containerFluentd and Docker - running fluentd within a docker container
Fluentd and Docker - running fluentd within a docker container
 
J-Day Kraków: Listen to the sounds of your application
J-Day Kraków: Listen to the sounds of your applicationJ-Day Kraków: Listen to the sounds of your application
J-Day Kraków: Listen to the sounds of your application
 
I’ve outgrown my basic stack. Now what?
I’ve outgrown my basic stack. Now what?I’ve outgrown my basic stack. Now what?
I’ve outgrown my basic stack. Now what?
 
Small intro to Big Data - Old version
Small intro to Big Data - Old versionSmall intro to Big Data - Old version
Small intro to Big Data - Old version
 
Elasticsearch Distributed search & analytics on BigData made easy
Elasticsearch Distributed search & analytics on BigData made easyElasticsearch Distributed search & analytics on BigData made easy
Elasticsearch Distributed search & analytics on BigData made easy
 
Webinar: What's new in the .NET Driver
Webinar: What's new in the .NET DriverWebinar: What's new in the .NET Driver
Webinar: What's new in the .NET Driver
 
Investigating server logs
Investigating server logsInvestigating server logs
Investigating server logs
 

Viewers also liked

Mining legal texts with Python
Mining legal texts with PythonMining legal texts with Python
Mining legal texts with Python
Flávio Codeço Coelho
 
Resume Michael M Poston
Resume Michael M PostonResume Michael M Poston
Resume Michael M PostonMike Poston
 
Mining the web, no experience required
Mining the web, no experience requiredMining the web, no experience required
Mining the web, no experience required
Scrapinghub
 
摘星
摘星摘星
摘星
zenyuhao
 
快快樂樂學 Scrapy
快快樂樂學 Scrapy快快樂樂學 Scrapy
快快樂樂學 Scrapyrecast203
 
XPath for web scraping
XPath for web scrapingXPath for web scraping
XPath for web scraping
Scrapinghub
 
Downloading the internet with Python + Scrapy
Downloading the internet with Python + ScrapyDownloading the internet with Python + Scrapy
Downloading the internet with Python + Scrapy
Erin Shellman
 
How to Scrap Any Website's content using ScrapyTutorial of How to scrape (cra...
How to Scrap Any Website's content using ScrapyTutorial of How to scrape (cra...How to Scrap Any Website's content using ScrapyTutorial of How to scrape (cra...
How to Scrap Any Website's content using ScrapyTutorial of How to scrape (cra...
Anton
 
Python, web scraping and content management: Scrapy and Django
Python, web scraping and content management: Scrapy and DjangoPython, web scraping and content management: Scrapy and Django
Python, web scraping and content management: Scrapy and DjangoSammy Fung
 

Viewers also liked (9)

Mining legal texts with Python
Mining legal texts with PythonMining legal texts with Python
Mining legal texts with Python
 
Resume Michael M Poston
Resume Michael M PostonResume Michael M Poston
Resume Michael M Poston
 
Mining the web, no experience required
Mining the web, no experience requiredMining the web, no experience required
Mining the web, no experience required
 
摘星
摘星摘星
摘星
 
快快樂樂學 Scrapy
快快樂樂學 Scrapy快快樂樂學 Scrapy
快快樂樂學 Scrapy
 
XPath for web scraping
XPath for web scrapingXPath for web scraping
XPath for web scraping
 
Downloading the internet with Python + Scrapy
Downloading the internet with Python + ScrapyDownloading the internet with Python + Scrapy
Downloading the internet with Python + Scrapy
 
How to Scrap Any Website's content using ScrapyTutorial of How to scrape (cra...
How to Scrap Any Website's content using ScrapyTutorial of How to scrape (cra...How to Scrap Any Website's content using ScrapyTutorial of How to scrape (cra...
How to Scrap Any Website's content using ScrapyTutorial of How to scrape (cra...
 
Python, web scraping and content management: Scrapy and Django
Python, web scraping and content management: Scrapy and DjangoPython, web scraping and content management: Scrapy and Django
Python, web scraping and content management: Scrapy and Django
 

Similar to Big data at scrapinghub

AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned
Omid Vahdaty
 
Presto @ Uber Hadoop summit2017
Presto @ Uber Hadoop summit2017Presto @ Uber Hadoop summit2017
Presto @ Uber Hadoop summit2017
Zhenxiao Luo
 
Even Faster: When Presto meets Parquet @ Uber
Even Faster: When Presto meets Parquet @ UberEven Faster: When Presto meets Parquet @ Uber
Even Faster: When Presto meets Parquet @ Uber
DataWorks Summit
 
Building a Scalable Web Crawler with Hadoop
Building a Scalable Web Crawler with HadoopBuilding a Scalable Web Crawler with Hadoop
Building a Scalable Web Crawler with Hadoop
Hadoop User Group
 
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | English
AWS big-data-demystified #1.1  | Big Data Architecture Lessons Learned | EnglishAWS big-data-demystified #1.1  | Big Data Architecture Lessons Learned | English
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | English
Omid Vahdaty
 
Introduction to Big Data Technologies & Applications
Introduction to Big Data Technologies & ApplicationsIntroduction to Big Data Technologies & Applications
Introduction to Big Data Technologies & Applications
Nguyen Cao
 
A Day in the Life of a Druid Implementor and Druid's Roadmap
A Day in the Life of a Druid Implementor and Druid's RoadmapA Day in the Life of a Druid Implementor and Druid's Roadmap
A Day in the Life of a Druid Implementor and Druid's Roadmap
Itai Yaffe
 
Alexander Sibiryakov- Frontera
Alexander Sibiryakov- FronteraAlexander Sibiryakov- Frontera
Alexander Sibiryakov- Frontera
PyData
 
Mongo db intro.pptx
Mongo db intro.pptxMongo db intro.pptx
Mongo db intro.pptx
JWORKS powered by Ordina
 
Time-oriented event search. A new level of scale
Time-oriented event search. A new level of scale Time-oriented event search. A new level of scale
Time-oriented event search. A new level of scale
DataWorks Summit/Hadoop Summit
 
How Netflix Monitors Applications in Near Real-time w Amazon Kinesis - ABD401...
How Netflix Monitors Applications in Near Real-time w Amazon Kinesis - ABD401...How Netflix Monitors Applications in Near Real-time w Amazon Kinesis - ABD401...
How Netflix Monitors Applications in Near Real-time w Amazon Kinesis - ABD401...
Amazon Web Services
 
When to Use MongoDB
When to Use MongoDBWhen to Use MongoDB
When to Use MongoDB
MongoDB
 
RubiX
RubiXRubiX
44CON 2014: Using hadoop for malware, network, forensics and log analysis
44CON 2014: Using hadoop for malware, network, forensics and log analysis44CON 2014: Using hadoop for malware, network, forensics and log analysis
44CON 2014: Using hadoop for malware, network, forensics and log analysis
Michael Boman
 
The of Operational Analytics Data Store
The of Operational Analytics Data StoreThe of Operational Analytics Data Store
The of Operational Analytics Data Store
Rommel Garcia
 
Interactive Data Analysis in Spark Streaming
Interactive Data Analysis in Spark StreamingInteractive Data Analysis in Spark Streaming
Interactive Data Analysis in Spark Streaming
datamantra
 
Apache frameworks for Big and Fast Data
Apache frameworks for Big and Fast DataApache frameworks for Big and Fast Data
Apache frameworks for Big and Fast Data
Naveen Korakoppa
 
Architecting Big Data Ingest & Manipulation
Architecting Big Data Ingest & ManipulationArchitecting Big Data Ingest & Manipulation
Architecting Big Data Ingest & Manipulation
George Long
 
Summer 2017 undergraduate research powerpoint
Summer 2017 undergraduate research powerpointSummer 2017 undergraduate research powerpoint
Summer 2017 undergraduate research powerpoint
Christopher Dubois
 

Similar to Big data at scrapinghub (20)

AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned
 
Presto @ Uber Hadoop summit2017
Presto @ Uber Hadoop summit2017Presto @ Uber Hadoop summit2017
Presto @ Uber Hadoop summit2017
 
Even Faster: When Presto meets Parquet @ Uber
Even Faster: When Presto meets Parquet @ UberEven Faster: When Presto meets Parquet @ Uber
Even Faster: When Presto meets Parquet @ Uber
 
Common crawlpresentation
Common crawlpresentationCommon crawlpresentation
Common crawlpresentation
 
Building a Scalable Web Crawler with Hadoop
Building a Scalable Web Crawler with HadoopBuilding a Scalable Web Crawler with Hadoop
Building a Scalable Web Crawler with Hadoop
 
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | English
AWS big-data-demystified #1.1  | Big Data Architecture Lessons Learned | EnglishAWS big-data-demystified #1.1  | Big Data Architecture Lessons Learned | English
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | English
 
Introduction to Big Data Technologies & Applications
Introduction to Big Data Technologies & ApplicationsIntroduction to Big Data Technologies & Applications
Introduction to Big Data Technologies & Applications
 
A Day in the Life of a Druid Implementor and Druid's Roadmap
A Day in the Life of a Druid Implementor and Druid's RoadmapA Day in the Life of a Druid Implementor and Druid's Roadmap
A Day in the Life of a Druid Implementor and Druid's Roadmap
 
Alexander Sibiryakov- Frontera
Alexander Sibiryakov- FronteraAlexander Sibiryakov- Frontera
Alexander Sibiryakov- Frontera
 
Mongo db intro.pptx
Mongo db intro.pptxMongo db intro.pptx
Mongo db intro.pptx
 
Time-oriented event search. A new level of scale
Time-oriented event search. A new level of scale Time-oriented event search. A new level of scale
Time-oriented event search. A new level of scale
 
How Netflix Monitors Applications in Near Real-time w Amazon Kinesis - ABD401...
How Netflix Monitors Applications in Near Real-time w Amazon Kinesis - ABD401...How Netflix Monitors Applications in Near Real-time w Amazon Kinesis - ABD401...
How Netflix Monitors Applications in Near Real-time w Amazon Kinesis - ABD401...
 
When to Use MongoDB
When to Use MongoDBWhen to Use MongoDB
When to Use MongoDB
 
RubiX
RubiXRubiX
RubiX
 
44CON 2014: Using hadoop for malware, network, forensics and log analysis
44CON 2014: Using hadoop for malware, network, forensics and log analysis44CON 2014: Using hadoop for malware, network, forensics and log analysis
44CON 2014: Using hadoop for malware, network, forensics and log analysis
 
The of Operational Analytics Data Store
The of Operational Analytics Data StoreThe of Operational Analytics Data Store
The of Operational Analytics Data Store
 
Interactive Data Analysis in Spark Streaming
Interactive Data Analysis in Spark StreamingInteractive Data Analysis in Spark Streaming
Interactive Data Analysis in Spark Streaming
 
Apache frameworks for Big and Fast Data
Apache frameworks for Big and Fast DataApache frameworks for Big and Fast Data
Apache frameworks for Big and Fast Data
 
Architecting Big Data Ingest & Manipulation
Architecting Big Data Ingest & ManipulationArchitecting Big Data Ingest & Manipulation
Architecting Big Data Ingest & Manipulation
 
Summer 2017 undergraduate research powerpoint
Summer 2017 undergraduate research powerpointSummer 2017 undergraduate research powerpoint
Summer 2017 undergraduate research powerpoint
 

Recently uploaded

Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
Aftab Hussain
 
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
James Anderson
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
James Anderson
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
Adtran
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
Safe Software
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
Dorra BARTAGUIZ
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
Kari Kakkonen
 
The Metaverse and AI: how can decision-makers harness the Metaverse for their...
The Metaverse and AI: how can decision-makers harness the Metaverse for their...The Metaverse and AI: how can decision-makers harness the Metaverse for their...
The Metaverse and AI: how can decision-makers harness the Metaverse for their...
Jen Stirrup
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
Ralf Eggert
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
Thijs Feryn
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
Jemma Hussein Allen
 
RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
KAMESHS29
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
Ana-Maria Mihalceanu
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
mikeeftimakis1
 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Aggregage
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
Laura Byrne
 
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptxSecstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
nkrafacyberclub
 

Recently uploaded (20)

Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
 
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
 
The Metaverse and AI: how can decision-makers harness the Metaverse for their...
The Metaverse and AI: how can decision-makers harness the Metaverse for their...The Metaverse and AI: how can decision-makers harness the Metaverse for their...
The Metaverse and AI: how can decision-makers harness the Metaverse for their...
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
 
RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
 
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptxSecstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
 

Big data at scrapinghub

  • 1. Big Data at Scrapinghub Shane Evans
  • 2. About Shane ● 9 years web scraping ● Decades with Big Data ● Scrapy, Portia, Frontera, Scrapy Cloud, etc. ● Co-founded Scrapinghub
  • 3. We turn web content into useful data
  • 4. Founded in 2010, largest 100% remote company based outside of the US We’re 126 teammates in 41 countries
  • 5. About Scrapinghub Scrapinghub specializes in data extraction. Our platform is used to scrape over 4 billion web pages a month. We offer: ● Professional Services to handle the web scraping for you ● Off-the-shelf datasets so you can get data hassle free ● A cloud-based platform that makes scraping a breeze
  • 6. Who Uses Web Scraping Used by everyone from individuals to multinational companies: ● Monitor your competitors’ prices by scraping product information ● Detect fraudulent reviews and sentiment changes by scraping product reviews ● Track online reputation by scraping social media profiles ● Create apps that use public data ● Track SEO by scraping search engine results
  • 7. “Getting information off the Internet is like taking a drink from a fire hydrant.” – Mitchell Kapor
  • 8. Scrapy Scrapy is a web scraping framework that gets the dirty work related to web crawling out of your way. Benefits ● No platform lock-in: Open Source ● Very popular (13k+ ★) ● Battle tested ● Highly extensible ● Great documentation
  • 9. Introducing Portia Portia is a Visual Scraping tool that lets you get data without needing to write code. Benefits ● No platform lock-in: Open Source ● JavaScript dynamic content generation ● Ideal for non-developers ● Extensible ● It’s as easy as annotating a page
  • 10. How Portia Works User provides seed URLs: Follows links ● Users specify which links to follow (regexp, point-and-click) ● Automatically guesses: finds and follows pagination, infinite scroll, prioritizes content ● Knows when to stop Extracts data ● Given a sample, extracts the same data from all similar pages ● Understands repetitive patterns ● Manages item schemas Run standalone or on Scrapy Cloud
  • 12. Large Scale Infrastructure Meet Scrapy Cloud , our PaaS for web crawlers: ● Scalable: Crawlers run on our cloud infrastructure ● Crawlera add-on ● Control your spiders: Command line, API or web UI ● Machine learning integration: BigML, MonkeyLearn, among others ● No lock-in: scrapyd, Scrapy or Portia to run spiders on your own infrastructure
  • 13. Data Growth ● Items, logs and requests are collected in real time ● Millions of web crawling jobs each month ● Now at 4 billion a month and growing ● Thousands of separate active projects
  • 14. ● Browse data as the crawl is running ● Filter and download huge datasets ● Items can have arbitrary schemas Data Dashboard
  • 15. MongoDB - v1.0 MongoDB was a good fit to get a demo up and running, but it’s a bad fit for our use at scale ● Cannot keep hot data in memory ● Lock contention ● Cannot order data without sorting, skip+limit queries slow ● Poor space efficiency See https://blog.scrapinghub.com/2013/05/13/mongo-bad-for-scraped-data/
  • 16. ● High write volume. Writes are micro-batched ● Much of the data is written in order and immutable (like logs) ● Items are semi-structured nested data ● Expect exponential growth ● Random access from dashboard users, keep summary stats ● Sequential reading important (downloading & analyzing) ● Store data on disk, many TB per node Storage Requirements - v2.0
  • 17. Bigtable looks good... Google’s Bigtable provides a sparse, distributed, persistent multidimensional sorted map Can express our requirements in what Bigtable provides Performance characteristics should match our workload Inspired several open source projects
  • 18. Apache HBase ● Modelled after Google’s Bigtable ● Provides real time random read and write to billions of rows with millions of columns ● Runs on hadoop and uses HDFS ● Strictly consistent reads and writes ● Extensible via server side filters and coprocessors ● Java-based
  • 20. HBase Key Selection Key selection is critical ● Atomic operations are at the row level: we use fat columns, update counts on write operations and delete whole rows at once ● Order is determined by the binary key: our offsets preserve order
  • 21. HBase Values ● Msgpack is like JSON but fast and small ● Storing entire records as a value has low overhead (vs. splitting records into multiple key/values in hbase) ● Doesn’t handle very large values well, requires us to limit the size of single records ● We need arbitrarily nested data anyway, so we need some custom binary encoding ● Write custom Filters to support simple queries We store the entire item record as msgpack encoded data in a single value
  • 22. HBase Deployment ● All access is via a single service that provides a restricted API ● Ensure no long running queries, deal with timeouts everywhere, ... ● Tune settings to work with a lot of data per node ● Set block size and compression for each Column Family ● Do not use block cache for large scans (Scan.setCacheBlocks) and ‘batch’ every time you touch fat columns ● Scripts to manage regions (balancing, merging, bulk delete) ● We host in Hetzner, on dedicated servers ● Data replicated to backup clusters, where we run analytics
  • 23. HBase Lessons Learned ● It was a lot of work ○ API is low level (untyped bytes) - check out Apache Phoenix ○ Many parts -> longer learning curve and difficult to debug. Tools are getting better ● Many of our early problems were addressed in later releases ○ reduced memory allocation & GC times ○ improved MTTR ○ online region merging ○ scanner heartbeat
  • 25. Broad Crawls Frontera allows us to build large scale web crawlers in Python: ● Scrapy support out of the box ● Distribute and scale custom web crawlers across servers ● Crawl Frontier Framework: large scale URL prioritization logic ● Aduana to prioritize URLs based on link analysis (PageRank, HITS)
  • 26. Broad Crawls Many uses of Frontera: ○ News analysis, Topical crawling ○ Plagiarism detection ○ Sentiment analysis (popularity, likeability) ○ Due diligence (profile/business data) ○ Lead generation (extracting contact information) ○ Track criminal activity & find lost persons (DARPA)
  • 27. Frontera Motivation Frontera started when we needed to identify frequently changing hubs We had to crawl about 1 billion pages per week
  • 28. Frontera Architecture Supports both local and distributed mode ● Scrapy for crawl spiders ● Kafka for message bus ● HBase for storage and frontier maintenance ● Twisted.Internet for async primitives ● Snappy for compression
  • 29. Frontera: Big and Small hosts Ordering of URLs across hosts is important: ● Politeness: a single host crawled by one Scrapy process ● Each Scrapy process crawls multiple hosts Challenges we found at scale: Queue flooded with URLs from the same host. ○ Underuse of spider resources. Additional per-host (per-IP) queue and metering algorithm. URLs from big hosts are cached in memory. ○ Found a few very huge hosts (>20M docs) All queue partitions were flooded with huge hosts. Two MapReduce jobs: queue shuffling, limit all hosts to 100 docs MAX.
  • 30. Breadth-first strategy: huge amount of DNS requests ● Recursive DNS server on every spider node, upstream to Verizon & OpenDNS ● Scrapy patch for large thread pool for DNS resolving and timeout customization Intensive network traffic from workers to services ● Throughput between workers and Kafka/HBase ~ 1Gbit/s ● Thrift compact protocol for HBase ● Message compression in Kafka with Snappy Batching and caching to achieve performance Frontera: tuning
  • 31. Duplicate Content The web is full of duplicate content. Duplicate Content negatively impacts: ● Storage ● Re-crawl performance ● Quality of data Efficient algorithms for Near Duplicate Detection, like SimHash, are applied to estimate similarity between web pages to avoid scraping duplicated content.
  • 32. Near Duplicate Detection Uses Compare prices of products scraped from different retailers by finding near duplicates in a dataset: Merge similar items to avoid duplicate entries: Title Store Price ThinkPad X220 Laptop Lenovo (i7 2.8GHz, 12.5 LED, 320 GB) Acme Store 599.89 Lenovo Thinkpad Notebook Model X220 (i7 2.8, 12.5’’, HDD 320) XYZ Electronics 559.95 Name Summary Location Saint Fin Barre’s Cathedral Begun in 1863, the cathedral was the first major work of the Victorian architect William Burges… 51.8944, -8.48064 St. Finbarr’s Cathedral Cork Designed by William Burges and consecrated in 1870, ... 51.894401550293, -8.48064041137695
  • 33. What we’re seeing.. ● More data is available than ever ● Scrapinghub can provide web data in a usable format ● We’re combining multiple data sources and analyzing ● The technology to use big data is rapidly improving and becoming more accessible ● Data Science is everywhere

Editor's Notes

  1. 8 years ago I started scraping in anger. I saw quite a few examples of what not to do.. which is one reason I started to write a framework.. that framework that was later outsourced as scrapy, worked on a visual scraper that turned into portia, etc. worked on design for frontera. If you’ve never heard of these, don’t worry, we’ll get to them in a while Co-Founded Scrapinghub with Pablo Hoffman Work with lots of amazing spidermen and spiderwomen - so I’m around web scraping all the time
  2. 3 billion pages a month: around 1200 pages per second
  3. Nice things about Scrapy: Async networking. Deals with retrying, redirection, duplicated requests, noscript traps, robots.txt, cookies, logins, throttling, JS (splash), community plugins, scrapy cloud or scrapyd to deploy, tools that make scrapy even better: crawlera, frontera, splash.
  4. Nice things about Portia: open source, uses Splash to render JS code, addons, scraping for non-devs, speedup the work for devs, JavaScript, data journalists can use it
  5. Clients are java based, there is a thrift gateway for non-java clients Multiple region servers (like data storage nodes). Each region holds a range of data and hbase maintains its start and end key internally. Once a region grows beyond a certain size, it is split in two. Many regions per region server. A directory of what regions are allocated where is kept in a META table, whose location is stored in zookeeper. Data aggregated in memory (in memstore) and written to WAL. Memstore periodically flushed. Hfiles merged together during compaction