SlideShare a Scribd company logo
1 of 4
Download to read offline
White Paper | eCommerce Crawling & Scraping Infrastructure
How do you accurately
and cost effectively
crawl 100’s of millions
of eCommerce retail
products?
We’ve built crawling
infrastructure that accurately
screen scrapes data from large
eCommerce websites for
customers like cdscience.com,
wishclouds.com, and
adchemy.com (acquired by
WalmartLabs). Through these
consulting projects we’ve
learned a lot and we’ve
designed a new technical
architecture with superior
scalability and a lower total cost
of ownership. In this white
paper, we provide an overview
of the problem and our solution.
The problem
It’s easy to build a basic one-off
screen scraper. In fact, many software
engineers build simple screen
scrapers early in their career because
a project they are working on needs to
fetch data from a third-party website
that doesn’t have an API. The problem
is one of scale.
When you crawl 100’s of millions of
product pages daily, you need
crawling infrastructure that’s highly
configurable and cost efficient. As an
example, most architectures don’t
accurately maintain the state of the
crawler, which tracks how many
pages have been crawled and how
many pages still need to be crawled.
You need to understand the state of
the crawler when it fails in order to
quickly debug it.
The solution
The crawling architecture we’ve
defined solves the core problems
faced by crawlers that scrape 100’s of
millions of eCommerce products.
‣ Machine learning is used to
scrape standard product data
‣ Error handling and debugging
tools speed up the process of
writing new crawlers and
debugging existing ones
‣ Requests to each website are
throttled to avoid DOS attacks
‣ Settings are configurable per
eCommerce retailer to maximize
the requests per second, saving
you money on infrastructure costs
‣ The architecture scales
horizontally to accomplish jobs of
any size
Technical architecture
Our solution is composed of two
applications:
‣ A web crawling application that
scrapes product data
‣ An API application that serves the
products
Solution at a glance
You want to
‣ Crawl, scrape and index 100’s of
millions of products daily
We help you
‣ Architect and build scalable
crawling infrastructure that
improves your data quality and
reduces your infrastructure costs
White Paper | eCommerce Crawling & Scraping Infrastructure
The web crawling
application
Because requirements can vary from
customer-to-customer and the
available data varies from retailer-to-
retailer, we’ve divided the crawling
application that populates the
database with product data from
eCommerce websites into eight
configurable de-coupled steps.
1) Crawl eCommerce
websites
We start with a list of eCommerce
URLs that you provide via API or
spreadsheet (if you only have a limited
number of websites you’d like us to
crawl). Then, we identify and store a
list of product URLs and product meta
data for each eCommerce website.
‣ A breadth-first search (BFS) or a
Depth-first search (DFS) — based
on the situation — crawl ensures
that we generate a complete list
of all URLs belonging to each
eCommerce website
‣ Jobs are run in parallel using
event loops
‣ Product URLs are identified,
stored along with the metadata in
the data store and placed into a
Queue for batch processing
‣ PhantomJS is a headless browser
we use to crawl some HTTPS
websites
‣ Our crawler framework provides
the ability to configure the
maximum concurrent requests to
a particular retailer to avoid DOS
attacks
‣ Sites like amazon.com can
handle 50 or more requests
per second, but less
established sites like
fancy.com can only handle
1-5 requests per second
‣ Our exception management
system enables developers to
easily debug crawlers when they
fail
‣ Asynchronous
‣ Separate log file for errors
‣ Tagged with job-id, retailer-id
and URL
‣ Stored to the disk in the
database
2) Fetch product
pages
In this phase, the web crawling
application grabs the next product
URL from the Queue and fetches the
HTML for each product page with an
HTTP request.
‣ We use workers to fetch the
HTML for each product page
‣ The HTML is compressed and
stored to the disk
‣ We utilize Amazon S3 as the
database to store the HTML
pages for each retailer
3) Fetch images
A separate process parses the HTML
pages and fetches each product’s
images, which are processed and
stored in Amazon S3. A configurable
job can batch process all the product
images to normalize image type and
size. By normalizing all the product
images, users in developing countries
with slower average internet speeds
will experience faster page loads.
4) Business logic
(mining)
A separate worker / daemon fetches
the HTML pages from the data store
and extracts the required fields. We
utilize a standardized data model like
STEP to map the data extracted from
each HTML product page to the
database. Once extracted, the data is
stored in a data store.
5) Check product
prices
The price check process enables us
to track the price of a specific product
on an eCommerce website over time.
It can be configured to check the
price of a product as often as every
day, so you can tell your customers
the moment a product goes on sale.
This is a separate process that fetches
the price from the HTML pages and
updates the data store.
White Paper | eCommerce Crawling & Scraping Infrastructure
6) Link products
across retailers
To enable you to offer price
comparison features like
shopzilla.com and Google Shopping,
we use several different variables to
determine the probability that two
products are the same. We start by
looking for the manufacturer’s product
ID in the meta data and HTML
scraped from each site. If the retailer
creates their own SKU numbers for
each product, we’ll do entity
recognition on the product title. We’ll
continue to add variables until we
achieve an acceptable statistical
confidence that two products are the
same. A relationship is created in the
data store once we’re confident that
we’ve found the same product sold at
multiple retailers.
7) Update products
Because the processes are
decoupled, we can update the entire
product catalog from an eCommerce
retailer as often as once a day.
8) Exception handling
All errors are logged and stored in a
temporary data store so the errors
can be easily reviewed and fixed.
Errors are tagged with job-id, phase
(crawl, fetch, mine), URL, and
message.
The API application
We’ll expose a REST API that
generates a list of products based on
a natural language query. The API will
also enable you to filter products
based on categories, price and other
meta-data.

Contact
Sales
‣ Ben Obear
‣ Business Development
‣ ben@cognitiveclouds.com
‣ 415.215.9082
Locations
‣ San Francisco, USA
‣ Bangalore, India
Case Study | First Version Native Mobile Apps for Yatra
Proposal v01
We’re not just another app developer.
We create connected experiences that stick.
CognitiveClouds’ team has worked on complex stuff like database B-trees,
document management systems, compilers, security systems, operating
systems, video platforms, telecom switches, mobile risk management
platforms and enterprise applications.
We utilize a mobile first engineering methodology to craft robust products
your users love. Products we build are always production-ready, which
means you can base your decision to go live on business factors and not
technical ones.
PROCESS
Our agile development process is
proven with distributed teams.
DISCOVER
Meet face-to-face with your team and
ensure they understand your vision.
We evaluate technologies.
PLAN
We deliver a project plan and technical
architecture.
PROTOTYPE
We deliver a functional prototype with
your product’s core features.
SHIP
Receive weekly product builds as we
add features and optimize. Your
product goes live.
LEARN
We measure analytic and user
feedback and prioritize iterations.
MAINTAIN
Our operations team maintains and
upgrades your infrastructure.
COMMUNICATION
Excellent communication ensures
timely completion of deliverables.
TEAM STRUCTURE
Your product manager is your main
point of contact. After kickoff, you’re
encouraged to maintain the
relationship you’ve built with your
technical lead and software engineers.
TEAM LOCATION
People on your team are located in
the USA and India.
DAILY SCRUMS
Participate in daily scrums to provide
input or gain visibility.
SPRINT BUILDS
After prototype completion, get weekly
product builds.
TOOLS
We use cloud productivity tools like
Basecamp for communication and
Pivotal Tracker for Scrum user stories.
PARTNERSHIP
As a partner, our success is
dependent upon your success.
LONG-TERM
All engagements start with the goal of
building a long-term relationship with
you and your team. Your best interests
are in mind when we provide advice
and strategic recommendations.
EXTENDED-TEAM
Our team is your team and that’s why
we recommend you meet face-to-face
with the people building your product.
BEST IDEAS
We share our best ideas to help you
build a better product.
SUCCESS METRICS
Success is measured by how the
product we build is received by your
customers. When you meet your
goals, we meet our goals.
DISCOVER MAINTAINPLAN PROTOTYPE SHIP LEARN
Company Overview

More Related Content

Viewers also liked

Infographic: How mobile will transform the future of Air Travel
Infographic: How mobile will transform the future of Air TravelInfographic: How mobile will transform the future of Air Travel
Infographic: How mobile will transform the future of Air TravelAmit Ashwini
 
Creating a mobile app: Concept and design
Creating a mobile app: Concept and designCreating a mobile app: Concept and design
Creating a mobile app: Concept and designAmit Ashwini
 
Making your personal projects happen - Oxford Geek Nights #32
Making your personal projects happen - Oxford Geek Nights #32Making your personal projects happen - Oxford Geek Nights #32
Making your personal projects happen - Oxford Geek Nights #32Mariana Morris
 
UX is team work! (Bulgaria Web Summit 2015)
UX is team work! (Bulgaria Web Summit 2015)UX is team work! (Bulgaria Web Summit 2015)
UX is team work! (Bulgaria Web Summit 2015)Mariana Morris
 
2014 Evolving Your UX Process 1up
2014 Evolving Your UX Process 1up2014 Evolving Your UX Process 1up
2014 Evolving Your UX Process 1upTom Brinck
 
What does the UX process look like... really?
What does the UX process look like... really?What does the UX process look like... really?
What does the UX process look like... really?Bryandan6
 
Session 1: UX Process + Interviewing
Session 1: UX Process + InterviewingSession 1: UX Process + Interviewing
Session 1: UX Process + InterviewingLeanna Gingras
 
Bible Journey बाइबिल यात्रा 6-8
Bible Journey बाइबिल यात्रा 6-8Bible Journey बाइबिल यात्रा 6-8
Bible Journey बाइबिल यात्रा 6-8Dr. Bella Pillai
 
UX is Team Work - Agile in the City: Bristol, 2016
UX is Team Work - Agile in the City: Bristol, 2016UX is Team Work - Agile in the City: Bristol, 2016
UX is Team Work - Agile in the City: Bristol, 2016Mariana Morris
 
Designing a Sustainable Enterprise UX Process
Designing a Sustainable Enterprise UX ProcessDesigning a Sustainable Enterprise UX Process
Designing a Sustainable Enterprise UX Processuxpin
 
BrightonSEO 2015: How good UX can improve SEO
BrightonSEO 2015: How good UX can improve SEOBrightonSEO 2015: How good UX can improve SEO
BrightonSEO 2015: How good UX can improve SEOMariana Morris
 
UX Workshop: Lean UX process (Usability Day FMH)
UX Workshop: Lean UX process (Usability Day FMH)UX Workshop: Lean UX process (Usability Day FMH)
UX Workshop: Lean UX process (Usability Day FMH)Ricardo Luiz
 
Bryan Daniel UX Portfolio
Bryan Daniel UX PortfolioBryan Daniel UX Portfolio
Bryan Daniel UX PortfolioBryandan6
 
Storytelling: Selling a brilliant idea like a rock star
Storytelling: Selling a brilliant idea like a rock starStorytelling: Selling a brilliant idea like a rock star
Storytelling: Selling a brilliant idea like a rock starRicardo Luiz
 

Viewers also liked (15)

Infographic: How mobile will transform the future of Air Travel
Infographic: How mobile will transform the future of Air TravelInfographic: How mobile will transform the future of Air Travel
Infographic: How mobile will transform the future of Air Travel
 
Creating a mobile app: Concept and design
Creating a mobile app: Concept and designCreating a mobile app: Concept and design
Creating a mobile app: Concept and design
 
Making your personal projects happen - Oxford Geek Nights #32
Making your personal projects happen - Oxford Geek Nights #32Making your personal projects happen - Oxford Geek Nights #32
Making your personal projects happen - Oxford Geek Nights #32
 
UX is team work! (Bulgaria Web Summit 2015)
UX is team work! (Bulgaria Web Summit 2015)UX is team work! (Bulgaria Web Summit 2015)
UX is team work! (Bulgaria Web Summit 2015)
 
2014 Evolving Your UX Process 1up
2014 Evolving Your UX Process 1up2014 Evolving Your UX Process 1up
2014 Evolving Your UX Process 1up
 
What does the UX process look like... really?
What does the UX process look like... really?What does the UX process look like... really?
What does the UX process look like... really?
 
Session 1: UX Process + Interviewing
Session 1: UX Process + InterviewingSession 1: UX Process + Interviewing
Session 1: UX Process + Interviewing
 
Bible Journey बाइबिल यात्रा 6-8
Bible Journey बाइबिल यात्रा 6-8Bible Journey बाइबिल यात्रा 6-8
Bible Journey बाइबिल यात्रा 6-8
 
UX is Team Work - Agile in the City: Bristol, 2016
UX is Team Work - Agile in the City: Bristol, 2016UX is Team Work - Agile in the City: Bristol, 2016
UX is Team Work - Agile in the City: Bristol, 2016
 
Designing a Sustainable Enterprise UX Process
Designing a Sustainable Enterprise UX ProcessDesigning a Sustainable Enterprise UX Process
Designing a Sustainable Enterprise UX Process
 
BrightonSEO 2015: How good UX can improve SEO
BrightonSEO 2015: How good UX can improve SEOBrightonSEO 2015: How good UX can improve SEO
BrightonSEO 2015: How good UX can improve SEO
 
Resume
ResumeResume
Resume
 
UX Workshop: Lean UX process (Usability Day FMH)
UX Workshop: Lean UX process (Usability Day FMH)UX Workshop: Lean UX process (Usability Day FMH)
UX Workshop: Lean UX process (Usability Day FMH)
 
Bryan Daniel UX Portfolio
Bryan Daniel UX PortfolioBryan Daniel UX Portfolio
Bryan Daniel UX Portfolio
 
Storytelling: Selling a brilliant idea like a rock star
Storytelling: Selling a brilliant idea like a rock starStorytelling: Selling a brilliant idea like a rock star
Storytelling: Selling a brilliant idea like a rock star
 

More from Amit Ashwini

Sample Marketing Budget.pdf
Sample Marketing Budget.pdfSample Marketing Budget.pdf
Sample Marketing Budget.pdfAmit Ashwini
 
Go to Marketing GTM template - Template.pdf
Go to Marketing GTM template - Template.pdfGo to Marketing GTM template - Template.pdf
Go to Marketing GTM template - Template.pdfAmit Ashwini
 
Covid 19-employee-manual-template
Covid 19-employee-manual-templateCovid 19-employee-manual-template
Covid 19-employee-manual-templateAmit Ashwini
 
Zibtek Company Presentation
Zibtek Company PresentationZibtek Company Presentation
Zibtek Company PresentationAmit Ashwini
 
Zibtek’s Software Development Comparison Guide
Zibtek’s Software Development Comparison GuideZibtek’s Software Development Comparison Guide
Zibtek’s Software Development Comparison GuideAmit Ashwini
 
Offshore Software Development Playbook
Offshore Software Development PlaybookOffshore Software Development Playbook
Offshore Software Development PlaybookAmit Ashwini
 
The Guide to becoming a full stack developer in 2018
The Guide to becoming a full stack developer in 2018The Guide to becoming a full stack developer in 2018
The Guide to becoming a full stack developer in 2018Amit Ashwini
 
CognitiveClouds Customer Presentation
CognitiveClouds Customer PresentationCognitiveClouds Customer Presentation
CognitiveClouds Customer PresentationAmit Ashwini
 
Top digital predictions for 2017 & beyond
Top digital predictions for 2017 & beyondTop digital predictions for 2017 & beyond
Top digital predictions for 2017 & beyondAmit Ashwini
 

More from Amit Ashwini (9)

Sample Marketing Budget.pdf
Sample Marketing Budget.pdfSample Marketing Budget.pdf
Sample Marketing Budget.pdf
 
Go to Marketing GTM template - Template.pdf
Go to Marketing GTM template - Template.pdfGo to Marketing GTM template - Template.pdf
Go to Marketing GTM template - Template.pdf
 
Covid 19-employee-manual-template
Covid 19-employee-manual-templateCovid 19-employee-manual-template
Covid 19-employee-manual-template
 
Zibtek Company Presentation
Zibtek Company PresentationZibtek Company Presentation
Zibtek Company Presentation
 
Zibtek’s Software Development Comparison Guide
Zibtek’s Software Development Comparison GuideZibtek’s Software Development Comparison Guide
Zibtek’s Software Development Comparison Guide
 
Offshore Software Development Playbook
Offshore Software Development PlaybookOffshore Software Development Playbook
Offshore Software Development Playbook
 
The Guide to becoming a full stack developer in 2018
The Guide to becoming a full stack developer in 2018The Guide to becoming a full stack developer in 2018
The Guide to becoming a full stack developer in 2018
 
CognitiveClouds Customer Presentation
CognitiveClouds Customer PresentationCognitiveClouds Customer Presentation
CognitiveClouds Customer Presentation
 
Top digital predictions for 2017 & beyond
Top digital predictions for 2017 & beyondTop digital predictions for 2017 & beyond
Top digital predictions for 2017 & beyond
 

Recently uploaded

UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1DianaGray10
 
Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™Adtran
 
AI You Can Trust - Ensuring Success with Data Integrity Webinar
AI You Can Trust - Ensuring Success with Data Integrity WebinarAI You Can Trust - Ensuring Success with Data Integrity Webinar
AI You Can Trust - Ensuring Success with Data Integrity WebinarPrecisely
 
Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024D Cloud Solutions
 
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019IES VE
 
Designing A Time bound resource download URL
Designing A Time bound resource download URLDesigning A Time bound resource download URL
Designing A Time bound resource download URLRuncy Oommen
 
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDE
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDEADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDE
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDELiveplex
 
Igniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration WorkflowsIgniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration WorkflowsSafe Software
 
9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding Team9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding TeamAdam Moalla
 
How Accurate are Carbon Emissions Projections?
How Accurate are Carbon Emissions Projections?How Accurate are Carbon Emissions Projections?
How Accurate are Carbon Emissions Projections?IES VE
 
Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1DianaGray10
 
NIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 WorkshopNIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 WorkshopBachir Benyammi
 
Introduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptxIntroduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptxMatsuo Lab
 
Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)Commit University
 
Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.YounusS2
 
UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7DianaGray10
 
UiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation DevelopersUiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation DevelopersUiPathCommunity
 
Comparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and IstioComparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and IstioChristian Posta
 
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...Aggregage
 

Recently uploaded (20)

UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1
 
Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™
 
AI You Can Trust - Ensuring Success with Data Integrity Webinar
AI You Can Trust - Ensuring Success with Data Integrity WebinarAI You Can Trust - Ensuring Success with Data Integrity Webinar
AI You Can Trust - Ensuring Success with Data Integrity Webinar
 
Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024
 
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
 
Designing A Time bound resource download URL
Designing A Time bound resource download URLDesigning A Time bound resource download URL
Designing A Time bound resource download URL
 
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDE
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDEADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDE
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDE
 
Igniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration WorkflowsIgniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration Workflows
 
9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding Team9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding Team
 
How Accurate are Carbon Emissions Projections?
How Accurate are Carbon Emissions Projections?How Accurate are Carbon Emissions Projections?
How Accurate are Carbon Emissions Projections?
 
Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1
 
NIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 WorkshopNIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 Workshop
 
Introduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptxIntroduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptx
 
Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)
 
Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.
 
201610817 - edge part1
201610817 - edge part1201610817 - edge part1
201610817 - edge part1
 
UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7
 
UiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation DevelopersUiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation Developers
 
Comparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and IstioComparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and Istio
 
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
 

CognitiveClouds: eCommerce Crawling & Scraping Infrastructure

  • 1. White Paper | eCommerce Crawling & Scraping Infrastructure How do you accurately and cost effectively crawl 100’s of millions of eCommerce retail products? We’ve built crawling infrastructure that accurately screen scrapes data from large eCommerce websites for customers like cdscience.com, wishclouds.com, and adchemy.com (acquired by WalmartLabs). Through these consulting projects we’ve learned a lot and we’ve designed a new technical architecture with superior scalability and a lower total cost of ownership. In this white paper, we provide an overview of the problem and our solution. The problem It’s easy to build a basic one-off screen scraper. In fact, many software engineers build simple screen scrapers early in their career because a project they are working on needs to fetch data from a third-party website that doesn’t have an API. The problem is one of scale. When you crawl 100’s of millions of product pages daily, you need crawling infrastructure that’s highly configurable and cost efficient. As an example, most architectures don’t accurately maintain the state of the crawler, which tracks how many pages have been crawled and how many pages still need to be crawled. You need to understand the state of the crawler when it fails in order to quickly debug it. The solution The crawling architecture we’ve defined solves the core problems faced by crawlers that scrape 100’s of millions of eCommerce products. ‣ Machine learning is used to scrape standard product data ‣ Error handling and debugging tools speed up the process of writing new crawlers and debugging existing ones ‣ Requests to each website are throttled to avoid DOS attacks ‣ Settings are configurable per eCommerce retailer to maximize the requests per second, saving you money on infrastructure costs ‣ The architecture scales horizontally to accomplish jobs of any size Technical architecture Our solution is composed of two applications: ‣ A web crawling application that scrapes product data ‣ An API application that serves the products Solution at a glance You want to ‣ Crawl, scrape and index 100’s of millions of products daily We help you ‣ Architect and build scalable crawling infrastructure that improves your data quality and reduces your infrastructure costs
  • 2. White Paper | eCommerce Crawling & Scraping Infrastructure The web crawling application Because requirements can vary from customer-to-customer and the available data varies from retailer-to- retailer, we’ve divided the crawling application that populates the database with product data from eCommerce websites into eight configurable de-coupled steps. 1) Crawl eCommerce websites We start with a list of eCommerce URLs that you provide via API or spreadsheet (if you only have a limited number of websites you’d like us to crawl). Then, we identify and store a list of product URLs and product meta data for each eCommerce website. ‣ A breadth-first search (BFS) or a Depth-first search (DFS) — based on the situation — crawl ensures that we generate a complete list of all URLs belonging to each eCommerce website ‣ Jobs are run in parallel using event loops ‣ Product URLs are identified, stored along with the metadata in the data store and placed into a Queue for batch processing ‣ PhantomJS is a headless browser we use to crawl some HTTPS websites ‣ Our crawler framework provides the ability to configure the maximum concurrent requests to a particular retailer to avoid DOS attacks ‣ Sites like amazon.com can handle 50 or more requests per second, but less established sites like fancy.com can only handle 1-5 requests per second ‣ Our exception management system enables developers to easily debug crawlers when they fail ‣ Asynchronous ‣ Separate log file for errors ‣ Tagged with job-id, retailer-id and URL ‣ Stored to the disk in the database 2) Fetch product pages In this phase, the web crawling application grabs the next product URL from the Queue and fetches the HTML for each product page with an HTTP request. ‣ We use workers to fetch the HTML for each product page ‣ The HTML is compressed and stored to the disk ‣ We utilize Amazon S3 as the database to store the HTML pages for each retailer 3) Fetch images A separate process parses the HTML pages and fetches each product’s images, which are processed and stored in Amazon S3. A configurable job can batch process all the product images to normalize image type and size. By normalizing all the product images, users in developing countries with slower average internet speeds will experience faster page loads. 4) Business logic (mining) A separate worker / daemon fetches the HTML pages from the data store and extracts the required fields. We utilize a standardized data model like STEP to map the data extracted from each HTML product page to the database. Once extracted, the data is stored in a data store. 5) Check product prices The price check process enables us to track the price of a specific product on an eCommerce website over time. It can be configured to check the price of a product as often as every day, so you can tell your customers the moment a product goes on sale. This is a separate process that fetches the price from the HTML pages and updates the data store.
  • 3. White Paper | eCommerce Crawling & Scraping Infrastructure 6) Link products across retailers To enable you to offer price comparison features like shopzilla.com and Google Shopping, we use several different variables to determine the probability that two products are the same. We start by looking for the manufacturer’s product ID in the meta data and HTML scraped from each site. If the retailer creates their own SKU numbers for each product, we’ll do entity recognition on the product title. We’ll continue to add variables until we achieve an acceptable statistical confidence that two products are the same. A relationship is created in the data store once we’re confident that we’ve found the same product sold at multiple retailers. 7) Update products Because the processes are decoupled, we can update the entire product catalog from an eCommerce retailer as often as once a day. 8) Exception handling All errors are logged and stored in a temporary data store so the errors can be easily reviewed and fixed. Errors are tagged with job-id, phase (crawl, fetch, mine), URL, and message. The API application We’ll expose a REST API that generates a list of products based on a natural language query. The API will also enable you to filter products based on categories, price and other meta-data.
 Contact Sales ‣ Ben Obear ‣ Business Development ‣ ben@cognitiveclouds.com ‣ 415.215.9082 Locations ‣ San Francisco, USA ‣ Bangalore, India
  • 4. Case Study | First Version Native Mobile Apps for Yatra Proposal v01 We’re not just another app developer. We create connected experiences that stick. CognitiveClouds’ team has worked on complex stuff like database B-trees, document management systems, compilers, security systems, operating systems, video platforms, telecom switches, mobile risk management platforms and enterprise applications. We utilize a mobile first engineering methodology to craft robust products your users love. Products we build are always production-ready, which means you can base your decision to go live on business factors and not technical ones. PROCESS Our agile development process is proven with distributed teams. DISCOVER Meet face-to-face with your team and ensure they understand your vision. We evaluate technologies. PLAN We deliver a project plan and technical architecture. PROTOTYPE We deliver a functional prototype with your product’s core features. SHIP Receive weekly product builds as we add features and optimize. Your product goes live. LEARN We measure analytic and user feedback and prioritize iterations. MAINTAIN Our operations team maintains and upgrades your infrastructure. COMMUNICATION Excellent communication ensures timely completion of deliverables. TEAM STRUCTURE Your product manager is your main point of contact. After kickoff, you’re encouraged to maintain the relationship you’ve built with your technical lead and software engineers. TEAM LOCATION People on your team are located in the USA and India. DAILY SCRUMS Participate in daily scrums to provide input or gain visibility. SPRINT BUILDS After prototype completion, get weekly product builds. TOOLS We use cloud productivity tools like Basecamp for communication and Pivotal Tracker for Scrum user stories. PARTNERSHIP As a partner, our success is dependent upon your success. LONG-TERM All engagements start with the goal of building a long-term relationship with you and your team. Your best interests are in mind when we provide advice and strategic recommendations. EXTENDED-TEAM Our team is your team and that’s why we recommend you meet face-to-face with the people building your product. BEST IDEAS We share our best ideas to help you build a better product. SUCCESS METRICS Success is measured by how the product we build is received by your customers. When you meet your goals, we meet our goals. DISCOVER MAINTAINPLAN PROTOTYPE SHIP LEARN Company Overview