Scraping the Web with Scrapinghub
For Finance
We turn web content into useful data
About Scrapinghub
Scrapinghub specializes in data extraction. Our platform is
used to scrape over 4 billion web pages a month.
We offer:
● Professional Services to handle the web scraping for you
● Off-the-shelf datasets so you can get data hassle free
● A cloud-based platform that makes scraping a breeze
Founded in 2010, largest 100% remote company based outside of the US
We’re 134 teammates in 48 countries
“Getting information off the
Internet is like taking a drink
from a fire hydrant.”
– Mitchell Kapor
Scrapy
Scrapy is a web scraping framework that
gets the dirty work related to web crawling
out of your way.
Benefits
● No platform lock-in: Open Source
● Very popular (13k+ ★)
● Battle tested
● Highly extensible
● Great documentation
Portia
Portia is a Visual Scraping tool that lets you
get data without needing to write code.
Benefits
● No platform lock-in: Open Source
● JavaScript dynamic content generation
● Ideal for non-developers
● Extensible
● It’s as easy as annotating a page
Portia
Large Scale Infrastructure
Meet Scrapy Cloud , our PaaS for web crawlers:
● Scalable: Crawlers run on EC2 instances or dedicated servers
● Crawlera add-on
● Control your spiders: Command line, API or web UI
● Machine learning integration: BigML, MonkeyLearn
● No lock-in: scrapyd to run Scrapy spiders on your own infrastructure
Broad Crawls
Frontera allows us to build large scale web crawlers in Python:
● Scrapy support out of the box
● Distribute and scale custom web crawlers across servers
● Crawl Frontier Framework: large scale URL prioritization logic
● Aduana to prioritize URLs based on link analysis (PageRank, HITS)
Web Scraping Use Cases
Competitive Pricing
Companies use web scraping to monitor the
pricing and the ratings of competitors:
● Scrape online retailers
● Structure the data in a search engine or DB
● Create an interface to search for products
● Sentiment analysis for product rankings
We help a leading IT manufacturer monitor the activities of their
resellers:
● Tracking and watching out for stolen goods
● Pricing agreement violations
● Customer support responses on complaints
● Product line quality checks
Monitor Resellers
Lead Generation
Mine scraped data to identify who to target in a company for your
outbound sales campaigns:
● Locate possible leads in your target market
● Identify the right contacts within each one
● Augment the information you already have on them
Real Estate
Crawl property websites and use the data obtained in order to:
● Estimate house prices
● Rental values
● Housing stock movements
● Give insight into real estate agents and homeowners
Fraud Detection
Monitor for sellers that offer products violating the ToS of credit card
companies including:
● Drugs
● Weapons
● Gambling
Identify stolen cards and IDs on the Dark Web
● Forums where hackers share ID numbers / pins
Company Reputation
Sentiment analysis of a company or product through newsletters, social
networks and other natural language data sources.
● NLP to create an associated sentiment indicator.
● Track the relevant news supporting the indicator can lead to market
insights for long-term trends.
Consumer Behavior
Extract data from forums and websites like Reddit to evaluate consumer
reviews and commentary:
● Volume of comments across brands
● Topics of discussion
● Comparisons with other brands and products
● Evaluate product launches and marketing tactics
Tracking Legislation
Monitor bills and regulations that are being discussed in Congress. Access
court judgments and opinions in order to:
● Follow discussions
● Try to forecast legislative outcomes
● Track regulations that impact different economic sectors
Hiring
Crawl and extract data from job boards and other
sources in order to understand:
● Hiring trends in different sectors or regions
● Find candidates for jobs, or future leaders
● Spot and rescue employees that are shopping
for a new job
Monitoring Corruption
Journalists and analysts can create Open Data by extracting information
from difficult to access government websites:
● Track the activities of lobbyists
● Patterns in the behavior of government officials
● Disruptions in the economy due to corruption allegations
Thank you!
scrapinghub.com
Thank you!

Using Web Data for Finance

  • 1.
    Scraping the Webwith Scrapinghub For Finance
  • 2.
    We turn webcontent into useful data
  • 3.
    About Scrapinghub Scrapinghub specializesin data extraction. Our platform is used to scrape over 4 billion web pages a month. We offer: ● Professional Services to handle the web scraping for you ● Off-the-shelf datasets so you can get data hassle free ● A cloud-based platform that makes scraping a breeze
  • 4.
    Founded in 2010,largest 100% remote company based outside of the US We’re 134 teammates in 48 countries
  • 5.
    “Getting information offthe Internet is like taking a drink from a fire hydrant.” – Mitchell Kapor
  • 6.
    Scrapy Scrapy is aweb scraping framework that gets the dirty work related to web crawling out of your way. Benefits ● No platform lock-in: Open Source ● Very popular (13k+ ★) ● Battle tested ● Highly extensible ● Great documentation
  • 7.
    Portia Portia is aVisual Scraping tool that lets you get data without needing to write code. Benefits ● No platform lock-in: Open Source ● JavaScript dynamic content generation ● Ideal for non-developers ● Extensible ● It’s as easy as annotating a page
  • 8.
  • 9.
    Large Scale Infrastructure MeetScrapy Cloud , our PaaS for web crawlers: ● Scalable: Crawlers run on EC2 instances or dedicated servers ● Crawlera add-on ● Control your spiders: Command line, API or web UI ● Machine learning integration: BigML, MonkeyLearn ● No lock-in: scrapyd to run Scrapy spiders on your own infrastructure
  • 10.
    Broad Crawls Frontera allowsus to build large scale web crawlers in Python: ● Scrapy support out of the box ● Distribute and scale custom web crawlers across servers ● Crawl Frontier Framework: large scale URL prioritization logic ● Aduana to prioritize URLs based on link analysis (PageRank, HITS)
  • 11.
  • 12.
    Competitive Pricing Companies useweb scraping to monitor the pricing and the ratings of competitors: ● Scrape online retailers ● Structure the data in a search engine or DB ● Create an interface to search for products ● Sentiment analysis for product rankings
  • 13.
    We help aleading IT manufacturer monitor the activities of their resellers: ● Tracking and watching out for stolen goods ● Pricing agreement violations ● Customer support responses on complaints ● Product line quality checks Monitor Resellers
  • 14.
    Lead Generation Mine scrapeddata to identify who to target in a company for your outbound sales campaigns: ● Locate possible leads in your target market ● Identify the right contacts within each one ● Augment the information you already have on them
  • 15.
    Real Estate Crawl propertywebsites and use the data obtained in order to: ● Estimate house prices ● Rental values ● Housing stock movements ● Give insight into real estate agents and homeowners
  • 16.
    Fraud Detection Monitor forsellers that offer products violating the ToS of credit card companies including: ● Drugs ● Weapons ● Gambling Identify stolen cards and IDs on the Dark Web ● Forums where hackers share ID numbers / pins
  • 17.
    Company Reputation Sentiment analysisof a company or product through newsletters, social networks and other natural language data sources. ● NLP to create an associated sentiment indicator. ● Track the relevant news supporting the indicator can lead to market insights for long-term trends.
  • 18.
    Consumer Behavior Extract datafrom forums and websites like Reddit to evaluate consumer reviews and commentary: ● Volume of comments across brands ● Topics of discussion ● Comparisons with other brands and products ● Evaluate product launches and marketing tactics
  • 19.
    Tracking Legislation Monitor billsand regulations that are being discussed in Congress. Access court judgments and opinions in order to: ● Follow discussions ● Try to forecast legislative outcomes ● Track regulations that impact different economic sectors
  • 20.
    Hiring Crawl and extractdata from job boards and other sources in order to understand: ● Hiring trends in different sectors or regions ● Find candidates for jobs, or future leaders ● Spot and rescue employees that are shopping for a new job
  • 21.
    Monitoring Corruption Journalists andanalysts can create Open Data by extracting information from difficult to access government websites: ● Track the activities of lobbyists ● Patterns in the behavior of government officials ● Disruptions in the economy due to corruption allegations
  • 22.