Crawling the web for fun and profit

Federico Feroldi
Federico FeroldiDigital Transformation leadership & CTO
Crawling the Web
(for fun and profit)
      Federico Feroldi
“A Web crawler is a computer
program that browses the World
Wide Web in a methodical,
automated manner.”
                         Wikipedia




                         Picture greetings to photoholic1 --LennyB
Crawling the web for fun and profit
Crawling the web for fun and profit
Crawling the web for fun and profit
Search engines only show you
what their crawlers can catch




                                Picture greetings to jimbrickett
The deep web contains a
 lot of valuable information


e-commerce              finance
      transportation
                    yellow pages
medicine
          government
   opinions          real estate
           personal
 intranets           social
                               Picture greetings to tricky ™
Dig deeper with
your own crawler
          Picture greetings to Super*Junk
Information
     =
Competitive
 Advantage
              Picture greetings to mastrobiggo
B a cku p h i s t o r i c a l
data: web sites, blogs
Social network analysis: find
influencers and interests
based on “social circles”
Find what people like
Sentiment analysis: find
what people say about
your brand or product
Trending topics
and products
Competitor price tracking
Real estate
Personal data and
online reputation
Do It Yourself




                 Picture greetings to vic_206
Anybody can build
a search engine
Scrapy                   Scheduler                               Internet
architecture
                                          Re
                                            qu
                                               es




                                                                     Data
                                                 ts

  Item                           Scrapy
                                                                 Downloader
pipeline              Requests   Engine

                                                            es
           Ite                                            ns
                 ms                                     po
                                                 R    es

                                 Spider
Twitter social graph crawler
with Scrapy in 150 LOC
The Web is much bigger
than what you can search
with Google
Thank you

federico@cloudify.me

twitter.com/cloudify
1 of 23

Recommended

Scrapy.for.dummies by
Scrapy.for.dummiesScrapy.for.dummies
Scrapy.for.dummiesChandler Huang
16.3K views15 slides
NGSIv1 を知っている開発者向けの NGSIv2 の概要 (Orion 2.3.0対応) by
NGSIv1 を知っている開発者向けの NGSIv2 の概要 (Orion 2.3.0対応)NGSIv1 を知っている開発者向けの NGSIv2 の概要 (Orion 2.3.0対応)
NGSIv1 を知っている開発者向けの NGSIv2 の概要 (Orion 2.3.0対応)fisuda
1K views47 slides
Data mining Introduction by
Data mining IntroductionData mining Introduction
Data mining IntroductionVijayasankariS
193 views27 slides
Jim Richardson Software Testing CS459 IP 5 by
Jim Richardson Software Testing CS459 IP 5Jim Richardson Software Testing CS459 IP 5
Jim Richardson Software Testing CS459 IP 5Jim Richardson
1.7K views84 slides
Discovery of rest at data by
Discovery of rest at dataDiscovery of rest at data
Discovery of rest at dataSanjeev Solanki
157 views12 slides
Big Data, Fast Data @ PayPal (YOW 2018) by
Big Data, Fast Data @ PayPal (YOW 2018)Big Data, Fast Data @ PayPal (YOW 2018)
Big Data, Fast Data @ PayPal (YOW 2018)Sid Anand
3.1K views100 slides

More Related Content

What's hot

Introduction data mining by
Introduction data miningIntroduction data mining
Introduction data miningRana Chakraborty
4K views22 slides
ちょっと理解に自信がないな という皆さまに贈るHadoop/Sparkのキホン (IBM Datapalooza Tokyo 2016講演資料) by
ちょっと理解に自信がないなという皆さまに贈るHadoop/Sparkのキホン (IBM Datapalooza Tokyo 2016講演資料)ちょっと理解に自信がないなという皆さまに贈るHadoop/Sparkのキホン (IBM Datapalooza Tokyo 2016講演資料)
ちょっと理解に自信がないな という皆さまに贈るHadoop/Sparkのキホン (IBM Datapalooza Tokyo 2016講演資料)hamaken
21.4K views34 slides
Web usage-mining by
Web usage-miningWeb usage-mining
Web usage-miningSamik Bhattacharjee
202 views34 slides
Presentation 1.6-hdi-carbo-issues by
Presentation 1.6-hdi-carbo-issuesPresentation 1.6-hdi-carbo-issues
Presentation 1.6-hdi-carbo-issuesJacob Mouw
4.7K views44 slides
GS1 Data Revolution Series 2 - Internet of Trains by
GS1 Data Revolution Series 2 - Internet of TrainsGS1 Data Revolution Series 2 - Internet of Trains
GS1 Data Revolution Series 2 - Internet of TrainsDaeyoung Kim
764 views168 slides
Seminar Report Mine by
Seminar Report MineSeminar Report Mine
Seminar Report Minesachin narang
2.8K views20 slides

What's hot(8)

ちょっと理解に自信がないな という皆さまに贈るHadoop/Sparkのキホン (IBM Datapalooza Tokyo 2016講演資料) by hamaken
ちょっと理解に自信がないなという皆さまに贈るHadoop/Sparkのキホン (IBM Datapalooza Tokyo 2016講演資料)ちょっと理解に自信がないなという皆さまに贈るHadoop/Sparkのキホン (IBM Datapalooza Tokyo 2016講演資料)
ちょっと理解に自信がないな という皆さまに贈るHadoop/Sparkのキホン (IBM Datapalooza Tokyo 2016講演資料)
hamaken21.4K views
Presentation 1.6-hdi-carbo-issues by Jacob Mouw
Presentation 1.6-hdi-carbo-issuesPresentation 1.6-hdi-carbo-issues
Presentation 1.6-hdi-carbo-issues
Jacob Mouw4.7K views
GS1 Data Revolution Series 2 - Internet of Trains by Daeyoung Kim
GS1 Data Revolution Series 2 - Internet of TrainsGS1 Data Revolution Series 2 - Internet of Trains
GS1 Data Revolution Series 2 - Internet of Trains
Daeyoung Kim764 views
빅데이터 분야를 위한 이미지 마이닝 기술동향 및 산업 동향 고찰 by JeongHeon Lee
빅데이터 분야를 위한 이미지 마이닝 기술동향 및 산업 동향 고찰빅데이터 분야를 위한 이미지 마이닝 기술동향 및 산업 동향 고찰
빅데이터 분야를 위한 이미지 마이닝 기술동향 및 산업 동향 고찰
JeongHeon Lee9.1K views

Viewers also liked

How to Scrap Any Website's content using ScrapyTutorial of How to scrape (cra... by
How to Scrap Any Website's content using ScrapyTutorial of How to scrape (cra...How to Scrap Any Website's content using ScrapyTutorial of How to scrape (cra...
How to Scrap Any Website's content using ScrapyTutorial of How to scrape (cra...Anton
27.1K views20 slides
Web scraping 1 2-3 with python + scrapy (Summer BarCampHK 2012 version) by
Web scraping 1 2-3 with python + scrapy (Summer BarCampHK 2012 version)Web scraping 1 2-3 with python + scrapy (Summer BarCampHK 2012 version)
Web scraping 1 2-3 with python + scrapy (Summer BarCampHK 2012 version)Sammy Fung
6.6K views21 slides
Web Scrapping with Python by
Web Scrapping with PythonWeb Scrapping with Python
Web Scrapping with PythonMiguel Miranda de Mattos
8.2K views13 slides
Collecting web information with open source tools by
Collecting web information with open source toolsCollecting web information with open source tools
Collecting web information with open source toolsSammy Fung
2.4K views27 slides
When big data meet python @ COSCUP 2012 by
When big data meet python @ COSCUP 2012When big data meet python @ COSCUP 2012
When big data meet python @ COSCUP 2012Jimmy Lai
4.8K views26 slides
Downloading the internet with Python + Scrapy by
Downloading the internet with Python + ScrapyDownloading the internet with Python + Scrapy
Downloading the internet with Python + ScrapyErin Shellman
7.2K views30 slides

Viewers also liked(20)

How to Scrap Any Website's content using ScrapyTutorial of How to scrape (cra... by Anton
How to Scrap Any Website's content using ScrapyTutorial of How to scrape (cra...How to Scrap Any Website's content using ScrapyTutorial of How to scrape (cra...
How to Scrap Any Website's content using ScrapyTutorial of How to scrape (cra...
Anton 27.1K views
Web scraping 1 2-3 with python + scrapy (Summer BarCampHK 2012 version) by Sammy Fung
Web scraping 1 2-3 with python + scrapy (Summer BarCampHK 2012 version)Web scraping 1 2-3 with python + scrapy (Summer BarCampHK 2012 version)
Web scraping 1 2-3 with python + scrapy (Summer BarCampHK 2012 version)
Sammy Fung6.6K views
Collecting web information with open source tools by Sammy Fung
Collecting web information with open source toolsCollecting web information with open source tools
Collecting web information with open source tools
Sammy Fung2.4K views
When big data meet python @ COSCUP 2012 by Jimmy Lai
When big data meet python @ COSCUP 2012When big data meet python @ COSCUP 2012
When big data meet python @ COSCUP 2012
Jimmy Lai4.8K views
Downloading the internet with Python + Scrapy by Erin Shellman
Downloading the internet with Python + ScrapyDownloading the internet with Python + Scrapy
Downloading the internet with Python + Scrapy
Erin Shellman7.2K views
Web Scraping in Python with Scrapy by orangain
Web Scraping in Python with ScrapyWeb Scraping in Python with Scrapy
Web Scraping in Python with Scrapy
orangain2.4K views
Taller de Scrapy - Barcelona Activa by Daniel Bertinat
Taller de Scrapy - Barcelona ActivaTaller de Scrapy - Barcelona Activa
Taller de Scrapy - Barcelona Activa
Daniel Bertinat6.2K views
From Startup to Exit in 18 months by Federico Feroldi
From Startup to Exit in 18 monthsFrom Startup to Exit in 18 months
From Startup to Exit in 18 months
Federico Feroldi1.3K views
Design and development of an Online Social Network crawler by Federico Feroldi
Design and development of an Online Social Network crawlerDesign and development of an Online Social Network crawler
Design and development of an Online Social Network crawler
Federico Feroldi1.2K views
Innovate, optimize and profit with cloud computing by Federico Feroldi
Innovate, optimize and profit with cloud computingInnovate, optimize and profit with cloud computing
Innovate, optimize and profit with cloud computing
Federico Feroldi521 views
Scaling web application in the Cloud by Federico Feroldi
Scaling web application in the CloudScaling web application in the Cloud
Scaling web application in the Cloud
Federico Feroldi2.8K views
摘星 by zenyuhao
摘星摘星
摘星
zenyuhao6.4K views
Cloudify your applications with Amazon Web Services by Federico Feroldi
Cloudify your applications with Amazon Web ServicesCloudify your applications with Amazon Web Services
Cloudify your applications with Amazon Web Services
Federico Feroldi2.3K views
Study of Chromium OS by William Lee
Study of Chromium OSStudy of Chromium OS
Study of Chromium OS
William Lee17.3K views
10分でわかる marathon-lb by Shuji Yamada
10分でわかる marathon-lb10分でわかる marathon-lb
10分でわかる marathon-lb
Shuji Yamada2.5K views

Similar to Crawling the web for fun and profit

Content Used to Be King - Now what? by
Content Used to Be King - Now what?Content Used to Be King - Now what?
Content Used to Be King - Now what?Judy O'Connell
9.5K views49 slides
Explaining The Semantic Web by
Explaining The Semantic WebExplaining The Semantic Web
Explaining The Semantic WebAditya Tuli
749 views40 slides
Internet Predators by
Internet PredatorsInternet Predators
Internet PredatorsLindsey Rivera
2 views78 slides
How to Build Linked Data Sites with Drupal 7 and RDFa by
How to Build Linked Data Sites with Drupal 7 and RDFaHow to Build Linked Data Sites with Drupal 7 and RDFa
How to Build Linked Data Sites with Drupal 7 and RDFascorlosquet
11.5K views140 slides
Deep Web and TOR Browser by
Deep Web and TOR BrowserDeep Web and TOR Browser
Deep Web and TOR BrowserArjith K Raj
1.6K views37 slides
The State Of Rdf In Drupal 7 by
The State Of Rdf In Drupal 7The State Of Rdf In Drupal 7
The State Of Rdf In Drupal 7Drupalcon Paris
4.2K views44 slides

Similar to Crawling the web for fun and profit(20)

Content Used to Be King - Now what? by Judy O'Connell
Content Used to Be King - Now what?Content Used to Be King - Now what?
Content Used to Be King - Now what?
Judy O'Connell9.5K views
Explaining The Semantic Web by Aditya Tuli
Explaining The Semantic WebExplaining The Semantic Web
Explaining The Semantic Web
Aditya Tuli749 views
How to Build Linked Data Sites with Drupal 7 and RDFa by scorlosquet
How to Build Linked Data Sites with Drupal 7 and RDFaHow to Build Linked Data Sites with Drupal 7 and RDFa
How to Build Linked Data Sites with Drupal 7 and RDFa
scorlosquet11.5K views
Deep Web and TOR Browser by Arjith K Raj
Deep Web and TOR BrowserDeep Web and TOR Browser
Deep Web and TOR Browser
Arjith K Raj1.6K views
Tech4Africa - Opportunities around Big Data by Steve Watt
Tech4Africa - Opportunities around Big DataTech4Africa - Opportunities around Big Data
Tech4Africa - Opportunities around Big Data
Steve Watt1.2K views
WEB Analytics - Data Mining - MIS - eBusiness website by Jyotindra Zaveri
WEB Analytics  - Data Mining - MIS - eBusiness website WEB Analytics  - Data Mining - MIS - eBusiness website
WEB Analytics - Data Mining - MIS - eBusiness website
Jyotindra Zaveri7.6K views
Understanding The Basis Of The Dark Web by Carolina Fox
Understanding The Basis Of The Dark WebUnderstanding The Basis Of The Dark Web
Understanding The Basis Of The Dark Web
Carolina Fox6 views
Ar design reality2018 by Anselm Hook
Ar design reality2018Ar design reality2018
Ar design reality2018
Anselm Hook83 views
GalvanizeU Seattle: Eleven Almost-Truisms About Data by Paco Nathan
GalvanizeU Seattle: Eleven Almost-Truisms About DataGalvanizeU Seattle: Eleven Almost-Truisms About Data
GalvanizeU Seattle: Eleven Almost-Truisms About Data
Paco Nathan9.3K views
Public private-cloud by Jamie Taylor
Public private-cloudPublic private-cloud
Public private-cloud
Jamie Taylor694 views
Skb web2.0 by animove
Skb web2.0Skb web2.0
Skb web2.0
animove931 views
Big data - An Introduction by Spotle.ai
Big data - An IntroductionBig data - An Introduction
Big data - An Introduction
Spotle.ai978 views

More from Federico Feroldi

Project IO - TS-Conf 2019 by
Project IO - TS-Conf 2019Project IO - TS-Conf 2019
Project IO - TS-Conf 2019Federico Feroldi
436 views43 slides
Una Pubblica Amministrazione Agile, Funzionale e Serverless: si può fare! - C... by
Una Pubblica Amministrazione Agile, Funzionale e Serverless: si può fare! - C...Una Pubblica Amministrazione Agile, Funzionale e Serverless: si può fare! - C...
Una Pubblica Amministrazione Agile, Funzionale e Serverless: si può fare! - C...Federico Feroldi
494 views33 slides
From 1 to infinity: how to scale your tech organization, build a great cultur... by
From 1 to infinity: how to scale your tech organization, build a great cultur...From 1 to infinity: how to scale your tech organization, build a great cultur...
From 1 to infinity: how to scale your tech organization, build a great cultur...Federico Feroldi
573 views27 slides
A Blueprint for Scala Microservices by
A Blueprint for Scala MicroservicesA Blueprint for Scala Microservices
A Blueprint for Scala MicroservicesFederico Feroldi
4.1K views46 slides
the Picmix experiment by
the Picmix experimentthe Picmix experiment
the Picmix experimentFederico Feroldi
769 views15 slides
Cloudify - Scalability On Demand by
Cloudify - Scalability On DemandCloudify - Scalability On Demand
Cloudify - Scalability On DemandFederico Feroldi
1.2K views9 slides

More from Federico Feroldi(7)

Una Pubblica Amministrazione Agile, Funzionale e Serverless: si può fare! - C... by Federico Feroldi
Una Pubblica Amministrazione Agile, Funzionale e Serverless: si può fare! - C...Una Pubblica Amministrazione Agile, Funzionale e Serverless: si può fare! - C...
Una Pubblica Amministrazione Agile, Funzionale e Serverless: si può fare! - C...
Federico Feroldi494 views
From 1 to infinity: how to scale your tech organization, build a great cultur... by Federico Feroldi
From 1 to infinity: how to scale your tech organization, build a great cultur...From 1 to infinity: how to scale your tech organization, build a great cultur...
From 1 to infinity: how to scale your tech organization, build a great cultur...
Federico Feroldi573 views
A Blueprint for Scala Microservices by Federico Feroldi
A Blueprint for Scala MicroservicesA Blueprint for Scala Microservices
A Blueprint for Scala Microservices
Federico Feroldi4.1K views

Recently uploaded

【USB韌體設計課程】精選講義節錄-USB的列舉過程_艾鍗學院 by
【USB韌體設計課程】精選講義節錄-USB的列舉過程_艾鍗學院【USB韌體設計課程】精選講義節錄-USB的列舉過程_艾鍗學院
【USB韌體設計課程】精選講義節錄-USB的列舉過程_艾鍗學院IttrainingIttraining
52 views8 slides
Data Integrity for Banking and Financial Services by
Data Integrity for Banking and Financial ServicesData Integrity for Banking and Financial Services
Data Integrity for Banking and Financial ServicesPrecisely
21 views26 slides
Zero to Automated in Under a Year by
Zero to Automated in Under a YearZero to Automated in Under a Year
Zero to Automated in Under a YearNetwork Automation Forum
15 views23 slides
SUPPLIER SOURCING.pptx by
SUPPLIER SOURCING.pptxSUPPLIER SOURCING.pptx
SUPPLIER SOURCING.pptxangelicacueva6
15 views1 slide
Info Session November 2023.pdf by
Info Session November 2023.pdfInfo Session November 2023.pdf
Info Session November 2023.pdfAleksandraKoprivica4
12 views15 slides
6g - REPORT.pdf by
6g - REPORT.pdf6g - REPORT.pdf
6g - REPORT.pdfLiveplex
10 views23 slides

Recently uploaded(20)

【USB韌體設計課程】精選講義節錄-USB的列舉過程_艾鍗學院 by IttrainingIttraining
【USB韌體設計課程】精選講義節錄-USB的列舉過程_艾鍗學院【USB韌體設計課程】精選講義節錄-USB的列舉過程_艾鍗學院
【USB韌體設計課程】精選講義節錄-USB的列舉過程_艾鍗學院
Data Integrity for Banking and Financial Services by Precisely
Data Integrity for Banking and Financial ServicesData Integrity for Banking and Financial Services
Data Integrity for Banking and Financial Services
Precisely21 views
6g - REPORT.pdf by Liveplex
6g - REPORT.pdf6g - REPORT.pdf
6g - REPORT.pdf
Liveplex10 views
STKI Israeli Market Study 2023 corrected forecast 2023_24 v3.pdf by Dr. Jimmy Schwarzkopf
STKI Israeli Market Study 2023   corrected forecast 2023_24 v3.pdfSTKI Israeli Market Study 2023   corrected forecast 2023_24 v3.pdf
STKI Israeli Market Study 2023 corrected forecast 2023_24 v3.pdf
"Running students' code in isolation. The hard way", Yurii Holiuk by Fwdays
"Running students' code in isolation. The hard way", Yurii Holiuk "Running students' code in isolation. The hard way", Yurii Holiuk
"Running students' code in isolation. The hard way", Yurii Holiuk
Fwdays11 views
Voice Logger - Telephony Integration Solution at Aegis by Nirmal Sharma
Voice Logger - Telephony Integration Solution at AegisVoice Logger - Telephony Integration Solution at Aegis
Voice Logger - Telephony Integration Solution at Aegis
Nirmal Sharma39 views
GDG Cloud Southlake 28 Brad Taylor and Shawn Augenstein Old Problems in the N... by James Anderson
GDG Cloud Southlake 28 Brad Taylor and Shawn Augenstein Old Problems in the N...GDG Cloud Southlake 28 Brad Taylor and Shawn Augenstein Old Problems in the N...
GDG Cloud Southlake 28 Brad Taylor and Shawn Augenstein Old Problems in the N...
James Anderson85 views
ESPC 2023 - Protect and Govern your Sensitive Data with Microsoft Purview in ... by Jasper Oosterveld
ESPC 2023 - Protect and Govern your Sensitive Data with Microsoft Purview in ...ESPC 2023 - Protect and Govern your Sensitive Data with Microsoft Purview in ...
ESPC 2023 - Protect and Govern your Sensitive Data with Microsoft Purview in ...

Crawling the web for fun and profit