Scraping content from web for
location-based mobile app.
Nguyen Hong Diep
founder, magik.vn
Summary
1. Web Scraping
– Definitions
– Value added
– Analysis a Sample Case
2. Scrapy Framework
– Overview
– Architecture...
Web crawler
Internet bot that systematically browses
the World Wide Web,
typically for web indexing.
Sources: wikipedia.org
Scrape
Crawl websites and
extract structured data from pages.
Sources: wikipedia.org
Added Value?
giamua.com – “groupon”
baomoi.com
Added Value?
same user experience
but
more content than
oizoioi.vn
Price comparison for electronic
Added Value?
make
new knowledge
from many informations
Wisdom
Knowledge
Information
Data
DIKW Hierachy
Nha Tro Tot
Added Value?
The smartphone revolution
new platform
need
new user experienced
Source: www.widexconnect.ca
And mores
Sources : Laban.vn
Analysis a sample case
(1) collect [home for sales] records
from Web
(2) from many websites in Vietnam
(3) as soon as they...
Step 1: Listing sources
Step 2: build general database
Step 3: Ctrl+C, Ctrl+V
• For every sites:
– Find listing latest records webpage link.
– For every record :
• Check if new ...
Step 3: Ctrl+C, Ctrl+V
Bước 3 : Let’s Scrapy
Scrapy Framework
• Overview
• Architecture
• Xpath
• Make a simple Scrapy program.
• Scrapy is a fast high-level screen
scraping and web crawling
framework.
• Open-source, 100% Python => Portable
Scrapy’s github info
• From 2008
• Stats
Architecture
Source: http://doc.scrapy.org/en/0.12/topics/architecture.html
XPath
Navigate through
elements and attributes
in an XML document.
Simple Scrapy Program
• (1) Pick a website
– http://www.mininova.org/today
• (2) Define the data you want to
scrape
Simple Scrapy Program (cont.)
• (3) Write a Spider to extract the data
Simple Scrapy Program (cont.)
(4) Run the spider to extract the data
(5) Review scraped data
Build a auto scraping system for
location-based apps
• Extract LatLng from address
• Extract phone number
• Realtime updat...
Extract LatLng from address
• Use Google Geocode
• https://maps.googleapis.com/maps/api/geocode/json?addr
ess=xxx&sensor=t...
Extract LatLng from address (cont.)
Extract LatLng from address (cont.)
Extract Phone Number
• Libphonenumber’s python port.
• Sample
“Real time” update and
continuous 24/7.
• Task Schedule
(Windows)
• Cron jobs
(Linux)
Prevent duplication data
• Make a middleware for ignore exists
Item. IgnoreExistsMiddleW
are
Without a dedicated server or
VPS
• Problems: my server-side is on a cpanel
web hosting => can’t deploy scrapy
• Solutions...
How to scraping content from web for location-based mobile app.
How to scraping content from web for location-based mobile app.
Upcoming SlideShare
Loading in …5
×

How to scraping content from web for location-based mobile app.

1,804 views

Published on

Published in: Technology
0 Comments
4 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,804
On SlideShare
0
From Embeds
0
Number of Embeds
8
Actions
Shares
0
Downloads
32
Comments
0
Likes
4
Embeds 0
No embeds

No notes for slide

How to scraping content from web for location-based mobile app.

  1. 1. Scraping content from web for location-based mobile app.
  2. 2. Nguyen Hong Diep founder, magik.vn
  3. 3. Summary 1. Web Scraping – Definitions – Value added – Analysis a Sample Case 2. Scrapy Framework – Overview – Architecture – A simple Scrapy program. 3. Build a auto scraping system for location-based apps – Extract LatLng from address – Extract phone number – Realtime update & continuous 24/7 – Prevent duplication data – Deploy without a dedicated server or VPS
  4. 4. Web crawler Internet bot that systematically browses the World Wide Web, typically for web indexing. Sources: wikipedia.org
  5. 5. Scrape Crawl websites and extract structured data from pages. Sources: wikipedia.org
  6. 6. Added Value?
  7. 7. giamua.com – “groupon”
  8. 8. baomoi.com
  9. 9. Added Value? same user experience but more content than
  10. 10. oizoioi.vn Price comparison for electronic
  11. 11. Added Value? make new knowledge from many informations Wisdom Knowledge Information Data DIKW Hierachy
  12. 12. Nha Tro Tot
  13. 13. Added Value? The smartphone revolution new platform need new user experienced Source: www.widexconnect.ca
  14. 14. And mores Sources : Laban.vn
  15. 15. Analysis a sample case (1) collect [home for sales] records from Web (2) from many websites in Vietnam (3) as soon as they posted (4) continuous 24 / 7 Need
  16. 16. Step 1: Listing sources
  17. 17. Step 2: build general database
  18. 18. Step 3: Ctrl+C, Ctrl+V • For every sites: – Find listing latest records webpage link. – For every record : • Check if new record – Copy & paste fields into a new record in my DB.
  19. 19. Step 3: Ctrl+C, Ctrl+V
  20. 20. Bước 3 : Let’s Scrapy
  21. 21. Scrapy Framework • Overview • Architecture • Xpath • Make a simple Scrapy program.
  22. 22. • Scrapy is a fast high-level screen scraping and web crawling framework. • Open-source, 100% Python => Portable
  23. 23. Scrapy’s github info • From 2008 • Stats
  24. 24. Architecture Source: http://doc.scrapy.org/en/0.12/topics/architecture.html
  25. 25. XPath Navigate through elements and attributes in an XML document.
  26. 26. Simple Scrapy Program • (1) Pick a website – http://www.mininova.org/today • (2) Define the data you want to scrape
  27. 27. Simple Scrapy Program (cont.) • (3) Write a Spider to extract the data
  28. 28. Simple Scrapy Program (cont.) (4) Run the spider to extract the data (5) Review scraped data
  29. 29. Build a auto scraping system for location-based apps • Extract LatLng from address • Extract phone number • Realtime update & continuous 24/7 • Prevent duplication data • Deploy without a dedicated server or VPS
  30. 30. Extract LatLng from address • Use Google Geocode • https://maps.googleapis.com/maps/api/geocode/json?addr ess=xxx&sensor=true_or_false&key=API_KEY
  31. 31. Extract LatLng from address (cont.)
  32. 32. Extract LatLng from address (cont.)
  33. 33. Extract Phone Number • Libphonenumber’s python port. • Sample
  34. 34. “Real time” update and continuous 24/7. • Task Schedule (Windows) • Cron jobs (Linux)
  35. 35. Prevent duplication data • Make a middleware for ignore exists Item. IgnoreExistsMiddleW are
  36. 36. Without a dedicated server or VPS • Problems: my server-side is on a cpanel web hosting => can’t deploy scrapy • Solutions: – Make a web services for sync new record data. • /get_head_revision • /sync – Scrapy run on my PC, then sync with server.

×