Your SlideShare is downloading. ×
0
Python, web scraping and content management: Scrapy and Django
Python, web scraping and content management: Scrapy and Django
Python, web scraping and content management: Scrapy and Django
Python, web scraping and content management: Scrapy and Django
Python, web scraping and content management: Scrapy and Django
Python, web scraping and content management: Scrapy and Django
Python, web scraping and content management: Scrapy and Django
Python, web scraping and content management: Scrapy and Django
Python, web scraping and content management: Scrapy and Django
Python, web scraping and content management: Scrapy and Django
Python, web scraping and content management: Scrapy and Django
Python, web scraping and content management: Scrapy and Django
Python, web scraping and content management: Scrapy and Django
Python, web scraping and content management: Scrapy and Django
Python, web scraping and content management: Scrapy and Django
Python, web scraping and content management: Scrapy and Django
Python, web scraping and content management: Scrapy and Django
Python, web scraping and content management: Scrapy and Django
Python, web scraping and content management: Scrapy and Django
Python, web scraping and content management: Scrapy and Django
Python, web scraping and content management: Scrapy and Django
Python, web scraping and content management: Scrapy and Django
Python, web scraping and content management: Scrapy and Django
Python, web scraping and content management: Scrapy and Django
Python, web scraping and content management: Scrapy and Django
Python, web scraping and content management: Scrapy and Django
Python, web scraping and content management: Scrapy and Django
Python, web scraping and content management: Scrapy and Django
Python, web scraping and content management: Scrapy and Django
Python, web scraping and content management: Scrapy and Django
Python, web scraping and content management: Scrapy and Django
Python, web scraping and content management: Scrapy and Django
Python, web scraping and content management: Scrapy and Django
Python, web scraping and content management: Scrapy and Django
Python, web scraping and content management: Scrapy and Django
Python, web scraping and content management: Scrapy and Django
Python, web scraping and content management: Scrapy and Django
Python, web scraping and content management: Scrapy and Django
Python, web scraping and content management: Scrapy and Django
Python, web scraping and content management: Scrapy and Django
Python, web scraping and content management: Scrapy and Django
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Python, web scraping and content management: Scrapy and Django

2,652

Published on

Published in: Technology
0 Comments
5 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
2,652
On Slideshare
0
From Embeds
0
Number of Embeds
5
Actions
Shares
0
Downloads
73
Comments
0
Likes
5
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Python, Web Scraping and Content Management: Scrapy and Django Sammy Fung http://sammy.hk OpenSource.HK Workshop 2014.07.05
  • 2. Sammy Fung ● Perl → PHP → Python ● Linux → Open Source → Open Data ● Freelance → Startup ● http://sammy.hk ● sammy@sammy.hk
  • 3. Open Data
  • 4. Can computer program read this ?
  • 5. Is this UI easy understanding ?
  • 6. Five Star Open Data 1.make your stuff available on the Web (whatever format) under an open license. 2.make it available as structured data (e.g., Excel instead of image scan of a table) 3.use non-proprietary formats (e.g., CSV instead of Excel) 4.use URIs to denote things, so that people can point at your stuff. 5.link your data to other data to provide context. 5stardata.info by Tim Berners-Lee, the inventor of the Web.
  • 7. Open Data ● Data.One – Lead by OGCIO of Hong Kong Government. – Use the term “public sector information” (PSI) insteads of “open data”. – Many data are not available in machine-readable format with useful data structure. – A lot of data are still requiring web scraping with customized data extraction to collect useful machine-readable data.
  • 8. Web Scraping with Scrapy
  • 9. Web Scraping a computer software technique of extracting information from websites. (Wikipedia)
  • 10. Scrapy ● Python. ● Open source web scraping framework. ● Scrap websites and extract structured data. ● From data mining to monitoring and automated testing.
  • 11. Scrapy ● Define your own data structures. ● Write spiders to extract data. ● Built-in XPath selectors to extracting data. ● Built-in JSON, CSV, XML output. ● Interactive shell console, telnet console, logging......
  • 12. scrapyd ● Scrapy web service daemon. ● pip install scrapyd ● Web API with simple Web UI: – http://localhost:6800 ● Web API Documentation: – http://scrapyd.readthedocs.org/en/latest/api.html
  • 13. scrapyd ● Examples: – curl http://localhost:6800/listprojects.json – curl http://localhost:6800/listspiders.json? project=default ● eg. {"status": "ok", "spiders": ["pollutant24", "aqhi24"]}
  • 14. Scrapy Installation $ apt-get install python python-virtualenv python-pip $ virtualenv env $ source env/bin/activate $ pip install scrapy
  • 15. Creating Scrapy Project $ scrapy startproject <new project name> newproject |-- newproject | |-- __init__.py | |-- items.py | |-- pipelines.py | |-- settings.py | |-- spider | |- __init__.py |-- scrapy.cfg
  • 16. Creating Scrapy Project ● Define your data structure ● Write your first spider – Test with scrapy shell console ● Output / Store collected data – Output with built-in supported formats – Store to database / object store.
  • 17. Define your data structure items.py class Hk0WeatherItem(Item): reporttime = Field() station = Field() temperture = Field() humidity = Field()
  • 18. Write your first spider ● Import a Class of your own data structure. – $ scrapy genspider -t basic <YOUR SPIDER NAME> <DOMAIN> – $ scrapy list ● Import any scrapy class which you required. – eg. Spider, XPath Selector ● Extend parse() function of a Spider class. ● Test with scrapy shell console – $ scrapy shell <URL>
  • 19. Output / Store collected data ● Use built-in JSON, CSV, XML output at command line. – $ scrapy crawl <Spider Name> -t json -o <Output File> ● Pipelines.py – Import a Class of your own data structure. – Extend process_item() function. – Add to ITEM_PIPELINES at settings.
  • 20. Django web framework
  • 21. Creating django project $ pip install django $ django-admin.py startproject <Project name> myproject |-- manage.py |-- myproject |-- __init__.py |-- settings.py |-- urls.py |-- wsgi.py
  • 22. Creating django project ● Define django settings. – Create database, tables and first django user. ● Create your own django app. – or add existing django apps. – Create database tables. ● Activate django admin UI. – Add URL router to access admin UI.
  • 23. Creating django project ● settings.py – Define your database connection. – Add your own app to INSTALLED_APPS. – Define your own settings.
  • 24. Create django app $ cd <Project Name> $ python manage.py startapp <App Name> myproject |-- manage.py |-- myproject | |-- __init__.py | |-- settings.py | |-- urls.py | |-- wsgi.py |-- myapp |-- admin.py |-- __init__.py |-- models.py |-- tests.py |-- views.py
  • 25. Create django app ● Define your own data model. ● Define and activate your admin UI. ● Furthermore: – Define your data views. – Addi URL routers to connect with data views.
  • 26. Define django data model ● Define at models.py. ● Import django data model base class. ● Define your own data model class. ● Create database table(s). – $ python manage.py syncdb
  • 27. Define django data model class WeatherData(models.Model): reporttime = models.DateTimeField() station = models.CharField(max_length=3) temperture = models.FloatField(null=True, blank=True) humidity = models.IntegerField(null=True, blank=True)
  • 28. Define django data model ● admin.py – Import admin class – Import your own data model class. – Extend admin class for your data model. – Register admin class ● with admin.site.register() function.
  • 29. Define django data model class WeatherDataAdmin(admin.ModelAdmin): list_display = ('reporttime', 'station', 'temperture', 'humidity', 'windspeed') list_filter = ['station'] admin.site.register(WeatherData, WeatherDataAdmin)
  • 30. Enable django admin ui ● Adding to INSTALLED_APPS at settings.py – django.contrib.admin ● Adding URL router at urls.py – $ python manage.py runserver ● Access admin UI – http://127.0.0.1:8000/admin
  • 31. Scrapy + Django
  • 32. Scrapy + Django ● Define django environment at scrapy settings. – Load django configuration. ● Use Scrapy DjangoItem class – Insteads of Item and Field class – Define which django data model should be linked with. ● Query and insert data at scrapy pipelines.
  • 33. hk0weather
  • 34. hk0weather ● Weather Data Project. – https://github.com/sammyfung/hk0weather – convert weather information to JSON data from HKO webpages. – python + scrapy + django
  • 35. hk0weather ● Hong Kong Weather Data. – 20+ HKO weather stations in Hong Kong. – Regional weather data. – Rainfall data. – Weather forecast report.
  • 36. hk0weather ● Setup and activate a python virtual enviornment, and install scrapy and django with pip. ● Clone hk0weather from GitHub – $ git clone https://github.com/sammyfung/hk0weather.git ● Setup database connection at Django and create database, tables and first django user. ● Scrap regional weather data – $ scrapy crawl regionalwx -t json -o regional.json
  • 37. DEMO
  • 38. Thank you! http://sammy.hk

×