SlideShare a Scribd company logo
1 of 41
Download to read offline
Python, Web Scraping and
Content Management:
Scrapy and Django
Sammy Fung
http://sammy.hk
OpenSource.HK Workshop 2014.07.05
Sammy Fung
● Perl → PHP → Python
● Linux → Open Source → Open Data
● Freelance → Startup
● http://sammy.hk
● sammy@sammy.hk
Open Data
Can computer program read this ?
Is this UI easy understanding ?
Five Star Open Data
1.make your stuff available on the Web (whatever format)
under an open license.
2.make it available as structured data (e.g., Excel instead of
image scan of a table)
3.use non-proprietary formats (e.g., CSV instead of Excel)
4.use URIs to denote things, so that people can point at your
stuff.
5.link your data to other data to provide context.
5stardata.info by Tim Berners-Lee, the inventor of the Web.
Open Data
● Data.One
– Lead by OGCIO of Hong Kong Government.
– Use the term “public sector information” (PSI)
insteads of “open data”.
– Many data are not available in machine-readable
format with useful data structure.
– A lot of data are still requiring web scraping with
customized data extraction to collect useful
machine-readable data.
Web Scraping
with Scrapy
Web Scraping
a computer software
technique of extracting
information from websites.
(Wikipedia)
Scrapy
● Python.
● Open source web scraping framework.
● Scrap websites and extract structured data.
● From data mining to monitoring and
automated testing.
Scrapy
● Define your own data structures.
● Write spiders to extract data.
● Built-in XPath selectors to extracting data.
● Built-in JSON, CSV, XML output.
● Interactive shell console, telnet console,
logging......
scrapyd
● Scrapy web service daemon.
● pip install scrapyd
● Web API with simple Web UI:
– http://localhost:6800
● Web API Documentation:
– http://scrapyd.readthedocs.org/en/latest/api.html
scrapyd
● Examples:
– curl http://localhost:6800/listprojects.json
– curl http://localhost:6800/listspiders.json?
project=default
● eg. {"status": "ok", "spiders": ["pollutant24", "aqhi24"]}
Scrapy Installation
$ apt-get install python python-virtualenv
python-pip
$ virtualenv env
$ source env/bin/activate
$ pip install scrapy
Creating Scrapy Project
$ scrapy startproject <new project name>
newproject
|-- newproject
| |-- __init__.py
| |-- items.py
| |-- pipelines.py
| |-- settings.py
| |-- spider
| |- __init__.py
|-- scrapy.cfg
Creating Scrapy Project
● Define your data structure
● Write your first spider
– Test with scrapy shell console
● Output / Store collected data
– Output with built-in supported formats
– Store to database / object store.
Define your data structure
items.py
class Hk0WeatherItem(Item):
reporttime = Field()
station = Field()
temperture = Field()
humidity = Field()
Write your first spider
● Import a Class of your own data structure.
– $ scrapy genspider -t basic <YOUR SPIDER NAME>
<DOMAIN>
– $ scrapy list
● Import any scrapy class which you required.
– eg. Spider, XPath Selector
● Extend parse() function of a Spider class.
●
Test with scrapy shell console
– $ scrapy shell <URL>
Output / Store collected data
● Use built-in JSON, CSV, XML output at
command line.
– $ scrapy crawl <Spider Name> -t json -o <Output
File>
● Pipelines.py
– Import a Class of your own data structure.
– Extend process_item() function.
– Add to ITEM_PIPELINES at settings.
Django web
framework
Creating django project
$ pip install django
$ django-admin.py startproject <Project name>
myproject
|-- manage.py
|-- myproject
|-- __init__.py
|-- settings.py
|-- urls.py
|-- wsgi.py
Creating django project
● Define django settings.
– Create database, tables and first django user.
● Create your own django app.
– or add existing django apps.
– Create database tables.
● Activate django admin UI.
– Add URL router to access admin UI.
Creating django project
● settings.py
– Define your database connection.
– Add your own app to INSTALLED_APPS.
– Define your own settings.
Create django app
$ cd <Project Name>
$ python manage.py startapp <App Name>
myproject
|-- manage.py
|-- myproject
| |-- __init__.py
| |-- settings.py
| |-- urls.py
| |-- wsgi.py
|-- myapp
|-- admin.py
|-- __init__.py
|-- models.py
|-- tests.py
|-- views.py
Create django app
● Define your own data model.
● Define and activate your admin UI.
● Furthermore:
– Define your data views.
– Addi URL routers to connect with data views.
Define django data model
● Define at models.py.
● Import django data model base class.
● Define your own data model class.
● Create database table(s).
– $ python manage.py syncdb
Define django data model
class WeatherData(models.Model):
reporttime = models.DateTimeField()
station = models.CharField(max_length=3)
temperture = models.FloatField(null=True,
blank=True)
humidity = models.IntegerField(null=True,
blank=True)
Define django data model
● admin.py
– Import admin class
– Import your own data model class.
– Extend admin class for your data model.
– Register admin class
● with admin.site.register() function.
Define django data model
class WeatherDataAdmin(admin.ModelAdmin):
list_display = ('reporttime', 'station',
'temperture', 'humidity', 'windspeed')
list_filter = ['station']
admin.site.register(WeatherData,
WeatherDataAdmin)
Enable django admin ui
● Adding to INSTALLED_APPS at settings.py
– django.contrib.admin
● Adding URL router at urls.py
– $ python manage.py runserver
● Access admin UI
– http://127.0.0.1:8000/admin
Scrapy + Django
Scrapy + Django
● Define django environment at scrapy settings.
– Load django configuration.
● Use Scrapy DjangoItem class
– Insteads of Item and Field class
– Define which django data model should be linked
with.
● Query and insert data at scrapy pipelines.
hk0weather
hk0weather
● Weather Data Project.
– https://github.com/sammyfung/hk0weather
– convert weather information to JSON data from
HKO webpages.
– python + scrapy + django
hk0weather
● Hong Kong Weather Data.
– 20+ HKO weather stations in Hong Kong.
– Regional weather data.
– Rainfall data.
– Weather forecast report.
hk0weather
● Setup and activate a python virtual enviornment,
and install scrapy and django with pip.
● Clone hk0weather from GitHub
– $ git clone https://github.com/sammyfung/hk0weather.git
● Setup database connection at Django and create
database, tables and first django user.
● Scrap regional weather data
– $ scrapy crawl regionalwx -t json -o regional.json
DEMO
Thank you!
http://sammy.hk

More Related Content

What's hot

Web Scraping with Python
Web Scraping with PythonWeb Scraping with Python
Web Scraping with PythonPaul Schreiber
 
Scraping with Python for Fun and Profit - PyCon India 2010
Scraping with Python for Fun and Profit - PyCon India 2010Scraping with Python for Fun and Profit - PyCon India 2010
Scraping with Python for Fun and Profit - PyCon India 2010Abhishek Mishra
 
How to scraping content from web for location-based mobile app.
How to scraping content from web for location-based mobile app.How to scraping content from web for location-based mobile app.
How to scraping content from web for location-based mobile app.Diep Nguyen
 
Assumptions: Check yo'self before you wreck yourself
Assumptions: Check yo'self before you wreck yourselfAssumptions: Check yo'self before you wreck yourself
Assumptions: Check yo'self before you wreck yourselfErin Shellman
 
Do something in 5 minutes with gas 1-use spreadsheet as database
Do something in 5 minutes with gas 1-use spreadsheet as databaseDo something in 5 minutes with gas 1-use spreadsheet as database
Do something in 5 minutes with gas 1-use spreadsheet as databaseBruce McPherson
 
GDG İstanbul Şubat Etkinliği - Sunum
GDG İstanbul Şubat Etkinliği - SunumGDG İstanbul Şubat Etkinliği - Sunum
GDG İstanbul Şubat Etkinliği - SunumCüneyt Yeşilkaya
 
Video WebChat Conference Tool
Video WebChat Conference ToolVideo WebChat Conference Tool
Video WebChat Conference ToolSergiu Gordienco
 
Do something in 5 with gas 3-simple invoicing app
Do something in 5 with gas 3-simple invoicing appDo something in 5 with gas 3-simple invoicing app
Do something in 5 with gas 3-simple invoicing appBruce McPherson
 
Open Hack London - Introduction to YQL
Open Hack London - Introduction to YQLOpen Hack London - Introduction to YQL
Open Hack London - Introduction to YQLChristian Heilmann
 
Do something in 5 with gas 8-copy between databases
Do something in 5 with gas 8-copy between databasesDo something in 5 with gas 8-copy between databases
Do something in 5 with gas 8-copy between databasesBruce McPherson
 
Application Logging With The ELK Stack
Application Logging With The ELK StackApplication Logging With The ELK Stack
Application Logging With The ELK Stackbenwaine
 
Hands On Spring Data
Hands On Spring DataHands On Spring Data
Hands On Spring DataEric Bottard
 
Using YQL Sensibly - YUIConf 2010
Using YQL Sensibly - YUIConf 2010Using YQL Sensibly - YUIConf 2010
Using YQL Sensibly - YUIConf 2010Christian Heilmann
 
CouchDB Day NYC 2017: Introduction to CouchDB 2.0
CouchDB Day NYC 2017: Introduction to CouchDB 2.0CouchDB Day NYC 2017: Introduction to CouchDB 2.0
CouchDB Day NYC 2017: Introduction to CouchDB 2.0IBM Cloud Data Services
 
Golang slidesaudrey
Golang slidesaudreyGolang slidesaudrey
Golang slidesaudreyAudrey Lim
 
CouchDB Open Source Bridge
CouchDB Open Source BridgeCouchDB Open Source Bridge
CouchDB Open Source BridgeChris Anderson
 

What's hot (20)

Web Scraping with Python
Web Scraping with PythonWeb Scraping with Python
Web Scraping with Python
 
Scraping with Python for Fun and Profit - PyCon India 2010
Scraping with Python for Fun and Profit - PyCon India 2010Scraping with Python for Fun and Profit - PyCon India 2010
Scraping with Python for Fun and Profit - PyCon India 2010
 
How to scraping content from web for location-based mobile app.
How to scraping content from web for location-based mobile app.How to scraping content from web for location-based mobile app.
How to scraping content from web for location-based mobile app.
 
Beautiful soup
Beautiful soupBeautiful soup
Beautiful soup
 
Fun with Python
Fun with PythonFun with Python
Fun with Python
 
Assumptions: Check yo'self before you wreck yourself
Assumptions: Check yo'self before you wreck yourselfAssumptions: Check yo'self before you wreck yourself
Assumptions: Check yo'self before you wreck yourself
 
Do something in 5 minutes with gas 1-use spreadsheet as database
Do something in 5 minutes with gas 1-use spreadsheet as databaseDo something in 5 minutes with gas 1-use spreadsheet as database
Do something in 5 minutes with gas 1-use spreadsheet as database
 
GDG İstanbul Şubat Etkinliği - Sunum
GDG İstanbul Şubat Etkinliği - SunumGDG İstanbul Şubat Etkinliği - Sunum
GDG İstanbul Şubat Etkinliği - Sunum
 
Video WebChat Conference Tool
Video WebChat Conference ToolVideo WebChat Conference Tool
Video WebChat Conference Tool
 
CouchDB Day NYC 2017: MapReduce Views
CouchDB Day NYC 2017: MapReduce ViewsCouchDB Day NYC 2017: MapReduce Views
CouchDB Day NYC 2017: MapReduce Views
 
Do something in 5 with gas 3-simple invoicing app
Do something in 5 with gas 3-simple invoicing appDo something in 5 with gas 3-simple invoicing app
Do something in 5 with gas 3-simple invoicing app
 
Open Hack London - Introduction to YQL
Open Hack London - Introduction to YQLOpen Hack London - Introduction to YQL
Open Hack London - Introduction to YQL
 
Do something in 5 with gas 8-copy between databases
Do something in 5 with gas 8-copy between databasesDo something in 5 with gas 8-copy between databases
Do something in 5 with gas 8-copy between databases
 
Application Logging With The ELK Stack
Application Logging With The ELK StackApplication Logging With The ELK Stack
Application Logging With The ELK Stack
 
Hands On Spring Data
Hands On Spring DataHands On Spring Data
Hands On Spring Data
 
Using YQL Sensibly - YUIConf 2010
Using YQL Sensibly - YUIConf 2010Using YQL Sensibly - YUIConf 2010
Using YQL Sensibly - YUIConf 2010
 
CouchDB Day NYC 2017: Introduction to CouchDB 2.0
CouchDB Day NYC 2017: Introduction to CouchDB 2.0CouchDB Day NYC 2017: Introduction to CouchDB 2.0
CouchDB Day NYC 2017: Introduction to CouchDB 2.0
 
Golang slidesaudrey
Golang slidesaudreyGolang slidesaudrey
Golang slidesaudrey
 
CouchDB Open Source Bridge
CouchDB Open Source BridgeCouchDB Open Source Bridge
CouchDB Open Source Bridge
 
CouchDB Day NYC 2017: Mango
CouchDB Day NYC 2017: MangoCouchDB Day NYC 2017: Mango
CouchDB Day NYC 2017: Mango
 

Viewers also liked

香港中文開源軟件翻譯
香港中文開源軟件翻譯香港中文開源軟件翻譯
香港中文開源軟件翻譯Sammy Fung
 
Python beautiful soup - bs4
Python beautiful soup - bs4Python beautiful soup - bs4
Python beautiful soup - bs4Eueung Mulyana
 
Learn 90% of Python in 90 Minutes
Learn 90% of Python in 90 MinutesLearn 90% of Python in 90 Minutes
Learn 90% of Python in 90 MinutesMatt Harrison
 
From Hk0weather to Open Data
From Hk0weather to Open DataFrom Hk0weather to Open Data
From Hk0weather to Open DataSammy Fung
 
웹크롤러 조사
웹크롤러 조사웹크롤러 조사
웹크롤러 조사rupert kim
 
Big data at scrapinghub
Big data at scrapinghubBig data at scrapinghub
Big data at scrapinghubDana Brophy
 
快快樂樂學 Scrapy
快快樂樂學 Scrapy快快樂樂學 Scrapy
快快樂樂學 Scrapyrecast203
 
Intro to web scraping with Python
Intro to web scraping with PythonIntro to web scraping with Python
Intro to web scraping with PythonMaris Lemba
 
Quokka CMS - Content Management with Flask and Mongo #tdc2014
Quokka CMS - Content Management with Flask and Mongo #tdc2014Quokka CMS - Content Management with Flask and Mongo #tdc2014
Quokka CMS - Content Management with Flask and Mongo #tdc2014Bruno Rocha
 
Developing effective data scientists
Developing effective data scientistsDeveloping effective data scientists
Developing effective data scientistsErin Shellman
 
Spider进化论
Spider进化论Spider进化论
Spider进化论cjhacker
 
Web Scraping is BS
Web Scraping is BSWeb Scraping is BS
Web Scraping is BSJohn D
 

Viewers also liked (18)

Scraping the web with python
Scraping the web with pythonScraping the web with python
Scraping the web with python
 
香港中文開源軟件翻譯
香港中文開源軟件翻譯香港中文開源軟件翻譯
香港中文開源軟件翻譯
 
Python beautiful soup - bs4
Python beautiful soup - bs4Python beautiful soup - bs4
Python beautiful soup - bs4
 
Scrapy.for.dummies
Scrapy.for.dummiesScrapy.for.dummies
Scrapy.for.dummies
 
Learn 90% of Python in 90 Minutes
Learn 90% of Python in 90 MinutesLearn 90% of Python in 90 Minutes
Learn 90% of Python in 90 Minutes
 
From Hk0weather to Open Data
From Hk0weather to Open DataFrom Hk0weather to Open Data
From Hk0weather to Open Data
 
웹크롤러 조사
웹크롤러 조사웹크롤러 조사
웹크롤러 조사
 
摘星
摘星摘星
摘星
 
Big data at scrapinghub
Big data at scrapinghubBig data at scrapinghub
Big data at scrapinghub
 
快快樂樂學 Scrapy
快快樂樂學 Scrapy快快樂樂學 Scrapy
快快樂樂學 Scrapy
 
Intro to web scraping with Python
Intro to web scraping with PythonIntro to web scraping with Python
Intro to web scraping with Python
 
Quokka CMS - Content Management with Flask and Mongo #tdc2014
Quokka CMS - Content Management with Flask and Mongo #tdc2014Quokka CMS - Content Management with Flask and Mongo #tdc2014
Quokka CMS - Content Management with Flask and Mongo #tdc2014
 
Developing effective data scientists
Developing effective data scientistsDeveloping effective data scientists
Developing effective data scientists
 
Spider进化论
Spider进化论Spider进化论
Spider进化论
 
Web Scraping is BS
Web Scraping is BSWeb Scraping is BS
Web Scraping is BS
 
Bot or Not
Bot or NotBot or Not
Bot or Not
 
Scrapy-101
Scrapy-101Scrapy-101
Scrapy-101
 
[Week5]R_scraping
[Week5]R_scraping[Week5]R_scraping
[Week5]R_scraping
 

Similar to Python, web scraping and content management: Scrapy and Django

Big data analysis in python @ PyCon.tw 2013
Big data analysis in python @ PyCon.tw 2013Big data analysis in python @ PyCon.tw 2013
Big data analysis in python @ PyCon.tw 2013Jimmy Lai
 
Hands on django part 1
Hands on django part 1Hands on django part 1
Hands on django part 1MicroPyramid .
 
Introduction to Django
Introduction to DjangoIntroduction to Django
Introduction to DjangoJames Casey
 
Introduction to Django
Introduction to DjangoIntroduction to Django
Introduction to DjangoJoaquim Rocha
 
بررسی چارچوب جنگو
بررسی چارچوب جنگوبررسی چارچوب جنگو
بررسی چارچوب جنگوrailsbootcamp
 
Use open source software to develop ideas at work
Use open source software to develop ideas at workUse open source software to develop ideas at work
Use open source software to develop ideas at workSammy Fung
 
Mini Curso Django Ii Congresso Academico Ces
Mini Curso Django Ii Congresso Academico CesMini Curso Django Ii Congresso Academico Ces
Mini Curso Django Ii Congresso Academico CesLeonardo Fernandes
 
Django tech-talk
Django tech-talkDjango tech-talk
Django tech-talkdtdannen
 
BDX 2015 - Scaling out big-data computation & machine learning using Pig, Pyt...
BDX 2015 - Scaling out big-data computation & machine learning using Pig, Pyt...BDX 2015 - Scaling out big-data computation & machine learning using Pig, Pyt...
BDX 2015 - Scaling out big-data computation & machine learning using Pig, Pyt...Ron Reiter
 
Akash rajguru project report sem v
Akash rajguru project report sem vAkash rajguru project report sem v
Akash rajguru project report sem vAkash Rajguru
 
Custom web application development with Django for startups and Django-CRM intro
Custom web application development with Django for startups and Django-CRM introCustom web application development with Django for startups and Django-CRM intro
Custom web application development with Django for startups and Django-CRM introMicroPyramid .
 
Declarative Machine Learning: Bring your own Syntax, Algorithm, Data and Infr...
Declarative Machine Learning: Bring your own Syntax, Algorithm, Data and Infr...Declarative Machine Learning: Bring your own Syntax, Algorithm, Data and Infr...
Declarative Machine Learning: Bring your own Syntax, Algorithm, Data and Infr...Turi, Inc.
 
將 Open Data 放上 Open Source Platforms: 開源資料入口平台 CKAN 開發經驗分享
將 Open Data 放上 Open Source Platforms: 開源資料入口平台 CKAN 開發經驗分享將 Open Data 放上 Open Source Platforms: 開源資料入口平台 CKAN 開發經驗分享
將 Open Data 放上 Open Source Platforms: 開源資料入口平台 CKAN 開發經驗分享Chengjen Lee
 
GDG Addis - An Introduction to Django and App Engine
GDG Addis - An Introduction to Django and App EngineGDG Addis - An Introduction to Django and App Engine
GDG Addis - An Introduction to Django and App EngineYared Ayalew
 
Django 1.10.3 Getting started
Django 1.10.3 Getting startedDjango 1.10.3 Getting started
Django 1.10.3 Getting startedMoniaJ
 
Introduction to Google Cloud platform technologies
Introduction to Google Cloud platform technologiesIntroduction to Google Cloud platform technologies
Introduction to Google Cloud platform technologiesChris Schalk
 

Similar to Python, web scraping and content management: Scrapy and Django (20)

Big data analysis in python @ PyCon.tw 2013
Big data analysis in python @ PyCon.tw 2013Big data analysis in python @ PyCon.tw 2013
Big data analysis in python @ PyCon.tw 2013
 
Hands on django part 1
Hands on django part 1Hands on django part 1
Hands on django part 1
 
Introduction to Django
Introduction to DjangoIntroduction to Django
Introduction to Django
 
Django
DjangoDjango
Django
 
Introduction to Django
Introduction to DjangoIntroduction to Django
Introduction to Django
 
Django - basics
Django - basicsDjango - basics
Django - basics
 
Mini Curso de Django
Mini Curso de DjangoMini Curso de Django
Mini Curso de Django
 
بررسی چارچوب جنگو
بررسی چارچوب جنگوبررسی چارچوب جنگو
بررسی چارچوب جنگو
 
Introduction to Django
Introduction to DjangoIntroduction to Django
Introduction to Django
 
Use open source software to develop ideas at work
Use open source software to develop ideas at workUse open source software to develop ideas at work
Use open source software to develop ideas at work
 
Mini Curso Django Ii Congresso Academico Ces
Mini Curso Django Ii Congresso Academico CesMini Curso Django Ii Congresso Academico Ces
Mini Curso Django Ii Congresso Academico Ces
 
Django tech-talk
Django tech-talkDjango tech-talk
Django tech-talk
 
BDX 2015 - Scaling out big-data computation & machine learning using Pig, Pyt...
BDX 2015 - Scaling out big-data computation & machine learning using Pig, Pyt...BDX 2015 - Scaling out big-data computation & machine learning using Pig, Pyt...
BDX 2015 - Scaling out big-data computation & machine learning using Pig, Pyt...
 
Akash rajguru project report sem v
Akash rajguru project report sem vAkash rajguru project report sem v
Akash rajguru project report sem v
 
Custom web application development with Django for startups and Django-CRM intro
Custom web application development with Django for startups and Django-CRM introCustom web application development with Django for startups and Django-CRM intro
Custom web application development with Django for startups and Django-CRM intro
 
Declarative Machine Learning: Bring your own Syntax, Algorithm, Data and Infr...
Declarative Machine Learning: Bring your own Syntax, Algorithm, Data and Infr...Declarative Machine Learning: Bring your own Syntax, Algorithm, Data and Infr...
Declarative Machine Learning: Bring your own Syntax, Algorithm, Data and Infr...
 
將 Open Data 放上 Open Source Platforms: 開源資料入口平台 CKAN 開發經驗分享
將 Open Data 放上 Open Source Platforms: 開源資料入口平台 CKAN 開發經驗分享將 Open Data 放上 Open Source Platforms: 開源資料入口平台 CKAN 開發經驗分享
將 Open Data 放上 Open Source Platforms: 開源資料入口平台 CKAN 開發經驗分享
 
GDG Addis - An Introduction to Django and App Engine
GDG Addis - An Introduction to Django and App EngineGDG Addis - An Introduction to Django and App Engine
GDG Addis - An Introduction to Django and App Engine
 
Django 1.10.3 Getting started
Django 1.10.3 Getting startedDjango 1.10.3 Getting started
Django 1.10.3 Getting started
 
Introduction to Google Cloud platform technologies
Introduction to Google Cloud platform technologiesIntroduction to Google Cloud platform technologies
Introduction to Google Cloud platform technologies
 

More from Sammy Fung

Python 爬網⾴工具 - Scrapy 介紹
Python 爬網⾴工具 - Scrapy 介紹Python 爬網⾴工具 - Scrapy 介紹
Python 爬網⾴工具 - Scrapy 介紹Sammy Fung
 
DevRel - Transform article writing from printing to online
DevRel - Transform article writing from printing to onlineDevRel - Transform article writing from printing to online
DevRel - Transform article writing from printing to onlineSammy Fung
 
Introduction to Open Source by opensource.hk (2019 Edition)
Introduction to Open Source by opensource.hk (2019 Edition)Introduction to Open Source by opensource.hk (2019 Edition)
Introduction to Open Source by opensource.hk (2019 Edition)Sammy Fung
 
My Open Source Journey - Developer and Community
My Open Source Journey - Developer and CommunityMy Open Source Journey - Developer and Community
My Open Source Journey - Developer and CommunitySammy Fung
 
Introduction to development with Django web framework
Introduction to development with Django web frameworkIntroduction to development with Django web framework
Introduction to development with Django web frameworkSammy Fung
 
Open Data and Web API
Open Data and Web APIOpen Data and Web API
Open Data and Web APISammy Fung
 
Global Open Source Development 2011-2014 Review and 2015 Forecast
Global Open Source Development 2011-2014 Review and 2015 ForecastGlobal Open Source Development 2011-2014 Review and 2015 Forecast
Global Open Source Development 2011-2014 Review and 2015 ForecastSammy Fung
 
Mozilla - Openness of the Web
Mozilla - Openness of the WebMozilla - Openness of the Web
Mozilla - Openness of the WebSammy Fung
 
Open Source Technology and Community
Open Source Technology and CommunityOpen Source Technology and Community
Open Source Technology and CommunitySammy Fung
 
Access Open Data with Open Source Software Tools
Access Open Data with Open Source Software ToolsAccess Open Data with Open Source Software Tools
Access Open Data with Open Source Software ToolsSammy Fung
 
Installation of LAMP Server with Ubuntu 14.10 Server Edition
Installation of LAMP Server with Ubuntu 14.10 Server EditionInstallation of LAMP Server with Ubuntu 14.10 Server Edition
Installation of LAMP Server with Ubuntu 14.10 Server EditionSammy Fung
 
Software Freedom and Open Source Community
Software Freedom and Open Source CommunitySoftware Freedom and Open Source Community
Software Freedom and Open Source CommunitySammy Fung
 
Building your own job site with Drupal
Building your own job site with DrupalBuilding your own job site with Drupal
Building your own job site with DrupalSammy Fung
 
Software Freedom and Community
Software Freedom and CommunitySoftware Freedom and Community
Software Freedom and CommunitySammy Fung
 
Open Source Job Board
Open Source Job BoardOpen Source Job Board
Open Source Job BoardSammy Fung
 
Introduction of Mozilla Hong Kong (COSCUP 2014)
Introduction of Mozilla Hong Kong (COSCUP 2014)Introduction of Mozilla Hong Kong (COSCUP 2014)
Introduction of Mozilla Hong Kong (COSCUP 2014)Sammy Fung
 
Introduction of Open Source Job Board with Drupal CMS
Introduction of Open Source Job Board with Drupal CMSIntroduction of Open Source Job Board with Drupal CMS
Introduction of Open Source Job Board with Drupal CMSSammy Fung
 
Local Weather Information and GNOME Shell Extension
Local Weather Information and GNOME Shell ExtensionLocal Weather Information and GNOME Shell Extension
Local Weather Information and GNOME Shell ExtensionSammy Fung
 
Mozilla Community and Hong Kong
Mozilla Community and Hong KongMozilla Community and Hong Kong
Mozilla Community and Hong KongSammy Fung
 
ITFest 2014 - Open Source Marketing
ITFest 2014 - Open Source MarketingITFest 2014 - Open Source Marketing
ITFest 2014 - Open Source MarketingSammy Fung
 

More from Sammy Fung (20)

Python 爬網⾴工具 - Scrapy 介紹
Python 爬網⾴工具 - Scrapy 介紹Python 爬網⾴工具 - Scrapy 介紹
Python 爬網⾴工具 - Scrapy 介紹
 
DevRel - Transform article writing from printing to online
DevRel - Transform article writing from printing to onlineDevRel - Transform article writing from printing to online
DevRel - Transform article writing from printing to online
 
Introduction to Open Source by opensource.hk (2019 Edition)
Introduction to Open Source by opensource.hk (2019 Edition)Introduction to Open Source by opensource.hk (2019 Edition)
Introduction to Open Source by opensource.hk (2019 Edition)
 
My Open Source Journey - Developer and Community
My Open Source Journey - Developer and CommunityMy Open Source Journey - Developer and Community
My Open Source Journey - Developer and Community
 
Introduction to development with Django web framework
Introduction to development with Django web frameworkIntroduction to development with Django web framework
Introduction to development with Django web framework
 
Open Data and Web API
Open Data and Web APIOpen Data and Web API
Open Data and Web API
 
Global Open Source Development 2011-2014 Review and 2015 Forecast
Global Open Source Development 2011-2014 Review and 2015 ForecastGlobal Open Source Development 2011-2014 Review and 2015 Forecast
Global Open Source Development 2011-2014 Review and 2015 Forecast
 
Mozilla - Openness of the Web
Mozilla - Openness of the WebMozilla - Openness of the Web
Mozilla - Openness of the Web
 
Open Source Technology and Community
Open Source Technology and CommunityOpen Source Technology and Community
Open Source Technology and Community
 
Access Open Data with Open Source Software Tools
Access Open Data with Open Source Software ToolsAccess Open Data with Open Source Software Tools
Access Open Data with Open Source Software Tools
 
Installation of LAMP Server with Ubuntu 14.10 Server Edition
Installation of LAMP Server with Ubuntu 14.10 Server EditionInstallation of LAMP Server with Ubuntu 14.10 Server Edition
Installation of LAMP Server with Ubuntu 14.10 Server Edition
 
Software Freedom and Open Source Community
Software Freedom and Open Source CommunitySoftware Freedom and Open Source Community
Software Freedom and Open Source Community
 
Building your own job site with Drupal
Building your own job site with DrupalBuilding your own job site with Drupal
Building your own job site with Drupal
 
Software Freedom and Community
Software Freedom and CommunitySoftware Freedom and Community
Software Freedom and Community
 
Open Source Job Board
Open Source Job BoardOpen Source Job Board
Open Source Job Board
 
Introduction of Mozilla Hong Kong (COSCUP 2014)
Introduction of Mozilla Hong Kong (COSCUP 2014)Introduction of Mozilla Hong Kong (COSCUP 2014)
Introduction of Mozilla Hong Kong (COSCUP 2014)
 
Introduction of Open Source Job Board with Drupal CMS
Introduction of Open Source Job Board with Drupal CMSIntroduction of Open Source Job Board with Drupal CMS
Introduction of Open Source Job Board with Drupal CMS
 
Local Weather Information and GNOME Shell Extension
Local Weather Information and GNOME Shell ExtensionLocal Weather Information and GNOME Shell Extension
Local Weather Information and GNOME Shell Extension
 
Mozilla Community and Hong Kong
Mozilla Community and Hong KongMozilla Community and Hong Kong
Mozilla Community and Hong Kong
 
ITFest 2014 - Open Source Marketing
ITFest 2014 - Open Source MarketingITFest 2014 - Open Source Marketing
ITFest 2014 - Open Source Marketing
 

Recently uploaded

Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
Visualising and forecasting stocks using Dash
Visualising and forecasting stocks using DashVisualising and forecasting stocks using Dash
Visualising and forecasting stocks using Dashnarutouzumaki53779
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxBkGupta21
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 

Recently uploaded (20)

Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
Visualising and forecasting stocks using Dash
Visualising and forecasting stocks using DashVisualising and forecasting stocks using Dash
Visualising and forecasting stocks using Dash
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptx
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 

Python, web scraping and content management: Scrapy and Django

  • 1. Python, Web Scraping and Content Management: Scrapy and Django Sammy Fung http://sammy.hk OpenSource.HK Workshop 2014.07.05
  • 2. Sammy Fung ● Perl → PHP → Python ● Linux → Open Source → Open Data ● Freelance → Startup ● http://sammy.hk ● sammy@sammy.hk
  • 4. Can computer program read this ?
  • 5. Is this UI easy understanding ?
  • 6.
  • 7.
  • 8.
  • 9. Five Star Open Data 1.make your stuff available on the Web (whatever format) under an open license. 2.make it available as structured data (e.g., Excel instead of image scan of a table) 3.use non-proprietary formats (e.g., CSV instead of Excel) 4.use URIs to denote things, so that people can point at your stuff. 5.link your data to other data to provide context. 5stardata.info by Tim Berners-Lee, the inventor of the Web.
  • 10. Open Data ● Data.One – Lead by OGCIO of Hong Kong Government. – Use the term “public sector information” (PSI) insteads of “open data”. – Many data are not available in machine-readable format with useful data structure. – A lot of data are still requiring web scraping with customized data extraction to collect useful machine-readable data.
  • 12. Web Scraping a computer software technique of extracting information from websites. (Wikipedia)
  • 13. Scrapy ● Python. ● Open source web scraping framework. ● Scrap websites and extract structured data. ● From data mining to monitoring and automated testing.
  • 14. Scrapy ● Define your own data structures. ● Write spiders to extract data. ● Built-in XPath selectors to extracting data. ● Built-in JSON, CSV, XML output. ● Interactive shell console, telnet console, logging......
  • 15. scrapyd ● Scrapy web service daemon. ● pip install scrapyd ● Web API with simple Web UI: – http://localhost:6800 ● Web API Documentation: – http://scrapyd.readthedocs.org/en/latest/api.html
  • 16. scrapyd ● Examples: – curl http://localhost:6800/listprojects.json – curl http://localhost:6800/listspiders.json? project=default ● eg. {"status": "ok", "spiders": ["pollutant24", "aqhi24"]}
  • 17. Scrapy Installation $ apt-get install python python-virtualenv python-pip $ virtualenv env $ source env/bin/activate $ pip install scrapy
  • 18. Creating Scrapy Project $ scrapy startproject <new project name> newproject |-- newproject | |-- __init__.py | |-- items.py | |-- pipelines.py | |-- settings.py | |-- spider | |- __init__.py |-- scrapy.cfg
  • 19. Creating Scrapy Project ● Define your data structure ● Write your first spider – Test with scrapy shell console ● Output / Store collected data – Output with built-in supported formats – Store to database / object store.
  • 20. Define your data structure items.py class Hk0WeatherItem(Item): reporttime = Field() station = Field() temperture = Field() humidity = Field()
  • 21. Write your first spider ● Import a Class of your own data structure. – $ scrapy genspider -t basic <YOUR SPIDER NAME> <DOMAIN> – $ scrapy list ● Import any scrapy class which you required. – eg. Spider, XPath Selector ● Extend parse() function of a Spider class. ● Test with scrapy shell console – $ scrapy shell <URL>
  • 22. Output / Store collected data ● Use built-in JSON, CSV, XML output at command line. – $ scrapy crawl <Spider Name> -t json -o <Output File> ● Pipelines.py – Import a Class of your own data structure. – Extend process_item() function. – Add to ITEM_PIPELINES at settings.
  • 24. Creating django project $ pip install django $ django-admin.py startproject <Project name> myproject |-- manage.py |-- myproject |-- __init__.py |-- settings.py |-- urls.py |-- wsgi.py
  • 25. Creating django project ● Define django settings. – Create database, tables and first django user. ● Create your own django app. – or add existing django apps. – Create database tables. ● Activate django admin UI. – Add URL router to access admin UI.
  • 26. Creating django project ● settings.py – Define your database connection. – Add your own app to INSTALLED_APPS. – Define your own settings.
  • 27. Create django app $ cd <Project Name> $ python manage.py startapp <App Name> myproject |-- manage.py |-- myproject | |-- __init__.py | |-- settings.py | |-- urls.py | |-- wsgi.py |-- myapp |-- admin.py |-- __init__.py |-- models.py |-- tests.py |-- views.py
  • 28. Create django app ● Define your own data model. ● Define and activate your admin UI. ● Furthermore: – Define your data views. – Addi URL routers to connect with data views.
  • 29. Define django data model ● Define at models.py. ● Import django data model base class. ● Define your own data model class. ● Create database table(s). – $ python manage.py syncdb
  • 30. Define django data model class WeatherData(models.Model): reporttime = models.DateTimeField() station = models.CharField(max_length=3) temperture = models.FloatField(null=True, blank=True) humidity = models.IntegerField(null=True, blank=True)
  • 31. Define django data model ● admin.py – Import admin class – Import your own data model class. – Extend admin class for your data model. – Register admin class ● with admin.site.register() function.
  • 32. Define django data model class WeatherDataAdmin(admin.ModelAdmin): list_display = ('reporttime', 'station', 'temperture', 'humidity', 'windspeed') list_filter = ['station'] admin.site.register(WeatherData, WeatherDataAdmin)
  • 33. Enable django admin ui ● Adding to INSTALLED_APPS at settings.py – django.contrib.admin ● Adding URL router at urls.py – $ python manage.py runserver ● Access admin UI – http://127.0.0.1:8000/admin
  • 35. Scrapy + Django ● Define django environment at scrapy settings. – Load django configuration. ● Use Scrapy DjangoItem class – Insteads of Item and Field class – Define which django data model should be linked with. ● Query and insert data at scrapy pipelines.
  • 37. hk0weather ● Weather Data Project. – https://github.com/sammyfung/hk0weather – convert weather information to JSON data from HKO webpages. – python + scrapy + django
  • 38. hk0weather ● Hong Kong Weather Data. – 20+ HKO weather stations in Hong Kong. – Regional weather data. – Rainfall data. – Weather forecast report.
  • 39. hk0weather ● Setup and activate a python virtual enviornment, and install scrapy and django with pip. ● Clone hk0weather from GitHub – $ git clone https://github.com/sammyfung/hk0weather.git ● Setup database connection at Django and create database, tables and first django user. ● Scrap regional weather data – $ scrapy crawl regionalwx -t json -o regional.json
  • 40. DEMO