簡易爬蟲製作和Pttcrawler

NCKU TechOrange /Tien Yang/吳典陽
web crawler PTTcrawler  
for example

PTTcrawler
❖ git clone https://github.com/wy36101299/
PTTcrawler.git!
❖ 比對抓下來的資料和網路上的資料

crawler sop
1st: observe the web’s framework
2nd: parse the part you want
3rd: save to json

crawler sop
3rd: save to json
ﬁnal: tell your mom that I got it

html
header
body
footer
div
p

Beautiful Soup is a Python library for pulling data !
out of HTML and XML ﬁles.

Beautiful Soup is a Python library for pulling data !
out of HTML and XML ﬁles.
from bs4 import BeautifulSoup

http://hpdswy.ee.ncku.edu.tw/~wy/ncku_orange/crawler_test.html

http://hpdswy.ee.ncku.edu.tw/~wy/ncku_orange/crawler_test.html
what’s the top you print?

try to parse other part
print which you interested in

3rd: save to json
why we choose json!?

3rd: save to json
JSON (JavaScript Object Notation) is a
lightweight data-interchange format.

3rd: save to json
import json
JSON (JavaScript Object Notation) is a
lightweight data-interchange format.

final: tell your mom that I got it
you have become a data scientist !

More Related Content