Python 網頁爬蟲由淺入淺

網頁爬蟲由淺入淺
HST - PF

今天就只是跟大家分享
一些騙錢接案的心得

請不要期待會出現什麼高深的技術

只可能會出現一些很髒的解法

首先讓我們先把
網頁抓下來

第一式 - 先抓抓看再說
importurllib2
data=urllib2.urlopen(url).read() #拿到資料啦！
不管遇到什麼網站，我都一律先用urllib2來看看這是不是
個簡單抓的網站，用來決定開價技術程度

第二式 - 換個姿勢
importpycurl
importStringIO
curl=pycurl.Curl()
curl.fp=StringIO.StringIO()
curl.setopt(pycurl.URL,url)
curl.setopt(crl.WRITEFUNCTION,crl.fp.write)
curl.setopt(pycurl.DEBUGFUNCTION,show)#重要！不然螢幕會很吵
curl.perform()
data=crl.fp.getvalue() #終於拿到資料
有些網站輕微雞掰難抓，需要cookie或是agent才可以讀到
網頁，懶得用cookielib或是捏header的時候可以用

第三式 - 直接開瀏覽器硬幹！
fromseleniumimportwebdriver
browser=webdriver.Chrome() #開個Chrome出來
browser.get(url)
obj=browser.find_element_by_xx('button')#選擇對象
obj.send_keys('HST') #Key-In資料
obj.click() #按鈕！
data=browser.page_source #終於拿到資料
專門用來解一些混蛋Ajax跟一些奇奇怪怪的問題
有遇到問題，開selenium就對了：)

當你發現有這種問題代表你找不到
對象(哈哈~魯蛇一枚)

你以為問題就這樣全解決了嗎？

第四式 - 毀天滅地不敢直式
importvirtkey #按鍵精靈
v=virtkey.virtkey()
v.press_unicode(ord(s))
v.release_unicode(ord(s))
importpytesser #影像辨識
image=Image.open('fuck.png')
data=image_to_string(image) #幹他媽的終於拿到資料
希望你不要走到這一步，使出這招你可能會有生活不能自
理或是對人生絕望等副作用

既然都可以抓到資料了
一筆一筆抓也太慢了

你可能想說那我開一堆py來抓
千萬不要這麼做，很蠢...而且抓到的應該不是你要的結果

正解
importthreading #多執行緒
importmultiprocessing #多進程
importQueue
多執行緒或多進程是你的好碰友

看個例子
classWorker(threading.Thread):
def__init__(self,queue):
threading.Thread.__init__(self)
self.queue=queue
defrun(self):
whileTrue:
try:
url=self.queue.get(timeout=1)
exceptQueue.Empty:
return
GetData(url)
self.queue.task_done()
queue=Queue.Queue()
threads=[]
foriinrange(thread_num):
worker=Worker(queue)
worker.setDaemon(True)
worker.start()
threads.append(worker)

什麼？因為尻抓太快被擋了？！

這邊講一下我平常用的兩種方法

方法一 - 睡覺皇帝大
importtime
data=urllib2.urlopen(url1).read()
time.sleep(1000000000000)
data=urllib2.urlopen(url2).read()
有什麼不能解決的事先來跟我睡一覺再說

方法二 - 人海戰術
importProxyFinder
proxys=ProxyFinder.GetProxy()
一個ip抓不了，我還有千千萬萬個ip
詳情請看ProxyFinder

今天的小分享大概到這邊

如果我講的這麼簡單你都聽不懂

Python 網頁爬蟲由淺入淺

Recommended

Recommended

More Related Content

More from hackstuff

More from hackstuff (8)

Python 網頁爬蟲由淺入淺