파이썬 유용한 라이브러리

안성현(@SH84AHN)
기타 유용한 라이브러리 소개
requests, bs4, scrapy, pycrypto
1

오늘 다룰 것들.
2
Requests
Scrapy
pycrypto
BS4
기타 사이트 소개

Requests
3
urllib 의 기능이 파편화 되어 있다고 판단.
좀더 편한 형태로 만든 url 요청 라이브러리
http://docs.python-requests.org/en/latest
각각의 HTTPMethod 에 해당하는 함수가 존재, url 을 던져주면 된다.
> pip install requests

Requests
4
파라미터를 전송해야 하는 경우, 딕셔너리(dict(), {}) 형태로 파라미터를 전달
http://apis.daum.net/search/blog?q=daum&apikey=DAUM_SEARCH_DEMO_APIKEY
post 의 경우, data 파라미터를 통해서 딕셔너리의 형태로 전달
헤더 추가하기
{'API_VERSION': '2.0', 'Content-Length': '0', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'User-Agent': 'python-
requests/2.3.0 CPython/2.7.6 Windows/7'}

Requests
5
r = Response Object
<class 'requests.models.Response'>
['__attrs__', '__bool__', '__class__', '__delattr__', '__dict__', '__doc__', '__format__', '__getattribute__', '__getstate__', '__hash__', '__init__',
'__iter__', '__module__', '__new__', '__nonzero__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__setstate__', '__sizeof__',
'__str__', '__subclasshook__', '__weakref__', '_content', '_content_consumed', 'apparent_encoding', 'close', 'connection',
'content', 'cookies', 'elapsed', 'encoding', 'headers', 'history', 'is_redirect', 'iter_content', 'iter_lines',
'json', 'links', 'ok', 'raise_for_status', 'raw', 'reason', 'request', 'status_code', 'text', 'url']

Requests
6
UTF-8
{'content-length': '6049', 'content-language': 'ko-KR', 'server': 'nginx/1.4.1',
'connection': 'keep-alive', 'cache-control': 'no-cache', 'date': 'Thu, 26 Jun
2014 05:02:18 GMT', 'content-type': 'text/json;charset=UTF-8'}
200
[]
False
r.text : text encoding 된 response body 텍스트
r.content : body가 바이트인 경우 content 를 통해서 데이터 가져올수 있다.
r.json : JSON 형식으로 response body 가 오는 경우, dict 로 변경
r. raw : raw socket response, requests() 함수에서 stream=True 설정 해야함.

Requests + Auth
8
HTTP Basic Auth 기본 제공
request() 함수에 auth=HTTPBasic(id, pw) 지정
HTTPBasic 은 생략 가능.
기존의 HTTPBasicAuthHandler() 만들고
install 해서 사용. 번거로움.

Requests + Auth
9
기타 다양한 인증 기능 제공
Digest 인증
OAuth1 인증
 OAuth2 는 아직 공식 미지원.

BS4(BeautifulSoup4)
10
HTML 파싱 라이브러리
http://coreapython.hosting.paran.com/etc/beautifulsoup4.html
 기존 라이브러리의 문제점 : 찾는 방식의 문제
HTMLParser 이벤트기반, 불편하다.
> pip install beautifulsoup4

BS4(BeautifulSoup4)
11
<title>The Dormouse's story</title>
title
The Dormouse's story
[u'title']
<a class="sister"
href="http://example.com/elsie"
id="link1">Elsie</a>
[[u'title'], [u'story'], [u'story']]
 찾으려고 하는 태그에 대해서 멤버 변수로 접근.

BS4(BeautifulSoup4)
12
 HTML String으로 생성하기
 URL 로 읽어와서 생성하기
urlib, requests 등을 이용해서 URL 에 있는 HTML을 가져온다.

BS4(BeautifulSoup4)
13
soup
tag
name
property
NavigableStringtag
BeautifulSoup 객체
- 하나의 soup 은 여러 개의 Tag형 변수를 가진다.
- Tag형 변수의 이름은는 태그명 자체이다. <b> => soup.b
- 하나의 Tag 에는 하나의 name을 가지는데, 태그명
- 하나의 Tag 는 여러 개의 속성을 가진다. class, id, style…
- NavigableString 은 태그안에 있는 문자열을 지칭

BS4(BeautifulSoup4)
15
 태그 이름을 통해서 접근 => 편하지만, 해당 이름의 첫번째 태그만 가져온다.!!
현재 문서에서 첫번째 img 태그, 첫번째 a 태그를 가져온다.
 원하는 태그 모두 가져오기, find_all(‘tag명’)
 현재 문서에서 a 태그의 모든 링크를 가져와라.
find_all(name, attrs, recursive, text, limit, **kwargs)

BS4(BeautifulSoup4)
16
 head 태그내에서 meta 태그중에서 name과 content 속성을 둘다 가진것을 출력해라.
HTML
결과
 키워드 지정시, 속성으로 검색, text 지정시 문자열 검색

BS4(BeautifulSoup4) 예제
17
 예제 – 글의 제목, 본문, 이미지 가져오기

18
 파폭툴 이용 각각의 영역에 대한 tag 관련 정보 수집

19
 글 제목 가져오기
<h1 class="entry-title">
<a href="/137">
라디오스타, 김유정 너무 일찍 철 들어 버리다
</a>
</h1>

20
 글 본문 가져오기
<div class="tt_article_useless_p_margin">
<p style="TEXT-ALIGN: justify; LINE-HEIGHT: 2">
<span style="FONT-SIZE: 12pt; FONT-FAMILY: NanumGothic">
예능 프로그램 진짜사나이에 헨리가 나왔을 때 저는 솔직히 …
</span>
</p>
<p style="TEXT-ALIGN: justify; LINE-HEIGHT: 2"></p>
<p style="TEXT-ALIGN: justify; LINE-HEIGHT: 2"></p>
….

21
 본문 내 이미지 저장하기
{'content-length': '142068', 'via': '1.1 Wcache(1.1)', 'content-disposition': 'inline; filename="10225.jpg"', 'age':
'36908', 'expires': 'Fri, 25 Jul 2014 22:52:11 GMT', 'server': 'Apache', 'last-modi
fied': 'Wed, 25 Jun 2014 22:26:01 GMT', 'connection': 'keep-alive', 'date': 'Wed, 25 Jun 2014 22:52:11 GMT',
'content-type': 'image/jpeg'}
{'content-length': '139046', 'via': '1.1 Wcache(1.1)', 'content-disposition': 'inline; filename="12335.jpg"', 'age':
'36985', 'expires': 'Fri, 25 Jul 2014 22:50:55 GMT', 'server': 'Apache', 'last-modi
fied': 'Wed, 25 Jun 2014 22:26:02 GMT', 'connection': 'keep-alive', 'date': 'Wed, 25 Jun 2014 22:50:54 GMT',
'content-type': 'image/jpeg'}
{'content-length': '781', 'via': '1.1 Wcache(1.1)', 'accept-ranges': 'bytes', 'expires': 'Fri, 27 Jun 2014 02:30:54 GMT',
'server': 'dws', 'last-modified': 'Mon, 03 Nov 2008 07:05:34 GMT', 'connection
': 'keep-alive', 'etag': '"1f04b02-30d-45ac392bfab80"', 'date': 'Thu, 26 Jun 2014 02:30:55 GMT', 'content-type':
'image/gif', 'age': '23785'}

Scrapy
22
웹 크롤링 프레임워크, 웹사이트를 크롤링하고, 페이지에서 데이터 추출하는 역할.
http://doc.scrapy.org/en/latest/topics/leaks.html
> pip install Scrapy
 하나의 크롤링을 위한 단계

Scrapy
23
 프로젝트 생성하기
scrapy startproject [project_name]

Scrapy
24
 Item.py 디자인하기
- 스크랩된 데이터를 담는 컨테이너, 파이썬 딕셔너리 같은.
- Item : scrapy.Item
- Item의 속성 : scrapy.Field
모델링
- 가져올 데이터들을 선정하고 그에 따라서 item 모델링
- dmoz.org 에서 title, link, description 을 가져온다.
/tutorial/tutorial/item.py

Scrapy
25
 Spider 만들기
- spider : 사용자가 만든 클래스, 도메인에서 정보 긁어오기 위해서 사용.
- download 할 url 리스트 정의, 어떻게 링크를 따라 갈것인지,
- 어떻게 페이지내 컨텐츠를 가져올것인지.
작성법
- scrapy.Spider 클래스를 서브클래싱해서 사용.
- 3가지의 필수 속성/메소드
속성/메소드 설명
name Identifier, unique, 다른이름으로 설정
start_urls URL 리스트
parse() 각각의 url에 대한 Response 객체에서 호출됨.
response data 를 파싱하고 데이터를 뽑아 내는 역할.
Response 를 처리하고 스크랩된 데이터를 item 객체로 변환.

Scrapy
27
 Crawling
프로젝트 디렉토리 상에서 :
scrapy crawl [spider_name]
scrapy Request(callback=parse())
start_urls count 만큼 생성
Request 가 실행.
Response 객체가 반환
parse() 로 전달.
 내부에서는 이렇게 돌아갑니다.

Scrapy
28
ash84 at ubuntu in ~/study/scrapy_test/tutorial
$ scrapy crawl dmoz
2014-06-27 11:30:17+0900 [dmoz] INFO: Spider opened
2014-06-27 11:30:17+0900 [dmoz] INFO: Crawled 0 pages (at 0 pages/min), scraped
0 items (at 0 items/min)
2014-06-27 11:30:17+0900 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6
023
2014-06-27 11:30:17+0900 [scrapy] DEBUG: Web service listening on 127.0.0.1:6080
2014-06-27 11:30:18+0900 [dmoz] DEBUG: Crawled (200) <GET http://www.dmoz.org/Co
mputers/Programming/Languages/Python/Resources/> (referer: None)
filename : Resources
2014-06-27 11:30:18+0900 [dmoz] DEBUG: Crawled (200) <GET http://www.dmoz.org/Co
mputers/Programming/Languages/Python/Books/> (referer: None)
filename : Books
2014-06-27 11:30:18+0900 [dmoz] INFO: Closing spider (finished)
2014-06-27 11:30:18+0900 [dmoz] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 516,
'downloader/request_count': 2,
'downloader/request_method_count/GET': 2,
'downloader/response_bytes': 16515,
'downloader/response_count': 2,
'downloader/response_status_count/200': 2,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2014, 6, 27, 2, 30, 18, 119539),
'log_count/DEBUG': 4,
'log_count/INFO': 7,
'response_received_count': 2,
'scheduler/dequeued': 2,
'scheduler/dequeued/memory': 2,
'scheduler/enqueued': 2,
'scheduler/enqueued/memory': 2,
'start_time': datetime.datetime(2014, 6, 27, 2, 30, 17, 59434)}
2014-06-27 11:30:18+0900 [dmoz] INFO: Spider closed (finished)
실제 작성한 dmoz spider 가 작동하는 부분
지정한 URL 에서 크롤링 수행

Scrapy
29
원하는 웹 페이지에서 가져오는것 까지 성공, 다음은?
가져온 것 중에서 필요한것만 챙기기
웹 페이지에서 데이터 가져오기
Scrapy Selector : XPath, CSS 기반
예 설명
/html/head/title Html 안에 head 안에 title 태그
/html/head/title/text() Html 안에 head 안에 title 안의 문자열 반환
//td 전체 문서에서 td 태그들을 반환
//div[@class=“min”] 전체 div 태그들에서 class 명이 mine 것만 반환

Scrapy
30
 Selector 클래스 제공
- 파싱을 직접 수행하는 주체이자 노드를 나타냄.
- 4가지 기본 메소드 제공
메소드 설명
xpath() Xpath 표현에 의해서 선택된 Selector 의 리스트 반환
css() CSS 표현에 의해서 선택된 Selector의 리스트 반환
extract() 선택된 데이터의 유니코드 값 반환
re() 정규표현식에 의한 Selector의 리스트 반환

Scrapy
31
데이터 추출하기
- response.body 부분에서 XPath 를 이용해서 필요한 것들을 추출
- XPath 를 알기 위해서 HTML을 사람이 봐야 한다. Firefox 확장기능 활용
예제) link-title, link url, text

Scrapy
32
기존의 dmoz_spider.py 파일내 parse() 함수에 HTML 파일 저장 대신
Selector 를 이용한 파싱 코드 삽입

Scrapy
33
Item 객체에 넣기
- Item 객체는 파이썬 딕셔너리 커스텀 객체
- item[‘title’] 이런식으로 접근.
Spiders 에서 추출한 데이터를 Item 객체로 변환후 반환.

Scrapy
34
scrapy crawl dmoz
{"desc": [], "link": ["/Computers/"], "title": ["Computers"]},
{"desc": [], "link": ["/Computers/Programming/"], "title": ["Programming"]},
{"desc": [], "link": ["/Computers/Programming/Languages/"], "title": ["Languages"]},
{"desc": [], "link": ["/Computers/Programming/Languages/Python/"], "title": ["Python"]},
scrapy crawl dmoz –o items.json
- 추출된 데이터 => item => item.json 파일
- 좀더 복잡한 프로젝트에서는 Item Pipeline 이용

pycrypto
35
보안 해쉬 함수(SHA256 등) 와 다양한 암호화 알고리즘(AES, DES, RSA ..) 를 하나로 묶은 패키지.
https://launchpad.net/products/pycrypto
> pip install pycrypto

정리
37
urllib2 보다는 requests 사용하자.
oauth2 는 아직 미지원, OAuth1, Basic, Digest 인증 지원
HTMLParsing : 파싱 대상/성격 따라 다르게
파싱 대상
구조가 다르다. BS4
구조가 같다. Scrapy
ex) 각각의 쇼핑몰에서
데이터 파싱
ex) 특정 블로그 시스템내
페이지 파싱

http://www.pythonweekly.com
38

https://www.facebook.com/groups/pythonkorea
40

http://ask.python.kr/questions
41

파이썬 유용한 라이브러리

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to 파이썬 유용한 라이브러리

Similar to 파이썬 유용한 라이브러리 (20)

More from SeongHyun Ahn

More from SeongHyun Ahn (14)

파이썬 유용한 라이브러리