Web Scraping with Python
Softnix Technology
Chakrit Phain
Topic
HTML parsing
HTTP
Programming
Methods Cookie Session
HTTP Tools
Chrome
Develop
Tool
Postman
Python Web
Scraping
Regular
Expression
DOM
parsing
• HTTP programming
• DOM parsing
• Text pattern matching (Regular
Expression)
• Etc.
Web Scraping technique
https://en.wikipedia.org/wiki/Web_scraping#HTTP_programming
HTTP Programming
Methods
• Get
• Post
Cookie Session
HTTP Programming
https://en.wikipedia.org/wiki/Web_scraping#HTTP_programming
HTTP Request & Response
https://en.wikipedia.org/wiki/Web_scraping#HTTP_programming
GET /index.html HTTP/1.1
Host: www.example.com
HTTP/1.1 200 OK
Date: Mon, 23 May 2005 22:38:34
GMT
Content-Type: text/html;
charset=UTF-8
Content-Encoding: UTF-8
Content-Length: 138
Last-Modified: Wed, 08 Jan 2003
23:11:55 GMT
Server: Apache/1.3.3.7 (Unix) (Red-
Hat/Linux)
ETag: "3f80f-1b6-3e1cb03b"
Accept-Ranges: bytes
Connection: close
<html>
<head>
<title>An Example Page</title>
</head>
<body> Hello World, this is a very
simple HTML document. </body>
</html>
Request Response
Hand On #1 http telnet (5mins)
HTTP Components
Cookie & Session
HTTP Tools
HTTP Tools
Hand On #2 Session Hijack (5mins)
Python Web Scraping
Web Scraping with Python

Web Scraping with Python

  • 1.
    Web Scraping withPython Softnix Technology Chakrit Phain
  • 2.
    Topic HTML parsing HTTP Programming Methods CookieSession HTTP Tools Chrome Develop Tool Postman Python Web Scraping Regular Expression DOM parsing
  • 3.
    • HTTP programming •DOM parsing • Text pattern matching (Regular Expression) • Etc. Web Scraping technique https://en.wikipedia.org/wiki/Web_scraping#HTTP_programming
  • 4.
  • 5.
    Methods • Get • Post CookieSession HTTP Programming https://en.wikipedia.org/wiki/Web_scraping#HTTP_programming
  • 6.
    HTTP Request &Response https://en.wikipedia.org/wiki/Web_scraping#HTTP_programming GET /index.html HTTP/1.1 Host: www.example.com HTTP/1.1 200 OK Date: Mon, 23 May 2005 22:38:34 GMT Content-Type: text/html; charset=UTF-8 Content-Encoding: UTF-8 Content-Length: 138 Last-Modified: Wed, 08 Jan 2003 23:11:55 GMT Server: Apache/1.3.3.7 (Unix) (Red- Hat/Linux) ETag: "3f80f-1b6-3e1cb03b" Accept-Ranges: bytes Connection: close <html> <head> <title>An Example Page</title> </head> <body> Hello World, this is a very simple HTML document. </body> </html> Request Response
  • 7.
    Hand On #1http telnet (5mins)
  • 8.
  • 9.
  • 10.
  • 11.
  • 12.
    Hand On #2Session Hijack (5mins)
  • 13.