Web data extraction with PHP
Chakrit Phain
Softnix Technology Co.,Ltd.
goo.gl/ytLxtS
Topic
HTTP Programming
• Methods
• Cookie
• Session
HTTP Tools
• Chrome Develop tool
• Postman
HTTP access with PHP
• Fopen
• File_get_contents
• Curl
HTML access with DOM
PHP I/O Stream
• Data extraction is the act or process of
retrieving data out of (usually unstructured or
poorly structured) data sources for further data
processing or data storage (data migration).
https://en.wikipedia.org/wiki/Data_extraction
Techniques
• HTTP programming
• DOM parsing
• Text pattern matching (Regular
expression)
• Etc.
https://en.wikipedia.org/wiki/Web_scraping
HTTP
Programming
• Methods
• Get
• Post
• Cookie
• Session
HTTP
Request &
Response
GET /index.html HTTP/1.1
Host: www.example.com
HTTP/1.1 200 OK
Date: Mon, 23 May 2005 22:38:34
GMT
Content-Type: text/html;
charset=UTF-8
Content-Encoding: UTF-8
Content-Length: 138
Last-Modified: Wed, 08 Jan 2003
23:11:55 GMT
Server: Apache/1.3.3.7 (Unix) (Red-
Hat/Linux)
ETag: "3f80f-1b6-3e1cb03b"
Accept-Ranges: bytes
Connection: close
<html>
<head>
<title>An Example Page</title>
</head>
<body> Hello World, this is a very
simple HTML document. </body>
</html>
Hand On #1 http telnet
Hand On #1
GET / HTTP/1.1
Host: testing-ground.scraping.pro
HTTP/1.1 200 OK
Date: Fri, 18 May 2018 05:17:36 GMT
Server: Apache/2.2.22 (Debian)
X-Powered-By: PHP/5.4.4-14+deb7u12
Vary: Accept-Encoding
Content-Length: 3701
Content-Type: text/html
HTTP Component
https://www.gammon.com.au/forum/?id=12942
Cookie &
Session
http://www.hackingarticles.in/beginner-guide-understand-cookies-session-management/
Tools
Hand On #2 http cookie
http://testing-
ground.scraping.pro/login
How to get html
content with PHP
•fopen()
•file_get_contents()
•curl
Fopen
Hand On #3 Fopen
FILE_GET_CONTENTS
Hand On #4 File_get_contents
CURL
Hand On #5 CURL
Document
Object Model
(DOM)
HTML DOM
Access
• getElementById
• getElementByTagName
• DOMXPath
DOMDocument::getElementById
DOMDocument::getElementsByTagName
PHP DOM
CLASS TYPE
• DOMDocument
• Represents an entire HTML or XML
document
• DOMNodeList
• The NodeList object represents an ordered
list of nodes
• DOMNode
• Each of DOMNodeList
• DOMElement
• Extend DOMNode
DOMXPath
Expression Description
nodename Selects all nodes with the name "nodename"
/ Selects from the root node
// Selects nodes in the document from the current node that match the selection no matter
where they are
. Selects the current node
.. Selects the parent of the current node
@ Selects attributes
DOMXPath
Path Expression Description
bookstore Selects all nodes with the name "bookstore"
/bookstore Selects from the root node
bookstore/book Selects all book elements that are children of bookstore
//book Selects all book elements no matter where they are in the document
bookstore//book Selects all book elements that are descendant of the bookstore element, no matter
where they are under the bookstore element
//@lang Selects all attributes that are named lang
/bookstore/book[1] Selects the first book element that is the child of the bookstore element.
DOM Example
DOM Example
//article/div[2]/div[7]//a
Hand On #6 IMDB
PHP I/O Stream

Web scraping with php

Editor's Notes

  • #4 3-5m
  • #5 สกัดเป็น csv
  • #13 Review Develop tools & Postman (5Mins)