2. Don’t Expect Too Much ……
Internet Computing
ASP.NET
JSP
PHP
Website Design
3. Outline
1. Different Types of Websites
2. “Scraping” onto Website ……
3. What is “VBA in Excel”?
4. Mechanism of applying MS Excel VBA in scraping Webpage Information
5. Scenarios
6. Other Alternatives
7. Pros and Cons
8. Q&A
4. Different Types of Websites
Static websites
• Simplest form of website.
• Site’s content is delivered without use of server side processing.
• Common use: Brochure / Advertisement sites.
• Limitation: Cannot provide complex user interactivity.
List of FEHD Public Markets and Cooked Food Markets/Centres
(www.fehd.gov.hk/english/pleasant_environment/tidy_market/Markets_CFC_list.html)
5. Dynamic websites
Different Types of Websites
• Reply on server side scripting.
• Provision of advanced interactivity.
• Usually use a database to deliver content for individual pages.
• Advantage: Efficient way to manage a large-scale site.
• Limitation: Search Engine Optimisation (SEO) more difficulty to implement.
OpenRice
(www.openrice.com)
6. Content managed
websites
Different Types of Websites
• Provide a password protected interface add, edit and remove content from the site.
• Content Management System (CMS): Benefit on numerous contributors and some may be
working from remote locations.
Wikipedia
(en.wikipedia.org)
7. eCommerce websites
Different Types of Websites
• Dynamic website.
• Major functionality: to process financial transactions.
• Common modules such as shopping basket system, secure online payment system, etc.
• Also include a content management system, so product details can be added / updated.
TaoBao
(www.taobao.com)
8. Flash websites
Different Types of Websites
• Flash: A software developed by Adobe (previously, Macromedia).
• Widely used to generate complex animations ActionScript ……
• Also impossible to carry out SEO.
We choose the Moon
(www.wechoosethemoon.org)
9. “Scraping” onto Website ……
Yellow Pages, Hong Kong
(www.yp.com.hk)
How can I save a copy with name,
address, telephone no., and link
into spreadsheet format?
10. “Scraping” onto Website ……
Web scraping is the process of
automatically collecting
information from the World Wide
Web.
Web scraping (also called
web harvesting or web data
extraction) is a computer software
technique of extracting
information from websites.
Computer programs can
“crawl” or “spider”
through web sites so as to
pull out the data.
People often do this to build things like
comparison shopping engines, archive
web pages, or simply download text to a
spreadsheet so that it can be filtered and
analyzed.
What ???
MS Excel 2010
Visual Basic for Application (VBA)
11. “Scraping” onto Website ……
Technique Description
Human copy-and-paste Sometimes even the best web-scraping technology cannot replace a human’s manual
examination and copy-and-paste, and sometimes this may be the only workable solution
when the websites for scraping explicitly set up barriers to prevent machine automation.
Text grepping and regular
expression matching
A simple yet powerful approach to extract information from web pages can be based on
the UNIX grep command or regular expression matching facilities of programming
languages (for instance Perl or Python).
HTTP programming Static and dynamic web pages can be retrieved by posting HTTP requests to the remote
web server using socket programming.
Data mining algorithms any websites have large collections of pages generated dynamically from an underlying
structured source like a database. Data of the same category are typically encoded into
similar pages by a common script or template. In data mining, a program that detects
such templates in a particular information source, extracts its content and translates it
into a relational form is called a wrapper. Wrapper generation algorithms assume that
input pages of a wrapper induction system conform to a common template and that they
can be easily identified in terms of a URL common scheme.
Document Object Model (DOM)
parsing
By embedding a full-fledged web browser, such as the Internet Explorer or the Mozilla
browser control, programs can retrieve the dynamic contents generated by client side
scripts. These browser controls also parse web pages into a DOM tree, based on which
programs can retrieve parts of the pages.
HTML parsers Some semi-structured data query languages, such as XQuery and the HTQL, can be used
to parse HTML pages and to retrieve and transform page content.
Web-scraping software There are many software tools available that can be used to customize web-scraping
solutions. This software may attempt to automatically recognize the data structure of a
page or provide a recording interface that removes the necessity to manually write web-
scraping code, or some scripting functions that can be used to extract and transform
content, and database interfaces that can store the scraped data in local databases.
Semantic annotation recognizing The pages being scraped may embrace metadata or semantic markups and annotations,
which can be used to locate specific data snippets. If the annotations are embedded in
the pages, as Microformat does, this technique can be viewed as a special case of DOM
parsing. In another case, the annotations, organized into a semantic layer, are stored and
managed separately from the web pages, so the scrapers can retrieve data schema and
instructions from this layer before scraping the pages.
Various Techniques, extracted from Wikipedia
(en.wikipedia.org/wiki/Web_scraping)
12. What is “VBA in Excel”?
Workbook
Cell Range
Cell
Worksheets
Excel Object Model
Object Hierarchy
• Workbook contains worksheets
• Worksheet contains ranges
• Range contains cells
13. What is “VBA in Excel”?
• Macros will be used to illustrate basic Excel VBA coding
Recording macros creates VBA code automatically
• This code can be studied
Macros are useful in developing the fundamental skills for reading, understanding, and
writing VBA code
• General actions in Excel VBA
Recording a macro
Writing simple VBA procedures
Creating event procedures
Assigning macros to drawing objects in Excel
• Macros are technically defined as units of VBA code
A macro automates a repetitive series of actions in an Excel spreadsheet application
Macros can be recorded in Excel or created directly by writing VBA code in the Visual Basic Editor (VBE)
• In VBA, macros are referred to as procedures
There are two types of procedures
Sub procedures
Function procedures
The macro recorder can only produce sub procedures
• To record a macro, we must know exactly the actions we wish to perform and then use the Macro Recorder
15. Mechanism of applying MS Excel VBA in scraping Webpage Information
List of Permitted Premises for the Sale of Restricted Foods (持許可證售賣限制出售食物的處所名單)
(www.fehd.gov.hk/english/licensing/licence-type-permit.html)
Consider this website ……
16. Mechanism of applying MS Excel VBA in scraping Webpage Information
While I click the button Submit ……
17. Mechanism of applying MS Excel VBA in scraping Webpage Information
Some questions ……
1. One click to obtain the Full List?
Answer: No
2. Information shows in a Single Page?
Answer: No
• List published are restricted by a specific type of permit as user
require to select it via the list box.
• An establishment / outlet can obtain more than one type of permit.
……
• Page breaker adopted.
• Each page displays maximum 50 records.
• Expected number of pages is 9.
18. Mechanism of applying MS Excel VBA in scraping Webpage Information
Some questions ……
3. Any pattern of hyperlink onto the search result obtain?
Answer: Yes
http://www.fehd.gov.hk/cgi-
bin/fehdnew/licence/ecsvread.pl?field1=Chinese+Herb+Tea+Permit&field2=&field3=&field4=&order_by=fi
eld4&order=abc&page=0
field1=Chinese+Herb+Tea+Permit Permit Type, “Cut Fruit Permit” = Cut+Fruit+Permit
page=0 0 = Page 1, 1 = Page 2, …… n = Page (n + 1)
19. Mechanism of applying MS Excel VBA in scraping Webpage Information
Some questions ……
4. Search result can obtain the necessary information?
Answer: Yes
• Information: Shopsign (Registered Name / Trading Name), Address
5. Search result presents as tabular format?
Answer: Yes
20. Mechanism of applying MS Excel VBA in scraping Webpage Information
Conclusions …… not yet
1. One click to obtain the Full List? Answer: No
2. Information shows in a Single Page? Answer: No
3. Any pattern of hyperlink onto the search result obtain? Answer: Yes
4. Search result can obtain the necessary information? Answer: Yes
5. Search result presents as tabular format? Answer: Yes
Many combinations of those answer, depends on
various scenarios ……
21. Scenario 1: List of Licensed Hotels, Office of Licensing Authority, Home Affairs Department
Overview
Link:
http://www.hadla.gov.hk/en/hotels/search_h.html
22. Scenario 1: List of Licensed Hotels, Office of Licensing Authority, Home Affairs Department
While I click the button Search ……
Link:
http://www.hadla.gov.hk/cgi-bin/hadlanew/search.pl?client=1&searchtype=1&name=&address=&room=0&district=0&displaytype=2
1. One click to obtain the Full List? Answer: Yes
2. Information shows in a Single Page? Answer: Yes
3. Any pattern of hyperlink onto the search result obtain? Answer: Yes
4. Search result can obtain the necessary information? Answer: Yes
5. Search result presents as tabular format? Answer: Yes
23. Scenario 1: List of Licensed Hotels, Office of Licensing Authority, Home Affairs Department
Layout Design in MS Excel ……
……
Create ONE sheet and name it as Data.
1
2 Save as Scenario1.xlsm.
24. Scenario 1: List of Licensed Hotels, Office of Licensing Authority, Home Affairs Department
Demonstration
25. Scenario 2: Secondary School Profiles 2011/2012, Committee on Home-School Co-operation
Overview
Link:
http://www.chsc.hk/ssp/main.php?land_id=1 The Profiles aim at providing school information for P6 parents whose children are
going to participate in the Secondary School Places Allocation (SSPA) System. In order
to choose a secondary school for their children, parents may make reference to the
school information in the Profiles for application of discretionary places in January
each year and for choice-making in central allocation from late April to early May.
Information in the Profiles is provided and checked by schools with reference to their
situations as at September of the school year. Schools may update the web version of
the Profiles on or after mid-December of the school year.
26. Scenario 2: Secondary School Profiles 2011/2012, Committee on Home-School Co-operation
Overview
Link:
http://www.chsc.hk/ssp/sch_list.php?lang_id=1&search_mode=&
frmMode=pagebreak&district_id=3&page=1
27. Scenario 2: Secondary School Profiles 2011/2012, Committee on Home-School Co-operation
Overview
(a) Unique hyperlink
(b) Further information by Sub-section
(c) Information for specific Sub-section presented as Table
(a)
(b)
(c)
28. Scenario 2: Secondary School Profiles 2011/2012, Committee on Home-School Co-operation
Overview
1. One click to obtain the Full List? Answer: No (Page breaker adopted)
2. Information shows in a Single Page? Answer: No
3. Any pattern of hyperlink onto the search result obtain? Answer: Yes
4. Search result can obtain the necessary information? Answer: No (School Name is in hyperlinked)
5. Search result presents as tabular format? Answer: No (Nested)
29. Scenario 2: Secondary School Profiles 2011/2012, Committee on Home-School Co-operation
Obtain Syntax / Pattern of Hyperlink
Choose “Central & Western” from the list box of “District”, and then press “Search” button.
1
2
As there is insufficient information to detect the hyperlink pattern, so try to obtain the detail hyperlink by press the
hyperlinked text “2” to access the second page.
……
Hyperlink should
contain district_id
/ sch_type /
sch_gender
30. Scenario 2: Secondary School Profiles 2011/2012, Committee on Home-School Co-operation
Obtain Syntax / Pattern of Hyperlink
Hyperlink contains sufficient information for parsing into Excel VBA.
Amend the value after district_id= and page= to test whether the search result page can be properly displayed.
3
31. Scenario 2: Secondary School Profiles 2011/2012, Committee on Home-School Co-operation
Obtain Syntax / Pattern of Hyperlink
Test 1: district_id=1&page=1 (First search result page by district “Central & Western”)
4
32. Scenario 2: Secondary School Profiles 2011/2012, Committee on Home-School Co-operation
Obtain Syntax / Pattern of Hyperlink
Test 2: district_id=13&page=3 (Second search result page by district “Sha Tin”)
4
33. Scenario 2: Secondary School Profiles 2011/2012, Committee on Home-School Co-operation
Obtain Syntax / Pattern of Hyperlink
Test 3: district_id=19&page=5 (Expect nothing / error should be shown)
4
NOTHING
34. Scenario 2: Secondary School Profiles 2011/2012, Committee on Home-School Co-operation
Obtain Syntax / Pattern of Hyperlink
5
The hyperlink is proved to be ready for use as by amending the parameter to sch_type=, e.g. sch_type=Aided to display
the list of aided secondary schools.
35. Scenario 2: Secondary School Profiles 2011/2012, Committee on Home-School Co-operation
Layout Design in MS Excel ……
……
2 Save as Scenario2.xlsm.
1 Create SIX Sheets and name them as Working1, TotalRecord, Working2, Data1, Working3, Data2.
Working1: obtain the total number of records and compute number of pages by school type
Working2: by using the figures on Working1, pass it as Integer variable onto the VBA program for looping. String of the
individual hyperlink is expected to be retrieved.
Working3: using the School ID and ask Program for looping in order to access individual webpage.
TotalRecord: Pre-made, record down the total no. of records and assist in computing the no. of pages
Data1: Simple list of secondary schools, order by school type and then name of secondary school. Name of secondary
school is hyperlinked.
Data2: Final dataset, ready for analysis....
36. Scenario 2: Secondary School Profiles 2011/2012, Committee on Home-School Co-operation
Demonstration
39. Pros and Cons
Pros:
1. cost effectively --> Nearly free of charge --> embed in MS Excel
2. improve productivity
--> Secondary school profiles: around 400
--> human copy-and-paste: 2-3 days
--> Excel VBA: 5-10 minutes
--> if someone requests you to get it again --> 2-3 days again
3. Easy to learn
--> Excel Object Model and Object Hierarchy
--> Many books, tutorials, Microsoft Valuable Professionals (MVP) onto Internet
4. Perform the tasks without errors
--> Human copy-and-paste: waste time in proofing --> if two or three records missed, how can observe?
Cons:
1. Almost tailor-made on individual website
2. Spend time in studying the website operation and source code. ......
--> Very clear aim
3. Website revamping --> Lead to rewrite the VBA program
4. Unknown bug suddenly appears in other users PC even MS Excel 2010 was already installed.