SlideShare a Scribd company logo
1 of 41
Scraping Webpage Information by using MS Excel VBA
HO Kwan-tai, Patrick
10 August 2012
Don’t Expect Too Much ……
Internet Computing
ASP.NET
JSP
PHP
Website Design
Outline
1. Different Types of Websites
2. “Scraping” onto Website ……
3. What is “VBA in Excel”?
4. Mechanism of applying MS Excel VBA in scraping Webpage Information
5. Scenarios
6. Other Alternatives
7. Pros and Cons
8. Q&A
Different Types of Websites
Static websites
• Simplest form of website.
• Site’s content is delivered without use of server side processing.
• Common use: Brochure / Advertisement sites.
• Limitation: Cannot provide complex user interactivity.
List of FEHD Public Markets and Cooked Food Markets/Centres
(www.fehd.gov.hk/english/pleasant_environment/tidy_market/Markets_CFC_list.html)
Dynamic websites
Different Types of Websites
• Reply on server side scripting.
• Provision of advanced interactivity.
• Usually use a database to deliver content for individual pages.
• Advantage: Efficient way to manage a large-scale site.
• Limitation: Search Engine Optimisation (SEO) more difficulty to implement.
OpenRice
(www.openrice.com)
Content managed
websites
Different Types of Websites
• Provide a password protected interface  add, edit and remove content from the site.
• Content Management System (CMS): Benefit on numerous contributors and some may be
working from remote locations.
Wikipedia
(en.wikipedia.org)
eCommerce websites
Different Types of Websites
• Dynamic website.
• Major functionality: to process financial transactions.
• Common modules such as shopping basket system, secure online payment system, etc.
• Also include a content management system, so product details can be added / updated.
TaoBao
(www.taobao.com)
Flash websites
Different Types of Websites
• Flash: A software developed by Adobe (previously, Macromedia).
• Widely used to generate complex animations  ActionScript ……
• Also impossible to carry out SEO.
We choose the Moon
(www.wechoosethemoon.org)
“Scraping” onto Website ……
Yellow Pages, Hong Kong
(www.yp.com.hk)
How can I save a copy with name,
address, telephone no., and link
into spreadsheet format?
“Scraping” onto Website ……
Web scraping is the process of
automatically collecting
information from the World Wide
Web.
Web scraping (also called
web harvesting or web data
extraction) is a computer software
technique of extracting
information from websites.
Computer programs can
“crawl” or “spider”
through web sites so as to
pull out the data.
People often do this to build things like
comparison shopping engines, archive
web pages, or simply download text to a
spreadsheet so that it can be filtered and
analyzed.
What ???
MS Excel 2010
Visual Basic for Application (VBA)
“Scraping” onto Website ……
Technique Description
Human copy-and-paste Sometimes even the best web-scraping technology cannot replace a human’s manual
examination and copy-and-paste, and sometimes this may be the only workable solution
when the websites for scraping explicitly set up barriers to prevent machine automation.
Text grepping and regular
expression matching
A simple yet powerful approach to extract information from web pages can be based on
the UNIX grep command or regular expression matching facilities of programming
languages (for instance Perl or Python).
HTTP programming Static and dynamic web pages can be retrieved by posting HTTP requests to the remote
web server using socket programming.
Data mining algorithms any websites have large collections of pages generated dynamically from an underlying
structured source like a database. Data of the same category are typically encoded into
similar pages by a common script or template. In data mining, a program that detects
such templates in a particular information source, extracts its content and translates it
into a relational form is called a wrapper. Wrapper generation algorithms assume that
input pages of a wrapper induction system conform to a common template and that they
can be easily identified in terms of a URL common scheme.
Document Object Model (DOM)
parsing
By embedding a full-fledged web browser, such as the Internet Explorer or the Mozilla
browser control, programs can retrieve the dynamic contents generated by client side
scripts. These browser controls also parse web pages into a DOM tree, based on which
programs can retrieve parts of the pages.
HTML parsers Some semi-structured data query languages, such as XQuery and the HTQL, can be used
to parse HTML pages and to retrieve and transform page content.
Web-scraping software There are many software tools available that can be used to customize web-scraping
solutions. This software may attempt to automatically recognize the data structure of a
page or provide a recording interface that removes the necessity to manually write web-
scraping code, or some scripting functions that can be used to extract and transform
content, and database interfaces that can store the scraped data in local databases.
Semantic annotation recognizing The pages being scraped may embrace metadata or semantic markups and annotations,
which can be used to locate specific data snippets. If the annotations are embedded in
the pages, as Microformat does, this technique can be viewed as a special case of DOM
parsing. In another case, the annotations, organized into a semantic layer, are stored and
managed separately from the web pages, so the scrapers can retrieve data schema and
instructions from this layer before scraping the pages.
Various Techniques, extracted from Wikipedia
(en.wikipedia.org/wiki/Web_scraping)
What is “VBA in Excel”?
Workbook
Cell Range
Cell
Worksheets
Excel Object Model
Object Hierarchy
• Workbook contains worksheets
• Worksheet contains ranges
• Range contains cells
What is “VBA in Excel”?
• Macros will be used to illustrate basic Excel VBA coding
 Recording macros creates VBA code automatically
• This code can be studied
 Macros are useful in developing the fundamental skills for reading, understanding, and
writing VBA code
• General actions in Excel VBA
 Recording a macro
 Writing simple VBA procedures
 Creating event procedures
 Assigning macros to drawing objects in Excel
• Macros are technically defined as units of VBA code
 A macro automates a repetitive series of actions in an Excel spreadsheet application
 Macros can be recorded in Excel or created directly by writing VBA code in the Visual Basic Editor (VBE)
• In VBA, macros are referred to as procedures
 There are two types of procedures
 Sub procedures
 Function procedures
 The macro recorder can only produce sub procedures
• To record a macro, we must know exactly the actions we wish to perform and then use the Macro Recorder
What is “VBA in Excel”?
啟用開發人員索引標籤
1. 在 [檔案] 索引標籤上,選擇 [選項] 開啟 [Excel 選項] 對話方塊。
2. 按一下對話方塊左側的 [自訂功能區]。
3. 在對話方塊左側的 [由此選擇命令] 底下,選取 [常用命令]。
4. 在對話方塊右側的 [自訂功能區] 底下,選取 [主要索引標籤],然後選取 [開發人員] 核取方塊。
5. 按一下 [確定]。
6. Excel 顯示出 [開發人員] 索引標籤之後,請記下 [Visual Basic]、[巨集] 和 [巨集安全性] 按鈕在索引標籤上的位置。
Excel 2010 的 VBA 快速入門
(msdn.microsoft.com/library/office/ee814737.aspx)
安全性問題
按一下 [巨集安全性] 按鈕,指定可以執行的區集以及執行巨集的時機。雖
然惡意的巨集程式碼有可能嚴重損害您的電腦,但設了安全性條件之後也
可能會讓您無法執行一些相當有用的巨集,而大幅降低您的生產力。
當您開內含巨集的活頁簿時,如果在功能區與工作表之間出現
[安全性警告:已經停用巨集],可以按一下 [啟用內容] 按鈕啟用該巨集。
此為為安全起見,請勿將巨集儲存為預設的 Excel 檔案格式 (.xlsx),而應一
律將巨集儲存為另一種特殊的副檔名 .xlsm。
Mechanism of applying MS Excel VBA in scraping Webpage Information
List of Permitted Premises for the Sale of Restricted Foods (持許可證售賣限制出售食物的處所名單)
(www.fehd.gov.hk/english/licensing/licence-type-permit.html)
Consider this website ……
Mechanism of applying MS Excel VBA in scraping Webpage Information
While I click the button Submit ……
Mechanism of applying MS Excel VBA in scraping Webpage Information
Some questions ……
1. One click to obtain the Full List?
Answer: No
2. Information shows in a Single Page?
Answer: No
• List published are restricted by a specific type of permit as user
require to select it via the list box.
• An establishment / outlet can obtain more than one type of permit.
……
• Page breaker adopted.
• Each page displays maximum 50 records.
• Expected number of pages is 9.
Mechanism of applying MS Excel VBA in scraping Webpage Information
Some questions ……
3. Any pattern of hyperlink onto the search result obtain?
Answer: Yes
http://www.fehd.gov.hk/cgi-
bin/fehdnew/licence/ecsvread.pl?field1=Chinese+Herb+Tea+Permit&field2=&field3=&field4=&order_by=fi
eld4&order=abc&page=0
field1=Chinese+Herb+Tea+Permit  Permit Type, “Cut Fruit Permit” = Cut+Fruit+Permit
page=0  0 = Page 1, 1 = Page 2, …… n = Page (n + 1)
Mechanism of applying MS Excel VBA in scraping Webpage Information
Some questions ……
4. Search result can obtain the necessary information?
Answer: Yes
• Information: Shopsign (Registered Name / Trading Name), Address
5. Search result presents as tabular format?
Answer: Yes
Mechanism of applying MS Excel VBA in scraping Webpage Information
Conclusions …… not yet
1. One click to obtain the Full List? Answer: No
2. Information shows in a Single Page? Answer: No
3. Any pattern of hyperlink onto the search result obtain? Answer: Yes
4. Search result can obtain the necessary information? Answer: Yes
5. Search result presents as tabular format? Answer: Yes
Many combinations of those answer, depends on
various scenarios ……
Scenario 1: List of Licensed Hotels, Office of Licensing Authority, Home Affairs Department
Overview
Link:
http://www.hadla.gov.hk/en/hotels/search_h.html
Scenario 1: List of Licensed Hotels, Office of Licensing Authority, Home Affairs Department
While I click the button Search ……
Link:
http://www.hadla.gov.hk/cgi-bin/hadlanew/search.pl?client=1&searchtype=1&name=&address=&room=0&district=0&displaytype=2
1. One click to obtain the Full List? Answer: Yes
2. Information shows in a Single Page? Answer: Yes
3. Any pattern of hyperlink onto the search result obtain? Answer: Yes
4. Search result can obtain the necessary information? Answer: Yes
5. Search result presents as tabular format? Answer: Yes
Scenario 1: List of Licensed Hotels, Office of Licensing Authority, Home Affairs Department
Layout Design in MS Excel ……
……
Create ONE sheet and name it as Data.
1
2 Save as Scenario1.xlsm.
Scenario 1: List of Licensed Hotels, Office of Licensing Authority, Home Affairs Department
Demonstration
Scenario 2: Secondary School Profiles 2011/2012, Committee on Home-School Co-operation
Overview
Link:
http://www.chsc.hk/ssp/main.php?land_id=1 The Profiles aim at providing school information for P6 parents whose children are
going to participate in the Secondary School Places Allocation (SSPA) System. In order
to choose a secondary school for their children, parents may make reference to the
school information in the Profiles for application of discretionary places in January
each year and for choice-making in central allocation from late April to early May.
Information in the Profiles is provided and checked by schools with reference to their
situations as at September of the school year. Schools may update the web version of
the Profiles on or after mid-December of the school year.
Scenario 2: Secondary School Profiles 2011/2012, Committee on Home-School Co-operation
Overview
Link:
http://www.chsc.hk/ssp/sch_list.php?lang_id=1&search_mode=&
frmMode=pagebreak&district_id=3&page=1
Scenario 2: Secondary School Profiles 2011/2012, Committee on Home-School Co-operation
Overview
(a) Unique hyperlink
(b) Further information by Sub-section
(c) Information for specific Sub-section presented as Table
(a)
(b)
(c)
Scenario 2: Secondary School Profiles 2011/2012, Committee on Home-School Co-operation
Overview
1. One click to obtain the Full List? Answer: No (Page breaker adopted)
2. Information shows in a Single Page? Answer: No
3. Any pattern of hyperlink onto the search result obtain? Answer: Yes
4. Search result can obtain the necessary information? Answer: No (School Name is in hyperlinked)
5. Search result presents as tabular format? Answer: No (Nested)
Scenario 2: Secondary School Profiles 2011/2012, Committee on Home-School Co-operation
Obtain Syntax / Pattern of Hyperlink
Choose “Central & Western” from the list box of “District”, and then press “Search” button.
1
2
As there is insufficient information to detect the hyperlink pattern, so try to obtain the detail hyperlink by press the
hyperlinked text “2” to access the second page.
……
Hyperlink should
contain district_id
/ sch_type /
sch_gender
Scenario 2: Secondary School Profiles 2011/2012, Committee on Home-School Co-operation
Obtain Syntax / Pattern of Hyperlink
Hyperlink contains sufficient information for parsing into Excel VBA.
Amend the value after district_id= and page= to test whether the search result page can be properly displayed.
3
Scenario 2: Secondary School Profiles 2011/2012, Committee on Home-School Co-operation
Obtain Syntax / Pattern of Hyperlink
Test 1: district_id=1&page=1 (First search result page by district “Central & Western”)
4
Scenario 2: Secondary School Profiles 2011/2012, Committee on Home-School Co-operation
Obtain Syntax / Pattern of Hyperlink
Test 2: district_id=13&page=3 (Second search result page by district “Sha Tin”)
4
Scenario 2: Secondary School Profiles 2011/2012, Committee on Home-School Co-operation
Obtain Syntax / Pattern of Hyperlink
Test 3: district_id=19&page=5 (Expect nothing / error should be shown)
4
NOTHING
Scenario 2: Secondary School Profiles 2011/2012, Committee on Home-School Co-operation
Obtain Syntax / Pattern of Hyperlink
5
The hyperlink is proved to be ready for use as by amending the parameter to sch_type=, e.g. sch_type=Aided to display
the list of aided secondary schools.
Scenario 2: Secondary School Profiles 2011/2012, Committee on Home-School Co-operation
Layout Design in MS Excel ……
……
2 Save as Scenario2.xlsm.
1 Create SIX Sheets and name them as Working1, TotalRecord, Working2, Data1, Working3, Data2.
Working1: obtain the total number of records and compute number of pages by school type
Working2: by using the figures on Working1, pass it as Integer variable onto the VBA program for looping. String of the
individual hyperlink is expected to be retrieved.
Working3: using the School ID and ask Program for looping in order to access individual webpage.
TotalRecord: Pre-made, record down the total no. of records and assist in computing the no. of pages
Data1: Simple list of secondary schools, order by school type and then name of secondary school. Name of secondary
school is hyperlinked.
Data2: Final dataset, ready for analysis....
Scenario 2: Secondary School Profiles 2011/2012, Committee on Home-School Co-operation
Demonstration
Other Alternatives
Other Alternatives
Pros and Cons
Pros:
1. cost effectively --> Nearly free of charge --> embed in MS Excel
2. improve productivity
--> Secondary school profiles: around 400
--> human copy-and-paste: 2-3 days
--> Excel VBA: 5-10 minutes
--> if someone requests you to get it again --> 2-3 days again
3. Easy to learn
--> Excel Object Model and Object Hierarchy
--> Many books, tutorials, Microsoft Valuable Professionals (MVP) onto Internet
4. Perform the tasks without errors
--> Human copy-and-paste: waste time in proofing --> if two or three records missed, how can observe?
Cons:
1. Almost tailor-made on individual website
2. Spend time in studying the website operation and source code. ......
--> Very clear aim
3. Website revamping --> Lead to rewrite the VBA program
4. Unknown bug suddenly appears in other users PC even MS Excel 2010 was already installed.
Q&A
Scraping Webpage Information by using MS Excel VBA
HO Kwan-tai, Patrick
10 August 2012
(THE END)

More Related Content

Similar to Scraping Webpage Information by using MS Excel VBA

Making Of PHP Based Web Application
Making Of PHP Based Web ApplicationMaking Of PHP Based Web Application
Making Of PHP Based Web ApplicationSachin Walvekar
 
SPSDenver - Wrapping Your Head Around the SharePoint Beast
SPSDenver - Wrapping Your Head Around the SharePoint BeastSPSDenver - Wrapping Your Head Around the SharePoint Beast
SPSDenver - Wrapping Your Head Around the SharePoint BeastMark Rackley
 
JOB PORTALProject SummaryTitle JOB-PORT.docx
JOB PORTALProject SummaryTitle    JOB-PORT.docxJOB PORTALProject SummaryTitle    JOB-PORT.docx
JOB PORTALProject SummaryTitle JOB-PORT.docxchristiandean12115
 
All About Asp Net 4 0 Hosam Kamel
All About Asp Net 4 0  Hosam KamelAll About Asp Net 4 0  Hosam Kamel
All About Asp Net 4 0 Hosam KamelHosam Kamel
 
Intro to .NET for Government Developers
Intro to .NET for Government DevelopersIntro to .NET for Government Developers
Intro to .NET for Government DevelopersFrank La Vigne
 
Information Management & Sharing in Digital Era
Information Management & Sharing in Digital Era Information Management & Sharing in Digital Era
Information Management & Sharing in Digital Era Liaquat Rahoo
 
Web development concepts using microsoft technologies
Web development concepts using microsoft technologiesWeb development concepts using microsoft technologies
Web development concepts using microsoft technologiesHosam Kamel
 
Crash Course HTML/Rails Slides
Crash Course HTML/Rails SlidesCrash Course HTML/Rails Slides
Crash Course HTML/Rails SlidesUdita Plaha
 
Introducing the JotSpot Data Model and API
Introducing the JotSpot Data Model and APIIntroducing the JotSpot Data Model and API
Introducing the JotSpot Data Model and APIScott McMullan
 
Technical SEO Audit – 15 Point Checklist
Technical SEO Audit – 15 Point ChecklistTechnical SEO Audit – 15 Point Checklist
Technical SEO Audit – 15 Point ChecklistNavneet Singh
 
Best Practices to SharePoint Architecture Fundamentals NZ & AUS
Best Practices to SharePoint Architecture Fundamentals NZ & AUSBest Practices to SharePoint Architecture Fundamentals NZ & AUS
Best Practices to SharePoint Architecture Fundamentals NZ & AUSguest7c2e070
 
Continental Airlines 2009 Microsoft SharePoint Conference Presentation
Continental Airlines 2009 Microsoft SharePoint Conference PresentationContinental Airlines 2009 Microsoft SharePoint Conference Presentation
Continental Airlines 2009 Microsoft SharePoint Conference PresentationDenise Wilson
 

Similar to Scraping Webpage Information by using MS Excel VBA (20)

Making Of PHP Based Web Application
Making Of PHP Based Web ApplicationMaking Of PHP Based Web Application
Making Of PHP Based Web Application
 
Tech talk php_cms
Tech talk php_cmsTech talk php_cms
Tech talk php_cms
 
Beyond The MVC
Beyond The MVCBeyond The MVC
Beyond The MVC
 
SPSDenver - Wrapping Your Head Around the SharePoint Beast
SPSDenver - Wrapping Your Head Around the SharePoint BeastSPSDenver - Wrapping Your Head Around the SharePoint Beast
SPSDenver - Wrapping Your Head Around the SharePoint Beast
 
What is web scraping?
What is web scraping?What is web scraping?
What is web scraping?
 
JOB PORTALProject SummaryTitle JOB-PORT.docx
JOB PORTALProject SummaryTitle    JOB-PORT.docxJOB PORTALProject SummaryTitle    JOB-PORT.docx
JOB PORTALProject SummaryTitle JOB-PORT.docx
 
CODE IGNITER
CODE IGNITERCODE IGNITER
CODE IGNITER
 
Resume
ResumeResume
Resume
 
ppt of MANOJ KUMAR.pptx
ppt of MANOJ KUMAR.pptxppt of MANOJ KUMAR.pptx
ppt of MANOJ KUMAR.pptx
 
All About Asp Net 4 0 Hosam Kamel
All About Asp Net 4 0  Hosam KamelAll About Asp Net 4 0  Hosam Kamel
All About Asp Net 4 0 Hosam Kamel
 
Intro to .NET for Government Developers
Intro to .NET for Government DevelopersIntro to .NET for Government Developers
Intro to .NET for Government Developers
 
Information Management & Sharing in Digital Era
Information Management & Sharing in Digital Era Information Management & Sharing in Digital Era
Information Management & Sharing in Digital Era
 
Web development concepts using microsoft technologies
Web development concepts using microsoft technologiesWeb development concepts using microsoft technologies
Web development concepts using microsoft technologies
 
Crash Course HTML/Rails Slides
Crash Course HTML/Rails SlidesCrash Course HTML/Rails Slides
Crash Course HTML/Rails Slides
 
Intro to Application Express
Intro to Application ExpressIntro to Application Express
Intro to Application Express
 
Introducing the JotSpot Data Model and API
Introducing the JotSpot Data Model and APIIntroducing the JotSpot Data Model and API
Introducing the JotSpot Data Model and API
 
Technical SEO Audit – 15 Point Checklist
Technical SEO Audit – 15 Point ChecklistTechnical SEO Audit – 15 Point Checklist
Technical SEO Audit – 15 Point Checklist
 
Best Practices to SharePoint Architecture Fundamentals NZ & AUS
Best Practices to SharePoint Architecture Fundamentals NZ & AUSBest Practices to SharePoint Architecture Fundamentals NZ & AUS
Best Practices to SharePoint Architecture Fundamentals NZ & AUS
 
Continental Airlines 2009 Microsoft SharePoint Conference Presentation
Continental Airlines 2009 Microsoft SharePoint Conference PresentationContinental Airlines 2009 Microsoft SharePoint Conference Presentation
Continental Airlines 2009 Microsoft SharePoint Conference Presentation
 
Srs documentation
Srs documentationSrs documentation
Srs documentation
 

Recently uploaded

Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts ServiceSapana Sha
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationshipsccctableauusergroup
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptSonatrach
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Callshivangimorya083
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfRachmat Ramadhan H
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998YohFuh
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxStephen266013
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptxthyngster
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxolyaivanovalion
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxolyaivanovalion
 
(ISHITA) Call Girls Service Hyderabad Call Now 8617697112 Hyderabad Escorts
(ISHITA) Call Girls Service Hyderabad Call Now 8617697112 Hyderabad Escorts(ISHITA) Call Girls Service Hyderabad Call Now 8617697112 Hyderabad Escorts
(ISHITA) Call Girls Service Hyderabad Call Now 8617697112 Hyderabad EscortsCall girls in Ahmedabad High profile
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxEmmanuel Dauda
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionfulawalesam
 

Recently uploaded (20)

Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts Service
 
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
 
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998
 
E-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptxE-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptx
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docx
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
(ISHITA) Call Girls Service Hyderabad Call Now 8617697112 Hyderabad Escorts
(ISHITA) Call Girls Service Hyderabad Call Now 8617697112 Hyderabad Escorts(ISHITA) Call Girls Service Hyderabad Call Now 8617697112 Hyderabad Escorts
(ISHITA) Call Girls Service Hyderabad Call Now 8617697112 Hyderabad Escorts
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptx
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 

Scraping Webpage Information by using MS Excel VBA

  • 1. Scraping Webpage Information by using MS Excel VBA HO Kwan-tai, Patrick 10 August 2012
  • 2. Don’t Expect Too Much …… Internet Computing ASP.NET JSP PHP Website Design
  • 3. Outline 1. Different Types of Websites 2. “Scraping” onto Website …… 3. What is “VBA in Excel”? 4. Mechanism of applying MS Excel VBA in scraping Webpage Information 5. Scenarios 6. Other Alternatives 7. Pros and Cons 8. Q&A
  • 4. Different Types of Websites Static websites • Simplest form of website. • Site’s content is delivered without use of server side processing. • Common use: Brochure / Advertisement sites. • Limitation: Cannot provide complex user interactivity. List of FEHD Public Markets and Cooked Food Markets/Centres (www.fehd.gov.hk/english/pleasant_environment/tidy_market/Markets_CFC_list.html)
  • 5. Dynamic websites Different Types of Websites • Reply on server side scripting. • Provision of advanced interactivity. • Usually use a database to deliver content for individual pages. • Advantage: Efficient way to manage a large-scale site. • Limitation: Search Engine Optimisation (SEO) more difficulty to implement. OpenRice (www.openrice.com)
  • 6. Content managed websites Different Types of Websites • Provide a password protected interface  add, edit and remove content from the site. • Content Management System (CMS): Benefit on numerous contributors and some may be working from remote locations. Wikipedia (en.wikipedia.org)
  • 7. eCommerce websites Different Types of Websites • Dynamic website. • Major functionality: to process financial transactions. • Common modules such as shopping basket system, secure online payment system, etc. • Also include a content management system, so product details can be added / updated. TaoBao (www.taobao.com)
  • 8. Flash websites Different Types of Websites • Flash: A software developed by Adobe (previously, Macromedia). • Widely used to generate complex animations  ActionScript …… • Also impossible to carry out SEO. We choose the Moon (www.wechoosethemoon.org)
  • 9. “Scraping” onto Website …… Yellow Pages, Hong Kong (www.yp.com.hk) How can I save a copy with name, address, telephone no., and link into spreadsheet format?
  • 10. “Scraping” onto Website …… Web scraping is the process of automatically collecting information from the World Wide Web. Web scraping (also called web harvesting or web data extraction) is a computer software technique of extracting information from websites. Computer programs can “crawl” or “spider” through web sites so as to pull out the data. People often do this to build things like comparison shopping engines, archive web pages, or simply download text to a spreadsheet so that it can be filtered and analyzed. What ??? MS Excel 2010 Visual Basic for Application (VBA)
  • 11. “Scraping” onto Website …… Technique Description Human copy-and-paste Sometimes even the best web-scraping technology cannot replace a human’s manual examination and copy-and-paste, and sometimes this may be the only workable solution when the websites for scraping explicitly set up barriers to prevent machine automation. Text grepping and regular expression matching A simple yet powerful approach to extract information from web pages can be based on the UNIX grep command or regular expression matching facilities of programming languages (for instance Perl or Python). HTTP programming Static and dynamic web pages can be retrieved by posting HTTP requests to the remote web server using socket programming. Data mining algorithms any websites have large collections of pages generated dynamically from an underlying structured source like a database. Data of the same category are typically encoded into similar pages by a common script or template. In data mining, a program that detects such templates in a particular information source, extracts its content and translates it into a relational form is called a wrapper. Wrapper generation algorithms assume that input pages of a wrapper induction system conform to a common template and that they can be easily identified in terms of a URL common scheme. Document Object Model (DOM) parsing By embedding a full-fledged web browser, such as the Internet Explorer or the Mozilla browser control, programs can retrieve the dynamic contents generated by client side scripts. These browser controls also parse web pages into a DOM tree, based on which programs can retrieve parts of the pages. HTML parsers Some semi-structured data query languages, such as XQuery and the HTQL, can be used to parse HTML pages and to retrieve and transform page content. Web-scraping software There are many software tools available that can be used to customize web-scraping solutions. This software may attempt to automatically recognize the data structure of a page or provide a recording interface that removes the necessity to manually write web- scraping code, or some scripting functions that can be used to extract and transform content, and database interfaces that can store the scraped data in local databases. Semantic annotation recognizing The pages being scraped may embrace metadata or semantic markups and annotations, which can be used to locate specific data snippets. If the annotations are embedded in the pages, as Microformat does, this technique can be viewed as a special case of DOM parsing. In another case, the annotations, organized into a semantic layer, are stored and managed separately from the web pages, so the scrapers can retrieve data schema and instructions from this layer before scraping the pages. Various Techniques, extracted from Wikipedia (en.wikipedia.org/wiki/Web_scraping)
  • 12. What is “VBA in Excel”? Workbook Cell Range Cell Worksheets Excel Object Model Object Hierarchy • Workbook contains worksheets • Worksheet contains ranges • Range contains cells
  • 13. What is “VBA in Excel”? • Macros will be used to illustrate basic Excel VBA coding  Recording macros creates VBA code automatically • This code can be studied  Macros are useful in developing the fundamental skills for reading, understanding, and writing VBA code • General actions in Excel VBA  Recording a macro  Writing simple VBA procedures  Creating event procedures  Assigning macros to drawing objects in Excel • Macros are technically defined as units of VBA code  A macro automates a repetitive series of actions in an Excel spreadsheet application  Macros can be recorded in Excel or created directly by writing VBA code in the Visual Basic Editor (VBE) • In VBA, macros are referred to as procedures  There are two types of procedures  Sub procedures  Function procedures  The macro recorder can only produce sub procedures • To record a macro, we must know exactly the actions we wish to perform and then use the Macro Recorder
  • 14. What is “VBA in Excel”? 啟用開發人員索引標籤 1. 在 [檔案] 索引標籤上,選擇 [選項] 開啟 [Excel 選項] 對話方塊。 2. 按一下對話方塊左側的 [自訂功能區]。 3. 在對話方塊左側的 [由此選擇命令] 底下,選取 [常用命令]。 4. 在對話方塊右側的 [自訂功能區] 底下,選取 [主要索引標籤],然後選取 [開發人員] 核取方塊。 5. 按一下 [確定]。 6. Excel 顯示出 [開發人員] 索引標籤之後,請記下 [Visual Basic]、[巨集] 和 [巨集安全性] 按鈕在索引標籤上的位置。 Excel 2010 的 VBA 快速入門 (msdn.microsoft.com/library/office/ee814737.aspx) 安全性問題 按一下 [巨集安全性] 按鈕,指定可以執行的區集以及執行巨集的時機。雖 然惡意的巨集程式碼有可能嚴重損害您的電腦,但設了安全性條件之後也 可能會讓您無法執行一些相當有用的巨集,而大幅降低您的生產力。 當您開內含巨集的活頁簿時,如果在功能區與工作表之間出現 [安全性警告:已經停用巨集],可以按一下 [啟用內容] 按鈕啟用該巨集。 此為為安全起見,請勿將巨集儲存為預設的 Excel 檔案格式 (.xlsx),而應一 律將巨集儲存為另一種特殊的副檔名 .xlsm。
  • 15. Mechanism of applying MS Excel VBA in scraping Webpage Information List of Permitted Premises for the Sale of Restricted Foods (持許可證售賣限制出售食物的處所名單) (www.fehd.gov.hk/english/licensing/licence-type-permit.html) Consider this website ……
  • 16. Mechanism of applying MS Excel VBA in scraping Webpage Information While I click the button Submit ……
  • 17. Mechanism of applying MS Excel VBA in scraping Webpage Information Some questions …… 1. One click to obtain the Full List? Answer: No 2. Information shows in a Single Page? Answer: No • List published are restricted by a specific type of permit as user require to select it via the list box. • An establishment / outlet can obtain more than one type of permit. …… • Page breaker adopted. • Each page displays maximum 50 records. • Expected number of pages is 9.
  • 18. Mechanism of applying MS Excel VBA in scraping Webpage Information Some questions …… 3. Any pattern of hyperlink onto the search result obtain? Answer: Yes http://www.fehd.gov.hk/cgi- bin/fehdnew/licence/ecsvread.pl?field1=Chinese+Herb+Tea+Permit&field2=&field3=&field4=&order_by=fi eld4&order=abc&page=0 field1=Chinese+Herb+Tea+Permit  Permit Type, “Cut Fruit Permit” = Cut+Fruit+Permit page=0  0 = Page 1, 1 = Page 2, …… n = Page (n + 1)
  • 19. Mechanism of applying MS Excel VBA in scraping Webpage Information Some questions …… 4. Search result can obtain the necessary information? Answer: Yes • Information: Shopsign (Registered Name / Trading Name), Address 5. Search result presents as tabular format? Answer: Yes
  • 20. Mechanism of applying MS Excel VBA in scraping Webpage Information Conclusions …… not yet 1. One click to obtain the Full List? Answer: No 2. Information shows in a Single Page? Answer: No 3. Any pattern of hyperlink onto the search result obtain? Answer: Yes 4. Search result can obtain the necessary information? Answer: Yes 5. Search result presents as tabular format? Answer: Yes Many combinations of those answer, depends on various scenarios ……
  • 21. Scenario 1: List of Licensed Hotels, Office of Licensing Authority, Home Affairs Department Overview Link: http://www.hadla.gov.hk/en/hotels/search_h.html
  • 22. Scenario 1: List of Licensed Hotels, Office of Licensing Authority, Home Affairs Department While I click the button Search …… Link: http://www.hadla.gov.hk/cgi-bin/hadlanew/search.pl?client=1&searchtype=1&name=&address=&room=0&district=0&displaytype=2 1. One click to obtain the Full List? Answer: Yes 2. Information shows in a Single Page? Answer: Yes 3. Any pattern of hyperlink onto the search result obtain? Answer: Yes 4. Search result can obtain the necessary information? Answer: Yes 5. Search result presents as tabular format? Answer: Yes
  • 23. Scenario 1: List of Licensed Hotels, Office of Licensing Authority, Home Affairs Department Layout Design in MS Excel …… …… Create ONE sheet and name it as Data. 1 2 Save as Scenario1.xlsm.
  • 24. Scenario 1: List of Licensed Hotels, Office of Licensing Authority, Home Affairs Department Demonstration
  • 25. Scenario 2: Secondary School Profiles 2011/2012, Committee on Home-School Co-operation Overview Link: http://www.chsc.hk/ssp/main.php?land_id=1 The Profiles aim at providing school information for P6 parents whose children are going to participate in the Secondary School Places Allocation (SSPA) System. In order to choose a secondary school for their children, parents may make reference to the school information in the Profiles for application of discretionary places in January each year and for choice-making in central allocation from late April to early May. Information in the Profiles is provided and checked by schools with reference to their situations as at September of the school year. Schools may update the web version of the Profiles on or after mid-December of the school year.
  • 26. Scenario 2: Secondary School Profiles 2011/2012, Committee on Home-School Co-operation Overview Link: http://www.chsc.hk/ssp/sch_list.php?lang_id=1&search_mode=& frmMode=pagebreak&district_id=3&page=1
  • 27. Scenario 2: Secondary School Profiles 2011/2012, Committee on Home-School Co-operation Overview (a) Unique hyperlink (b) Further information by Sub-section (c) Information for specific Sub-section presented as Table (a) (b) (c)
  • 28. Scenario 2: Secondary School Profiles 2011/2012, Committee on Home-School Co-operation Overview 1. One click to obtain the Full List? Answer: No (Page breaker adopted) 2. Information shows in a Single Page? Answer: No 3. Any pattern of hyperlink onto the search result obtain? Answer: Yes 4. Search result can obtain the necessary information? Answer: No (School Name is in hyperlinked) 5. Search result presents as tabular format? Answer: No (Nested)
  • 29. Scenario 2: Secondary School Profiles 2011/2012, Committee on Home-School Co-operation Obtain Syntax / Pattern of Hyperlink Choose “Central & Western” from the list box of “District”, and then press “Search” button. 1 2 As there is insufficient information to detect the hyperlink pattern, so try to obtain the detail hyperlink by press the hyperlinked text “2” to access the second page. …… Hyperlink should contain district_id / sch_type / sch_gender
  • 30. Scenario 2: Secondary School Profiles 2011/2012, Committee on Home-School Co-operation Obtain Syntax / Pattern of Hyperlink Hyperlink contains sufficient information for parsing into Excel VBA. Amend the value after district_id= and page= to test whether the search result page can be properly displayed. 3
  • 31. Scenario 2: Secondary School Profiles 2011/2012, Committee on Home-School Co-operation Obtain Syntax / Pattern of Hyperlink Test 1: district_id=1&page=1 (First search result page by district “Central & Western”) 4
  • 32. Scenario 2: Secondary School Profiles 2011/2012, Committee on Home-School Co-operation Obtain Syntax / Pattern of Hyperlink Test 2: district_id=13&page=3 (Second search result page by district “Sha Tin”) 4
  • 33. Scenario 2: Secondary School Profiles 2011/2012, Committee on Home-School Co-operation Obtain Syntax / Pattern of Hyperlink Test 3: district_id=19&page=5 (Expect nothing / error should be shown) 4 NOTHING
  • 34. Scenario 2: Secondary School Profiles 2011/2012, Committee on Home-School Co-operation Obtain Syntax / Pattern of Hyperlink 5 The hyperlink is proved to be ready for use as by amending the parameter to sch_type=, e.g. sch_type=Aided to display the list of aided secondary schools.
  • 35. Scenario 2: Secondary School Profiles 2011/2012, Committee on Home-School Co-operation Layout Design in MS Excel …… …… 2 Save as Scenario2.xlsm. 1 Create SIX Sheets and name them as Working1, TotalRecord, Working2, Data1, Working3, Data2. Working1: obtain the total number of records and compute number of pages by school type Working2: by using the figures on Working1, pass it as Integer variable onto the VBA program for looping. String of the individual hyperlink is expected to be retrieved. Working3: using the School ID and ask Program for looping in order to access individual webpage. TotalRecord: Pre-made, record down the total no. of records and assist in computing the no. of pages Data1: Simple list of secondary schools, order by school type and then name of secondary school. Name of secondary school is hyperlinked. Data2: Final dataset, ready for analysis....
  • 36. Scenario 2: Secondary School Profiles 2011/2012, Committee on Home-School Co-operation Demonstration
  • 39. Pros and Cons Pros: 1. cost effectively --> Nearly free of charge --> embed in MS Excel 2. improve productivity --> Secondary school profiles: around 400 --> human copy-and-paste: 2-3 days --> Excel VBA: 5-10 minutes --> if someone requests you to get it again --> 2-3 days again 3. Easy to learn --> Excel Object Model and Object Hierarchy --> Many books, tutorials, Microsoft Valuable Professionals (MVP) onto Internet 4. Perform the tasks without errors --> Human copy-and-paste: waste time in proofing --> if two or three records missed, how can observe? Cons: 1. Almost tailor-made on individual website 2. Spend time in studying the website operation and source code. ...... --> Very clear aim 3. Website revamping --> Lead to rewrite the VBA program 4. Unknown bug suddenly appears in other users PC even MS Excel 2010 was already installed.
  • 40. Q&A
  • 41. Scraping Webpage Information by using MS Excel VBA HO Kwan-tai, Patrick 10 August 2012 (THE END)