eSUG fall 2011
Upcoming SlideShare
Loading in...5
×
 

eSUG fall 2011

on

  • 813 views

My presentation on eSUG, Fall 2011

My presentation on eSUG, Fall 2011

Statistics

Views

Total Views
813
Views on SlideShare
392
Embed Views
421

Actions

Likes
0
Downloads
1
Comments
0

1 Embed 421

http://qiaohaozhu.wordpress.com 421

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

eSUG fall 2011 eSUG fall 2011 Presentation Transcript

  • Outline Introduction Web Elements Tools Examples Accessing and Extracting Data from Internet Using SAS George Zhu, Sunita Ghosh Alberta Health Services - Cancer Care Oct 26, 2011 Edmonton SAS User Group (eSUG) Meeting George Zhu & Sunita Ghosh (AHS - Cancer Care) Accessing and Extracting Data from Internet Using SAS
  • Outline Introduction Web Elements Tools Examples 1 Introduction 2 Web Elements URLs HTML 3 Tools SAS Functions SAS Statements cURL Perl/LWP 4 Examples Example 1: Download .csv file Example 2: Get the list of eSUG presentations Example 3: Find out all registered clinical trials Example 4: Get and plot today’s temperature Example 5: Download City of Edmonton job postings George Zhu & Sunita Ghosh (AHS - Cancer Care) Accessing and Extracting Data from Internet Using SAS
  • Outline Introduction Web Elements Tools ExamplesIntroduction Suppose you want to: Download financial time series from FRED, Yahoo financial, OANDA, etc. Obtain a list of clinical trials conducted in Canada from the www.clinicaltrials.gov website Download presentations from eSUG webpage, or all Proceedings collected in Lex Jansen’s website Monitor career websites and notify you if suitable positions are available Get Twitter feeds on some hot topics and save them in data sets for further analysis or text mining. And many more... Question: Is it possible to do these using SAS? George Zhu & Sunita Ghosh (AHS - Cancer Care) Accessing and Extracting Data from Internet Using SAS
  • Outline Introduction Web Elements Tools ExamplesIntroduction Suppose you want to: Download financial time series from FRED, Yahoo financial, OANDA, etc. Obtain a list of clinical trials conducted in Canada from the www.clinicaltrials.gov website Download presentations from eSUG webpage, or all Proceedings collected in Lex Jansen’s website Monitor career websites and notify you if suitable positions are available Get Twitter feeds on some hot topics and save them in data sets for further analysis or text mining. And many more... Question: Is it possible to do these using SAS? George Zhu & Sunita Ghosh (AHS - Cancer Care) Accessing and Extracting Data from Internet Using SAS
  • Outline Introduction Web Elements Tools ExamplesIntroduction Suppose you want to: Download financial time series from FRED, Yahoo financial, OANDA, etc. Obtain a list of clinical trials conducted in Canada from the www.clinicaltrials.gov website Download presentations from eSUG webpage, or all Proceedings collected in Lex Jansen’s website Monitor career websites and notify you if suitable positions are available Get Twitter feeds on some hot topics and save them in data sets for further analysis or text mining. And many more... Question: Is it possible to do these using SAS? George Zhu & Sunita Ghosh (AHS - Cancer Care) Accessing and Extracting Data from Internet Using SAS
  • Outline Introduction Web Elements Tools ExamplesIntroduction Suppose you want to: Download financial time series from FRED, Yahoo financial, OANDA, etc. Obtain a list of clinical trials conducted in Canada from the www.clinicaltrials.gov website Download presentations from eSUG webpage, or all Proceedings collected in Lex Jansen’s website Monitor career websites and notify you if suitable positions are available Get Twitter feeds on some hot topics and save them in data sets for further analysis or text mining. And many more... Question: Is it possible to do these using SAS? George Zhu & Sunita Ghosh (AHS - Cancer Care) Accessing and Extracting Data from Internet Using SAS
  • Outline Introduction Web Elements Tools ExamplesIntroduction Suppose you want to: Download financial time series from FRED, Yahoo financial, OANDA, etc. Obtain a list of clinical trials conducted in Canada from the www.clinicaltrials.gov website Download presentations from eSUG webpage, or all Proceedings collected in Lex Jansen’s website Monitor career websites and notify you if suitable positions are available Get Twitter feeds on some hot topics and save them in data sets for further analysis or text mining. And many more... Question: Is it possible to do these using SAS? George Zhu & Sunita Ghosh (AHS - Cancer Care) Accessing and Extracting Data from Internet Using SAS
  • Outline Introduction Web Elements Tools ExamplesIntroduction Suppose you want to: Download financial time series from FRED, Yahoo financial, OANDA, etc. Obtain a list of clinical trials conducted in Canada from the www.clinicaltrials.gov website Download presentations from eSUG webpage, or all Proceedings collected in Lex Jansen’s website Monitor career websites and notify you if suitable positions are available Get Twitter feeds on some hot topics and save them in data sets for further analysis or text mining. And many more... Question: Is it possible to do these using SAS? George Zhu & Sunita Ghosh (AHS - Cancer Care) Accessing and Extracting Data from Internet Using SAS
  • Outline Introduction Web Elements Tools ExamplesIntroduction Suppose you want to: Download financial time series from FRED, Yahoo financial, OANDA, etc. Obtain a list of clinical trials conducted in Canada from the www.clinicaltrials.gov website Download presentations from eSUG webpage, or all Proceedings collected in Lex Jansen’s website Monitor career websites and notify you if suitable positions are available Get Twitter feeds on some hot topics and save them in data sets for further analysis or text mining. And many more... Question: Is it possible to do these using SAS? George Zhu & Sunita Ghosh (AHS - Cancer Care) Accessing and Extracting Data from Internet Using SAS
  • Outline Introduction Web Elements Tools ExamplesIntroduction Suppose you want to: Download financial time series from FRED, Yahoo financial, OANDA, etc. Obtain a list of clinical trials conducted in Canada from the www.clinicaltrials.gov website Download presentations from eSUG webpage, or all Proceedings collected in Lex Jansen’s website Monitor career websites and notify you if suitable positions are available Get Twitter feeds on some hot topics and save them in data sets for further analysis or text mining. And many more... Question: Is it possible to do these using SAS? George Zhu & Sunita Ghosh (AHS - Cancer Care) Accessing and Extracting Data from Internet Using SAS
  • Outline Introduction Web Elements Tools ExamplesIntroduction Yes! Mostly, but not for all web accessing with SAS only George Zhu & Sunita Ghosh (AHS - Cancer Care) Accessing and Extracting Data from Internet Using SAS
  • Outline Introduction Web Elements Tools ExamplesIntroduction Yes! Mostly, but not for all web accessing with SAS only George Zhu & Sunita Ghosh (AHS - Cancer Care) Accessing and Extracting Data from Internet Using SAS
  • Outline Introduction Web Elements Tools ExamplesIntroduction SAS is capable for web mining SAS has its own web access tools: filename url, filename socket, etc. SAS has data extraction tools: character string functions and powerful Perl Regular Expression functions and call process SAS provides two mechanisms to integrate specialized web access programs to get the work done: X statement, filename pipe statement George Zhu & Sunita Ghosh (AHS - Cancer Care) Accessing and Extracting Data from Internet Using SAS
  • Outline Introduction Web Elements Tools ExamplesIntroduction SAS is capable for web mining SAS has its own web access tools: filename url, filename socket, etc. SAS has data extraction tools: character string functions and powerful Perl Regular Expression functions and call process SAS provides two mechanisms to integrate specialized web access programs to get the work done: X statement, filename pipe statement George Zhu & Sunita Ghosh (AHS - Cancer Care) Accessing and Extracting Data from Internet Using SAS
  • Outline Introduction Web Elements Tools ExamplesIntroduction SAS is capable for web mining SAS has its own web access tools: filename url, filename socket, etc. SAS has data extraction tools: character string functions and powerful Perl Regular Expression functions and call process SAS provides two mechanisms to integrate specialized web access programs to get the work done: X statement, filename pipe statement George Zhu & Sunita Ghosh (AHS - Cancer Care) Accessing and Extracting Data from Internet Using SAS
  • Outline Introduction Web Elements Tools ExamplesIntroduction SAS is capable for web mining SAS has its own web access tools: filename url, filename socket, etc. SAS has data extraction tools: character string functions and powerful Perl Regular Expression functions and call process SAS provides two mechanisms to integrate specialized web access programs to get the work done: X statement, filename pipe statement George Zhu & Sunita Ghosh (AHS - Cancer Care) Accessing and Extracting Data from Internet Using SAS
  • Outline Introduction Web Elements Tools Examples URLs HTMLElements of Web Accessing Web Elements George Zhu & Sunita Ghosh (AHS - Cancer Care) Accessing and Extracting Data from Internet Using SAS
  • Outline Introduction Web Elements Tools Examples URLs HTMLElements of Web Accessing For Web accessing, we need to know some basic elements about internet: Where is the information located: URL (Universal Resource Locator) How the information is organized: HTML (Hypertext Markup Language) How to communicate (send and receive) with the server having the information: HTTP (Hypertext Transfer Protocol), HTTPS (secured HTTP), FTP, and more Request header and response header: These are critial information for more complicated websites, such as cookies, response status codes, etc. George Zhu & Sunita Ghosh (AHS - Cancer Care) Accessing and Extracting Data from Internet Using SAS
  • Outline Introduction Web Elements Tools Examples URLs HTMLElements of Web Accessing For Web accessing, we need to know some basic elements about internet: Where is the information located: URL (Universal Resource Locator) How the information is organized: HTML (Hypertext Markup Language) How to communicate (send and receive) with the server having the information: HTTP (Hypertext Transfer Protocol), HTTPS (secured HTTP), FTP, and more Request header and response header: These are critial information for more complicated websites, such as cookies, response status codes, etc. George Zhu & Sunita Ghosh (AHS - Cancer Care) Accessing and Extracting Data from Internet Using SAS
  • Outline Introduction Web Elements Tools Examples URLs HTMLElements of Web Accessing For Web accessing, we need to know some basic elements about internet: Where is the information located: URL (Universal Resource Locator) How the information is organized: HTML (Hypertext Markup Language) How to communicate (send and receive) with the server having the information: HTTP (Hypertext Transfer Protocol), HTTPS (secured HTTP), FTP, and more Request header and response header: These are critial information for more complicated websites, such as cookies, response status codes, etc. George Zhu & Sunita Ghosh (AHS - Cancer Care) Accessing and Extracting Data from Internet Using SAS
  • Outline Introduction Web Elements Tools Examples URLs HTMLElements of Web Accessing For Web accessing, we need to know some basic elements about internet: Where is the information located: URL (Universal Resource Locator) How the information is organized: HTML (Hypertext Markup Language) How to communicate (send and receive) with the server having the information: HTTP (Hypertext Transfer Protocol), HTTPS (secured HTTP), FTP, and more Request header and response header: These are critial information for more complicated websites, such as cookies, response status codes, etc. George Zhu & Sunita Ghosh (AHS - Cancer Care) Accessing and Extracting Data from Internet Using SAS
  • Outline Introduction Web Elements Tools Examples URLs HTMLElements of Web Accessing For Web accessing, we need to know some basic elements about internet: Where is the information located: URL (Universal Resource Locator) How the information is organized: HTML (Hypertext Markup Language) How to communicate (send and receive) with the server having the information: HTTP (Hypertext Transfer Protocol), HTTPS (secured HTTP), FTP, and more Request header and response header: These are critial information for more complicated websites, such as cookies, response status codes, etc. George Zhu & Sunita Ghosh (AHS - Cancer Care) Accessing and Extracting Data from Internet Using SAS
  • Outline Introduction Web Elements Tools Examples URLs HTMLURLs This is the address we type or displayed in the address bar in the web browser, such as: http://www.sas.com/offices/NA/canada/en/edmonton.html http://maps.google.com/maps?q=edmonton Usually has 3 components: Transfer Protocol: http, https, ftp, etc. IP Address or hostname: www.sas.com and maps.google.com The path and file name of the webpage on the host (web server): /offices/NA/canada/en (path), edmonton.html (filename). The third part can also be a function name (maps) and pairs of parameter=value (q=edmonton). George Zhu & Sunita Ghosh (AHS - Cancer Care) Accessing and Extracting Data from Internet Using SAS
  • Outline Introduction Web Elements Tools Examples URLs HTMLURLs This is the address we type or displayed in the address bar in the web browser, such as: http://www.sas.com/offices/NA/canada/en/edmonton.html http://maps.google.com/maps?q=edmonton Usually has 3 components: Transfer Protocol: http, https, ftp, etc. IP Address or hostname: www.sas.com and maps.google.com The path and file name of the webpage on the host (web server): /offices/NA/canada/en (path), edmonton.html (filename). The third part can also be a function name (maps) and pairs of parameter=value (q=edmonton). George Zhu & Sunita Ghosh (AHS - Cancer Care) Accessing and Extracting Data from Internet Using SAS
  • Outline Introduction Web Elements Tools Examples URLs HTMLURLs This is the address we type or displayed in the address bar in the web browser, such as: http://www.sas.com/offices/NA/canada/en/edmonton.html http://maps.google.com/maps?q=edmonton Usually has 3 components: Transfer Protocol: http, https, ftp, etc. IP Address or hostname: www.sas.com and maps.google.com The path and file name of the webpage on the host (web server): /offices/NA/canada/en (path), edmonton.html (filename). The third part can also be a function name (maps) and pairs of parameter=value (q=edmonton). George Zhu & Sunita Ghosh (AHS - Cancer Care) Accessing and Extracting Data from Internet Using SAS
  • Outline Introduction Web Elements Tools Examples URLs HTMLURLs This is the address we type or displayed in the address bar in the web browser, such as: http://www.sas.com/offices/NA/canada/en/edmonton.html http://maps.google.com/maps?q=edmonton Usually has 3 components: Transfer Protocol: http, https, ftp, etc. IP Address or hostname: www.sas.com and maps.google.com The path and file name of the webpage on the host (web server): /offices/NA/canada/en (path), edmonton.html (filename). The third part can also be a function name (maps) and pairs of parameter=value (q=edmonton). George Zhu & Sunita Ghosh (AHS - Cancer Care) Accessing and Extracting Data from Internet Using SAS
  • Outline Introduction Web Elements Tools Examples URLs HTMLURLs This is the address we type or displayed in the address bar in the web browser, such as: http://www.sas.com/offices/NA/canada/en/edmonton.html http://maps.google.com/maps?q=edmonton Usually has 3 components: Transfer Protocol: http, https, ftp, etc. IP Address or hostname: www.sas.com and maps.google.com The path and file name of the webpage on the host (web server): /offices/NA/canada/en (path), edmonton.html (filename). The third part can also be a function name (maps) and pairs of parameter=value (q=edmonton). George Zhu & Sunita Ghosh (AHS - Cancer Care) Accessing and Extracting Data from Internet Using SAS
  • Outline Introduction Web Elements Tools Examples URLs HTMLURLs This is the address we type or displayed in the address bar in the web browser, such as: http://www.sas.com/offices/NA/canada/en/edmonton.html http://maps.google.com/maps?q=edmonton Usually has 3 components: Transfer Protocol: http, https, ftp, etc. IP Address or hostname: www.sas.com and maps.google.com The path and file name of the webpage on the host (web server): /offices/NA/canada/en (path), edmonton.html (filename). The third part can also be a function name (maps) and pairs of parameter=value (q=edmonton). George Zhu & Sunita Ghosh (AHS - Cancer Care) Accessing and Extracting Data from Internet Using SAS
  • Outline Introduction Web Elements Tools Examples URLs HTMLURLs Two types of URLs: Static and Dynamic A Static webpage is the html file that is already stored in the web server. The file Edmonton.html is already physically stored in the SAS server, under the specified path. http://www.sas.com/offices/NA/canada/en/edmonton.html A Dynamic webpage is a webpage that is dynamically generated, depending on what information the server received. http://maps.google.com/maps?q=edmonton George Zhu & Sunita Ghosh (AHS - Cancer Care) Accessing and Extracting Data from Internet Using SAS
  • Outline Introduction Web Elements Tools Examples URLs HTMLURLs Two types of URLs: Static and Dynamic A Static webpage is the html file that is already stored in the web server. The file Edmonton.html is already physically stored in the SAS server, under the specified path. http://www.sas.com/offices/NA/canada/en/edmonton.html A Dynamic webpage is a webpage that is dynamically generated, depending on what information the server received. http://maps.google.com/maps?q=edmonton George Zhu & Sunita Ghosh (AHS - Cancer Care) Accessing and Extracting Data from Internet Using SAS
  • Outline Introduction Web Elements Tools Examples URLs HTMLURLs Two types of URLs: Static and Dynamic A Static webpage is the html file that is already stored in the web server. The file Edmonton.html is already physically stored in the SAS server, under the specified path. http://www.sas.com/offices/NA/canada/en/edmonton.html A Dynamic webpage is a webpage that is dynamically generated, depending on what information the server received. http://maps.google.com/maps?q=edmonton George Zhu & Sunita Ghosh (AHS - Cancer Care) Accessing and Extracting Data from Internet Using SAS
  • Outline Introduction Web Elements Tools Examples URLs HTMLURLs For all static webpages and some dynamic webpages, the url (when you click the link) is usually given with <a href="url"> in the HTML file, so you can easily extract the next url For many dynamic webpages, you need to determine function name and what parameter/value pairs are needed in order to get the required results. The function name and parameters are usually specified in the current HTML with tags like: <form submit=[function]>, <input name=[param], value=["value"]>. George Zhu & Sunita Ghosh (AHS - Cancer Care) Accessing and Extracting Data from Internet Using SAS
  • Outline Introduction Web Elements Tools Examples URLs HTMLURLs For all static webpages and some dynamic webpages, the url (when you click the link) is usually given with <a href="url"> in the HTML file, so you can easily extract the next url For many dynamic webpages, you need to determine function name and what parameter/value pairs are needed in order to get the required results. The function name and parameters are usually specified in the current HTML with tags like: <form submit=[function]>, <input name=[param], value=["value"]>. George Zhu & Sunita Ghosh (AHS - Cancer Care) Accessing and Extracting Data from Internet Using SAS
  • Outline Introduction Web Elements Tools Examples URLs HTMLURLs For all static webpages and some dynamic webpages, the url (when you click the link) is usually given with <a href="url"> in the HTML file, so you can easily extract the next url For many dynamic webpages, you need to determine function name and what parameter/value pairs are needed in order to get the required results. The function name and parameters are usually specified in the current HTML with tags like: <form submit=[function]>, <input name=[param], value=["value"]>. George Zhu & Sunita Ghosh (AHS - Cancer Care) Accessing and Extracting Data from Internet Using SAS
  • Outline Introduction Web Elements Tools Examples URLs HTMLURLs URLs can not contain space or any special characters (like #.,()"’;:, etc.). These need to be encoded SAS provides the function urlencode for this purpose. Also, the function urldecode will do the opposite If there are more than one pair of parameter=value in the dynamic url, a & is used to separate the pairs These special characters (especially &) will cause trouble when constructing URLs in a SAS macro. Need to use macro quoting functions (like %str, %quote, %nrquote, %superq, etc.) for the URL George Zhu & Sunita Ghosh (AHS - Cancer Care) Accessing and Extracting Data from Internet Using SAS
  • Outline Introduction Web Elements Tools Examples URLs HTMLURLs URLs can not contain space or any special characters (like #.,()"’;:, etc.). These need to be encoded SAS provides the function urlencode for this purpose. Also, the function urldecode will do the opposite If there are more than one pair of parameter=value in the dynamic url, a & is used to separate the pairs These special characters (especially &) will cause trouble when constructing URLs in a SAS macro. Need to use macro quoting functions (like %str, %quote, %nrquote, %superq, etc.) for the URL George Zhu & Sunita Ghosh (AHS - Cancer Care) Accessing and Extracting Data from Internet Using SAS
  • Outline Introduction Web Elements Tools Examples URLs HTMLURLs URLs can not contain space or any special characters (like #.,()"’;:, etc.). These need to be encoded SAS provides the function urlencode for this purpose. Also, the function urldecode will do the opposite If there are more than one pair of parameter=value in the dynamic url, a & is used to separate the pairs These special characters (especially &) will cause trouble when constructing URLs in a SAS macro. Need to use macro quoting functions (like %str, %quote, %nrquote, %superq, etc.) for the URL George Zhu & Sunita Ghosh (AHS - Cancer Care) Accessing and Extracting Data from Internet Using SAS
  • Outline Introduction Web Elements Tools Examples URLs HTMLURLs URLs can not contain space or any special characters (like #.,()"’;:, etc.). These need to be encoded SAS provides the function urlencode for this purpose. Also, the function urldecode will do the opposite If there are more than one pair of parameter=value in the dynamic url, a & is used to separate the pairs These special characters (especially &) will cause trouble when constructing URLs in a SAS macro. Need to use macro quoting functions (like %str, %quote, %nrquote, %superq, etc.) for the URL George Zhu & Sunita Ghosh (AHS - Cancer Care) Accessing and Extracting Data from Internet Using SAS
  • Outline Introduction Web Elements Tools Examples URLs HTMLURLs Two methods of sending the required parameters to the server: GET: through the static or dynmaic url - parameters can be seen on the address bar POST: through the request header - parameters not displayed in the address bar POST method can be used to transfer large amount of parameters or sensitive data (such as password) In SAS, most GET requests can be realized with FILENAME URL statement For POST requests, you can only use FILENAME Socket statement - very complicated! - not recommended. Use other tools (like cURL) George Zhu & Sunita Ghosh (AHS - Cancer Care) Accessing and Extracting Data from Internet Using SAS
  • Outline Introduction Web Elements Tools Examples URLs HTMLURLs Two methods of sending the required parameters to the server: GET: through the static or dynmaic url - parameters can be seen on the address bar POST: through the request header - parameters not displayed in the address bar POST method can be used to transfer large amount of parameters or sensitive data (such as password) In SAS, most GET requests can be realized with FILENAME URL statement For POST requests, you can only use FILENAME Socket statement - very complicated! - not recommended. Use other tools (like cURL) George Zhu & Sunita Ghosh (AHS - Cancer Care) Accessing and Extracting Data from Internet Using SAS
  • Outline Introduction Web Elements Tools Examples URLs HTMLURLs Two methods of sending the required parameters to the server: GET: through the static or dynmaic url - parameters can be seen on the address bar POST: through the request header - parameters not displayed in the address bar POST method can be used to transfer large amount of parameters or sensitive data (such as password) In SAS, most GET requests can be realized with FILENAME URL statement For POST requests, you can only use FILENAME Socket statement - very complicated! - not recommended. Use other tools (like cURL) George Zhu & Sunita Ghosh (AHS - Cancer Care) Accessing and Extracting Data from Internet Using SAS
  • Outline Introduction Web Elements Tools Examples URLs HTMLURLs Two methods of sending the required parameters to the server: GET: through the static or dynmaic url - parameters can be seen on the address bar POST: through the request header - parameters not displayed in the address bar POST method can be used to transfer large amount of parameters or sensitive data (such as password) In SAS, most GET requests can be realized with FILENAME URL statement For POST requests, you can only use FILENAME Socket statement - very complicated! - not recommended. Use other tools (like cURL) George Zhu & Sunita Ghosh (AHS - Cancer Care) Accessing and Extracting Data from Internet Using SAS
  • Outline Introduction Web Elements Tools Examples URLs HTMLURLs Two methods of sending the required parameters to the server: GET: through the static or dynmaic url - parameters can be seen on the address bar POST: through the request header - parameters not displayed in the address bar POST method can be used to transfer large amount of parameters or sensitive data (such as password) In SAS, most GET requests can be realized with FILENAME URL statement For POST requests, you can only use FILENAME Socket statement - very complicated! - not recommended. Use other tools (like cURL) George Zhu & Sunita Ghosh (AHS - Cancer Care) Accessing and Extracting Data from Internet Using SAS
  • Outline Introduction Web Elements Tools Examples URLs HTMLURLs Two methods of sending the required parameters to the server: GET: through the static or dynmaic url - parameters can be seen on the address bar POST: through the request header - parameters not displayed in the address bar POST method can be used to transfer large amount of parameters or sensitive data (such as password) In SAS, most GET requests can be realized with FILENAME URL statement For POST requests, you can only use FILENAME Socket statement - very complicated! - not recommended. Use other tools (like cURL) George Zhu & Sunita Ghosh (AHS - Cancer Care) Accessing and Extracting Data from Internet Using SAS
  • Outline Introduction Web Elements Tools Examples URLs HTMLHTML The response from the server is usually in the form of HTML file In web browser, you can right-click and select View page source to see the webpage in HTML codes HTML file is a text file, so it can be read into a SAS data set. Each line is stored as one record in SAS data set. In SAS, use the INFILE statement with varying length to read each line Be aware that SAS has a length limit of 32767 - characters after this limit in a line can’t be read into SAS. George Zhu & Sunita Ghosh (AHS - Cancer Care) Accessing and Extracting Data from Internet Using SAS
  • Outline Introduction Web Elements Tools Examples URLs HTMLHTML The response from the server is usually in the form of HTML file In web browser, you can right-click and select View page source to see the webpage in HTML codes HTML file is a text file, so it can be read into a SAS data set. Each line is stored as one record in SAS data set. In SAS, use the INFILE statement with varying length to read each line Be aware that SAS has a length limit of 32767 - characters after this limit in a line can’t be read into SAS. George Zhu & Sunita Ghosh (AHS - Cancer Care) Accessing and Extracting Data from Internet Using SAS
  • Outline Introduction Web Elements Tools Examples URLs HTMLHTML The response from the server is usually in the form of HTML file In web browser, you can right-click and select View page source to see the webpage in HTML codes HTML file is a text file, so it can be read into a SAS data set. Each line is stored as one record in SAS data set. In SAS, use the INFILE statement with varying length to read each line Be aware that SAS has a length limit of 32767 - characters after this limit in a line can’t be read into SAS. George Zhu & Sunita Ghosh (AHS - Cancer Care) Accessing and Extracting Data from Internet Using SAS
  • Outline Introduction Web Elements Tools Examples URLs HTMLHTML The response from the server is usually in the form of HTML file In web browser, you can right-click and select View page source to see the webpage in HTML codes HTML file is a text file, so it can be read into a SAS data set. Each line is stored as one record in SAS data set. In SAS, use the INFILE statement with varying length to read each line Be aware that SAS has a length limit of 32767 - characters after this limit in a line can’t be read into SAS. George Zhu & Sunita Ghosh (AHS - Cancer Care) Accessing and Extracting Data from Internet Using SAS
  • Outline Introduction Web Elements Tools Examples URLs HTMLHTML The response from the server is usually in the form of HTML file In web browser, you can right-click and select View page source to see the webpage in HTML codes HTML file is a text file, so it can be read into a SAS data set. Each line is stored as one record in SAS data set. In SAS, use the INFILE statement with varying length to read each line Be aware that SAS has a length limit of 32767 - characters after this limit in a line can’t be read into SAS. George Zhu & Sunita Ghosh (AHS - Cancer Care) Accessing and Extracting Data from Internet Using SAS
  • Outline Introduction Web Elements Tools Examples URLs HTMLHTML The response from the server is usually in the form of HTML file In web browser, you can right-click and select View page source to see the webpage in HTML codes HTML file is a text file, so it can be read into a SAS data set. Each line is stored as one record in SAS data set. In SAS, use the INFILE statement with varying length to read each line Be aware that SAS has a length limit of 32767 - characters after this limit in a line can’t be read into SAS. George Zhu & Sunita Ghosh (AHS - Cancer Care) Accessing and Extracting Data from Internet Using SAS
  • Outline Introduction Web Elements Tools Examples URLs HTMLHTML HTML file consists pairs of tags and actual information. Tags are used to instruct the web browser how to display the response on the screen We can use Tags to locate the required information and how to extract them, for example, where is the URL for next page, where is the data table and what values need to be extracted. George Zhu & Sunita Ghosh (AHS - Cancer Care) Accessing and Extracting Data from Internet Using SAS
  • Outline Introduction Web Elements Tools Examples URLs HTMLHTML HTML file consists pairs of tags and actual information. Tags are used to instruct the web browser how to display the response on the screen We can use Tags to locate the required information and how to extract them, for example, where is the URL for next page, where is the data table and what values need to be extracted. George Zhu & Sunita Ghosh (AHS - Cancer Care) Accessing and Extracting Data from Internet Using SAS
  • Outline Introduction Web Elements Tools Examples URLs HTMLHTML HTML file consists pairs of tags and actual information. Tags are used to instruct the web browser how to display the response on the screen We can use Tags to locate the required information and how to extract them, for example, where is the URL for next page, where is the data table and what values need to be extracted. George Zhu & Sunita Ghosh (AHS - Cancer Care) Accessing and Extracting Data from Internet Using SAS
  • Outline Introduction Web Elements Tools Examples URLs HTMLHTML Examples of tags (normally come in pairs): <a href=...>: for specifying a link (an URL) <div id=...>: for identifying a division or a section in an html document <table>, <th>, <tr>, <td>: identify the table, table head, table row, table data <form ...method="post">, <input..name=** value=**>: specify what parameters (name) and values are needed to be sent to the server, and what method to send the request (get or post) George Zhu & Sunita Ghosh (AHS - Cancer Care) Accessing and Extracting Data from Internet Using SAS
  • Outline Introduction Web Elements Tools Examples SAS Functions SAS Statements cURL Perl/LWPWeb Accessing Tools Web Accessing Tools George Zhu & Sunita Ghosh (AHS - Cancer Care) Accessing and Extracting Data from Internet Using SAS
  • Outline Introduction Web Elements Tools Examples SAS Functions SAS Statements cURL Perl/LWPTools for Web Accessing Tools (software) we use for getting information from Internet: Web browsers: Internet Explorer, Firefox, Google Chrome, Safari, etc. - For browsing and clicking, not for automation. Command line programs (cURL), LWP package for Perl programing language - Very powerful web accessing, basically can replicate any web browser functionalities. SAS filename statements: URL, Socket, FTP, EMail - basic web accessing for specific functionalities, may not enough for accessing all websites. George Zhu & Sunita Ghosh (AHS - Cancer Care) Accessing and Extracting Data from Internet Using SAS
  • Outline Introduction Web Elements Tools Examples SAS Functions SAS Statements cURL Perl/LWPTools for Web Accessing Tools (software) we use for getting information from Internet: Web browsers: Internet Explorer, Firefox, Google Chrome, Safari, etc. - For browsing and clicking, not for automation. Command line programs (cURL), LWP package for Perl programing language - Very powerful web accessing, basically can replicate any web browser functionalities. SAS filename statements: URL, Socket, FTP, EMail - basic web accessing for specific functionalities, may not enough for accessing all websites. George Zhu & Sunita Ghosh (AHS - Cancer Care) Accessing and Extracting Data from Internet Using SAS
  • Outline Introduction Web Elements Tools Examples SAS Functions SAS Statements cURL Perl/LWPTools for Web Accessing Tools (software) we use for getting information from Internet: Web browsers: Internet Explorer, Firefox, Google Chrome, Safari, etc. - For browsing and clicking, not for automation. Command line programs (cURL), LWP package for Perl programing language - Very powerful web accessing, basically can replicate any web browser functionalities. SAS filename statements: URL, Socket, FTP, EMail - basic web accessing for specific functionalities, may not enough for accessing all websites. George Zhu & Sunita Ghosh (AHS - Cancer Care) Accessing and Extracting Data from Internet Using SAS
  • Outline Introduction Web Elements Tools Examples SAS Functions SAS Statements cURL Perl/LWPTools for Web Accessing Tools (software) we use for getting information from Internet: Web browsers: Internet Explorer, Firefox, Google Chrome, Safari, etc. - For browsing and clicking, not for automation. Command line programs (cURL), LWP package for Perl programing language - Very powerful web accessing, basically can replicate any web browser functionalities. SAS filename statements: URL, Socket, FTP, EMail - basic web accessing for specific functionalities, may not enough for accessing all websites. George Zhu & Sunita Ghosh (AHS - Cancer Care) Accessing and Extracting Data from Internet Using SAS
  • Outline Introduction Web Elements Tools Examples SAS Functions SAS Statements cURL Perl/LWPSAS functions for text extraction SAS string functions: FIND(), INDEX(), SCAN(), etc. These functions are useful for locating the target information Perl Regular Expression functions (Version 9): prxparse(), prxmatch(), prxchange(), and the prx-call routines: call prxchange(), call prxmatch(), call prxnext(). These are very powerful for exacting the information Example: Extract the URLs in the record (variable name: line): url=prxchange(’s/ˆ.*?href="(.*?)".*?$/$1/’,-1,line); George Zhu & Sunita Ghosh (AHS - Cancer Care) Accessing and Extracting Data from Internet Using SAS
  • Outline Introduction Web Elements Tools Examples SAS Functions SAS Statements cURL Perl/LWPSAS functions for text extraction SAS string functions: FIND(), INDEX(), SCAN(), etc. These functions are useful for locating the target information Perl Regular Expression functions (Version 9): prxparse(), prxmatch(), prxchange(), and the prx-call routines: call prxchange(), call prxmatch(), call prxnext(). These are very powerful for exacting the information Example: Extract the URLs in the record (variable name: line): url=prxchange(’s/ˆ.*?href="(.*?)".*?$/$1/’,-1,line); George Zhu & Sunita Ghosh (AHS - Cancer Care) Accessing and Extracting Data from Internet Using SAS
  • Outline Introduction Web Elements Tools Examples SAS Functions SAS Statements cURL Perl/LWPSAS functions for text extraction SAS string functions: FIND(), INDEX(), SCAN(), etc. These functions are useful for locating the target information Perl Regular Expression functions (Version 9): prxparse(), prxmatch(), prxchange(), and the prx-call routines: call prxchange(), call prxmatch(), call prxnext(). These are very powerful for exacting the information Example: Extract the URLs in the record (variable name: line): url=prxchange(’s/ˆ.*?href="(.*?)".*?$/$1/’,-1,line); George Zhu & Sunita Ghosh (AHS - Cancer Care) Accessing and Extracting Data from Internet Using SAS
  • Outline Introduction Web Elements Tools Examples SAS Functions SAS Statements cURL Perl/LWPSAS functions for text extraction SAS string functions: FIND(), INDEX(), SCAN(), etc. These functions are useful for locating the target information Perl Regular Expression functions (Version 9): prxparse(), prxmatch(), prxchange(), and the prx-call routines: call prxchange(), call prxmatch(), call prxnext(). These are very powerful for exacting the information Example: Extract the URLs in the record (variable name: line): url=prxchange(’s/ˆ.*?href="(.*?)".*?$/$1/’,-1,line); George Zhu & Sunita Ghosh (AHS - Cancer Care) Accessing and Extracting Data from Internet Using SAS
  • Outline Introduction Web Elements Tools Examples SAS Functions SAS Statements cURL Perl/LWPSAS Web Access Statements SAS provides two main statements for web access: FILENAME URL: for the GET request method (static or dynamic url) filename eSUG url "http://www.sas.com/offices/NA/canada/en/edmonton.html"; Before 9.2, FILENAME URL did not support HTTPS protocol completely, now in 9.2, it seems to support HTTPS for secured transfer FILENAME Socket: two-way commnuication, for more complicated web requests (like POST method, cookies, referer, etc.) Other statements: FILENAME EMAIL - for accessing and sending emails FILENAME FTP - for transfering files with the web server George Zhu & Sunita Ghosh (AHS - Cancer Care) Accessing and Extracting Data from Internet Using SAS
  • Outline Introduction Web Elements Tools Examples SAS Functions SAS Statements cURL Perl/LWPSAS Web Access Statements SAS provides two main statements for web access: FILENAME URL: for the GET request method (static or dynamic url) filename eSUG url "http://www.sas.com/offices/NA/canada/en/edmonton.html"; Before 9.2, FILENAME URL did not support HTTPS protocol completely, now in 9.2, it seems to support HTTPS for secured transfer FILENAME Socket: two-way commnuication, for more complicated web requests (like POST method, cookies, referer, etc.) Other statements: FILENAME EMAIL - for accessing and sending emails FILENAME FTP - for transfering files with the web server George Zhu & Sunita Ghosh (AHS - Cancer Care) Accessing and Extracting Data from Internet Using SAS
  • Outline Introduction Web Elements Tools Examples SAS Functions SAS Statements cURL Perl/LWPSAS Web Access Statements SAS provides two main statements for web access: FILENAME URL: for the GET request method (static or dynamic url) filename eSUG url "http://www.sas.com/offices/NA/canada/en/edmonton.html"; Before 9.2, FILENAME URL did not support HTTPS protocol completely, now in 9.2, it seems to support HTTPS for secured transfer FILENAME Socket: two-way commnuication, for more complicated web requests (like POST method, cookies, referer, etc.) Other statements: FILENAME EMAIL - for accessing and sending emails FILENAME FTP - for transfering files with the web server George Zhu & Sunita Ghosh (AHS - Cancer Care) Accessing and Extracting Data from Internet Using SAS
  • Outline Introduction Web Elements Tools Examples SAS Functions SAS Statements cURL Perl/LWPSAS Web Access Statements SAS provides two main statements for web access: FILENAME URL: for the GET request method (static or dynamic url) filename eSUG url "http://www.sas.com/offices/NA/canada/en/edmonton.html"; Before 9.2, FILENAME URL did not support HTTPS protocol completely, now in 9.2, it seems to support HTTPS for secured transfer FILENAME Socket: two-way commnuication, for more complicated web requests (like POST method, cookies, referer, etc.) Other statements: FILENAME EMAIL - for accessing and sending emails FILENAME FTP - for transfering files with the web server George Zhu & Sunita Ghosh (AHS - Cancer Care) Accessing and Extracting Data from Internet Using SAS
  • Outline Introduction Web Elements Tools Examples SAS Functions SAS Statements cURL Perl/LWPSAS Web Access Statements SAS provides two main statements for web access: FILENAME URL: for the GET request method (static or dynamic url) filename eSUG url "http://www.sas.com/offices/NA/canada/en/edmonton.html"; Before 9.2, FILENAME URL did not support HTTPS protocol completely, now in 9.2, it seems to support HTTPS for secured transfer FILENAME Socket: two-way commnuication, for more complicated web requests (like POST method, cookies, referer, etc.) Other statements: FILENAME EMAIL - for accessing and sending emails FILENAME FTP - for transfering files with the web server George Zhu & Sunita Ghosh (AHS - Cancer Care) Accessing and Extracting Data from Internet Using SAS
  • Outline Introduction Web Elements Tools Examples SAS Functions SAS Statements cURL Perl/LWPSAS Web Access Statements SAS provides two main statements for web access: FILENAME URL: for the GET request method (static or dynamic url) filename eSUG url "http://www.sas.com/offices/NA/canada/en/edmonton.html"; Before 9.2, FILENAME URL did not support HTTPS protocol completely, now in 9.2, it seems to support HTTPS for secured transfer FILENAME Socket: two-way commnuication, for more complicated web requests (like POST method, cookies, referer, etc.) Other statements: FILENAME EMAIL - for accessing and sending emails FILENAME FTP - for transfering files with the web server George Zhu & Sunita Ghosh (AHS - Cancer Care) Accessing and Extracting Data from Internet Using SAS
  • Outline Introduction Web Elements Tools Examples SAS Functions SAS Statements cURL Perl/LWPSAS Web Access Statements SAS provides two main statements for web access: FILENAME URL: for the GET request method (static or dynamic url) filename eSUG url "http://www.sas.com/offices/NA/canada/en/edmonton.html"; Before 9.2, FILENAME URL did not support HTTPS protocol completely, now in 9.2, it seems to support HTTPS for secured transfer FILENAME Socket: two-way commnuication, for more complicated web requests (like POST method, cookies, referer, etc.) Other statements: FILENAME EMAIL - for accessing and sending emails FILENAME FTP - for transfering files with the web server George Zhu & Sunita Ghosh (AHS - Cancer Care) Accessing and Extracting Data from Internet Using SAS
  • Outline Introduction Web Elements Tools Examples SAS Functions SAS Statements cURL Perl/LWPSAS Web Access Statements Two SAS mechanisms for extending SAS functions with other software: FILENAME PIPE statement: the results from this statement are treated as the inputs to the SAS data set. For example: filename fileref pipe "<DOS Command>"; X statement: leave SAS temperately and run the external program, and then return to SAS after the execution: X "<DOS Command>"; The X statement does not provide direct interaction between SAS and the external program The PIPE statement feeds the results directly to SAS George Zhu & Sunita Ghosh (AHS - Cancer Care) Accessing and Extracting Data from Internet Using SAS
  • Outline Introduction Web Elements Tools Examples SAS Functions SAS Statements cURL Perl/LWPSAS Web Access Statements Two SAS mechanisms for extending SAS functions with other software: FILENAME PIPE statement: the results from this statement are treated as the inputs to the SAS data set. For example: filename fileref pipe "<DOS Command>"; X statement: leave SAS temperately and run the external program, and then return to SAS after the execution: X "<DOS Command>"; The X statement does not provide direct interaction between SAS and the external program The PIPE statement feeds the results directly to SAS George Zhu & Sunita Ghosh (AHS - Cancer Care) Accessing and Extracting Data from Internet Using SAS
  • Outline Introduction Web Elements Tools Examples SAS Functions SAS Statements cURL Perl/LWPSAS Web Access Statements Two SAS mechanisms for extending SAS functions with other software: FILENAME PIPE statement: the results from this statement are treated as the inputs to the SAS data set. For example: filename fileref pipe "<DOS Command>"; X statement: leave SAS temperately and run the external program, and then return to SAS after the execution: X "<DOS Command>"; The X statement does not provide direct interaction between SAS and the external program The PIPE statement feeds the results directly to SAS George Zhu & Sunita Ghosh (AHS - Cancer Care) Accessing and Extracting Data from Internet Using SAS
  • Outline Introduction Web Elements Tools Examples SAS Functions SAS Statements cURL Perl/LWPSAS Web Access Statements Two SAS mechanisms for extending SAS functions with other software: FILENAME PIPE statement: the results from this statement are treated as the inputs to the SAS data set. For example: filename fileref pipe "<DOS Command>"; X statement: leave SAS temperately and run the external program, and then return to SAS after the execution: X "<DOS Command>"; The X statement does not provide direct interaction between SAS and the external program The PIPE statement feeds the results directly to SAS George Zhu & Sunita Ghosh (AHS - Cancer Care) Accessing and Extracting Data from Internet Using SAS
  • Outline Introduction Web Elements Tools Examples SAS Functions SAS Statements cURL Perl/LWPSAS Web Access Statements Two SAS mechanisms for extending SAS functions with other software: FILENAME PIPE statement: the results from this statement are treated as the inputs to the SAS data set. For example: filename fileref pipe "<DOS Command>"; X statement: leave SAS temperately and run the external program, and then return to SAS after the execution: X "<DOS Command>"; The X statement does not provide direct interaction between SAS and the external program The PIPE statement feeds the results directly to SAS George Zhu & Sunita Ghosh (AHS - Cancer Care) Accessing and Extracting Data from Internet Using SAS
  • Outline Introduction Web Elements Tools Examples SAS Functions SAS Statements cURL Perl/LWPcURL Command Line Program cURL is a command line tool for transferring data with URL syntax libCurl is a C library, and cURL is binary program by integrating all the functions in libCurl libCurl has been implemented in many programming languages and software Basically anything you can do with a web browser can be replicated with cURL, including clicking, downloading files (any file, like music, video, pdf, etc) and uploading. Most importantly, it is FREE. It can be downloaded from: http://curl.haxx.se/ George Zhu & Sunita Ghosh (AHS - Cancer Care) Accessing and Extracting Data from Internet Using SAS
  • Outline Introduction Web Elements Tools Examples SAS Functions SAS Statements cURL Perl/LWPcURL Command Line Program cURL is a command line tool for transferring data with URL syntax libCurl is a C library, and cURL is binary program by integrating all the functions in libCurl libCurl has been implemented in many programming languages and software Basically anything you can do with a web browser can be replicated with cURL, including clicking, downloading files (any file, like music, video, pdf, etc) and uploading. Most importantly, it is FREE. It can be downloaded from: http://curl.haxx.se/ George Zhu & Sunita Ghosh (AHS - Cancer Care) Accessing and Extracting Data from Internet Using SAS
  • Outline Introduction Web Elements Tools Examples SAS Functions SAS Statements cURL Perl/LWPcURL Command Line Program cURL is a command line tool for transferring data with URL syntax libCurl is a C library, and cURL is binary program by integrating all the functions in libCurl libCurl has been implemented in many programming languages and software Basically anything you can do with a web browser can be replicated with cURL, including clicking, downloading files (any file, like music, video, pdf, etc) and uploading. Most importantly, it is FREE. It can be downloaded from: http://curl.haxx.se/ George Zhu & Sunita Ghosh (AHS - Cancer Care) Accessing and Extracting Data from Internet Using SAS
  • Outline Introduction Web Elements Tools Examples SAS Functions SAS Statements cURL Perl/LWPcURL Command Line Program cURL is a command line tool for transferring data with URL syntax libCurl is a C library, and cURL is binary program by integrating all the functions in libCurl libCurl has been implemented in many programming languages and software Basically anything you can do with a web browser can be replicated with cURL, including clicking, downloading files (any file, like music, video, pdf, etc) and uploading. Most importantly, it is FREE. It can be downloaded from: http://curl.haxx.se/ George Zhu & Sunita Ghosh (AHS - Cancer Care) Accessing and Extracting Data from Internet Using SAS
  • Outline Introduction Web Elements Tools Examples SAS Functions SAS Statements cURL Perl/LWPcURL Command Line Program cURL is a command line tool for transferring data with URL syntax libCurl is a C library, and cURL is binary program by integrating all the functions in libCurl libCurl has been implemented in many programming languages and software Basically anything you can do with a web browser can be replicated with cURL, including clicking, downloading files (any file, like music, video, pdf, etc) and uploading. Most importantly, it is FREE. It can be downloaded from: http://curl.haxx.se/ George Zhu & Sunita Ghosh (AHS - Cancer Care) Accessing and Extracting Data from Internet Using SAS
  • Outline Introduction Web Elements Tools Examples SAS Functions SAS Statements cURL Perl/LWPcURL Command Line Program Basic usage of cURL CURL --options urls cURL has lots of options, with these options it can deal with virtually any requests with any websites. with FILENAME PIPE statement: filename eSUG pipe "CURL --options urls"; with X statement: x "CURL --options urls"; George Zhu & Sunita Ghosh (AHS - Cancer Care) Accessing and Extracting Data from Internet Using SAS
  • Outline Introduction Web Elements Tools Examples SAS Functions SAS Statements cURL Perl/LWPcURL Command Line Program Basic usage of cURL CURL --options urls cURL has lots of options, with these options it can deal with virtually any requests with any websites. with FILENAME PIPE statement: filename eSUG pipe "CURL --options urls"; with X statement: x "CURL --options urls"; George Zhu & Sunita Ghosh (AHS - Cancer Care) Accessing and Extracting Data from Internet Using SAS
  • Outline Introduction Web Elements Tools Examples SAS Functions SAS Statements cURL Perl/LWPcURL Command Line Program Basic usage of cURL CURL --options urls cURL has lots of options, with these options it can deal with virtually any requests with any websites. with FILENAME PIPE statement: filename eSUG pipe "CURL --options urls"; with X statement: x "CURL --options urls"; George Zhu & Sunita Ghosh (AHS - Cancer Care) Accessing and Extracting Data from Internet Using SAS
  • Outline Introduction Web Elements Tools Examples SAS Functions SAS Statements cURL Perl/LWPthe LWP package with Perl Perl programming language is very powerful and flexible in parsing information from HTML or XML files The LWP (library for WWW in Perl) package is a web accessing package in Perl Go to the website http://www.perl.org/ for more information about Perl and its packages The web accessing and information parsing with LWP is much faster than with SAS You can write a Perl script and use FILENAME PIPE statement or X statement to let SAS run the Perl script and return the results to SAS for further processing George Zhu & Sunita Ghosh (AHS - Cancer Care) Accessing and Extracting Data from Internet Using SAS
  • Outline Introduction Web Elements Tools Examples SAS Functions SAS Statements cURL Perl/LWPthe LWP package with Perl Perl programming language is very powerful and flexible in parsing information from HTML or XML files The LWP (library for WWW in Perl) package is a web accessing package in Perl Go to the website http://www.perl.org/ for more information about Perl and its packages The web accessing and information parsing with LWP is much faster than with SAS You can write a Perl script and use FILENAME PIPE statement or X statement to let SAS run the Perl script and return the results to SAS for further processing George Zhu & Sunita Ghosh (AHS - Cancer Care) Accessing and Extracting Data from Internet Using SAS
  • Outline Introduction Web Elements Tools Examples SAS Functions SAS Statements cURL Perl/LWPthe LWP package with Perl Perl programming language is very powerful and flexible in parsing information from HTML or XML files The LWP (library for WWW in Perl) package is a web accessing package in Perl Go to the website http://www.perl.org/ for more information about Perl and its packages The web accessing and information parsing with LWP is much faster than with SAS You can write a Perl script and use FILENAME PIPE statement or X statement to let SAS run the Perl script and return the results to SAS for further processing George Zhu & Sunita Ghosh (AHS - Cancer Care) Accessing and Extracting Data from Internet Using SAS
  • Outline Introduction Web Elements Tools Examples SAS Functions SAS Statements cURL Perl/LWPthe LWP package with Perl Perl programming language is very powerful and flexible in parsing information from HTML or XML files The LWP (library for WWW in Perl) package is a web accessing package in Perl Go to the website http://www.perl.org/ for more information about Perl and its packages The web accessing and information parsing with LWP is much faster than with SAS You can write a Perl script and use FILENAME PIPE statement or X statement to let SAS run the Perl script and return the results to SAS for further processing George Zhu & Sunita Ghosh (AHS - Cancer Care) Accessing and Extracting Data from Internet Using SAS
  • Outline Introduction Web Elements Tools Examples SAS Functions SAS Statements cURL Perl/LWPthe LWP package with Perl Perl programming language is very powerful and flexible in parsing information from HTML or XML files The LWP (library for WWW in Perl) package is a web accessing package in Perl Go to the website http://www.perl.org/ for more information about Perl and its packages The web accessing and information parsing with LWP is much faster than with SAS You can write a Perl script and use FILENAME PIPE statement or X statement to let SAS run the Perl script and return the results to SAS for further processing George Zhu & Sunita Ghosh (AHS - Cancer Care) Accessing and Extracting Data from Internet Using SAS
  • Outline Introduction Web Elements Tools Examples Example 1 Example 2 Example 3 Example 4 Example 5Examples Examples George Zhu & Sunita Ghosh (AHS - Cancer Care) Accessing and Extracting Data from Internet Using SAS
  • Outline Introduction Web Elements Tools Examples Example 1 Example 2 Example 3 Example 4 Example 5Example 1: Download .csv file Download Moody’s seasoned AAA coporate bond yield from FRED website: The return data is in .csv format, so no data extraction is needed, just like reading a local .csv file SAS Codes Resulting Data Set filename DAAA url "http://research.stlouisfed.org/fred2 /series/DAAA/downloaddata/DAAA.csv"; data BY_AAA(drop=dd); length dd $10; format date date9.; infile DAAA dlm="," dsd; input dd$ value; date=input(dd,yymmdd10.); if not missing(date) then output; run; filename DAAA clear; George Zhu & Sunita Ghosh (AHS - Cancer Care) Accessing and Extracting Data from Internet Using SAS
  • Outline Introduction Web Elements Tools Examples Example 1 Example 2 Example 3 Example 4 Example 5Example 2: Get the list of eSUG presentations Download the list of all presentations in eSUG webpage All the information is in one webpage, no need to look for url for next page All presentations are in .pdf files, so for data extraction we search the string within quotation marks and ends with.pdf SAS Codes filename eSUG url "http://www.sas.com/offices/NA/canada/en/edmonton.html"; data eSUG_achive(keep=pdffile); length pdffile $200; infile eSUG length=len lrecl=32767; input line $varying32767. len; *all the file names end with .pdf; if find(line,".pdf") then do; *get the string ending with .pdf and enclosed by quotation marks; pdffile=prxchange(’s/ˆ.*?"(.*?.pdf)".*$/$1/i’,-1,line); output; end; run; filename eSUG clear; George Zhu & Sunita Ghosh (AHS - Cancer Care) Accessing and Extracting Data from Internet Using SAS
  • Outline Introduction Web Elements Tools Examples Example 1 Example 2 Example 3 Example 4 Example 5Example 2: Get the list of eSUG presentations Resulting Data Set: eSUG Archive George Zhu & Sunita Ghosh (AHS - Cancer Care) Accessing and Extracting Data from Internet Using SAS
  • Outline Introduction Web Elements Tools Examples Example 1 Example 2 Example 3 Example 4 Example 5Example 2: Get the list of eSUG presentations How about down load all the presentations? FILENAME URL does not work - can’t save files use cURL command line program with the SAS X statement cURL option −−output filename can save the result to a file specified by filename SAS Codes proc sql noprint; select pdffile into :pdffiles separated by "|" from eSUG_achive; quit; options noxwait; %macro downLoadFiles(); %let eSUGFolder=H:eSUGPDFs; %let i=1; %let pdf1=%scan(&pdffiles.,&i.,"|"); %do %while (%quote(&pdf1.)˜=); *I only need the file name, don’t need the path name; %let filename=%scan(&pdf1.,-1,"/"); *the --output option saves the file; x "curl --output &eSUGFolder./&filename. &pdf1."; %let i=%eval(&i.+1); %let pdf1=%scan(&pdffiles.,&i.,"|"); %end; %mend downLoadFiles; %downLoadFiles; George Zhu & Sunita Ghosh (AHS - Cancer Care) Accessing and Extracting Data from Internet Using SAS
  • Outline Introduction Web Elements Tools Examples Example 1 Example 2 Example 3 Example 4 Example 5Example 3: Find out all clinical trials Locate and extract information in the html using the tags Find the url for next webpage SAS Codes %macro GetTrials(startlink=, out=Trials); %let Continue=Yes; %let NextLink=&startlink.; %let i=0; %do %while(&Continue.=Yes); filename Trials url "http://www.clinicaltrial.gov/%superq(NextLink)" lrecl=8192; data _Trials_(drop=recordline record nextpage); length Status $25 Study $200 link $200; retain Rank Status Study link; retain recordLine 0; =1 means this line may contain needed information; infile Trials length=len; input record $varying8192. len; if prxmatch(’/>Study</th>/i’,record) then recordLine=1; if recordLine=1 and find(record,"/div") then recordLine=0; if (recordLine=1) then do; *now get the related information; if prxmatch(’/<span.*>.*?</span>s*$/’,record) then status=prxchange(’s/ˆ.*?<span.*>(.*?)</span>s*$/$1/’,-1,record); if find(record,"href") then do; Study=prxchange(’s/ˆ.*>(.*?)<.*$/$1/’,-1,record); link=prxchange(’s/ˆ.*?href="(.*?)".*$/$1/’,-1,record); end; if find(record,"</table>") then do; rank=input(scan(link,-1,"="),5.0); George Zhu & Sunita Ghosh (AHS - Cancer Care) Accessing and Extracting Data from Internet Using SAS
  • Outline Introduction Web Elements Tools Examples Example 1 Example 2 Example 3 Example 4 Example 5Example 3: Find out all clinical trials SAS Codes (continued) output; end; end; if find(record,"Next Page") then do; if find(record,"href") then do; nextpage=tranwrd(prxchange(’s/ˆ.*?href="(.*?)".*?$/$1/’,-1,record),’&amp;’,’&’); call symput("nextlink",nextpage); end; else call symput("Continue","No"); end; run; filename Trials clear; data _Trials_; set _Trials_ end=last; if (last˜=1) then output; run; %if (&i.=0) %then %do; data &out.; set _Trials_; run; %end; %else %do; data &out.; set &out. _Trials_; run; %end; %let i=%eval(&i.+1); %end; %mend GetTrials; George Zhu & Sunita Ghosh (AHS - Cancer Care) Accessing and Extracting Data from Internet Using SAS
  • Outline Introduction Web Elements Tools Examples Example 1 Example 2 Example 3 Example 4 Example 5Example 3: Find out all clinical trials This macro accepts a start link and then keep looking for all the trials in next pages To get all trials in Alberta, use the start link of ct2/results?state1=NA%3ACA%3AAB Note: NA%3ACA%3AAB is the encoded string of NA:CA:AB SAS Codes %GetTrials(startlink=%str(ct2/results?state1=NA%3ACA%3AAB),out=AB_Trials); For all trials in Canada: SAS Codes %GetTrials(startlink=%str(ct2/results?state1=NA%3ACA),out=CA_Trials); George Zhu & Sunita Ghosh (AHS - Cancer Care) Accessing and Extracting Data from Internet Using SAS
  • Outline Introduction Web Elements Tools Examples Example 1 Example 2 Example 3 Example 4 Example 5Example 4: Get and plot today’s temperature From the weather underground website: www.wunderground.com Although the url link can be specified easily, it is not the final link, so SAS URL statement can not get the required results cURL option −−location can track down to the final link and get the resulting text file (basically a .csv file) Use FILENAME PIPE statement to get the return from cURL directly into SAS SAS Codes *There are 3 weather stations in Edmonton: IABEDMON10 IABEDMON12 IABEDMON13; %let WStation=IABEDMON13; *This is Edmonton Downtown weather station; *get today’s date and obtain day, month and year. These are used to construct the url; %let Day=%substr(&sysdate9.,1,2); %let Month=%sysfunc(month(%sysfunc(today()))); %let Year=%substr(&sysdate9.,6,4); %let EdmWeather="http://www.wunderground.com/weatherstation/WXDailyHistory.asp? ID=&WStation.%str(&)month=&month.%str(&)day=&Day.%str(&)year=&Year.%str(&)format=1"; filename Weather pipe "curl --location %superq(EdmWeather)";*FILENAME URL doesn’t work; data weather(drop=line); infile Weather lrecl=2000 length=len; input line $varying2000. len; if find(line,",") then do; time=input(scan(line,1,",","m"),anydtdtm21.); format time datetime12.; Temp=input(scan(line,2,",","m"),5.0); Dewpoint=input(scan(line,3,",","m"),5.0); if not missing(time) then output; end; run; filename Weather clear; George Zhu & Sunita Ghosh (AHS - Cancer Care) Accessing and Extracting Data from Internet Using SAS
  • Outline Introduction Web Elements Tools Examples Example 1 Example 2 Example 3 Example 4 Example 5Example 4: Get and plot today’s temperature SAS Codes for Plotting the Temperature symbol1 i=j; legend1 label=(h=2 "Weather") value=(h=2); axis1 value=(h=2); title1 "Temperature on &sysdate9."; proc gplot data=weather; plot (temp Dewpoint)*time/haxis=axis1 vaxis=axis1 overlay legend=legend1; run; quit; Resulting Temperature Plot George Zhu & Sunita Ghosh (AHS - Cancer Care) Accessing and Extracting Data from Internet Using SAS
  • Outline Introduction Web Elements Tools Examples Example 1 Example 2 Example 3 Example 4 Example 5Example 5: Download City of Edmonton job postings Most web accessing use POST method - Use Perl/LWP for web accessing and also for data extraction Then in SAS use filename pipe statement to run the Perl program and get the extracted data into a data set SAS Codes filename EdJobs pipe "C:Perlbinperl C:esugedmjobs.pl"; data Edm_Jobs; *Totally 16 variables, 14 are read in as character strings. Declare their length; length JobCode $10 Title $100 Description $10000 Qualification $10000 Salary $500; length Hours $500 PDate $20 CDate $20 JobType1 $100 JobType2 $100 Union $100; length Department $100 WorkLocation $200 WorkAddress $200; infile EdJobs lrecl=32767 length=len; *use multiple input because the final result from Perl is printed in multiple lines; input JobCode $varying10. len; input Title $varying100. len; input JobNum; input Description $varying10000. len; input Qualification $varying10000. len; input Salary $varying500. len; input Hours $varying500. len; input PDate $varying20. len; input CDate $varying20. len; input NumOfOpening; input JobType1 $varying100. len; input JobType2 $varying100. len; input Union $varying100. len; input Department $varying100. len; input WorkLocation $varying200. len; input WorkAddress $varying200. len; run; filename EdJobs clear; George Zhu & Sunita Ghosh (AHS - Cancer Care) Accessing and Extracting Data from Internet Using SAS
  • Outline Introduction Web Elements Tools Examples Example 1 Example 2 Example 3 Example 4 Example 5Example 5: Download City of Edmonton job postings Perl program: edmjobs.pl #===================================================================================== # Perl program to download the job postings from City of Edmonton # Written by George Zhu (george.zhu@albertahealthservices.ca) # for the Edmonton SAS User Group (eSUG) meeting, Fall 2011 # Usage: # 1. Stand Alone: # C:Perlbinperl C:esugedmjobs.pl >[output file name] # 2. With SAS, use filename pipe statement: # filename EdJobs pipe "C:Perlbinperl C:esugedmjobs.pl" # then in the data set step, use infile statement: # infile EdJobs lrecl=32767 length=len; # Note: change the folder name to where the program edmjobs.pl is located. #===================================================================================== use strict; use LWP::UserAgent; use XML::Parser; use URI::Escape; my $EdJobs=’https://edmonton.taleo.net/careersection/2/moresearch.ftl’; my $EdBrowser=LWP::UserAgent->new(); $EdBrowser->agent(’Mozilla/5.0 (Windows NT 5.1; rv:6.0.2) Gecko/20100101 Firefox/6.0.2’); my $EdResponse=$EdBrowser->get($EdJobs); my $EdPage1=$EdResponse->content() if $EdResponse->is_success(); ## obtain the form items using the XML Parser; my %inputAttrs; #for the POST data for next inquery; my $EdParse=XML::Parser->new(Handlers=>{Start=>&Input_start,}); George Zhu & Sunita Ghosh (AHS - Cancer Care) Accessing and Extracting Data from Internet Using SAS
  • Outline Introduction Web Elements Tools Examples Example 1 Example 2 Example 3 Example 4 Example 5Example 5: Download City of Edmonton job postings Perl program: edmjobs.pl - continued sub Input_start { my ($expat,$element,%attrs)=@_; if($element=˜/input/i) # look for the input tag { my ($param,$pmval); while(my($key,$value)=each(%attrs)) { if ($key=˜/name/i) {$param=$value;} if ($key=˜/value/i) {$value=˜tr/ /+/; $pmval=$value;} } $inputAttrs{$param}=$pmval; } } $EdParse->parse($EdPage1); my %AttrsPage1=%inputAttrs; my %AttrsPost1=%inputAttrs; ## total number of postings; my $nJobs=$1 if $AttrsPage1{"initialHistory"}=˜/listRequisition.nbElements!%7C!(d+?)!%7C!/i; ## Number of postings per page; my $listSize=$1 if $AttrsPage1{"initialHistory"}=˜/listRequisition.size!%7C!(d+?)!%7C!/i; George Zhu & Sunita Ghosh (AHS - Cancer Care) Accessing and Extracting Data from Internet Using SAS
  • Outline Introduction Web Elements Tools Examples Example 1 Example 2 Example 3 Example 4 Example 5Example 5: Download City of Edmonton job postings Perl program: edmjobs.pl - continued ## calculate number of pages; my $nPages=$nJobs/$listSize; $nPages=int($nPages+1) if ($nPages>int($nPages)); ## Obtain the pages of the postings, first set the parameters; $AttrsPage1{"countryPanelErrorDrawer.state"}="false"; $AttrsPage1{"errorMessageDrawer.state"}="false"; $AttrsPage1{"ftlcompclass"}="PagerComponent"; $AttrsPage1{"ftlcallback"}="ftlPager_processResponse"; $AttrsPage1{"ftlcallback"}="ftlPager_processResponse"; $AttrsPage1{"ftlcompid"}="rlPager"; $AttrsPage1{"ftlinterfaceid"}="requisitionListInterface"; my $pp=1; my $i=0; my @JobCodes; my $jobDetails=’https://edmonton.taleo.net/careersection/2/jobdetail.ftl’; while ($pp<=$nPages) { $AttrsPage1{"rlPager.currentPage"}="$pp"; my $JobPage=$EdBrowser->post(jobDetails,%AttrsPage1, ’Referer’=>EdJobs,); my @fillList=split("’,’",$1) if $JobPage->content()=˜/api.fillList(.*?);/i; shift(@fillList); shift(@fillList); my $nLine=3; my $listSize=@fillList; George Zhu & Sunita Ghosh (AHS - Cancer Care) Accessing and Extracting Data from Internet Using SAS
  • Outline Introduction Web Elements Tools Examples Example 1 Example 2 Example 3 Example 4 Example 5Example 5: Download City of Edmonton job postings Perl program: edmjobs.pl - continued while($nLine<$listSize) { $JobCodes[$i]=$fillList[$nLine]; $nLine+=33; $i++; } $pp++; } $AttrsPost1{"ftlcompid"}="actOpenRequisitionDescription"; $AttrsPost1{"ftlinterfaceid"}="requisitionListInterface"; foreach my $code (@JobCodes) { $AttrsPost1{"actOpenRequisitionDescription.requisitionNo"}="$code"; # Use the POST method to get the detailed info about a posting my $Job1=$EdBrowser->post(jobDetails,%AttrsPost1,’Referer’=>EdJobs,); $EdParse->parse($Job1->content()); $inputAttrs{"initialHistory"}=˜tr/+/ /; my @infor=split("!%7C!",$inputAttrs{"initialHistory"}); my $description=uri_unescape($infor[15]); $description=˜s/(<.*?>)|(!*!)//g; $description=˜s/(&nbsp;)/ /g; $description=˜s/(&#39;)/’/g; $description=˜s/(n)/ /g; my $qualification=uri_unescape($infor[16]); $qualification=˜s/(<.*?>)|(!*!)//g; George Zhu & Sunita Ghosh (AHS - Cancer Care) Accessing and Extracting Data from Internet Using SAS
  • Outline Introduction Web Elements Tools Examples Example 1 Example 2 Example 3 Example 4 Example 5Example 5: Download City of Edmonton job postings Perl program: edmjobs.pl - continued $qualification=˜s/(&nbsp;)/ /g; $qualification=˜s/(&#39;)/’/g; my $salary=$1 if $qualification=˜s/Salary Range: (.*?)n//i; my $Hours=$1 if $qualification=˜s/Hours of Work: (.*?)n//i; $qualification=˜s/(n)/ /g; ## these are the extracted information from a job posting; print $code,"n"; #jobcode print $infor[12],"n"; #title print $infor[13],"n"; #Job Number print $description,"n"; #Description print $qualification,"n"; #Qualification print $salary,"n"; #Salary Range print $Hours,"n"; #Hours of Work print $infor[20],"n"; #Posting Date print $infor[22],"n"; #Closing Date print $infor[23],"n"; #Number of opening print $infor[25],"n"; #Job type 1 print $infor[26],"n"; #Job type 2 print $infor[27],"n"; #Union print $infor[29],"n"; #Department print $infor[32],"n"; #Work location print $infor[33],"n"; #Work location address #print "=================================nn"; } #### End of program #### George Zhu & Sunita Ghosh (AHS - Cancer Care) Accessing and Extracting Data from Internet Using SAS