Innoplexia DevTools to Crawl Webpages

2,367 views

Published on

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
2,367
On SlideShare
0
From Embeds
0
Number of Embeds
7
Actions
Shares
0
Downloads
7
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Innoplexia DevTools to Crawl Webpages

  1. 1. DevTools to crawl Webpages.
  2. 2. DevTools09.05.12 @chrschneider 2
  3. 3. DevTools … Apache … toolset of low level Java components focused on HTTP and associated protocols.“ ● HttpComponents Core … is a set of low level HTTP transport components ● HttpComponents Client … provides reusable components for client-side ... HTTP connection management. ● HttpComponents AsyncClient (DEV) … ability to handle a great number of concurrent connections ... more ... performance in terms of a raw data throughput. ● Commons HttpClient (Legacy) … All users of Commons HttpClient 3.x are strongly encouraged to upgrade to HttpClient 4.1.09.05.12 @chrschneider 3
  4. 4. DevTools HttpComponents Client Example Components ● Get, Post, Delete, … Request Objects ● Cookie Manager ● SSL ● Content Encoding Aware ● HTTP Authentication (Basic, Digest, ...)09.05.12 @chrschneider 4
  5. 5. DevTools HttpComponents Client Example public final static void main(final String[] args) throws Exception { final HttpClient httpclient = new DefaultHttpClient(); try { final HttpGet httpget = new HttpGet("http://www.google.com/"); System.out.println("executing request " + httpget.getURI()); // Create a response handler final ResponseHandler<String> responseHandler = new BasicResponseHandler(); final String responseBody = httpclient.execute(httpget, responseHandler); System.out.println("----------------------------------------"); System.out.println(responseBody); System.out.println("----------------------------------------"); } finally { httpclient.getConnectionManager().shutdown(); } } http://hc.apache.org/httpcomponents-client-ga/examples.html09.05.12 @chrschneider 5
  6. 6. DevTools HttpComponents Client Demo09.05.12 @chrschneider 6
  7. 7. DevTools … is an asynchronous event-driven network application framework for rapid development of maintainable high performance protocol servers & clients. See: http://netty.io/09.05.12 @chrschneider 7
  8. 8. DevTools … is a "GUI-Less browser for Java programs" Features (extraction): ● Support for the HTTP and HTTPS protocols ● Support for cookies ● Ability to specify whether failing responses from the server should throw exceptions or should be returned as pages of the appropriate type (based on content type) ● Ability to customize the request headers being sent to the server ● Support for HTML responses ● Support for submitting forms ● Support for clicking links ● Support for walking the DOM model of the HTML document ● JavaScript support09.05.12 @chrschneider 8
  9. 9. DevTools … is a "GUI-Less browser for Java programs" @Test public void homePage() throws Exception { final WebClient webClient = new WebClient(); final HtmlPage page = webClient.getPage("http://htmlunit.sourceforge.net"); System.out.println(page.getTitleText()); assertEquals("Welcome to HtmlUnit", page.getTitleText()); final String pageAsXml = page.asXml(); assertTrue(pageAsXml.contains("<body class="composite">")); final String pageAsText = page.asText(); assertTrue(pageAsText.contains("Support for the HTTP and HTTPS protocols")); webClient.closeAllWindows(); } http://htmlunit.sourceforge.net/gettingStarted.html09.05.12 @chrschneider 9
  10. 10. DevTools … is a "GUI-Less browser for Java programs" @Test public void getElements() throws Exception { final WebClient webClient = new WebClient(); final HtmlPage page = webClient.getPage("http://some_url"); final HtmlDivision div = page.getHtmlElementById("some_div_id"); final HtmlAnchor anchor = page.getAnchorByName("anchor_name"); webClient.closeAllWindows(); } Luxus :) Note: Also html tables are supported. They wrote easy wrapper classes to walk though them. … Handy! http://htmlunit.sourceforge.net/table-howto.html http://htmlunit.sourceforge.net/gettingStarted.html09.05.12 @chrschneider 10
  11. 11. DevTools … automates browsers. Thats it. Selenium-WebDriver supports the following browsers along with the operating systems these browsers are compatible with. ● Google Chrome 12.0.712.0+ ● Internet Explorer 6, 7, 8, 9 - 32 and 64-bit where applicable ● Firefox 3.0, 3.5, 3.6, 4.0, 5.0, 6, 7 ● Opera 11.5+ ● HtmlUnit 2.9 ● Android – 2.3+ for phones and tablets (devices & emulators) ● iOS 3+ for phones (devices & emulators) and 3.2+ for tablets (devices & emulators)09.05.12 @chrschneider 11
  12. 12. DevTools … automates browsers. Thats it. The Selenium Family Selenium IDE Also c#, Phython, Ruby, ... Selenium WebDriver Also on Windows and Mac Selenium Grid09.05.12 @chrschneider 12
  13. 13. DevTools … automates browsers. Thats it. The Selenium Family … create quick bug reproduction scripts Selenium IDE … create scripts to aid in automation-aided exploratory testing Selenium WebDriver … create robust, browser-based regression automation … scale and distribute scripts across many environments Selenium Grid http://seleniumhq.org/09.05.12 @chrschneider 13
  14. 14. DevTools Requirements for Selenium WebDriver with Firefox (and HtmlUnit) Dependencies Browser Binaries <dependency> <groupId>org.seleniumhq.selenium</groupId> <artifactId>selenium-java</artifactId> <version>2.21.0</version> </dependency> <dependency> <groupId>org.seleniumhq.selenium</groupId> <artifactId>selenium-htmlunit-driver</artifactId> <version>2.21.0</version> </dependency> <dependency> <groupId>org.seleniumhq.selenium</groupId> it. <artifactId>selenium-firefox-driver</artifactId> s <version>2.21.0</version> at Th </dependency>09.05.12 @chrschneider 14
  15. 15. DevTools Basic Selenium example @Test public void testSeleniumWithFirefox() throws InterruptedException { final WebDriver webDriver = new FirefoxDriver(); webDriver.get("http://www.majug.de"); final WebElement veranstaltungenLink = webDriver.findElement(By.linkText("Veranstaltungen")); veranstaltungenLink.click(); // Close the browser Thread.sleep(5000); webDriver.quit(); }09.05.12 @chrschneider 15
  16. 16. DevTools Selenium WebDriver Locator Strategies Its also possible to call findElements(...) to get a List<> of WebElements.: List<WebElement> hits = webDriver.findElements(By.tagName("a"));09.05.12 @chrschneider 16
  17. 17. DevTools Selenium WebDriver Interactions If you got a webElement, you can... ● webElement.click() it ● webElement.sendKeys(...) to it ● webElement.submit() on it. It is also possible to perform “Actions“ like DoubleClick, DragAndDrop, ClickAndHold, … with the “Actions“ class.09.05.12 @chrschneider 17
  18. 18. DevTools Selenium WebDriver Demo09.05.12 @chrschneider 18
  19. 19. DevTools Selenium WebDriver Pitfalls Newbie Pitfalls: ● Selenium doesnt wait until the hole site is loaded (Keyword: Implicit wait) ● webElement.xPath(“@// ...“) starts from root of the DOM (use “.//...“ instead) ● Google brings up “Selenium RC“ solutions. This is the old Selenium project. ● A reference to a WebElement will become invalid if the driver “moves“ to another page. ● Firefox doesnt run on our CI because it is a headless system (try Xvfb) ● New XPath 2.0 functions (like ends-with(...)) are failing. This is because Selenium uses the drivers native Xpath engine. For Firefox this means it is Xpath 1.0 today.09.05.12 @chrschneider 19
  20. 20. Noch Fragen?Vielen Dank für Ihre Aufmerksamkeit!

×