Nutch + Hadoop scaled, for crawling protected web sites (hint: Selenium)

1,714 views
1,315 views

Published on

Presented at Houston Hadoop Meetup in March '14

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
1,714
On SlideShare
0
From Embeds
0
Number of Embeds
21
Actions
Shares
0
Downloads
42
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Nutch + Hadoop scaled, for crawling protected web sites (hint: Selenium)

  1. 1. Houston Hadoop Meetup 2/12/14 Nutch + Hadoop with Selenium and Burp By Mark Kerzner, Elephant Scale
  2. 2. Nutch story • Created by Doug Cutting to crawl the web • Not scalable • Enter HDFS • Nutch on HDFS • Nutch on Hadoop • Nutch 1.x, Nutch 2.x
  3. 3. Nutch 1.x • Local or HDFS • Command-line • Crawl-db
  4. 4. Configuring Nutch • Edit the file conf/regex-urlfilter.txt and replace # accept anything else +. • Use a regular expression matching the domain you wish to crawl. • For example, to crawl only nutch.apache.org domain +^http://([a-z0-9]*.)*nutch.apache.org/
  5. 5. Nutch architecture
  6. 6. Solr integration
  7. 7. Solr Application (FreeEed, demo)
  8. 8. Scaling Nutch • HDFS – scaling storage • MapReduce – scale crawling • Gora – scale back end
  9. 9. Gora • Data Persistence : Persisting objects to Column stores such as HBase, Cassandra, Hypertable, Voldermort, Redis, etc; SQL databases, such as MySQL, HSQLDB, flat files in local file system of Hadoop HDFS • Data Access : Java-friendly API for accessing the data regardless of its location • Indexing : Solr • Analysis Apache Pig, Apache Hive and Cascading • MapReduce support
  10. 10. Passwords? – Oops! 1. Burp + HttpClient 2. Selenium + Java
  11. 11. Burp (with demo)
  12. 12. HttpClient CloseableHttpClient httpclient = HttpClients.createDefault(); try { HttpPost httpPost = new HttpPost(getUrl()); // put in all custom headers Map<String, String> headers = getHeaders(); for (Map.Entry<String, String> header : headers.entrySet()) { httpPost.addHeader(header.getKey(), header.getValue()); } HttpEntity entity = new ByteArrayEntity(getPostBody().getBytes("UTF-8")); httpPost.setEntity(entity); response = httpclient.execute(httpPost);
  13. 13. Browser interaction? – Oops! Selenium Selenium + Java
  14. 14. Selenium (with demo) WebDriver driver = new FirefoxDriver(); // Go to the login page driver.get("https://mysite.com"); // put in the username WebElement query = driver.findElement(By.name("username-element")); query.sendKeys("your-user-name"); // put in the password query = driver.findElement(By.name("password-element")); query.sendKeys("real-password"); ((JavascriptExecutor) driver).executeScript("javascript:whatever-login-script();");

×