Flash Player 9 (or above) is needed to view presentations.
We have detected that you do not have it on your computer. To install it, go here.

Like this presentation? Why not share!

Mining the Web for Information using Hadoop






Total Views
Views on SlideShare
Embed Views



34 Embeds 3,195

http://austinhug.blogspot.com 2870
http://austinhug.blogspot.in 134
http://austinhug.blogspot.co.uk 26
http://austinhug.blogspot.ca 17
http://austinhug.blogspot.com.au 14
http://austinhug.blogspot.ru 13
http://austinhug.blogspot.de 13
http://austinhug.blogspot.fr 13
http://austinhug.blogspot.com.es 9
http://austinhug.blogspot.mx 9
http://austinhug.blogspot.com.br 7
http://austinhug.blogspot.nl 7
http://austinhug.blogspot.it 6
http://austinhug.blogspot.pt 5
http://austinhug.blogspot.co.il 5
http://austinhug.blogspot.be 5
http://austinhug.blogspot.kr 5
http://webcache.googleusercontent.com 4
http://austinhug.blogspot.tw 4
http://austinhug.blogspot.se 4
http://austinhug.blogspot.ch 3
http://austinhug.blogspot.jp 3
http://austinhug.blogspot.sg 3
http://austinhug.blogspot.co.nz 2
http://austinhug.blogspot.com.ar 2
http://austinhug.blogspot.hk 2
http://austinhug.blogspot.cz 2
http://austinhug.blogspot.gr 2
http://austinhug.blogspot.ie 1
http://prlog.ru 1
http://austinhug.blogspot.fi 1
http://austinhug.blogspot.dk 1
http://www.austinhug.blogspot.com 1
http://austinhug.blogspot.co.at 1



Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment
  • Give a Nutch example

Mining the Web for Information using Hadoop Mining the Web for Information using Hadoop Presentation Transcript

  • Mining the web with Hadoop Steve Watt Emerging Technologies @ HP 1– Someday Soon (Flickr)
  • 2– timsnell (Flickr)
  • Gathering DataData Marketplaces 3
  • 4
  • 5
  • Gathering DataApache Nutch(Web Crawler) 6
  • 7Pascal Terjan (Flickr)
  • 8
  • 9
  • Using ApacheIdentify Optimal Seed URLs for a Seed List & Crawl to a depth of 2For example:http://www.crunchbase.com/companies?c=a&q=private_heldhttp://www.crunchbase.com/companies?c=b&q=private_heldhttp://www.crunchbase.com/companies?c=c&q=private_heldhttp://www.crunchbase.com/companies?c=d&q=private_held...Crawl data is stored in sequence files in the segments dir on the HDFS 10
  • ALSO11
  • Making the data STRUCTURED Retrieving HTML Prelim Filtering on URL Company POJO then /t Out12
  • The Result? Tab Delimited Structured Data…Company City State Country Sector Round Day Month Year Amount InvestorsInfoChimps Austin TX USA Enterprise Angel 14 9 2010 350000 Stage One CapitalInfoChimps Austin TX USA Enterprise A 7 11 2010 1200000 DFJ MercuryMassRelevance Austin TX USA Enterprise A 20 12 2010 2200000 Floodgate, AV,etcMasher Calabasas CA USA Games_Video Seed 0 2 2009 175000Masher Calabasas CA USA Games_Video Angel 11 8 2009 300000 Tech Coast Angels Note: I dropped the ZipCode because it didn’t occur consistently 13
  • Time to Analyze/Visualize the data… Step1: Select the right visual encoding for your questions Lets start by asking questions & seeing what we can learn from some simple Bar Charts…14
  • *Total Tech Investments By Year
  • *Total Tech Investments By Year*Total Tech Investments By Year
  • *Investment Funding By Sector
  • Total Investments By Zip Code for all Sectors $1.2 Billion in Boston $7.3 Billion in San Francisco $2.9 Billion in Mountain View $1.7 Billion in Austin18
  • Total Investments By Zip Code for all Sectors $1.2 Billion in Boston $7.3 Billion in San Francisco $2.9 Billion in Mountain View $1.7 Billion in Austin19
  • Total Investments By Zip Code for Consumer Web $600 Million in Seattle $1.2 Billion in Chicago $1.7 Billion in San Francisco20
  • Total Investments By Zip Code for BioTech $1.3 Billion in Cambridge $528 Million in Dallas $1.1 Billion in San Diego21
  • Geospatial Encoding of Data22 HP Confidential
  • Steve’s Not so Excellent Adventure• Let’s try a Choropleth Encoding of the distribution of investment income by County• Wait, what is GeoJSON?• OK, the GeoJSON County is mapped to some code• Each County code has a value that corresponds to a palette color• So what are these codes? FIPS Codes? But Google returns 3 & 5 digit codes?!?• I found a 5 digit code list, it has A LOT of codes in it. I’m going to assume its correct because there is no way I can manually verify all of them23
  • Generating Investment Income By CountyFIPS = LOAD ‘data/fips.txt’ using PigStorage(‘t’) as (City, State, FIPSCode);Amt = LOAD ‘data/equity.txt’ using PigStorage(‘t’) as (City, State, Amount);AmtGroup = Group Amt BY (City, State);SumGroup = FOREACH AmtGroup Generate group, SUM(Amt.Amount);JoinGroup = JOIN SumGroup by (City,State), FIPS By (City,State);Final = FOREACH JoinGroup generate FIPSCode, Amount;RESULT: 51234 5000000 16234 1234000 (...)ALWAYS, ALWAYS check your output…24
  • But wait, why are there duplicate records?Apparently some cities can actually belong to two counties… I guess I’ll pick one.25
  • Yay, no duplicates. Lets visualize this!• Wait, what happened to California ?• Aaargh, I stored the FIPS codes in PIG as INTS instead of charrays which trimmed off the leading Zero. OK, I add them back. Voila! We have California.26
  • On Error Checking…• Crowd Sourced data has LOADS of errors in it. Actually influencing your results. You need a good system that helps identify those errors.• Santa Clara, Ca• Santa, Clara• Santa, Clara CA• Track(Count) input and output records. Examine the results. Something fishy?27
  • 28 HP Confidential
  • Questions? Steve Watt swatt@hp.com @wattsteve emergingafrican.com29