Mining the Web for Information using Hadoop






  • Give a Nutch example

Mining the Web for Information using Hadoop

  Mining the web with Hadoop Steve Watt Emerging Technologies @ HP
  • 2– timsnell (Flickr)
  • Gathering DataApache Nutch(Web Crawler) 6
  • Using ApacheIdentify Optimal Seed URLs for a Seed List & Crawl to a depth of 2For example:http://www.crunchbase.com/companies?c=a&q=private_heldhttp://www.crunchbase.com/companies?c=b&q=private_heldhttp://www.crunchbase.com/companies?c=c&q=private_heldhttp://www.crunchbase.com/companies?c=d&q=private_held...Crawl data is stored in sequence files in the segments dir on the HDFS 10
  • ALSO11
  • Making the data STRUCTURED Retrieving HTML Prelim Filtering on URL Company POJO then /t Out12
  • The Result? Tab Delimited Structured Data…Company City State Country Sector Round Day Month Year Amount InvestorsInfoChimps Austin TX USA Enterprise Angel 14 9 2010 350000 Stage One CapitalInfoChimps Austin TX USA Enterprise A 7 11 2010 1200000 DFJ MercuryMassRelevance Austin TX USA Enterprise A 20 12 2010 2200000 Floodgate, AV,etcMasher Calabasas CA USA Games_Video Seed 0 2 2009 175000Masher Calabasas CA USA Games_Video Angel 11 8 2009 300000 Tech Coast Angels Note: I dropped the ZipCode because it didn’t occur consistently 13
  • Time to Analyze/Visualize the data… Step1: Select the right visual encoding for your questions Lets start by asking questions & seeing what we can learn from some simple Bar Charts…14
  • *Total Tech Investments By Year*Total Tech Investments By Year
  • *Investment Funding By Sector
  • Total Investments By Zip Code for all Sectors $1.2 Billion in Boston $7.3 Billion in San Francisco $2.9 Billion in Mountain View $1.7 Billion in Austin19
  • Total Investments By Zip Code for Consumer Web $600 Million in Seattle $1.2 Billion in Chicago $1.7 Billion in San Francisco20
  • Total Investments By Zip Code for BioTech $1.3 Billion in Cambridge $528 Million in Dallas $1.1 Billion in San Diego21
  • Steve’s Not so Excellent Adventure• Let’s try a Choropleth Encoding of the distribution of investment income by County• Wait, what is GeoJSON?• OK, the GeoJSON County is mapped to some code• Each County code has a value that corresponds to a palette color• So what are these codes? FIPS Codes? But Google returns 3 & 5 digit codes?!?• I found a 5 digit code list, it has A LOT of codes in it. I’m going to assume its correct because there is no way I can manually verify all of them23
  • Generating Investment Income By CountyFIPS = LOAD ‘data/fips.txt’ using PigStorage(‘t’) as (City, State, FIPSCode);Amt = LOAD ‘data/equity.txt’ using PigStorage(‘t’) as (City, State, Amount);AmtGroup = Group Amt BY (City, State);SumGroup = FOREACH AmtGroup Generate group, SUM(Amt.Amount);JoinGroup = JOIN SumGroup by (City,State), FIPS By (City,State);Final = FOREACH JoinGroup generate FIPSCode, Amount;RESULT: 51234 5000000 16234 1234000 (...)ALWAYS, ALWAYS check your output…24
  • But wait, why are there duplicate records?Apparently some cities can actually belong to two counties… I guess I’ll pick one.25
  • Yay, no duplicates. Lets visualize this!• Wait, what happened to California ?• Aaargh, I stored the FIPS codes in PIG as INTS instead of charrays which trimmed off the leading Zero. OK, I add them back. Voila! We have California.26
  • On Error Checking…• Crowd Sourced data has LOADS of errors in it. Actually influencing your results. You need a good system that helps identify those errors.• Santa Clara, Ca• Santa, Clara• Santa, Clara CA• Track(Count) input and output records. Examine the results. Something fishy?27
  • Questions? Steve Watt swatt@hp.com @wattsteve emergingafrican.com29