1
– Someday Soon (Flickr)
Mining the web with Hadoop
Steve Watt Emerging Technologies @ HP
2
– timsnell (Flickr)
3
Gathering Data
Data Marketplaces
4
5
6
Gathering Data
Apache Nutch
(Web Crawler)
7
Pascal Terjan (Flickr)
8
9
10
Using Apache
Identify Optimal Seed URLs for a Seed List & Crawl to a depth of 2
For example:
http://www.crunchbase.com/...
11
ALSO
12
Company POJO then /t Out
Prelim Filtering on URL
Making the data STRUCTURED
Retrieving HTML
13
Company City State Country Sector Round Day Month Year Amount Investors
InfoChimps Austin TX USA Enterprise Angel 14 9 ...
14
Time to Analyze/Visualize the data…
Step1: Select the right visual encoding for your
questions
Lets start by asking que...
*Total Tech Investments By Year
*Total Tech Investments By Year
*Total Tech Investments By Year
*Investment Funding By Sector
18
Total Investments By Zip Code for all Sectors
$7.3 Billion in San Francisco
$2.9 Billion in Mountain View
$1.2 Billion ...
19
Total Investments By Zip Code for all Sectors
$7.3 Billion in San Francisco
$2.9 Billion in Mountain View
$1.2 Billion ...
20
Total Investments By Zip Code for Consumer Web
$1.2 Billion in Chicago
$600 Million in Seattle
$1.7 Billion in San Fran...
21
Total Investments By Zip Code for BioTech
$1.3 Billion in Cambridge
$528 Million in Dallas
$1.1 Billion in San Diego
22
HP Confidential
Geospatial Encoding of Data
Steve’s Not so Excellent Adventure
23
• Let’s try a Choropleth Encoding of the distribution of investment income by
County...
Generating Investment Income By County
24
FIPS = LOAD ‘data/fips.txt’ using PigStorage(‘t’) as (City, State, FIPSCode);
Am...
But wait, why are there duplicate records?
25
Apparently some cities can actually belong to two counties… I guess I’ll pic...
Yay, no duplicates. Lets visualize this!
26
• Wait, what happened to California ?
• Aaargh, I stored the FIPS codes in PIG...
On Error Checking…
27
• Crowd Sourced data has LOADS of errors in it. Actually influencing your
results. You need a good s...
28
HP Confidential
29
Questions?
Steve Watt swatt@hp.com
@wattsteve
emergingafrican.com
Upcoming SlideShare
Loading in...5
×

Mining the Web for Information using Hadoop

4,833

Published on

Published in: Technology, Business
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
4,833
On Slideshare
0
From Embeds
0
Number of Embeds
32
Actions
Shares
0
Downloads
31
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide
  • Give a Nutch example
  • Mining the Web for Information using Hadoop

    1. 1. 1 – Someday Soon (Flickr) Mining the web with Hadoop Steve Watt Emerging Technologies @ HP
    2. 2. 2 – timsnell (Flickr)
    3. 3. 3 Gathering Data Data Marketplaces
    4. 4. 4
    5. 5. 5
    6. 6. 6 Gathering Data Apache Nutch (Web Crawler)
    7. 7. 7 Pascal Terjan (Flickr)
    8. 8. 8
    9. 9. 9
    10. 10. 10 Using Apache Identify Optimal Seed URLs for a Seed List & Crawl to a depth of 2 For example: http://www.crunchbase.com/companies?c=a&q=private_held http://www.crunchbase.com/companies?c=b&q=private_held http://www.crunchbase.com/companies?c=c&q=private_held http://www.crunchbase.com/companies?c=d&q=private_held . . . Crawl data is stored in sequence files in the segments dir on the HDFS
    11. 11. 11 ALSO
    12. 12. 12 Company POJO then /t Out Prelim Filtering on URL Making the data STRUCTURED Retrieving HTML
    13. 13. 13 Company City State Country Sector Round Day Month Year Amount Investors InfoChimps Austin TX USA Enterprise Angel 14 9 2010 350000 Stage One Capital InfoChimps Austin TX USA Enterprise A 7 11 2010 1200000 DFJ Mercury MassRelevance Austin TX USA Enterprise A 20 12 2010 2200000 Floodgate, AV,etc Masher Calabasas CA USA Games_Video Seed 0 2 2009 175000 Masher Calabasas CA USA Games_Video Angel 11 8 2009 300000 Tech Coast Angels The Result? Tab Delimited Structured Data… Note: I dropped the ZipCode because it didn’t occur consistently
    14. 14. 14 Time to Analyze/Visualize the data… Step1: Select the right visual encoding for your questions Lets start by asking questions & seeing what we can learn from some simple Bar Charts…
    15. 15. *Total Tech Investments By Year
    16. 16. *Total Tech Investments By Year *Total Tech Investments By Year
    17. 17. *Investment Funding By Sector
    18. 18. 18 Total Investments By Zip Code for all Sectors $7.3 Billion in San Francisco $2.9 Billion in Mountain View $1.2 Billion in Boston $1.7 Billion in Austin
    19. 19. 19 Total Investments By Zip Code for all Sectors $7.3 Billion in San Francisco $2.9 Billion in Mountain View $1.2 Billion in Boston $1.7 Billion in Austin
    20. 20. 20 Total Investments By Zip Code for Consumer Web $1.2 Billion in Chicago $600 Million in Seattle $1.7 Billion in San Francisco
    21. 21. 21 Total Investments By Zip Code for BioTech $1.3 Billion in Cambridge $528 Million in Dallas $1.1 Billion in San Diego
    22. 22. 22 HP Confidential Geospatial Encoding of Data
    23. 23. Steve’s Not so Excellent Adventure 23 • Let’s try a Choropleth Encoding of the distribution of investment income by County • Wait, what is GeoJSON? • OK, the GeoJSON County is mapped to some code • Each County code has a value that corresponds to a palette color • So what are these codes? FIPS Codes? But Google returns 3 & 5 digit codes?!? • I found a 5 digit code list, it has A LOT of codes in it. I’m going to assume its correct because there is no way I can manually verify all of them
    24. 24. Generating Investment Income By County 24 FIPS = LOAD ‘data/fips.txt’ using PigStorage(‘t’) as (City, State, FIPSCode); Amt = LOAD ‘data/equity.txt’ using PigStorage(‘t’) as (City, State, Amount); AmtGroup = Group Amt BY (City, State); SumGroup = FOREACH AmtGroup Generate group, SUM(Amt.Amount); JoinGroup = JOIN SumGroup by (City,State), FIPS By (City,State); Final = FOREACH JoinGroup generate FIPSCode, Amount; RESULT: 51234 5000000 16234 1234000 (...) ALWAYS, ALWAYS check your output…
    25. 25. But wait, why are there duplicate records? 25 Apparently some cities can actually belong to two counties… I guess I’ll pick one.
    26. 26. Yay, no duplicates. Lets visualize this! 26 • Wait, what happened to California ? • Aaargh, I stored the FIPS codes in PIG as INTS instead of charrays which trimmed off the leading Zero. OK, I add them back. Voila! We have California.
    27. 27. On Error Checking… 27 • Crowd Sourced data has LOADS of errors in it. Actually influencing your results. You need a good system that helps identify those errors. • Santa Clara, Ca • Santa, Clara • Santa, Clara CA • Track(Count) input and output records. Examine the results. Something fishy?
    28. 28. 28 HP Confidential
    29. 29. 29 Questions? Steve Watt swatt@hp.com @wattsteve emergingafrican.com
    1. A particular slide catching your eye?

      Clipping is a handy way to collect important slides you want to go back to later.

    ×