Your SlideShare is downloading. ×
Mining the Web for Information using Hadoop
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Mining the Web for Information using Hadoop

4,668

Published on

Published in: Technology, Business
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
4,668
On Slideshare
0
From Embeds
0
Number of Embeds
31
Actions
Shares
0
Downloads
28
Comments
0
Likes
1
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide
  • Give a Nutch example
  • Transcript

    • 1. 1 – Someday Soon (Flickr) Mining the web with Hadoop Steve Watt Emerging Technologies @ HP
    • 2. 2 – timsnell (Flickr)
    • 3. 3 Gathering Data Data Marketplaces
    • 4. 4
    • 5. 5
    • 6. 6 Gathering Data Apache Nutch (Web Crawler)
    • 7. 7 Pascal Terjan (Flickr)
    • 8. 8
    • 9. 9
    • 10. 10 Using Apache Identify Optimal Seed URLs for a Seed List & Crawl to a depth of 2 For example: http://www.crunchbase.com/companies?c=a&q=private_held http://www.crunchbase.com/companies?c=b&q=private_held http://www.crunchbase.com/companies?c=c&q=private_held http://www.crunchbase.com/companies?c=d&q=private_held . . . Crawl data is stored in sequence files in the segments dir on the HDFS
    • 11. 11 ALSO
    • 12. 12 Company POJO then /t Out Prelim Filtering on URL Making the data STRUCTURED Retrieving HTML
    • 13. 13 Company City State Country Sector Round Day Month Year Amount Investors InfoChimps Austin TX USA Enterprise Angel 14 9 2010 350000 Stage One Capital InfoChimps Austin TX USA Enterprise A 7 11 2010 1200000 DFJ Mercury MassRelevance Austin TX USA Enterprise A 20 12 2010 2200000 Floodgate, AV,etc Masher Calabasas CA USA Games_Video Seed 0 2 2009 175000 Masher Calabasas CA USA Games_Video Angel 11 8 2009 300000 Tech Coast Angels The Result? Tab Delimited Structured Data… Note: I dropped the ZipCode because it didn’t occur consistently
    • 14. 14 Time to Analyze/Visualize the data… Step1: Select the right visual encoding for your questions Lets start by asking questions & seeing what we can learn from some simple Bar Charts…
    • 15. *Total Tech Investments By Year
    • 16. *Total Tech Investments By Year *Total Tech Investments By Year
    • 17. *Investment Funding By Sector
    • 18. 18 Total Investments By Zip Code for all Sectors $7.3 Billion in San Francisco $2.9 Billion in Mountain View $1.2 Billion in Boston $1.7 Billion in Austin
    • 19. 19 Total Investments By Zip Code for all Sectors $7.3 Billion in San Francisco $2.9 Billion in Mountain View $1.2 Billion in Boston $1.7 Billion in Austin
    • 20. 20 Total Investments By Zip Code for Consumer Web $1.2 Billion in Chicago $600 Million in Seattle $1.7 Billion in San Francisco
    • 21. 21 Total Investments By Zip Code for BioTech $1.3 Billion in Cambridge $528 Million in Dallas $1.1 Billion in San Diego
    • 22. 22 HP Confidential Geospatial Encoding of Data
    • 23. Steve’s Not so Excellent Adventure 23 • Let’s try a Choropleth Encoding of the distribution of investment income by County • Wait, what is GeoJSON? • OK, the GeoJSON County is mapped to some code • Each County code has a value that corresponds to a palette color • So what are these codes? FIPS Codes? But Google returns 3 & 5 digit codes?!? • I found a 5 digit code list, it has A LOT of codes in it. I’m going to assume its correct because there is no way I can manually verify all of them
    • 24. Generating Investment Income By County 24 FIPS = LOAD ‘data/fips.txt’ using PigStorage(‘t’) as (City, State, FIPSCode); Amt = LOAD ‘data/equity.txt’ using PigStorage(‘t’) as (City, State, Amount); AmtGroup = Group Amt BY (City, State); SumGroup = FOREACH AmtGroup Generate group, SUM(Amt.Amount); JoinGroup = JOIN SumGroup by (City,State), FIPS By (City,State); Final = FOREACH JoinGroup generate FIPSCode, Amount; RESULT: 51234 5000000 16234 1234000 (...) ALWAYS, ALWAYS check your output…
    • 25. But wait, why are there duplicate records? 25 Apparently some cities can actually belong to two counties… I guess I’ll pick one.
    • 26. Yay, no duplicates. Lets visualize this! 26 • Wait, what happened to California ? • Aaargh, I stored the FIPS codes in PIG as INTS instead of charrays which trimmed off the leading Zero. OK, I add them back. Voila! We have California.
    • 27. On Error Checking… 27 • Crowd Sourced data has LOADS of errors in it. Actually influencing your results. You need a good system that helps identify those errors. • Santa Clara, Ca • Santa, Clara • Santa, Clara CA • Track(Count) input and output records. Examine the results. Something fishy?
    • 28. 28 HP Confidential
    • 29. 29 Questions? Steve Watt swatt@hp.com @wattsteve emergingafrican.com

    ×