Your SlideShare is downloading. ×
0
Mining the Web for Information using Hadoop
Mining the Web for Information using Hadoop
Mining the Web for Information using Hadoop
Mining the Web for Information using Hadoop
Mining the Web for Information using Hadoop
Mining the Web for Information using Hadoop
Mining the Web for Information using Hadoop
Mining the Web for Information using Hadoop
Mining the Web for Information using Hadoop
Mining the Web for Information using Hadoop
Mining the Web for Information using Hadoop
Mining the Web for Information using Hadoop
Mining the Web for Information using Hadoop
Mining the Web for Information using Hadoop
Mining the Web for Information using Hadoop
Mining the Web for Information using Hadoop
Mining the Web for Information using Hadoop
Mining the Web for Information using Hadoop
Mining the Web for Information using Hadoop
Mining the Web for Information using Hadoop
Mining the Web for Information using Hadoop
Mining the Web for Information using Hadoop
Mining the Web for Information using Hadoop
Mining the Web for Information using Hadoop
Mining the Web for Information using Hadoop
Mining the Web for Information using Hadoop
Mining the Web for Information using Hadoop
Mining the Web for Information using Hadoop
Mining the Web for Information using Hadoop
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Mining the Web for Information using Hadoop

4,710

Published on

Published in: Technology, Business
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
4,710
On Slideshare
0
From Embeds
0
Number of Embeds
31
Actions
Shares
0
Downloads
28
Comments
0
Likes
1
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide
  • Give a Nutch example
  • Transcript

    • 1. 1 – Someday Soon (Flickr) Mining the web with Hadoop Steve Watt Emerging Technologies @ HP
    • 2. 2 – timsnell (Flickr)
    • 3. 3 Gathering Data Data Marketplaces
    • 4. 4
    • 5. 5
    • 6. 6 Gathering Data Apache Nutch (Web Crawler)
    • 7. 7 Pascal Terjan (Flickr)
    • 8. 8
    • 9. 9
    • 10. 10 Using Apache Identify Optimal Seed URLs for a Seed List & Crawl to a depth of 2 For example: http://www.crunchbase.com/companies?c=a&q=private_held http://www.crunchbase.com/companies?c=b&q=private_held http://www.crunchbase.com/companies?c=c&q=private_held http://www.crunchbase.com/companies?c=d&q=private_held . . . Crawl data is stored in sequence files in the segments dir on the HDFS
    • 11. 11 ALSO
    • 12. 12 Company POJO then /t Out Prelim Filtering on URL Making the data STRUCTURED Retrieving HTML
    • 13. 13 Company City State Country Sector Round Day Month Year Amount Investors InfoChimps Austin TX USA Enterprise Angel 14 9 2010 350000 Stage One Capital InfoChimps Austin TX USA Enterprise A 7 11 2010 1200000 DFJ Mercury MassRelevance Austin TX USA Enterprise A 20 12 2010 2200000 Floodgate, AV,etc Masher Calabasas CA USA Games_Video Seed 0 2 2009 175000 Masher Calabasas CA USA Games_Video Angel 11 8 2009 300000 Tech Coast Angels The Result? Tab Delimited Structured Data… Note: I dropped the ZipCode because it didn’t occur consistently
    • 14. 14 Time to Analyze/Visualize the data… Step1: Select the right visual encoding for your questions Lets start by asking questions & seeing what we can learn from some simple Bar Charts…
    • 15. *Total Tech Investments By Year
    • 16. *Total Tech Investments By Year *Total Tech Investments By Year
    • 17. *Investment Funding By Sector
    • 18. 18 Total Investments By Zip Code for all Sectors $7.3 Billion in San Francisco $2.9 Billion in Mountain View $1.2 Billion in Boston $1.7 Billion in Austin
    • 19. 19 Total Investments By Zip Code for all Sectors $7.3 Billion in San Francisco $2.9 Billion in Mountain View $1.2 Billion in Boston $1.7 Billion in Austin
    • 20. 20 Total Investments By Zip Code for Consumer Web $1.2 Billion in Chicago $600 Million in Seattle $1.7 Billion in San Francisco
    • 21. 21 Total Investments By Zip Code for BioTech $1.3 Billion in Cambridge $528 Million in Dallas $1.1 Billion in San Diego
    • 22. 22 HP Confidential Geospatial Encoding of Data
    • 23. Steve’s Not so Excellent Adventure 23 • Let’s try a Choropleth Encoding of the distribution of investment income by County • Wait, what is GeoJSON? • OK, the GeoJSON County is mapped to some code • Each County code has a value that corresponds to a palette color • So what are these codes? FIPS Codes? But Google returns 3 & 5 digit codes?!? • I found a 5 digit code list, it has A LOT of codes in it. I’m going to assume its correct because there is no way I can manually verify all of them
    • 24. Generating Investment Income By County 24 FIPS = LOAD ‘data/fips.txt’ using PigStorage(‘t’) as (City, State, FIPSCode); Amt = LOAD ‘data/equity.txt’ using PigStorage(‘t’) as (City, State, Amount); AmtGroup = Group Amt BY (City, State); SumGroup = FOREACH AmtGroup Generate group, SUM(Amt.Amount); JoinGroup = JOIN SumGroup by (City,State), FIPS By (City,State); Final = FOREACH JoinGroup generate FIPSCode, Amount; RESULT: 51234 5000000 16234 1234000 (...) ALWAYS, ALWAYS check your output…
    • 25. But wait, why are there duplicate records? 25 Apparently some cities can actually belong to two counties… I guess I’ll pick one.
    • 26. Yay, no duplicates. Lets visualize this! 26 • Wait, what happened to California ? • Aaargh, I stored the FIPS codes in PIG as INTS instead of charrays which trimmed off the leading Zero. OK, I add them back. Voila! We have California.
    • 27. On Error Checking… 27 • Crowd Sourced data has LOADS of errors in it. Actually influencing your results. You need a good system that helps identify those errors. • Santa Clara, Ca • Santa, Clara • Santa, Clara CA • Track(Count) input and output records. Examine the results. Something fishy?
    • 28. 28 HP Confidential
    • 29. 29 Questions? Steve Watt swatt@hp.com @wattsteve emergingafrican.com

    ×