Your SlideShare is downloading. ×
0
Big Tools for Big Data
Big Tools for Big Data
Big Tools for Big Data
Big Tools for Big Data
Big Tools for Big Data
Big Tools for Big Data
Big Tools for Big Data
Big Tools for Big Data
Big Tools for Big Data
Big Tools for Big Data
Big Tools for Big Data
Big Tools for Big Data
Big Tools for Big Data
Big Tools for Big Data
Big Tools for Big Data
Big Tools for Big Data
Big Tools for Big Data
Big Tools for Big Data
Big Tools for Big Data
Big Tools for Big Data
Big Tools for Big Data
Big Tools for Big Data
Big Tools for Big Data
Big Tools for Big Data
Big Tools for Big Data
Big Tools for Big Data
Big Tools for Big Data
Big Tools for Big Data
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Big Tools for Big Data

1,816

Published on

0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,816
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
77
Comments
0
Likes
3
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide
  • Introduction the problem of big data hadoop map / reduce hdfs Bigsheets! PIG the open source stack Analytics - the meta tag example. Data management Arc to Warc Jhove format migration flv to mpeg4? Simple Examples - Iraq Inquiry video link extraction Slash Page crawl - election sites extraction Newspapers Back to analytics the next generation access tool - targeted at researchers - cooliris, network / swirl, spreadsheet, skydragon
  • Straw Pole of how much archive material there is in the room. 3 Petabytes
  • Add diagram? Page
  • Add PIG
  • IBM insight engine
  • New york times example Page
  • Seadragon notes: Review current access tool Search by title, urls, or full text browse by Subject or special collection. More websites search results already in the millions Provide tools to mine the data (renewable resource?) Page
  • Transcript

    1. Big Tools for Big Data Analytics and Management at web scale IIPC General Assembly, Singapore, May 2010 Lewis Crawford Web Archiving Programme Technical Lead British Library
    2. Big Data “the Petabyte age” <ul><li>Internet Archive stores about 2 Petabytes of data and grows at 20TB a month </li></ul><ul><li>Large Hadron Collider 15PB / year </li></ul><ul><li>At the BL </li></ul><ul><li>Selective Web Archive growing at </li></ul><ul><li>200GB a month </li></ul><ul><li>Conservative estimate for </li></ul><ul><li>Domain Crawl is 100TB </li></ul>
    3. The problem of big data <ul><ul><li>We can process data very quickly but we can read/write it very slowly </li></ul></ul><ul><ul><li>1990 1 GB disk 4.4MB/s read whole disk in 5 mins </li></ul></ul><ul><ul><li>2010 1 TB disk 100MB/s read whole disk in 2.5 hours </li></ul></ul>
    4. The solution! <ul><ul><li>Solution: parallel reads </li></ul></ul><ul><ul><li>1 HDD = 100 MB/sec </li></ul></ul><ul><ul><li>1000 HDDs = 100 GB/sec </li></ul></ul>
    5. Hadoop <ul><li>2002 Nutch Crawler - Doug Cutting </li></ul><ul><li>2003 GFS http://labs.google.com/papers/gfs.html </li></ul><ul><li>2004 Map Reduce http://labs.google.com/papers/mapreduce.html </li></ul><ul><li>2005 Nutch moves to Map Reduce model with NDFS </li></ul><ul><li>2006 NDFS and Map Reduce model becomes Hadoop </li></ul><ul><li>under </li></ul><ul><li>2008 Top level project at Apache </li></ul><ul><li>2009 17 clusters with 24,000 nodes at Yahoo! </li></ul><ul><li>1TB sorted in 62 seconds </li></ul><ul><li>100TB sorted in 173 minutes </li></ul>
    6. Hadoop Users <ul><li>Yahoo! </li></ul><ul><li>More than 100,000 CPUs in >25,000 computers running Hadoop </li></ul><ul><li>Our biggest cluster: 4000 nodes (2*4cpu boxes w 4*1TB disk & 16GB RAM) </li></ul><ul><ul><li>Used to support research for Ad Systems and Web Search </li></ul></ul><ul><ul><li>Also used to do scaling tests to support development of Hadoop on larger clusters </li></ul></ul><ul><li>Baidu - the leading Chinese language search engine </li></ul><ul><li>Hadoop used to analyze the log of search and do some mining work on web page database </li></ul><ul><ul><li>We handle about 3000TB per week </li></ul></ul><ul><ul><li>Our clusters vary from 10 to 500 nodes </li></ul></ul><ul><li>Facebook </li></ul><ul><li>Use Hadoop to store copies of internal log and dimension data sources and use it as a source for reporting/analytics and machine learning. </li></ul><ul><li>Currently we have 2 major clusters: </li></ul><ul><ul><li>A 1100-machine cluster with 8800 cores and about 12 PB raw storage. </li></ul></ul><ul><ul><li>A 300-machine cluster with 2400 cores and about 3 PB raw storage. </li></ul></ul><ul><ul><li>Each (commodity) node has 8 cores and 12 TB of storage. </li></ul></ul>http://wiki.apache.org/hadoop/PoweredBy
    7. Nutchwax!
    8. [email_address]
    9. IBM Digital Democracy for the BBC
    10. Bigsheets!
    11. BigSheets and the open source stack Top level Apache Project Yahoo! Contributed open source IBM Research Licence Insight Engine Spreadsheet Paradigm SQL ‘like’ programming language Distributed processing and file system
    12. Analytics - the meta tag example. <ul><li>Extract meta data tags from all html files in the 2005 General Election Collection </li></ul><ul><li>Extract ‘keywords’ from metatags </li></ul><ul><li>Record all html pages into three separate ‘bags’ where keywords contained: </li></ul><ul><ul><li>Tory, Conservative </li></ul></ul><ul><ul><li>Labour </li></ul></ul><ul><ul><li>Liberal, Lib Dem, Liberal Democrat </li></ul></ul><ul><li>Analyse single and pairs of words in each of those ‘bags’ of data </li></ul><ul><li>Generate Tag clouds from the 50 most common words . </li></ul>
    13. Data management
    14. robots.txt example
    15. Robots.txt continued…
    16. Data management <ul><li>High level management tool – Spreadsheet paradigm </li></ul><ul><li>Clean User interface </li></ul><ul><li>Straightforward programming model (UDF’s) </li></ul><ul><li>Use cases: </li></ul><ul><ul><ul><li>ARC to WARC migration </li></ul></ul></ul><ul><ul><ul><li>Information package generation (SIP) </li></ul></ul></ul><ul><ul><ul><li>CDX indexes / Lucene indexes </li></ul></ul></ul><ul><ul><ul><li>JHOVE object validation / verification </li></ul></ul></ul><ul><ul><ul><li>Object format migration. </li></ul></ul></ul>
    17. Slash Page crawl - election sites extraction <ul><li>Slash page (home page) of known UK domains </li></ul><ul><ul><li>Data discarded after processing </li></ul></ul><ul><li>Generate list of election terms (Politcal parties, Mori election tags) </li></ul><ul><li>Extract text from html pages using an HTML tag density algorithm </li></ul><ul><li>Identify all web pages that contain these words </li></ul><ul><li>Identify sites that contain two or more of the terms </li></ul>
    18. Slash Page Data
    19. Text Extracted Using Tag Density Algorithm
    20. Election Key Terms
    21. Results
    22. Pie Chart Visualization
    23. Seeds With 2 Or More Terms
    24. Manual Verification
    25. Other potential potential digital material <ul><li>Digital Books </li></ul><ul><li>Datasets </li></ul><ul><li>19 th Century Newspapers </li></ul>
    26. Back to analytics and the next generation access tools <ul><li>Automatic Classification – WebDewey, LOC Subject Headings </li></ul><ul><ul><ul><li>Machine learning </li></ul></ul></ul><ul><li>Faceted lucene indexes for Advanced Search functionality </li></ul><ul><li>Engage directly with Higher Education community </li></ul><ul><li>Access tool – researcher focus? </li></ul><ul><ul><ul><li>BL 3 year Research Behaviour Study </li></ul></ul></ul>
    27. Thank you! <ul><li>[email_address] </li></ul><ul><li>http://uk.linkedin.com/in/lewiscrawford </li></ul><ul><li>3x30 Nehalem-based node grids, with 2x4 cores, 16GB RAM, 8x1TB storage using ZFS in a JBOD configuration. </li></ul><ul><li>Hadoop and Pig for discovering People You May Know and other fun facts. </li></ul>

    ×