More than 100,000 CPUs in >25,000 computers running Hadoop
Our biggest cluster: 4000 nodes (2*4cpu boxes w 4*1TB disk & 16GB RAM)
Used to support research for Ad Systems and Web Search
Also used to do scaling tests to support development of Hadoop on larger clusters
Baidu - the leading Chinese language search engine
Hadoop used to analyze the log of search and do some mining work on web page database
We handle about 3000TB per week
Our clusters vary from 10 to 500 nodes
Use Hadoop to store copies of internal log and dimension data sources and use it as a source for reporting/analytics and machine learning.
Currently we have 2 major clusters:
A 1100-machine cluster with 8800 cores and about 12 PB raw storage.
A 300-machine cluster with 2400 cores and about 3 PB raw storage.
Each (commodity) node has 8 cores and 12 TB of storage.
IBM Digital Democracy for the BBC
BigSheets and the open source stack Top level Apache Project Yahoo! Contributed open source IBM Research Licence Insight Engine Spreadsheet Paradigm SQL ‘like’ programming language Distributed processing and file system
Analytics - the meta tag example.
Extract meta data tags from all html files in the 2005 General Election Collection
Extract ‘keywords’ from metatags
Record all html pages into three separate ‘bags’ where keywords contained:
Liberal, Lib Dem, Liberal Democrat
Analyse single and pairs of words in each of those ‘bags’ of data
Generate Tag clouds from the 50 most common words .
High level management tool – Spreadsheet paradigm
Clean User interface
Straightforward programming model (UDF’s)
ARC to WARC migration
Information package generation (SIP)
CDX indexes / Lucene indexes
JHOVE object validation / verification
Object format migration.
Slash Page crawl - election sites extraction
Slash page (home page) of known UK domains
Data discarded after processing
Generate list of election terms (Politcal parties, Mori election tags)
Extract text from html pages using an HTML tag density algorithm
Identify all web pages that contain these words
Identify sites that contain two or more of the terms
Slash Page Data
Text Extracted Using Tag Density Algorithm
Election Key Terms
Pie Chart Visualization
Seeds With 2 Or More Terms
Other potential potential digital material
19 th Century Newspapers
Back to analytics and the next generation access tools
Automatic Classification – WebDewey, LOC Subject Headings
Faceted lucene indexes for Advanced Search functionality
Engage directly with Higher Education community
Access tool – researcher focus?
BL 3 year Research Behaviour Study
3x30 Nehalem-based node grids, with 2x4 cores, 16GB RAM, 8x1TB storage using ZFS in a JBOD configuration.
Hadoop and Pig for discovering People You May Know and other fun facts.