21. public static class TokenizerMapper
extends Mapper<Object, Text, Text, IntWritable>{
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(Object key, Text value, Context context
) throws IOException, InterruptedException {
StringTokenizer itr = new StringTokenizer(value.toString());
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
context.write(word, one);
}
}
}
22. public static class IntSumReducer
extends Reducer<Text,IntWritable,Text,IntWritable> {
private IntWritable result = new IntWritable();
public void reduce(Text key, Iterable<IntWritable> values,
Context context
) throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
result.set(sum);
context.write(key, result);
}
}
23.
24.
25.
26. Use a parallel
database system
• eBay – 10PB on 256 nodes
Use a NoSQL system
• Facebook - 20PB on 2700 nodes
• Bing – 150PB on 40K nodes
27.
28. Data Model Example Stores (apologies to the ones I did not list)
Simple Key-Value Pairs Memcache, Redis, Dynamo, Voldermort, LevelDB, Azure Caching
Wide Sparse Column Sets
HyperTable, Big Table, Cassandra, HBASE, Hyperbase, Amazon DynamoDB,
Windows Azure Tables, SQL Server/Azure Sparse columns
BLOBs
Amazon S3, Oracle Berkeley NoSQL, Windows Azure Blob Store, SQL
Server RBS/FileTable
JSON Documents MongoDB, CouchBase, Riak, RavenDB
Graph Neo4J, GraphDB, HypergraphDB, Stig, Intellidimension
Objects and XML Documents
Versant, Oracle Berkeley NoSQL, MarkLogic, existDB, EMC HiveDB, SQL
Server/Azure, Oracle, IBM DB2
Extended Relational Oracle, EMC SQLFire, IBM DB2, MySQL, Postgres, SQL Server/Azure
29.
30.
31.
32.
33.
34.
35.
36. H2
2011
Hadoop on Azure CTP2
More capacity
Stability Improvements
H1
2012
CY
Hadoop on Azure Private CTP
Hadoop on Server Private TAP
Hadoop Core & Common
JavaScript Framework
Hadoop on Azure GA
• Portal Integration & Billing
• Azure SDK integration
Hadoop on Server GA
• JavaScript, PIG, Hive, Hbase
• Active Directory Integration
• Systems Center Integration
H2
2012
Hive ODBC Driver
Azure Labs
Data Explorer
Social Analytics
Private Data Market
Hadoop Connectors
Azure Data Market
Excel Integration
Hive Add-in for Excel
PowerPivot Add-in for Excel
Power View for SharePoint Office 15 Integration
DATA
MANAGEMENT
DATA
ENRICHMENT
INSIGHTS
37. HTML Page AJAX
Browser
Jetty Server
J2EE Servlets
Job Depot
Query
Translator
Processes
(hadoop, pig, hive)
Web
Resources FsShell