Galaxy of bits

Galaxy of bits
Surviving the flood of information

Michał Żyliński, Microsoft
(michal.zylinski@microsoft.com)

In 2000 the Sloan Digital Sky Survey collected more data in its 1st
week than was collected in the entire history of Astronomy

By 2016 the New Large Synoptic Survey Telescope in Chile will
acquire 140 terabytes in 5 days - more than Sloan acquired in 10
years

The Large Hadron Collider at CERN generates 40 terabytes of data
every second

2
Sources: The Economist, Feb ‘10; IDC

Bing ingests > 7 petabyte a month

The Twitter community generates over 1 terabyte of tweets every day

Cisco predicts that by 2013 annual internet traffic flowing will reach 667
exabytes

3
Sources: The Economist, Feb ‘10; DBMS2; Microsoft Corp

1,800,000,00

1,8 0,000,000,00
0,000 bytes
The size of Digital Universe in
ZB 2011
9
8
7
6
5
Within 24 months #
of intelligent devices >
traditional IT devices
4
3
2 In 2015 nearly 20%
1
0 of the information will
2010 2011 2012 2015 be touched by cloud
Sources: IDC Digital Universe Study 2011, Worldwide Big Data Technology and Services 2012–2015 Forecast

How
But... real
is it?

Financial Retail
Services

Modeling True Risk Point of Sales
Threat Analysis Transaction Analysis
Fraud Detection Customer Churn
Trade Surveillance Analysis
Credit scoring and Sentiment Analysis
analysis

Telecommunication
E- s
Commerce Customer Churn
Prevention
Recommendation
Engines Network Performance
optimization
Ad Targeting
Call Detail Record
Search Quality (CDR) Analysis
Abuse and click fraud Analyzing Network to
detection Predict Failure

A day in life of typical e-commerce
site

New exploratory e-commerce data
flow

So how does it work?
FIRST, STORE THE DATA

So how does it work?
SECOND, TAKE THE PROCESSING TO THE DATA

// Map Reduce function in
JavaScript

var map = function
(key, value, context) {
var words =
value.split(/[^a-zA-Z]/);
for (var i = 0; i <
words.length; i++) {
if
(words[i] !== "")
{context.write(words[i].to
LowerCase(), 1);}
}};

var reduce = function
(key, values, context) {
var sum = 0;
while (values.hasNext()) {
sum +=
parseInt(values.next());
}
context.write(key, sum);
};

Hadoop in detail
Analysis of semi and unstructured data distributed across a commodity cluster

Based on Google’s MapReduce paper
and Google File system (GFS)
Programs = Sequence of “map” and
“reduce” tasks.
Simplify writing distributed applications
Highly fault tolerant – multiple copies
Move computation close to data
Implemented in Java and optimized for
Linux

Traditional RDBMS MapReduce
Data Size Gigabytes (Terabytes) Petabytes (Hexabytes)
Access Interactive and Batch Batch
Updates Read / Write many times Write once, Read many times
Structure Static Schema Dynamic Schema
Integrity High (ACID) Low
Scaling Nonlinear Linear
DBA Ratio 1:40 1:3000

Hadoop Ecosystem
HBase / Cassandra
Oozie
Traditional BI Tools (Columnar NoSQL
(Workflow)
Databases)

Hive
Karmasphere
Pig (Data (Warehouse Apache
(Development Flume Sqoop
Flow) and Data Mahout
Tool)
Access)
Zookeeper (Coordination)

Avro (Serialization)
HBase (Column DB)

MapReduce (Job Scheduling/Execution System)

Hadoop = MapReduce + HDFS
HDFS
(Hadoop Distributed File System)

Hadoop + Microsoft
Our own • Submit changes back to
distribution of Apache Foundation
Hadoop • Download for free

• AD & Systems Center
Optimized for integration
Windows & Azure • Hadoop-as-a-service-on-
Azure

Focus on .NET • Integration with Visual Studio
Developers • Support for C#

• Performance and Scale
• High Availability
• Ease of use

Why Hadoop as a Service?
• Task based billing
• Easy admin
• Zero install
• Support a wide variety of job types
– Machine Learning (mahout), Graph Mining
(Pegasus), HIVE, Pig, Java, JS, etc.
• Greatly simplified UI

cheap fast

UNIX Pipes
cat [input_file] | [mapper] | sort | [reducer]
>[output_file]

Hadoop Streaming
hadoop jar libhadoop-streaming.jar
-input directory
-output directory
-mapper any script or executable
-reducer any script or executable

Benefits
Key Features
Data Market integration

Benefits
Some other fancy stuff...

Models augmented with
publicly available data
from social media sites
Key Features

Microsoft
Codename
"Social Analytics"

Reality check A.D. 2012
ANALYTICS
SELF-SERVICE MOBILE
OPERATIONAL REAL-TIME
PREDICTIVE COLLABORATIVE

MARKETPLACE
DATA ENRICHMENT

External Data
and Services
DISCOVER TRANSFORM SHARE
AND RECOMMEND AND CLEAN AND GOVERN

DATA MANAGEMENT
1
011
01
RELATIONAL NON RELATIONAL MULTIDIMENSIONAL STREAMING

Use Case:

• Extremely large volume of
Microsoft unstructured web log
BI Tools analysis

• Ad hoc analysis of
unstructured web logs to
prototype patterns

• Hadoop data feeds large
24TB Cube
24 TB Cube

Hadoop Distribution

Michal.Zylinski@microsoft.co
m

Thank you!

Galaxy of bits

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (9)

Similar to Galaxy of bits

Similar to Galaxy of bits (20)

More from Michal Zylinski

More from Michal Zylinski (16)

Recently uploaded

Recently uploaded (20)

Galaxy of bits

Editor's Notes