Microsoft Big Data @ SQLUG 2013

BIG DATA

Wesley Backelant
Technology Advisor
Microsoft
@WesleyBackelant

Nathan Bijnens
Big Data Consultant
DataCrunchers
@nathan_gs

AGENDA

• Big Data
• Hadoop (& Ecosystem)
• How does it fit in the Microsoft world?
• Demo
• Resources
• Q&A

TODAY A NEW SET OF QUESTIONS ARE BEING ASKED OF
THE BUSINESS:

What’s the social How do I better
sentiment for my predict future
brand or products outcomes?

How do I optimize
my fleet based on
weather and traffic
patterns?

TRANSFORMATION OF ONLINE MARKETING

BLOGS.FORBES.COM/DAVEFEINLEIB

TRANSFORMATION OF OPERATIONS


TRANSFORMATION OF CUSTOMER SERVICE


TRANSFORMATION OF FRAUD DETECTION

Then… Now…

NEW HARDWARE APPROACH
Traditional Big Data
Exotic HW Commodity HW
• Big central servers • racks of pizza boxes
• SAN • Ethernet
• RAID • JBOD
Hardware reliability Unreliable HW
Limited scalability Scales further
Cost effective

NEW SOFTWARE APPROACH
Traditional Big Data
Monolotic Distributed
• Centralized - storage & compute nodes
• RDBMS Raw data
Schema first
Proprietary

HADOOP & BIG DATA ECOSYSTEM

MapReduce

HDFS

HIVE

A data warehouse infrastructure built on top of
Hadoop for providing data summarization, query, and
analysis.
– Ideal for ad hoc querying
– Query execution via MapReduce.

Key Building Principles:
– SQL
– Extensibility
– Types
– Functions
– Scripts

HIVE

It supports many SQL features like:
– Data partitioning
– Aggregations
– Grouping
– Joins

HIVE

And it’s extendable using UDFs.
package com.example.hive.udf;

import org.apache.hadoop.hive.ql.exec.UDF;
import org.apache.hadoop.io.Text;

public final class Lower extends UDF {
public Text evaluate(final Text s) {
if (s == null) { return null; }
return new Text(s.toString().toLowerCase());
}
}

There are many UDFs published by external parties, for:
- Loading / Saving (SerDe)
- Field Transformations

HADOOP PIG: INTRO

Pig is a high level data flow language.

HADOOP PIG: 3 COMPONENTS

• Pig Latin

• Grunt

• PigServer

HADOOP PIG

data = LOAD 'employee.csv' USING PigStorage() AS (
first_name:chararray,
last_name:chararray,
age:int,
wage:float,
department:chararray
);

HADOOP PIG

grouped_by_department = GROUP data BY department;

total_wage_by_department =
FOREACH grouped_by_department
GENERATE
group AS department,
COUNT(data) as employee_count,
SUM(data::wage) AS total_wage;

total_ordered = ORDER total_wage_by_department BY total_wage;

total_limited = LIMIT total_ordered 10;

HADOOP PIG

DUMP total_limited;

STORE total_limited INTO ‘/test/’;

UDF

● Custom Load and Store classes.
● Hbase
● ProtocolBuffers
● CombinedLog
● Custom extraction
eg. date, ...

● Take a look at the PiggyBank.

HBASE

A distributed, versioned, column-oriented
database.
• Main features:
• Horizontal scalability
• Machine failure tolerance
• Row-level atomic operations including compare-and-swap ops like
incrementing counters
• Augmented key-value schemas, the user can group columns into families which
are configured independently
• Multiple clients like its native Java library, Thrift, and REST
• Upcoming Security

STORM

• Message passing.
• Distributed processing.
• Horizontally scalable.
• Incremental algorithms.
• Fast.

• Data in motion.

STORM

Nimbus Zookeeper

Supervisor Supervisor Supervisor
Worker

Worker

Worker

Worker

Worker
Worker

Worker
Worker

Worker
Worker Node Worker Node Worker Node

STORM

• Tuple

• Stream

DATA IS MORE THAN INFORMATION

Not all information is equal.
Some information is derived from other pieces of information.

DATA IS MORE THAN INFORMATION

Eventually you will reach the most ‘raw’
form of information.
This is the information you hold true, simple because it exists.
Let’s call this ‘data’, very similar to ‘event’.

EVENTS
Everything we do generates events:
• Pay with Credit Card
• Commit to Git
• Click on a webpage
• Tweet

EVENTS - BEFORE

Events used to manipulate
the master data.

EVENTS - AFTER

Today, events are the master
data.

DATA SYSTEM

Let’s store everything.

EVENTS

Data is Immutable

EVENTS

Data is Time Based

CAPTURING CHANGE TRADITIONALLY

Person Location Person Location
Nathan Antwerp Nathan Ghent
Geert Dendermonde Geert Dendermonde
John Ghent John Ghent

CAPTURING CHANGE

Person Location Timestamp Person Location Time

Nathan Antwerp 2005-01-01
Nathan Antwerp 2005-01-01

Geert Dendermonde 2011-10-08

John Ghent 2010-05-02

Nathan Ghent 2013-02-03

QUERY

The data you query is often
transformed, aggregated, ...
Rarely used in it’s original form.

QUERY

Query = function ( data )

NUMBER OF PEOPLE LIVING IN EACH CITY.

Person Location Time Location Count
Nathan Antwerp 2005-01-01 Ghent 2
Dendermonde 1


Nathan Ghent 2013-02-03

QUERY

All Data Query

QUERY: PRECOMPUTE

All Data Precomputed
View Query

LAYERED ARCHITECTURE

Batch Layer

Speed Layer

Serving Layer

LAYERED ARCHITECTURE

SQL

Query
Incoming Data

HD Insight
Column
Store

BATCH LAYER

Incoming Data

HD Insight
Column
Store

BATCH LAYER

Unrestrained computation.

BATCH LAYER

Horizontal scalable.

BATCH LAYER

High Latency.
Let’s pretend temporarily that update latency
doesn’t matter.

BATCH LAYER

Stores master copy of data set...
append only.

BATCH: VIEW GENERATION

View #1

Master Dataset

View #2
MapReduce

View #3

MAPREDUCE

1. Take a large problem and divide it into sub-problems

…
MAP

2. Perform the same function on all sub-problems
…
DoWork() DoWork() DoWork()

3. Combine the output from all sub-problems
REDUCE

…

Output

BATCH VIEW DATABASE

Read only database.
No random writes required.

BATCH LAYER

We are not done yet… Just a few hours of data.

Not yet
Data absorbed into Batch Views absorbed.

Time

Now

OVERVIEW

SQL

Incoming Data

HD Insight
Column
Store

SPEED LAYER

Stream processing.

SPEED LAYER

Continuous computation.

SPEED LAYER

Transactional.

SPEED LAYER

Storing a limited window of data.
Compensating for the last few hours of data.

SPEED LAYER

All the complexity is isolated in the
Speed layer. If anything goes wrong,
it’s auto-corrected.

CAP

You have a choice between:
• Availability
• Queries are eventual consistent.
• Consistency
• Queries are consistent.

EVENTUAL ACCURACY

Some algorithms are hard to
implement in real time. For those
cases we could estimate the results.

SPEED LAYER

Real
Time
View 1

Incoming Data

Real
Time
View 2

SPEED LAYER VIEWS
• The views are stored in Read & Write database.
• MS SQL Server
• Column Store
• Cassandra
• …
• Much more complex than a read only view.

OVERVIEW

SQL

Query
Incoming Data

HD Insight
Column
Store

SERVING LAYER

This layer queries the Batch & Real
Time views and merges it.

SERVING LAYER

Batch
Views

Merge

Real
Time
Views

SERVING LAYER

Polybase is a great fit.

LAMBDA ARCHITECTURE
• Can discard any view, batch and real time, and just recreate
everything from the master data.
• Mistakes are corrected via recomputation.
• Write bad data? Remove the data & recompute.
• Bug in view generation? Just recompute the view.
• Data storage is highly optimized.

WHAT IS MICROSOFT DOING ON
THE BI & DEVELOPMENT SIDE

INSIGHTS FROM ANY DATA, ANY SIZE, ANYWHERE

010101010101010101
1010101010101010
01010101010101
101010101010

WE DELIVER INSIGHTS TO EVERYONE BY ENABLING BIG DATA
ANALYSIS WITH FAMILIAR END USER TOOLS
Benefits

Interaction and analysis of
unstructured data in Hadoop
Key Features

Hive add-in for Excel

UNLOCKING IMMERSIVE INSIGHTS FROM ALL DATA
WITH MICROSOFT BI TOOLS
Benefits

Familiar self service BI tools
Key Features

Hive ODBC Driver integrates Hadoop
to SQL Server Analysis Services,
PowerPivot, and Power View

WHILE DRAMATICALLY SIMPLIFYING PROGRAMMING
ON HADOOP

MapReduce
programs
Benefits

in JavaScript

Simplified Simplified Deployment of
Programming MapReduce jobs
Key Features

JS
Deploy JavaScript Hadoop
Integration with .NET and jobs from a simple web
new JavaScript libraries for browser on any supported
Hadoop device

WE MANAGE STREAMING DATA WITH STREAMINSIGHT
Benefits
Key Features

StreamInsight SQL StreamInsight

WHAT IS MICROSOFT DOING ON
THE HADOOP & INTEGRATION SIDE?

WE MANAGE RELATIONAL DATA WITH MICROSOFT
ENTERPRISE DATA WAREHOUSE SOLUTIONS
Reference Architectures Appliances

Dell Parallel HP Enterprise
Data Data
Fast Track for Warehouse Warehouse

Dell
HP Business
Quickstart
Data
Data
Warehouse
Warehouse

INTRODUCING POLYBASE
Fundamental Breakthrough in Data Processing

Single Query; Structured and Unstructured
SQL
• Query and join Hadoop tables with Relational Tables

SQL Server 2012 • Use Standard SQL language
PDW Powered • Select, From Where
by PolyBase

Existing SQL No IT Save Time Analyze All
Skillset Intervention and Costs Data Types

AND SUPPORT UNSTRUCTURED DATA WITH ENTERPRISE
CLASS HADOOP ON PREMISE AND IN THE CLOUD
Benefits
Key Features

MICROSOFT BRINGS THE SIMPLICITY AND MANAGEABILITY
OF WINDOWS AND SQL SERVER TO HADOOP
Benefits
Key Features

MICROSOFT DELIVERS BIG DATA THROUGH OPEN
PLATFORM AND A RICH PARTNER ECOSYSTEM
Benefits
Key Features

BIG DATA DEMO:
FROM DATA TO INSIGHTS!

Analysis with familiar Collaboration on
Simplicity tools insights

RESOURCES
• Microsoft Big Data Solution: www.microsoft.com/bigdata
• Windows Azure: www.windowsazure.com/en-us/home/scenarios/big-data
• Try Now: https://www.hadooponazure.com
• HDInsight For Windows Beta Download: http://hortonworks.com/download/
• HDInsight Services For Windows:
http://social.technet.microsoft.com/wiki/contents/articles/6204.hdinsight-services-for-
windows.aspx#videos
• Hadoop in PowerPivot: http://social.technet.microsoft.com/wiki/contents/articles/6294.how-to-
connect-excel-powerpivot-to-hive-on-azure-via-hiveodbc.aspx
• Hadoop in SSIS: http://msdn.microsoft.com/en-us/library/jj720569.aspx
• Hurricane Sandy: http://sqlcat.com/sqlcat/b/msdnmirror/archive/2013/02/01/hurricane-sandy-
mash-up-hive-sql-server-powerpivot-amp-power-view.aspx
• Hadoop PowerShell: http://blogs.msdn.com/b/cindygross/archive/2012/08/23/how-to-install-the-
powershell-cmdlets-for-apache-hadoop-based-services-for-windows.aspx
• SQL Server BCP to Hive: http://blogs.msdn.com/b/cindygross/archive/2012/09/28/load-sql-server-
bcp-data-to-hive.aspx
• Internal vs External Table Hive: http://blogs.msdn.com/b/cindygross/archive/2013/02/06/hdinsight-
hive-internal-and-external-tables-intro.aspx
• Microsoft.NET SDK for Hadoop: http://hadoopsdk.codeplex.com/
• Twitter Analytics Example: http://twitterbigdata.codeplex.com/

DATACRUNCHERS

We enable companies in envisioning, defining and implementing a data
strategy.
A one-stop-shop for all your Big Data needs.

The first Big Data Consultancy agency in Belgium.

Microsoft Big Data @ SQLUG 2013

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Microsoft Big Data @ SQLUG 2013

Similar to Microsoft Big Data @ SQLUG 2013 (20)

More from Nathan Bijnens

More from Nathan Bijnens (10)

Recently uploaded

Recently uploaded (20)

Microsoft Big Data @ SQLUG 2013