SQL Server Konferenz 2014 - SSIS & HDInsight

SSIS & HDInsight
Tillmann Eitelberg
Oliver Engels

Who we are…
Tillmann Eitelberg

Oliver Engels

• CTO of oh22information services GmbH

• CEO of oh22data AG

• PASS Regional Mentor Germany

• PASS Regional Mentor Germany

• Vice-president PASS Germany

• President PASS Germany

• Chapter Leader CologneBonn, Germany

• Chapter Leader Frankfurt, Germany

• Microsoft MVP

• Microsoft MVP
• Microsoft vTSP

Agenda
•
•
•
•
•
•
•
•
•

Traditional ETL Process
Challenges of Big Data and unstructured data
Useful Apache Hadoop Components for ETL
Some statements to be clarified...
Using Apache Hadoop within the ETL process
SSIS – not just an simple ETL Tool
Tools to work with HDInsight
Get started using Windows Azure HDInsight
Use SQL Server Integration Services to …

• Extract data from different sources

• different source systems
• different data organization and/or format
• (non-)relational databases, flat files

• Transforms it to fit operational needs
•
•
•
•
•

Translating coded values
Encoding free-form values
Deriving a new calculated value
Aggregation
data profiling, data quality

• Loads it into the end target

• database, data mart, data warehouse


OLAP Analysis

CRM
Load
Extract

Transform

Load

ERP
Load

Data
Warehouse
Data Mining

Web Site Traffic

Reporting


OLAP Analysis

CRM

ERP

E T

L
L
L

DBMS

E T

L
L
L

Data
Warehouse

E T

L
L
L
Data Mining

Web Site Traffic

Staging
Area

Data Marts

Reporting

Traditional ETL Process (Microsoft Glasses)
• Control Flow
• implement repeating workflows
• Connecting containers and tasks into an
ordered control flow by using precedence
constraints
• controlling external processes
• load meta objects and data container
• prepare data files

Traditional ETL Process (Microsoft Glasses)
• Data Flow
• Adding one or more sources to extract data
from files and databases
• Adding the transformations that meet the
business requirements
• Adding one or more destinations to load data
into data stores such as files and databases
• Configuring error outputs on components to
handle problems

Challenges of Big Data
• large amounts of data from multiple sources
• the volume of this amount of data goes into the terabytes,
petabytes and exabytes
• Classic relational database systems as well as statistical and
visualization programs are often not able to handle such large
amounts of data
• according to calculations from the year 2011, the global volume
of data doubles every 2 years

Challenges of unstructured data
• does not have a pre-defined data model or is not organized in a predefined manner
• typically text-heavy, but may contain data such as dates, numbers,
and facts as well
• structure, while not formally defined, can still be implied
• aggregates can not be accessed with computer programs through a
single interface
• Emails, audio - and video files without tags, also contributions in
different media such as online forums or on social-media platforms

Objectives of Big data

Real time tweets visualized on a map

HDInsight/Hadoop Eco-System
Red
Blue
Purple

= Core Hadoop
= Data processing
= Microsoft integration
points and value adds
Orange = Data Movement
Green = Packages

Useful Apache Hadoop Components (for ETL)
Apache Flume

Apache Sqoop

• Stream data from multiple sources into
Hadoop for analysis

• Allows data imports from external datastores
and enterprise data warehouses into Hadoop

• a large scale log aggregation framework

• Parallelizes data transfer for fast performance
and optimal system utilization

• Collect high-volume Web logs in real time
• Insulate themselves from transient spikes when
the rate of incoming data exceeds the rate at
which data can be written to the destination
• Guarantee data delivery
• Scale horizontally to handle additional data
volume

• Copies data quickly from external systems to
Hadoop
• Makes data analysis more efficient
• Mitigates excessive loads to external systems

Apache Hive

Apache Pig

• data warehouse infrastructure built on top of
Hadoop

• Platform for cerating MapReduce programs

• supports analysis of large datasets stored in
Hadoop's HDFS
• SQL-like language called HiveQL

• Internally a compiler translates HiveQL
statements into a directed acyclic graph of
MapReduce jobs

• Language is called Pig Latin
• abstracts Java MapReduce Job to something
similar to SQL
• Can use User Defined Functions written in Java,
Python, JavaScript, Ruby or Groovy
• Pig uses ETL

ODBC/JDBC Connectors

Apache Storm

• Microsoft® Hive ODBC Driver

• distributed real-time computation system for
processing fast, large streams of data
• processing one million 100 byte messages per
second per node
• Scalable with parallel calculations that run
across a cluster of machines
• Fault-tolerant – when workers die, Storm will
automatically restart them. If a node dies, the
worker will be restarted on another node
• Storm guarantees that each unit of data (tuple)
will be processed at least once or exactly once.

http://www.microsoft.com/en-us/download/details.aspx?id=40886

• Original: Apache Hive ODBC Driver provided
by Simba
• transforms an application’s SQL query into the
equivalent form in HiveQL
• Supports all major on-premise and cloud
Hadoop / Hive distributions
• Supports data types: TinyInt, SmallInt, Int,
BigInt, Float, Double, Boolean, String, Decimal
and TimeStamp

Some statements to be clarified...
• Hadoop will steal work from ETL solutions
• ETL is running faster on Hadoop
• Hadoop is not a data integration tool
• Hadoop is a batch processing system and Hadoop jobs tend to
have high latency
• Data integration solutions do not run natively in Hadoop
• Elephants do not live isolated
• Hadoop is not a solution for data quality (and other specialized
Transformations)

Using Apache Hadoop within the ETL process

OLAP Analysis
CRM

ERP

E T

DBMS

L
L
L

E T

L
L
L

Data
Warehouse

E T

L
L
L
Data Mining

Web Site Traffic

Staging
Area

Social
Media

Sensor
Logs

Sqoop
Flume
Storm

Hive
Pig

Data Marts

ODBC
JDBC
Sqoop

Reporting

Data Science

SSIS – not just a simple ETL Tool

Use SQL Server Integration Services to…
•
•
•
•
•
•

build complex workflows
manage Windows Azure and HDInsight clusters
load data into HDInsight/HDFS
control jobs on HDInsight
get data from Hive, Pig, …
combine Hadoop with „traditional“ ETL

• SSIS Tasks for HDInsight
http://www.youtube.com/watch?v=2Aj9_w3y9Xo&feature=player_embedded
&list=PLoGAcXKPcRvbTr23ujEN953pLP_nDyZJC#t=2184

• Announced at PASS Summit 2013

• Experimental Release on Codeplex
• No timeline yet

• Azure Storage Explorer

http://azurestorageexplorer.codeplex.com/

• CloudBerry Explorer for Azure Cloud Storage
http://www.cloudberrylab.com/free-microsoft-azure-explorer.aspx
• Cerebrata Azure Management Studio
http://www.cerebrata.com/

• Red Gate HDFS Explorer (beta)
http://bigdata.red-gate.com/

• Microsoft .NET SDK For Hadoop
(nuget Packages)
• Windows Azure HDInsight

Provides a .NET API for cluster management and job submission on Windows Azure HDInsight service.

• Microsoft .NET Map Reduce API For Hadoop

Provides a .NET API for the Map/Reduce functionality of Hadoop Streaming.

• Microsoft .NET API For Hadoop WebClient
Provides a .NET API for WebClient

• Microsoft .NET API for Hadoop

Provides a .NET API for working with Hadoop clusters over HTTP

• some API requries .NET 4.5
• By default SSIS 2012 uses
.NET 4.0
• Use SSDT 2012 BI Edition (or
higher) to work with .NET 4.5
in scripting tasks and
components

• NUGet Package Manager is not fully compatible with SQL Server
Integration Services Script Task
• nuget packages (assemblies) must be installed in the global
assembly cache

gacinstall –I <assembly.dll>
• nuget packages/assemblies must be installed on all servers that
are running the packages.
• all assemblies need a strong name

• Adding a Strong Name to an existing Assembly
sn -k keyPair.snk
ildasm AssemblyName.dll /out:AssemblyName.il
ilasm AssemblyName.il /dll /key= keyPair.snk

• Create a Storage Account

• Define Name/URL of the storage account
• Define location/affinity group, best setting currently „North Europe“
• Set replication, to avoid costs use „Locally Redundant“

• Create a container in the newly created storage account
• Manage Access Keys

• Get Storage Account Name
• Get Primary Access Key

• Create a Certificate
makecert -sky exchange -r -n "CN=SQLKonferenz"
-pe -a sha1 -len 2048 -ss My
"SQLKonferenz.cer“

• Upload Certificate to Windows Azure
• Get ScubscriptionId
• Get Thumbprint

Demo
Get started using Windows
Azure HDInsight

Manage Your HDInsight Cluster
• Create a container in your Windows Azure Storage account
• Create HDInsight Cluster
• Storage Container
• Authentication (Username/Password)
• Cluster Size

• Delete HDInsight Cluster
• (Delete corresponding container)

Manage Your HDInsight Cluster
// Get the certificate object from certificate store using thumbprint
var store = new X509Store();

store.Open(OpenFlags.ReadOnly);
var cert = store.Certificates.Cast<X509Certificate2>().First(
item => item.Thumbprint == thumbprint
);

// Create HDInsightClient object using factory method
var creds = new HDInsightCertificateCredential(
new Guid(subscriptionId), cert
);

var client = HDInsightClient.Connect(creds);

Upload data to HDInsight
var storageCredentials = new

StorageCredentials(

defaultStorageAccountName,

defaultStorageAccountKey
);
var storageAccount = new CloudStorageAccount(storageCredentials, true);
var cloudBlobClient = storageAccount.CreateCloudBlobClient();
var cloudBlobContainer = cloudBlobClient.GetContainerReference(defaultStorageCont);

var blockBlob = cloudBlobContainer.GetBlockBlobReference(
@"example/data/gutenberg/"
);
using (var fileStream = System.IO.File.OpenRead(filename))
{

blockBlob.UploadFromStream(fileStream);
}

Upload data to HDInsight
• ~ 300 MB ca. 45 Sec.
• from an Azure VM in the same region

Run a MapReduce Program
// Create Job Submission Client object
var creds = new JobSubmissionCertificateCredential(
new Guid(subscriptionId),
cert,
clusterName);
var jobClient = JobSubmissionClientFactory.Connect(creds);

// Create job object that captures details of the job
var mrJobDefinition = new MapReduceJobCreateParameters()
{

JarFile = "wasb:///example/jars/hadoop-examples.jar",
ClassName = "wordcount"
};
mrJobDefinition.Arguments.Add("wasb:///example/data/gutenberg/davinci.txt");
mrJobDefinition.Arguments.Add("wasb:///example/data/WordCountOutput");

// Submit job to the cluster
var jobResults = jobClient.CreateMapReduceJob(mrJobDefinition);

Run a Hive Query
• Hive Query via .NET Hadoop SDK
• Download result from Hive query
• Load result from Hive query direct in the data flow
• Microsoft® Hive ODBC Driver

http://www.microsoft.com/en-us/download/confirmation.aspx?id=40886
(available

for x86 and x64)

Vielen Dank!
Tillmann Eitelberg
t.eitelberg@oh22.net
Oliver Engels
o.engels@oh22.net

SQL Server Konferenz 2014 - SSIS & HDInsight

SQL Server Konferenz 2014 - SSIS & HDInsight

More Related Content

What's hot

Similar to SQL Server Konferenz 2014 - SSIS & HDInsight

More from Tillmann Eitelberg

Recently uploaded

SQL Server Konferenz 2014 - SSIS & HDInsight