SlideShare a Scribd company logo
1 of 22
Big Data with SQL Server
Philly Code Camp 2013.1
May 2013
http://www.pssug.org
Mark Kromer
http://www.kromerbigdata.com
@kromerbigdata
@mssqldude
makromer@microsoft.com
ā€£What is Big Data?
ā€£The Big Data and Apache Hadoop environment
ā€£Big Data Analytics
ā€£SQL Server in the Big Data world
ā€£Microsoft + Hortonworks (Yahoo!) = HDInsights
What weā€™ll (try) to cover today
2
Big Data 101
ā€£ 3 Vā€™s
ā€£ Volume ā€“ Terabyte records, transactions, tables, files
ā€£ Velocity ā€“ Batch, near-time, real-time (analytics), streams.
ā€£ Variety ā€“ Structures, unstructured, semi-structured, and all the above in a mix
ā€£ Text Processing
ā€£ Techniques for processing and analyzing unstructured (and structured) LARGE files
ā€£ Analytics & Insights
ā€£ Distributed File System & Programming
ā€£ Batch Processing
ā€£ Commodity Hardware
ā€£ Data Locality, no shared storage
ā€£ Scales linearly
ā€£ Great for large text file processing, not so great on small files
ā€£ Distributed programming paradigm
Popular Hadoop Distributions
Hosted PaaS Hadoop platforms: Amazon
EMR, Pivotal, Microsoft Hadoop on Azure
ā€£ Big Data ā‰  NoSQL
ā€£ NoSQL has similar Internet-scale Web origins of Hadoop stack (Yahoo!,
Google, Facebook, et al) but not the same thing
ā€£ Facebook, for example, uses Hbase from the Hadoop stack
ā€£ Big Data ā‰  Real Time
ā€£ Big Data is primarily about batch processing huge files in a distributed manner
and analyzing data that was otherwise too complex to provide value
ā€£ Use in-memory analytics for real time insights
ā€£ Big Data ā‰  Data Warehouse
ā€£ I still refer to large multi-TB DWs as ā€œVLDBā€
ā€£ Big Data is about crunching stats in text files for discovery of new patterns and
insights
ā€£ Use the DW to aggregate and store the summaries of those calculations for
reporting
Markā€™s Big Data Myths
Big Data Analytics Web Platform - Example
using Microsoft.Hadoop.MapReduce;
using System.Text.RegularExpressions;
public class TotalHitsForPageMap : MapperBase
{
public override void Map(string inputLine, MapperContext context)
{
context.Log(inputLine);
var parts = Regex.Split(inputLine, "s+");
if (parts.Length != expected) //only take records with all values
{
return;
}
context.EmitKeyValue(parts[pagePos], hit);
}
}
MapReduce Framework (Map)
public class TotalHitsForPageReducerCombiner : ReducerCombinerBase
{
public override void Reduce(string key, IEnumerable<string> values, ReducerCombinerContext
context)
{
context.EmitKeyValue(key, values.Sum(e=>long.Parse(e)).ToString());
}
}
public class TotalHitsJob : HadoopJob<TotalHitsForPageMap,TotalHitsForPageReducerCombiner>
{
public override HadoopJobConfiguration Configure(ExecutorContext context)
{
var retVal = new HadoopJobConfiguration();
retVal.InputPath = Environment.GetEnvironmentVariable("W3C_INPUT");
retVal.OutputFolder = Environment.GetEnvironmentVariable("W3C_OUTPUT");
retVal.DeleteOutputFolder = true;
return retVal;
}
}
MapReduce Framework (Reduce & Job)
ā€£ Linux shell commands to access data in HDFS
ā€£ Put file in HDFS: hadoop fs -put sales.csv /import/sales.csv
ā€£ List files in HDFS:
ā€£ c:Hadoop>hadoop fs -ls /import
Found 1 items
-rw-r--r-- 1 makromer supergroup 114 2013-05-07 12:11 /import/sales.csv
ā€£ View file in HDFS:
c:Hadoop>hadoop fs -cat /import/sales.csv
Kromer,123,5,55
Smith,567,1,25
Jones,123,9,99
James,11,12,1
Johnson,456,2,2.5
Singh,456,1,3.25
Yu,123,1,11
ā€£ Now, we can work on the data with MapReduce, Hive, Pig, etc.
Get Data into Hadoop
create external table ext_sales
(
lastname string,
productid int,
quantity int,
sales_amount float
)
row format delimited fields terminated by ',' stored as
textfile location '/user/makromer/hiveext/input';
LOAD DATA INPATH '/user/makromer/import/sales.csv' OVERWRITE
INTO TABLE ext_sales;
Use Hive for Data Schema and Analysis
ā€£ sqoop import ā€“connect jdbc:sqlserver://localhost ā€“username sqoop -password
password ā€“table customers -m 1
ā€£ > hadoop fs -cat /user/mark/customers/part-m-00000
ā€£ > 5,Bob Smith
ā€£ sqoop export ā€“connect jdbc:sqlserver://localhost ā€“username sqoop -password
password -m 1 ā€“table customers ā€“export-dir /user/mark/data/employees3
ā€£ 12/11/11 22:19:24 INFO mapreduce.ExportJobBase: Transferred 201 bytes in
32.6364 seconds (6.1588 bytes/sec)
ā€£ 12/11/11 22:19:24 INFO mapreduce.ExportJobBase: Exported 4 records.
Sqoop
Data transfer to & from Hadoop & SQL Server
SQL Server Big Data ā€“ Data Loading
Amazon HDFS & EMR Data Loading
Amazon S3 Bucket
Role of NoSQL in a Big Data Analytics Solution
ā€£ Use NoSQL to store data quickly without the overhead of RDBMS
ā€£ Hbase, Plain Old HDFS, Cassandra, MongoDB, Dynamo, just to name a few
ā€£ Why NoSQL?
ā€£ In the world of ā€œBig Dataā€
ā€£ ā€œSchema laterā€
ā€£ Ignore ACID properties
ā€£ Drop data into key-value store quick & dirty
ā€£ Worry about query & read later
ā€£ Why NOT NoSQL?
ā€£ In the world of Big Data Analytics, you will need support from analytical tools with a
SQL, SAS, MR interface
ā€£ SQL Server and NoSQL
ā€£ Not a natural fit
ā€£ Use HDFS or your favorite NoSQL database
ā€£ Consider turning off SQL Server locking mechanisms
ā€£ Focus on writes, not reads (read uncommitted)
ā€£ SQL Server Database
ā€£ SQL 2012 Enterprise Edition
ā€£ Page Compression
ā€£ 2012 Columnar Compression on Fact Tables
ā€£ Clustered Index on all tables
ā€£ Auto-update Stats Asynch
ā€£ Partition Fact Tables by month and archive data with sliding window technique
ā€£ Drop all indexes before nightly ETL load jobs
ā€£ Rebuild all indexes when ETL completes
ā€£ SQL Server Analysis Services
ā€£ SSAS 2012 Enterprise Edition
ā€£ 2008 R2 OLAP cubes partition-aligned with DW
ā€£ 2012 cubes in-memory tabular cubes
ā€£ All access through MSMDPUMP or SharePoint
SQL Server Big Data Environment
ā€£Columnstore
ā€£Sqoop adapter
ā€£PolyBase
ā€£Hive
ā€£In-memory analytics
ā€£Scale-out MPP
SQL Server Big Data Analytics Features
17 17
Sensors Devices Bots Crawlers
ERP CRM LOB APPs
Unstructured and Structured Data
Parallel Data Warehouse
Hadoop On
Windows
Azure
Hadoop On
Windows
Server
Connectors
S
S
R
S
SSAS
BI Platform
Familiar End User Tools
Excel with PowerPivot Embedded BIPredictive Analytics
Data Market Place
Data Market
Petabytes of Data
(Unstructured)
Hundreds of TB of Data
(structured)
Microsoftā€™s Data Solution ā€“ Big Data & PDW
MICROSOFT BIG DATA
Discover Combine Refine
Relational Non-relational Streaming
immersive data
experiences
connecting with
worlds data
any data, any
size, anywhere
Self-Service Collaboration Corporate Apps Devices
Analytical
Parallel Data Warehouse
Microsoft HDInsight Server
HDInsight Service
StreamInsight
PowerPivot
Power View
Microsoft .NET Hadoop APIs
ā€£ WebHDFS
ā€£ Linq to Hive
ā€£ MapReduce
ā€£ C#
ā€£ Java
ā€£ Hive
ā€£ Pig
ā€£ http://hadoopsdk.codeplex.com/
ā€£ SQL on Hadoop
ā€£ Cloudera Impala
ā€£ Teradata SQL-H
ā€£ Microsoft Polybase
ā€£ Hadapt
Data Movement to the Cloud
ā€£Use Windows Azure Blob Storage
ā€¢ Already stored in 3 copies
ā€¢ Hadoop can read from Azure blob storage
ā€¢ Allows you to upload while using no Hadoop network or CPU resources
ā€£Compress files
ā€¢ Hadoop can read Gzip
ā€¢ Uses less network resources than uncompressed
ā€¢ Costs less for direct storage costs
ā€¢ Compress directories where source files are created as well.
21
ā€£ What is a Big Data approach to Analytics?
ā€£ Massive scale
ā€£ Data discovery & research
ā€£ Self-service
ā€£ Reporting & BI
ā€£ Why do we take this Big Data Analytics approach?
ā€£ TBs of change data in each subject area
ā€£ The data in the sources are variable and unstructured
ā€£ SSIS ETL alone couldnā€™t keep up or handle complexity
ā€£ SQL Server 2012 columnstore and tabular SSAS 2012 are key to using SQL
Server for Big Data
ā€£ With the configs mentioned previously, SQL Server works great
ā€£ Analytics on Big Data also requires Big Data Analytics tools
ā€£ Aster, Tableau, PowerPivot, SAS, Parallel Data Warehouse
Wrap-up

More Related Content

What's hot

What's hot (20)

Database Choices
Database ChoicesDatabase Choices
Database Choices
Ā 
An Intro to Elasticsearch and Kibana
An Intro to Elasticsearch and KibanaAn Intro to Elasticsearch and Kibana
An Intro to Elasticsearch and Kibana
Ā 
Datalab 101 (Hadoop, Spark, ElasticSearch) par Jonathan Winandy - Paris Spark...
Datalab 101 (Hadoop, Spark, ElasticSearch) par Jonathan Winandy - Paris Spark...Datalab 101 (Hadoop, Spark, ElasticSearch) par Jonathan Winandy - Paris Spark...
Datalab 101 (Hadoop, Spark, ElasticSearch) par Jonathan Winandy - Paris Spark...
Ā 
Managed Cluster Services
Managed Cluster ServicesManaged Cluster Services
Managed Cluster Services
Ā 
Organising for Data Success
Organising for Data SuccessOrganising for Data Success
Organising for Data Success
Ā 
Scaling to Infinity - Open Source meets Big Data
Scaling to Infinity - Open Source meets Big DataScaling to Infinity - Open Source meets Big Data
Scaling to Infinity - Open Source meets Big Data
Ā 
Data pipelines from zero
Data pipelines from zero Data pipelines from zero
Data pipelines from zero
Ā 
Big Data Best Practices on GCP
Big Data Best Practices on GCPBig Data Best Practices on GCP
Big Data Best Practices on GCP
Ā 
Google Dremel. Concept and Implementations.
Google Dremel. Concept and Implementations.Google Dremel. Concept and Implementations.
Google Dremel. Concept and Implementations.
Ā 
AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned
Ā 
Continuous delivery for machine learning
Continuous delivery for machine learningContinuous delivery for machine learning
Continuous delivery for machine learning
Ā 
Augmenting Mongo DB with treasure data
Augmenting Mongo DB with treasure dataAugmenting Mongo DB with treasure data
Augmenting Mongo DB with treasure data
Ā 
Architecting Data in the AWS Ecosystem
Architecting Data in the AWS EcosystemArchitecting Data in the AWS Ecosystem
Architecting Data in the AWS Ecosystem
Ā 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
Ā 
Exploring BigData with Google BigQuery
Exploring BigData with Google BigQueryExploring BigData with Google BigQuery
Exploring BigData with Google BigQuery
Ā 
MongoDB & Hadoop - Understanding Your Big Data
MongoDB & Hadoop - Understanding Your Big DataMongoDB & Hadoop - Understanding Your Big Data
MongoDB & Hadoop - Understanding Your Big Data
Ā 
Digital Transformation with Microsoft Azure
Digital Transformation with Microsoft AzureDigital Transformation with Microsoft Azure
Digital Transformation with Microsoft Azure
Ā 
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
Ā 
Intro to new Google cloud technologies: Google Storage, Prediction API, BigQuery
Intro to new Google cloud technologies: Google Storage, Prediction API, BigQueryIntro to new Google cloud technologies: Google Storage, Prediction API, BigQuery
Intro to new Google cloud technologies: Google Storage, Prediction API, BigQuery
Ā 
GCP Data Engineer cheatsheet
GCP Data Engineer cheatsheetGCP Data Engineer cheatsheet
GCP Data Engineer cheatsheet
Ā 

Viewers also liked

Microsoft Cloud BI Update 2012 for SQL Saturday Philly
Microsoft Cloud BI Update 2012 for SQL Saturday PhillyMicrosoft Cloud BI Update 2012 for SQL Saturday Philly
Microsoft Cloud BI Update 2012 for SQL Saturday Philly
Mark Kromer
Ā 
Big Data Analytics Projects - Real World with Pentaho
Big Data Analytics Projects - Real World with PentahoBig Data Analytics Projects - Real World with Pentaho
Big Data Analytics Projects - Real World with Pentaho
Mark Kromer
Ā 

Viewers also liked (20)

Big Data in the Cloud with Azure Marketplace Images
Big Data in the Cloud with Azure Marketplace ImagesBig Data in the Cloud with Azure Marketplace Images
Big Data in the Cloud with Azure Marketplace Images
Ā 
Microsoft Event Registration System Hosted on Windows Azure
Microsoft Event Registration System Hosted on Windows AzureMicrosoft Event Registration System Hosted on Windows Azure
Microsoft Event Registration System Hosted on Windows Azure
Ā 
Microsoft Cloud BI Update 2012 for SQL Saturday Philly
Microsoft Cloud BI Update 2012 for SQL Saturday PhillyMicrosoft Cloud BI Update 2012 for SQL Saturday Philly
Microsoft Cloud BI Update 2012 for SQL Saturday Philly
Ā 
PSSUG Nov 2012: Big Data with SQL Server
PSSUG Nov 2012: Big Data with SQL ServerPSSUG Nov 2012: Big Data with SQL Server
PSSUG Nov 2012: Big Data with SQL Server
Ā 
What's new in SQL Server 2012 for philly code camp 2012.1
What's new in SQL Server 2012 for philly code camp 2012.1What's new in SQL Server 2012 for philly code camp 2012.1
What's new in SQL Server 2012 for philly code camp 2012.1
Ā 
MEC Data sheet
MEC Data sheetMEC Data sheet
MEC Data sheet
Ā 
Big Data with SQL Server
Big Data with SQL ServerBig Data with SQL Server
Big Data with SQL Server
Ā 
Pentaho Big Data Analytics with Vertica and Hadoop
Pentaho Big Data Analytics with Vertica and HadoopPentaho Big Data Analytics with Vertica and Hadoop
Pentaho Big Data Analytics with Vertica and Hadoop
Ā 
Anexinet Big Data Solutions
Anexinet Big Data SolutionsAnexinet Big Data Solutions
Anexinet Big Data Solutions
Ā 
Big Data in the Real World
Big Data in the Real WorldBig Data in the Real World
Big Data in the Real World
Ā 
Pentaho Analytics on MongoDB
Pentaho Analytics on MongoDBPentaho Analytics on MongoDB
Pentaho Analytics on MongoDB
Ā 
Big Data Analytics Projects - Real World with Pentaho
Big Data Analytics Projects - Real World with PentahoBig Data Analytics Projects - Real World with Pentaho
Big Data Analytics Projects - Real World with Pentaho
Ā 
Sql server 2012 roadshow masd overview 003
Sql server 2012 roadshow masd overview 003Sql server 2012 roadshow masd overview 003
Sql server 2012 roadshow masd overview 003
Ā 
Microsoft SQL Server Data Warehouses for SQL Server DBAs
Microsoft SQL Server Data Warehouses for SQL Server DBAsMicrosoft SQL Server Data Warehouses for SQL Server DBAs
Microsoft SQL Server Data Warehouses for SQL Server DBAs
Ā 
Azure vs. amazon
Azure vs. amazonAzure vs. amazon
Azure vs. amazon
Ā 
Big Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL ServerBig Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL Server
Ā 
ETL in the Cloud With Microsoft Azure
ETL in the Cloud With Microsoft AzureETL in the Cloud With Microsoft Azure
ETL in the Cloud With Microsoft Azure
Ā 
Azure cafe marketplace with looker data analytics
Azure cafe marketplace with looker data analyticsAzure cafe marketplace with looker data analytics
Azure cafe marketplace with looker data analytics
Ā 
AWS vs Azure - Cloud Services Comparison
AWS vs Azure - Cloud Services ComparisonAWS vs Azure - Cloud Services Comparison
AWS vs Azure - Cloud Services Comparison
Ā 
Big Data Analytics in the Cloud with Microsoft Azure
Big Data Analytics in the Cloud with Microsoft AzureBig Data Analytics in the Cloud with Microsoft Azure
Big Data Analytics in the Cloud with Microsoft Azure
Ā 

Similar to Philly Code Camp 2013 Mark Kromer Big Data with SQL Server

Big Data in the Microsoft Platform
Big Data in the Microsoft PlatformBig Data in the Microsoft Platform
Big Data in the Microsoft Platform
Jesus Rodriguez
Ā 
Hadoop_arunam_ppt
Hadoop_arunam_pptHadoop_arunam_ppt
Hadoop_arunam_ppt
jerrin joseph
Ā 

Similar to Philly Code Camp 2013 Mark Kromer Big Data with SQL Server (20)

Enkitec E4 Barcelona : SQL and Data Integration Futures on Hadoop :
Enkitec E4 Barcelona : SQL and Data Integration Futures on Hadoop : Enkitec E4 Barcelona : SQL and Data Integration Futures on Hadoop :
Enkitec E4 Barcelona : SQL and Data Integration Futures on Hadoop :
Ā 
Building a modern data warehouse
Building a modern data warehouseBuilding a modern data warehouse
Building a modern data warehouse
Ā 
Big Data with hadoop, Spark and BigQuery (Google cloud next Extended 2017 Kar...
Big Data with hadoop, Spark and BigQuery (Google cloud next Extended 2017 Kar...Big Data with hadoop, Spark and BigQuery (Google cloud next Extended 2017 Kar...
Big Data with hadoop, Spark and BigQuery (Google cloud next Extended 2017 Kar...
Ā 
Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...
Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...
Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...
Ā 
Hadoop: An Industry Perspective
Hadoop: An Industry PerspectiveHadoop: An Industry Perspective
Hadoop: An Industry Perspective
Ā 
Gluent New World #02 - SQL-on-Hadoop : A bit of History, Current State-of-the...
Gluent New World #02 - SQL-on-Hadoop : A bit of History, Current State-of-the...Gluent New World #02 - SQL-on-Hadoop : A bit of History, Current State-of-the...
Gluent New World #02 - SQL-on-Hadoop : A bit of History, Current State-of-the...
Ā 
SQLSaturday #230 - Introduction to Microsoft Big Data (Part 1)
SQLSaturday #230 - Introduction to Microsoft Big Data (Part 1)SQLSaturday #230 - Introduction to Microsoft Big Data (Part 1)
SQLSaturday #230 - Introduction to Microsoft Big Data (Part 1)
Ā 
Big Data in the Microsoft Platform
Big Data in the Microsoft PlatformBig Data in the Microsoft Platform
Big Data in the Microsoft Platform
Ā 
SQL-on-Hadoop for Analytics + BI: What Are My Options, What's the Future?
SQL-on-Hadoop for Analytics + BI: What Are My Options, What's the Future?SQL-on-Hadoop for Analytics + BI: What Are My Options, What's the Future?
SQL-on-Hadoop for Analytics + BI: What Are My Options, What's the Future?
Ā 
Final deck
Final deckFinal deck
Final deck
Ā 
Ų¹ŲµŲ± Ś©Ł„Ų§Ł† ŲÆŲ§ŲÆŁ‡ŲŒ Ś†Ų±Ų§ Łˆ Ś†ŚÆŁˆŁ†Ł‡ŲŸ
Ų¹ŲµŲ± Ś©Ł„Ų§Ł† ŲÆŲ§ŲÆŁ‡ŲŒ Ś†Ų±Ų§ Łˆ Ś†ŚÆŁˆŁ†Ł‡ŲŸŲ¹ŲµŲ± Ś©Ł„Ų§Ł† ŲÆŲ§ŲÆŁ‡ŲŒ Ś†Ų±Ų§ Łˆ Ś†ŚÆŁˆŁ†Ł‡ŲŸ
Ų¹ŲµŲ± Ś©Ł„Ų§Ł† ŲÆŲ§ŲÆŁ‡ŲŒ Ś†Ų±Ų§ Łˆ Ś†ŚÆŁˆŁ†Ł‡ŲŸ
Ā 
Hadoop_arunam_ppt
Hadoop_arunam_pptHadoop_arunam_ppt
Hadoop_arunam_ppt
Ā 
Big Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI ProsBig Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI Pros
Ā 
Differentiate Big Data vs Data Warehouse use cases for a cloud solution
Differentiate Big Data vs Data Warehouse use cases for a cloud solutionDifferentiate Big Data vs Data Warehouse use cases for a cloud solution
Differentiate Big Data vs Data Warehouse use cases for a cloud solution
Ā 
Microsoft Ignite AU 2017 - Orchestrating Big Data Pipelines with Azure Data F...
Microsoft Ignite AU 2017 - Orchestrating Big Data Pipelines with Azure Data F...Microsoft Ignite AU 2017 - Orchestrating Big Data Pipelines with Azure Data F...
Microsoft Ignite AU 2017 - Orchestrating Big Data Pipelines with Azure Data F...
Ā 
Big Data Analytics & Architecture
Big Data Analytics & ArchitectureBig Data Analytics & Architecture
Big Data Analytics & Architecture
Ā 
How does Microsoft solve Big Data?
How does Microsoft solve Big Data?How does Microsoft solve Big Data?
How does Microsoft solve Big Data?
Ā 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
Ā 
Big Data , Big Problem?
Big Data , Big Problem?Big Data , Big Problem?
Big Data , Big Problem?
Ā 
Nonrelational Databases
Nonrelational DatabasesNonrelational Databases
Nonrelational Databases
Ā 

More from Mark Kromer

More from Mark Kromer (20)

Fabric Data Factory Pipeline Copy Perf Tips.pptx
Fabric Data Factory Pipeline Copy Perf Tips.pptxFabric Data Factory Pipeline Copy Perf Tips.pptx
Fabric Data Factory Pipeline Copy Perf Tips.pptx
Ā 
Build data quality rules and data cleansing into your data pipelines
Build data quality rules and data cleansing into your data pipelinesBuild data quality rules and data cleansing into your data pipelines
Build data quality rules and data cleansing into your data pipelines
Ā 
Mapping Data Flows Training deck Q1 CY22
Mapping Data Flows Training deck Q1 CY22Mapping Data Flows Training deck Q1 CY22
Mapping Data Flows Training deck Q1 CY22
Ā 
Data cleansing and prep with synapse data flows
Data cleansing and prep with synapse data flowsData cleansing and prep with synapse data flows
Data cleansing and prep with synapse data flows
Ā 
Data cleansing and data prep with synapse data flows
Data cleansing and data prep with synapse data flowsData cleansing and data prep with synapse data flows
Data cleansing and data prep with synapse data flows
Ā 
Mapping Data Flows Training April 2021
Mapping Data Flows Training April 2021Mapping Data Flows Training April 2021
Mapping Data Flows Training April 2021
Ā 
Mapping Data Flows Perf Tuning April 2021
Mapping Data Flows Perf Tuning April 2021Mapping Data Flows Perf Tuning April 2021
Mapping Data Flows Perf Tuning April 2021
Ā 
Data Lake ETL in the Cloud with ADF
Data Lake ETL in the Cloud with ADFData Lake ETL in the Cloud with ADF
Data Lake ETL in the Cloud with ADF
Ā 
Azure Data Factory Data Wrangling with Power Query
Azure Data Factory Data Wrangling with Power QueryAzure Data Factory Data Wrangling with Power Query
Azure Data Factory Data Wrangling with Power Query
Ā 
Azure Data Factory Data Flow Performance Tuning 101
Azure Data Factory Data Flow Performance Tuning 101Azure Data Factory Data Flow Performance Tuning 101
Azure Data Factory Data Flow Performance Tuning 101
Ā 
Data Quality Patterns in the Cloud with ADF
Data Quality Patterns in the Cloud with ADFData Quality Patterns in the Cloud with ADF
Data Quality Patterns in the Cloud with ADF
Ā 
Azure Data Factory Data Flows Training (Sept 2020 Update)
Azure Data Factory Data Flows Training (Sept 2020 Update)Azure Data Factory Data Flows Training (Sept 2020 Update)
Azure Data Factory Data Flows Training (Sept 2020 Update)
Ā 
Data quality patterns in the cloud with ADF
Data quality patterns in the cloud with ADFData quality patterns in the cloud with ADF
Data quality patterns in the cloud with ADF
Ā 
Azure Data Factory Data Flows Training v005
Azure Data Factory Data Flows Training v005Azure Data Factory Data Flows Training v005
Azure Data Factory Data Flows Training v005
Ā 
Data Quality Patterns in the Cloud with Azure Data Factory
Data Quality Patterns in the Cloud with Azure Data FactoryData Quality Patterns in the Cloud with Azure Data Factory
Data Quality Patterns in the Cloud with Azure Data Factory
Ā 
ADF Mapping Data Flows Level 300
ADF Mapping Data Flows Level 300ADF Mapping Data Flows Level 300
ADF Mapping Data Flows Level 300
Ā 
ADF Mapping Data Flows Training V2
ADF Mapping Data Flows Training V2ADF Mapping Data Flows Training V2
ADF Mapping Data Flows Training V2
Ā 
ADF Mapping Data Flows Training Slides V1
ADF Mapping Data Flows Training Slides V1ADF Mapping Data Flows Training Slides V1
ADF Mapping Data Flows Training Slides V1
Ā 
ADF Mapping Data Flow Private Preview Migration
ADF Mapping Data Flow Private Preview MigrationADF Mapping Data Flow Private Preview Migration
ADF Mapping Data Flow Private Preview Migration
Ā 
Azure Data Factory ETL Patterns in the Cloud
Azure Data Factory ETL Patterns in the CloudAzure Data Factory ETL Patterns in the Cloud
Azure Data Factory ETL Patterns in the Cloud
Ā 

Recently uploaded

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(ā˜Žļø+971_581248768%)**%*]'#abortion pills for sale in dubai@
Ā 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Victor Rentea
Ā 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
Ā 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
Ā 

Recently uploaded (20)

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
Ā 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
Ā 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
Ā 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Ā 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
Ā 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Ā 
Mcleodganj Call Girls šŸ„° 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls šŸ„° 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls šŸ„° 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls šŸ„° 8617370543 Service Offer VIP Hot Model
Ā 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
Ā 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
Ā 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Ā 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Ā 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Ā 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Ā 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
Ā 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Ā 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
Ā 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
Ā 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
Ā 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
Ā 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
Ā 

Philly Code Camp 2013 Mark Kromer Big Data with SQL Server

  • 1. Big Data with SQL Server Philly Code Camp 2013.1 May 2013 http://www.pssug.org Mark Kromer http://www.kromerbigdata.com @kromerbigdata @mssqldude makromer@microsoft.com
  • 2. ā€£What is Big Data? ā€£The Big Data and Apache Hadoop environment ā€£Big Data Analytics ā€£SQL Server in the Big Data world ā€£Microsoft + Hortonworks (Yahoo!) = HDInsights What weā€™ll (try) to cover today 2
  • 3. Big Data 101 ā€£ 3 Vā€™s ā€£ Volume ā€“ Terabyte records, transactions, tables, files ā€£ Velocity ā€“ Batch, near-time, real-time (analytics), streams. ā€£ Variety ā€“ Structures, unstructured, semi-structured, and all the above in a mix ā€£ Text Processing ā€£ Techniques for processing and analyzing unstructured (and structured) LARGE files ā€£ Analytics & Insights ā€£ Distributed File System & Programming
  • 4. ā€£ Batch Processing ā€£ Commodity Hardware ā€£ Data Locality, no shared storage ā€£ Scales linearly ā€£ Great for large text file processing, not so great on small files ā€£ Distributed programming paradigm
  • 5. Popular Hadoop Distributions Hosted PaaS Hadoop platforms: Amazon EMR, Pivotal, Microsoft Hadoop on Azure
  • 6. ā€£ Big Data ā‰  NoSQL ā€£ NoSQL has similar Internet-scale Web origins of Hadoop stack (Yahoo!, Google, Facebook, et al) but not the same thing ā€£ Facebook, for example, uses Hbase from the Hadoop stack ā€£ Big Data ā‰  Real Time ā€£ Big Data is primarily about batch processing huge files in a distributed manner and analyzing data that was otherwise too complex to provide value ā€£ Use in-memory analytics for real time insights ā€£ Big Data ā‰  Data Warehouse ā€£ I still refer to large multi-TB DWs as ā€œVLDBā€ ā€£ Big Data is about crunching stats in text files for discovery of new patterns and insights ā€£ Use the DW to aggregate and store the summaries of those calculations for reporting Markā€™s Big Data Myths
  • 7. Big Data Analytics Web Platform - Example
  • 8. using Microsoft.Hadoop.MapReduce; using System.Text.RegularExpressions; public class TotalHitsForPageMap : MapperBase { public override void Map(string inputLine, MapperContext context) { context.Log(inputLine); var parts = Regex.Split(inputLine, "s+"); if (parts.Length != expected) //only take records with all values { return; } context.EmitKeyValue(parts[pagePos], hit); } } MapReduce Framework (Map)
  • 9. public class TotalHitsForPageReducerCombiner : ReducerCombinerBase { public override void Reduce(string key, IEnumerable<string> values, ReducerCombinerContext context) { context.EmitKeyValue(key, values.Sum(e=>long.Parse(e)).ToString()); } } public class TotalHitsJob : HadoopJob<TotalHitsForPageMap,TotalHitsForPageReducerCombiner> { public override HadoopJobConfiguration Configure(ExecutorContext context) { var retVal = new HadoopJobConfiguration(); retVal.InputPath = Environment.GetEnvironmentVariable("W3C_INPUT"); retVal.OutputFolder = Environment.GetEnvironmentVariable("W3C_OUTPUT"); retVal.DeleteOutputFolder = true; return retVal; } } MapReduce Framework (Reduce & Job)
  • 10. ā€£ Linux shell commands to access data in HDFS ā€£ Put file in HDFS: hadoop fs -put sales.csv /import/sales.csv ā€£ List files in HDFS: ā€£ c:Hadoop>hadoop fs -ls /import Found 1 items -rw-r--r-- 1 makromer supergroup 114 2013-05-07 12:11 /import/sales.csv ā€£ View file in HDFS: c:Hadoop>hadoop fs -cat /import/sales.csv Kromer,123,5,55 Smith,567,1,25 Jones,123,9,99 James,11,12,1 Johnson,456,2,2.5 Singh,456,1,3.25 Yu,123,1,11 ā€£ Now, we can work on the data with MapReduce, Hive, Pig, etc. Get Data into Hadoop
  • 11. create external table ext_sales ( lastname string, productid int, quantity int, sales_amount float ) row format delimited fields terminated by ',' stored as textfile location '/user/makromer/hiveext/input'; LOAD DATA INPATH '/user/makromer/import/sales.csv' OVERWRITE INTO TABLE ext_sales; Use Hive for Data Schema and Analysis
  • 12. ā€£ sqoop import ā€“connect jdbc:sqlserver://localhost ā€“username sqoop -password password ā€“table customers -m 1 ā€£ > hadoop fs -cat /user/mark/customers/part-m-00000 ā€£ > 5,Bob Smith ā€£ sqoop export ā€“connect jdbc:sqlserver://localhost ā€“username sqoop -password password -m 1 ā€“table customers ā€“export-dir /user/mark/data/employees3 ā€£ 12/11/11 22:19:24 INFO mapreduce.ExportJobBase: Transferred 201 bytes in 32.6364 seconds (6.1588 bytes/sec) ā€£ 12/11/11 22:19:24 INFO mapreduce.ExportJobBase: Exported 4 records. Sqoop Data transfer to & from Hadoop & SQL Server
  • 13. SQL Server Big Data ā€“ Data Loading Amazon HDFS & EMR Data Loading Amazon S3 Bucket
  • 14. Role of NoSQL in a Big Data Analytics Solution ā€£ Use NoSQL to store data quickly without the overhead of RDBMS ā€£ Hbase, Plain Old HDFS, Cassandra, MongoDB, Dynamo, just to name a few ā€£ Why NoSQL? ā€£ In the world of ā€œBig Dataā€ ā€£ ā€œSchema laterā€ ā€£ Ignore ACID properties ā€£ Drop data into key-value store quick & dirty ā€£ Worry about query & read later ā€£ Why NOT NoSQL? ā€£ In the world of Big Data Analytics, you will need support from analytical tools with a SQL, SAS, MR interface ā€£ SQL Server and NoSQL ā€£ Not a natural fit ā€£ Use HDFS or your favorite NoSQL database ā€£ Consider turning off SQL Server locking mechanisms ā€£ Focus on writes, not reads (read uncommitted)
  • 15. ā€£ SQL Server Database ā€£ SQL 2012 Enterprise Edition ā€£ Page Compression ā€£ 2012 Columnar Compression on Fact Tables ā€£ Clustered Index on all tables ā€£ Auto-update Stats Asynch ā€£ Partition Fact Tables by month and archive data with sliding window technique ā€£ Drop all indexes before nightly ETL load jobs ā€£ Rebuild all indexes when ETL completes ā€£ SQL Server Analysis Services ā€£ SSAS 2012 Enterprise Edition ā€£ 2008 R2 OLAP cubes partition-aligned with DW ā€£ 2012 cubes in-memory tabular cubes ā€£ All access through MSMDPUMP or SharePoint SQL Server Big Data Environment
  • 17. 17 17 Sensors Devices Bots Crawlers ERP CRM LOB APPs Unstructured and Structured Data Parallel Data Warehouse Hadoop On Windows Azure Hadoop On Windows Server Connectors S S R S SSAS BI Platform Familiar End User Tools Excel with PowerPivot Embedded BIPredictive Analytics Data Market Place Data Market Petabytes of Data (Unstructured) Hundreds of TB of Data (structured) Microsoftā€™s Data Solution ā€“ Big Data & PDW
  • 18. MICROSOFT BIG DATA Discover Combine Refine Relational Non-relational Streaming immersive data experiences connecting with worlds data any data, any size, anywhere Self-Service Collaboration Corporate Apps Devices Analytical Parallel Data Warehouse Microsoft HDInsight Server HDInsight Service StreamInsight PowerPivot Power View
  • 19.
  • 20. Microsoft .NET Hadoop APIs ā€£ WebHDFS ā€£ Linq to Hive ā€£ MapReduce ā€£ C# ā€£ Java ā€£ Hive ā€£ Pig ā€£ http://hadoopsdk.codeplex.com/ ā€£ SQL on Hadoop ā€£ Cloudera Impala ā€£ Teradata SQL-H ā€£ Microsoft Polybase ā€£ Hadapt
  • 21. Data Movement to the Cloud ā€£Use Windows Azure Blob Storage ā€¢ Already stored in 3 copies ā€¢ Hadoop can read from Azure blob storage ā€¢ Allows you to upload while using no Hadoop network or CPU resources ā€£Compress files ā€¢ Hadoop can read Gzip ā€¢ Uses less network resources than uncompressed ā€¢ Costs less for direct storage costs ā€¢ Compress directories where source files are created as well. 21
  • 22. ā€£ What is a Big Data approach to Analytics? ā€£ Massive scale ā€£ Data discovery & research ā€£ Self-service ā€£ Reporting & BI ā€£ Why do we take this Big Data Analytics approach? ā€£ TBs of change data in each subject area ā€£ The data in the sources are variable and unstructured ā€£ SSIS ETL alone couldnā€™t keep up or handle complexity ā€£ SQL Server 2012 columnstore and tabular SSAS 2012 are key to using SQL Server for Big Data ā€£ With the configs mentioned previously, SQL Server works great ā€£ Analytics on Big Data also requires Big Data Analytics tools ā€£ Aster, Tableau, PowerPivot, SAS, Parallel Data Warehouse Wrap-up