Big Data with SQL Server


Philly Code Camp
November 2012



Mark Kromer
BI & Big Data Technology Director
http://www.kromerbigdata.com
@kromerbigdata
@mssqldude
What we’ll (try) to cover today

‣ What is Big Data?
‣ The Big Data and Apache Hadoop environment
‣ Big Data Analytics
‣ SQL Server in the Big Data world
‣ How we utilize Big Data @ Razorfish




                                               2
Big Data 101

‣ 3 V’s
   ‣ Volume – Terabyte records, transactions, tables, files
   ‣ Velocity – Batch, near-time, real-time (analytics), streams.
   ‣ Variety – Structures, unstructured, semi-structured, and all the above in a mix
‣ Text Processing
   ‣ Techniques for processing and analyzing unstructured (and structured) LARGE files
‣ Analytics & Insights
‣ Distributed File System & Programming
‣   Batch Processing
‣   Commodity Hardware
‣   Data Locality, no shared storage
‣   Scales linearly
‣   Great for large text file processing, not so great on small files
‣   Distributed programming paradigm
MapReduce Framework (Map)
using Microsoft.Hadoop.MapReduce;
using System.Text.RegularExpressions;
public class TotalHitsForPageMap : MapperBase
{
public override void Map(string inputLine, MapperContext context)
        {
            context.Log(inputLine);
            var parts = Regex.Split(inputLine, "s+");
            if (parts.Length != expected) //only take records with all values
            {
                return;
            }
            context.EmitKeyValue(parts[pagePos], hit);
        }
    }
MapReduce Framework (Reduce & Job)
public class TotalHitsForPageReducerCombiner : ReducerCombinerBase
  {
     public override void Reduce(string key, IEnumerable<string> values, ReducerCombinerContext
context)
      {
          context.EmitKeyValue(key, values.Sum(e=>long.Parse(e)).ToString());
      }
  }
public class TotalHitsJob : HadoopJob<TotalHitsForPageMap,TotalHitsForPageReducerCombiner>
  {
      public override HadoopJobConfiguration Configure(ExecutorContext context)
      {
          var retVal = new HadoopJobConfiguration();
          retVal.InputPath = Environment.GetEnvironmentVariable("W3C_INPUT");
          retVal.OutputFolder = Environment.GetEnvironmentVariable("W3C_OUTPUT");
          retVal.DeleteOutputFolder = true;
          return retVal;
      }
  }
Mark’s Big Data Myths

‣ Big Data ≠ NoSQL
    ‣ NoSQL has similar Internet-scale Web origins of Hadoop stack (Yahoo!,
      Google, Facebook, et al) but not the same thing
    ‣ Facebook, for example, uses Hbase from the Hadoop stack
‣ Big Data ≠ Real Time
    ‣ Big Data is primarily about batch processing huge files in a distributed manner
      and analyzing data that was otherwise too complex to provide value
    ‣ Use in-memory analytics for real time insights
‣ Big Data ≠ Data Warehouse
    ‣ I still refer to large multi-TB DWs as “VLDB”
    ‣ Big Data is about crunching stats in text files for discovery of new patterns and
      insights
    ‣ Use the DW to aggregate and store the summaries of those calculations for
      reporting
Razorfish & Big Data

‣   Web Analytics
‣   Big Data Analytics
‣   Digital Marketing – Ad Server Analytics
‣   Multiple TBs of online data per client per year
‣   Elastic Web-scale MapReduce & Hadoop
‣   Increase ROI of digital marketing campaigns
Big Data Analytics Web Platform
In-Database Analytics (Teradata Aster)
•   Because of built-in analytics functions and big data performance, Aster becomes
    the data scientist’s sandbox and BI’s big data analytics processor.




                                                             Prepackaged Analytics
                                                             Functions (including Attribution)
SQL Server Big Data – Data Loading




Amazon HDFS & EMR          Data Loading




                Amazon S3 Bucket
Sqoop
Data transfer to & from Hadoop & SQL Server

‣ sqoop import –connect jdbc:sqlserver://localhost –username sqoop -password
  password –table customers -m 1


‣ > hadoop fs -cat /user/mark/customers/part-m-00000

‣ > 5,Bob Smith

‣ sqoop export –connect jdbc:sqlserver://localhost –username sqoop -password
  password -m 1 –table customers –export-dir /user/mark/data/employees3
‣ 12/11/11 22:19:24 INFO mapreduce.ExportJobBase: Transferred 201 bytes in
  32.6364 seconds (6.1588 bytes/sec)
‣ 12/11/11 22:19:24 INFO mapreduce.ExportJobBase: Exported 4 records.
SQL Server Big Data Environment

‣ SQL Server Database
   ‣   SQL Server 2008 R2 or 2012 Enterprise Edition
   ‣   Page Compression
   ‣   2012 Columnar Compression on Fact Tables
   ‣   Clustered Index on all tables
   ‣   Auto-update Stats Asynch
   ‣   Partition Fact Tables by month and archive data with sliding window technique
   ‣   Drop all indexes before nightly ETL load jobs
   ‣   Rebuild all indexes when ETL completes
‣ SQL Server Analysis Services
   ‣   SSAS 2008 R2 or 2012 Enterprise Edition
   ‣   2008 R2 OLAP cubes partition-aligned with DW
   ‣   2012 cubes in-memory tabular cubes
   ‣   All access through MSMDPUMP or SharePoint
SQL Server Big Data Analytics Features

‣ Columnstore
‣ Sqoop adapter
‣ PolyBase
‣ Hive
‣ In-memory analytics
‣ Scale-out MPP
Wrap-up

‣ What is a Big Data approach to Analytics?
   ‣ Massive scale
   ‣ Data discovery & research
   ‣ Self-service
   ‣ Reporting & BI
‣ Why did we take this Big Data Analytics approach?
   ‣ Each Web client produces an average of 6 TBs of ICA data in a year
   ‣ The data in the sources are variable and unstructured
   ‣ SSIS ETL alone couldn’t keep up or handle complexity
   ‣ SQL Server 2012 columnstore and tabular SSAS 2012 were key to using SQL
       Server for Big Data
    ‣ With the configs mentioned previously, SQL Server is working great
‣ Analytics on Big Data also requires Big Data Analytics tools
    ‣ Aster, Tableau, PowerPivot, SAS

Big Data with SQL Server

  • 1.
    Big Data withSQL Server Philly Code Camp November 2012 Mark Kromer BI & Big Data Technology Director http://www.kromerbigdata.com @kromerbigdata @mssqldude
  • 2.
    What we’ll (try)to cover today ‣ What is Big Data? ‣ The Big Data and Apache Hadoop environment ‣ Big Data Analytics ‣ SQL Server in the Big Data world ‣ How we utilize Big Data @ Razorfish 2
  • 3.
    Big Data 101 ‣3 V’s ‣ Volume – Terabyte records, transactions, tables, files ‣ Velocity – Batch, near-time, real-time (analytics), streams. ‣ Variety – Structures, unstructured, semi-structured, and all the above in a mix ‣ Text Processing ‣ Techniques for processing and analyzing unstructured (and structured) LARGE files ‣ Analytics & Insights ‣ Distributed File System & Programming
  • 4.
    Batch Processing ‣ Commodity Hardware ‣ Data Locality, no shared storage ‣ Scales linearly ‣ Great for large text file processing, not so great on small files ‣ Distributed programming paradigm
  • 5.
    MapReduce Framework (Map) usingMicrosoft.Hadoop.MapReduce; using System.Text.RegularExpressions; public class TotalHitsForPageMap : MapperBase { public override void Map(string inputLine, MapperContext context) { context.Log(inputLine); var parts = Regex.Split(inputLine, "s+"); if (parts.Length != expected) //only take records with all values { return; } context.EmitKeyValue(parts[pagePos], hit); } }
  • 6.
    MapReduce Framework (Reduce& Job) public class TotalHitsForPageReducerCombiner : ReducerCombinerBase { public override void Reduce(string key, IEnumerable<string> values, ReducerCombinerContext context) { context.EmitKeyValue(key, values.Sum(e=>long.Parse(e)).ToString()); } } public class TotalHitsJob : HadoopJob<TotalHitsForPageMap,TotalHitsForPageReducerCombiner> { public override HadoopJobConfiguration Configure(ExecutorContext context) { var retVal = new HadoopJobConfiguration(); retVal.InputPath = Environment.GetEnvironmentVariable("W3C_INPUT"); retVal.OutputFolder = Environment.GetEnvironmentVariable("W3C_OUTPUT"); retVal.DeleteOutputFolder = true; return retVal; } }
  • 7.
    Mark’s Big DataMyths ‣ Big Data ≠ NoSQL ‣ NoSQL has similar Internet-scale Web origins of Hadoop stack (Yahoo!, Google, Facebook, et al) but not the same thing ‣ Facebook, for example, uses Hbase from the Hadoop stack ‣ Big Data ≠ Real Time ‣ Big Data is primarily about batch processing huge files in a distributed manner and analyzing data that was otherwise too complex to provide value ‣ Use in-memory analytics for real time insights ‣ Big Data ≠ Data Warehouse ‣ I still refer to large multi-TB DWs as “VLDB” ‣ Big Data is about crunching stats in text files for discovery of new patterns and insights ‣ Use the DW to aggregate and store the summaries of those calculations for reporting
  • 8.
    Razorfish & BigData ‣ Web Analytics ‣ Big Data Analytics ‣ Digital Marketing – Ad Server Analytics ‣ Multiple TBs of online data per client per year ‣ Elastic Web-scale MapReduce & Hadoop ‣ Increase ROI of digital marketing campaigns
  • 9.
    Big Data AnalyticsWeb Platform
  • 10.
    In-Database Analytics (TeradataAster) • Because of built-in analytics functions and big data performance, Aster becomes the data scientist’s sandbox and BI’s big data analytics processor. Prepackaged Analytics Functions (including Attribution)
  • 11.
    SQL Server BigData – Data Loading Amazon HDFS & EMR Data Loading Amazon S3 Bucket
  • 12.
    Sqoop Data transfer to& from Hadoop & SQL Server ‣ sqoop import –connect jdbc:sqlserver://localhost –username sqoop -password password –table customers -m 1 ‣ > hadoop fs -cat /user/mark/customers/part-m-00000 ‣ > 5,Bob Smith ‣ sqoop export –connect jdbc:sqlserver://localhost –username sqoop -password password -m 1 –table customers –export-dir /user/mark/data/employees3 ‣ 12/11/11 22:19:24 INFO mapreduce.ExportJobBase: Transferred 201 bytes in 32.6364 seconds (6.1588 bytes/sec) ‣ 12/11/11 22:19:24 INFO mapreduce.ExportJobBase: Exported 4 records.
  • 13.
    SQL Server BigData Environment ‣ SQL Server Database ‣ SQL Server 2008 R2 or 2012 Enterprise Edition ‣ Page Compression ‣ 2012 Columnar Compression on Fact Tables ‣ Clustered Index on all tables ‣ Auto-update Stats Asynch ‣ Partition Fact Tables by month and archive data with sliding window technique ‣ Drop all indexes before nightly ETL load jobs ‣ Rebuild all indexes when ETL completes ‣ SQL Server Analysis Services ‣ SSAS 2008 R2 or 2012 Enterprise Edition ‣ 2008 R2 OLAP cubes partition-aligned with DW ‣ 2012 cubes in-memory tabular cubes ‣ All access through MSMDPUMP or SharePoint
  • 14.
    SQL Server BigData Analytics Features ‣ Columnstore ‣ Sqoop adapter ‣ PolyBase ‣ Hive ‣ In-memory analytics ‣ Scale-out MPP
  • 15.
    Wrap-up ‣ What isa Big Data approach to Analytics? ‣ Massive scale ‣ Data discovery & research ‣ Self-service ‣ Reporting & BI ‣ Why did we take this Big Data Analytics approach? ‣ Each Web client produces an average of 6 TBs of ICA data in a year ‣ The data in the sources are variable and unstructured ‣ SSIS ETL alone couldn’t keep up or handle complexity ‣ SQL Server 2012 columnstore and tabular SSAS 2012 were key to using SQL Server for Big Data ‣ With the configs mentioned previously, SQL Server is working great ‣ Analytics on Big Data also requires Big Data Analytics tools ‣ Aster, Tableau, PowerPivot, SAS