Big Data with SQL Server
Upcoming SlideShare
Loading in...5
×
 

Big Data with SQL Server

on

  • 1,835 views

My updated Big Data with SQL Server presentation from my Razorfish case study. Presented Nov 17 @ Philly Code Camp.

My updated Big Data with SQL Server presentation from my Razorfish case study. Presented Nov 17 @ Philly Code Camp.

Statistics

Views

Total Views
1,835
Views on SlideShare
1,815
Embed Views
20

Actions

Likes
1
Downloads
67
Comments
1

4 Embeds 20

https://twitter.com 14
https://www.linkedin.com 4
http://www.linkedin.com 1
http://www.docshut.com 1

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
  • I like your Big Data presentation.
    I would like to share with you document about application of Big Data and Data Science in retail banking. http://www.slideshare.net/LadislavUrban/syoncloud-big-data-for-retail-banking-syoncloud
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Big Data with SQL Server Big Data with SQL Server Presentation Transcript

  • Big Data with SQL ServerPhilly Code CampNovember 2012Mark KromerBI & Big Data Technology Directorhttp://www.kromerbigdata.com@kromerbigdata@mssqldude
  • What we’ll (try) to cover today‣ What is Big Data?‣ The Big Data and Apache Hadoop environment‣ Big Data Analytics‣ SQL Server in the Big Data world‣ How we utilize Big Data @ Razorfish 2
  • Big Data 101‣ 3 V’s ‣ Volume – Terabyte records, transactions, tables, files ‣ Velocity – Batch, near-time, real-time (analytics), streams. ‣ Variety – Structures, unstructured, semi-structured, and all the above in a mix‣ Text Processing ‣ Techniques for processing and analyzing unstructured (and structured) LARGE files‣ Analytics & Insights‣ Distributed File System & Programming View slide
  • ‣ Batch Processing‣ Commodity Hardware‣ Data Locality, no shared storage‣ Scales linearly‣ Great for large text file processing, not so great on small files‣ Distributed programming paradigm View slide
  • MapReduce Framework (Map)using Microsoft.Hadoop.MapReduce;using System.Text.RegularExpressions;public class TotalHitsForPageMap : MapperBase{public override void Map(string inputLine, MapperContext context) { context.Log(inputLine); var parts = Regex.Split(inputLine, "s+"); if (parts.Length != expected) //only take records with all values { return; } context.EmitKeyValue(parts[pagePos], hit); } }
  • MapReduce Framework (Reduce & Job)public class TotalHitsForPageReducerCombiner : ReducerCombinerBase { public override void Reduce(string key, IEnumerable<string> values, ReducerCombinerContextcontext) { context.EmitKeyValue(key, values.Sum(e=>long.Parse(e)).ToString()); } }public class TotalHitsJob : HadoopJob<TotalHitsForPageMap,TotalHitsForPageReducerCombiner> { public override HadoopJobConfiguration Configure(ExecutorContext context) { var retVal = new HadoopJobConfiguration(); retVal.InputPath = Environment.GetEnvironmentVariable("W3C_INPUT"); retVal.OutputFolder = Environment.GetEnvironmentVariable("W3C_OUTPUT"); retVal.DeleteOutputFolder = true; return retVal; } }
  • Mark’s Big Data Myths‣ Big Data ≠ NoSQL ‣ NoSQL has similar Internet-scale Web origins of Hadoop stack (Yahoo!, Google, Facebook, et al) but not the same thing ‣ Facebook, for example, uses Hbase from the Hadoop stack‣ Big Data ≠ Real Time ‣ Big Data is primarily about batch processing huge files in a distributed manner and analyzing data that was otherwise too complex to provide value ‣ Use in-memory analytics for real time insights‣ Big Data ≠ Data Warehouse ‣ I still refer to large multi-TB DWs as “VLDB” ‣ Big Data is about crunching stats in text files for discovery of new patterns and insights ‣ Use the DW to aggregate and store the summaries of those calculations for reporting
  • Razorfish & Big Data‣ Web Analytics‣ Big Data Analytics‣ Digital Marketing – Ad Server Analytics‣ Multiple TBs of online data per client per year‣ Elastic Web-scale MapReduce & Hadoop‣ Increase ROI of digital marketing campaigns
  • Big Data Analytics Web Platform
  • In-Database Analytics (Teradata Aster)• Because of built-in analytics functions and big data performance, Aster becomes the data scientist’s sandbox and BI’s big data analytics processor. Prepackaged Analytics Functions (including Attribution)
  • SQL Server Big Data – Data LoadingAmazon HDFS & EMR Data Loading Amazon S3 Bucket
  • SqoopData transfer to & from Hadoop & SQL Server‣ sqoop import –connect jdbc:sqlserver://localhost –username sqoop -password password –table customers -m 1‣ > hadoop fs -cat /user/mark/customers/part-m-00000‣ > 5,Bob Smith‣ sqoop export –connect jdbc:sqlserver://localhost –username sqoop -password password -m 1 –table customers –export-dir /user/mark/data/employees3‣ 12/11/11 22:19:24 INFO mapreduce.ExportJobBase: Transferred 201 bytes in 32.6364 seconds (6.1588 bytes/sec)‣ 12/11/11 22:19:24 INFO mapreduce.ExportJobBase: Exported 4 records.
  • SQL Server Big Data Environment‣ SQL Server Database ‣ SQL Server 2008 R2 or 2012 Enterprise Edition ‣ Page Compression ‣ 2012 Columnar Compression on Fact Tables ‣ Clustered Index on all tables ‣ Auto-update Stats Asynch ‣ Partition Fact Tables by month and archive data with sliding window technique ‣ Drop all indexes before nightly ETL load jobs ‣ Rebuild all indexes when ETL completes‣ SQL Server Analysis Services ‣ SSAS 2008 R2 or 2012 Enterprise Edition ‣ 2008 R2 OLAP cubes partition-aligned with DW ‣ 2012 cubes in-memory tabular cubes ‣ All access through MSMDPUMP or SharePoint
  • SQL Server Big Data Analytics Features‣ Columnstore‣ Sqoop adapter‣ PolyBase‣ Hive‣ In-memory analytics‣ Scale-out MPP
  • Wrap-up‣ What is a Big Data approach to Analytics? ‣ Massive scale ‣ Data discovery & research ‣ Self-service ‣ Reporting & BI‣ Why did we take this Big Data Analytics approach? ‣ Each Web client produces an average of 6 TBs of ICA data in a year ‣ The data in the sources are variable and unstructured ‣ SSIS ETL alone couldn’t keep up or handle complexity ‣ SQL Server 2012 columnstore and tabular SSAS 2012 were key to using SQL Server for Big Data ‣ With the configs mentioned previously, SQL Server is working great‣ Analytics on Big Data also requires Big Data Analytics tools ‣ Aster, Tableau, PowerPivot, SAS