Big Data on the Microsoft Platform - With Hadoop, MS BI and the SQL Server stack


Published on

Big Data on the Microsoft Platform - With Hadoop, MS BI and the SQL Server stack - 24 Hours of PASS

Published in: Technology
1 Comment
No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Big Data on the Microsoft Platform - With Hadoop, MS BI and the SQL Server stack

  1. 1. Global Sponsor:Big Data on the Microsoft PlatformAndrew J. Brust, CEO, Blue Badge InsightsWith Hadoop, MS BI and the SQL Server stack
  2. 2. CEO and Founder, Blue Badge InsightsBig Data blogger for ZDNetMicrosoft Regional Director, MVPCo-chair VSLive! and 17 years as a speakerFounder, Microsoft BI User Group of NYC http://www.msbigdatanyc.comCo-moderator, NYC .NET Developers Group“Redmond Review” columnist forVisual Studio, Twitter: @andrewbrustMeet Andrew
  3. 3. Read all about it!
  4. 4. My New Blog (
  5. 5. AgendaBig Data, Hadoop and HDInsightMapReduceHive ODBC, BI StackHekaton, NoSQLSQL Server Parallel Data Warehouse, MPP, PolyBase
  6. 6. What is Big Data?100s of TB into PB and higherInvolving data from: financial data, sensors, web logs,social media, etc.Parallel processing often involved Hadoop is emblematic, but other technologies are Big Data tooProcessing of data sets too large for transactionaldatabases Analyzing interactions, rather than transactions The three V’s: Volume, Velocity, Variety•Big Data tech sometimes imposed on small data problems
  7. 7. What is Hadoop?Open source implementation of Google’s MapReduce andGFS (Google File System)Allows for scale-out processing of petabyte scale data 1 PB = 1,024 TBAlso distributed storageCommodity hardwareCan work against flat files, or certain database formatsNative processing involves imperative Java codeOther languages supported through “Streaming”7
  8. 8. What is HDInsight?Microsoft’s Hadoop distribution, on Windows Most other distros on LinuxBased on Hortonworks Data Platform (HDP)Runs on Azure, eventually on Windows Server, and assandbox on dev PCFor .NET devs: .NET SDK for Hadoop, LINQ provider8
  9. 9. Global Sponsor:DemoHDInsight
  10. 10. The Hadoop StackMapReduce, HDFSDatabaseRDBMS Import/ExportQuery: HiveQL and Pig LatinMachine Learning/Data MiningLog file integration
  11. 11. MapReduce, in a DiagrammappermappermappermappermappermapperInputreducerreducerreducerInputInputInputInputInputInputOutputOutputOutputOutputOutputOutputOutputInputInputInputK1K2K3OutputOutputOutput
  12. 12. A MapReduce Example• Count by suite, on each floor• Send per-suite, per platform totals to lobby• Sort totals by platform• Send two platform packets to 10th, 20th, 30th floor• Tally up each platform• Merge tallies into one spreadsheet• Collect the tallies
  13. 13. MapReduce OptionsPig, Hive, Sqoop, Mahout also generate MapReduce code13JavaJavaScript (“Rhino”)Other languages, especially Python, via StreamingC# via StreamingC# via .NET SDK
  14. 14. Hortonworks DataPlatform forWindowsMRLib (NuGetPackage)LINQ to HiveOdbcClient +Hive ODBCDriverDeploymentWebHDFSclientMR code in C#,HadoopJob,MapperBase,ReducerBaseAmenities forVisual Studio/.NET
  15. 15. Global Sponsor:DemoMapReduce
  16. 16. HiveBegan as Hadoop sub-project Now top-level Apache projectProvides a SQL-like (“HiveQL”) abstraction overMapReduceHas its own HDFS table file format (and it’s fully schema-bound)Can also work over HBaseActs as a bridge to many BI products which expect tabulardata
  17. 17. Hive ODBC Consumers17Excel 2010 or 2013 (including via add-in)PowerPivotSQL Server Analysis Services, Tabular ModeSQL Server Reporting ServicesADO.NET OdbcClient providerLINQ provider
  18. 18. xVelocity TechnologiesFormerly known as VertiPaqPowerPivot, SSAS Tabular, SQL Server columnar indexesImplements BI Semantic Model (BISM)Uses column store technology Compression In-memory SpeedNot a Big Data technology per se, but very useful foranalysis of job output18
  19. 19. Power ViewReports on BISM models (PowerPivot, SSAS Tabular)Hosted in SharePoint 2010, 2013 EnterpriseAlso Excel 2013 (but not on ARM/Windows RT)Interactive data exploration19
  20. 20. Global Sponsor:DemoHive ODBC + BI Stack
  21. 21. Project “Hekaton”In-memory engine for SQL Server transactional workloadsTables must be declared as in-memory explicitlyIn-memory and standard tables can coexist in same dbStored procs on in-mem tables are compiled to native codeHekaton and xVelocity are separateHekaton ≠ PowerPivot/SSAS TabularHekaton ≠ Columnstore indexesCompare to SAP HANA In-memory, transactional, analytical, column store21
  22. 22. NoSQLNoSQL databases are non-relational and non- or loosely-schematizedHBase is a NoSQL database, of the wide column variety Hive implements a SQL layer over it HBase not yet in HDInsightHBase table = HDFS fileThree other NoSQL categories Key-value store, document store, graph database Azure Table Storage is a key-value store NoSQL databaseSome of them aren’t really Big Data tools, but marketthemselves that way anyway22
  23. 23. SQL Parallel Data Warehouse (PDW)SQL PDW is a Massively Parallel Processing (MPP)databaseTeradata, IBM Netezza, HP Vertica also in this categoryIt’s an array/cluster of SQL servers made to look like oneSQL ServerAvailable as appliance only Purchase from HP, Dell Server, storage and network all pre-built and configuredMany other MPP products based on PostgreSQLPDW loosely based on acquired DATAllegro product Implemented MPP with Ingres, written in Java, running on Linux23
  24. 24. MapReduce versus MPP24MapReduce MPP Splits preprocessing amongst mappernodes and aggregation amongst reducers Scales infinitely on commodity hardware Uses locally attached commodity disks onnodes Uses imperative code Processes flat files, wide column tables(HBase), relational tables (Hive) Divide-and-conquer approach, parallel,distributed Splits query amongst nodes then unifiesresult sets Scales to high-end assets in theappliance cabinet Uses shared storage (can be morenetwork traffic) Uses SQL Works with relational tables only Divide-and-conquer approach, parallel,distributed
  25. 25. PolyBaseTo be included in next version of PDWMashup of SQL Server and HadoopEnables PDW to address Hadoop data nodes (HDFS)directlyParallelism managed by PDWTables are imported into SQL Server db They are EXTERNAL tables They can participate in joins with standard tables25
  26. 26. ResourcesMS Big Data/HDInsight Hadoop HBase PDW store View
  27. 27. Global Sponsor:Questions?
  28. 28. Global Sponsor:Thank You for Attending