Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Big Data on the Microsoft Platform


Published on

24 Hours of PASS - Big Data on the Microsoft Platform

Big Data on the Microsoft Platform

  1. 1. Big Data on the Microsoft PlatformWith Hadoop, MS BI and the SQL Server stackAndrew J. Brust, CEO, Blue Badge Insights Global Sponsor:
  2. 2. Meet Andrew CEO and Founder, Blue Badge Insights Big Data blogger for ZDNet Microsoft Regional Director, MVP Co-chair VSLive! and 17 years as a speaker Founder, Microsoft BI User Group of NYC  Co-moderator, NYC .NET Developers Group  “Redmond Review” columnist for Visual Studio Magazine, Twitter: @andrewbrust
  3. 3. Read all about it!
  4. 4. My New Blog (
  5. 5. AgendaBig Data, Hadoop and HDInsightMapReduceHive ODBC, BI StackHekaton, NoSQLSQL Server Parallel Data Warehouse, MPP, PolyBase
  6. 6. What is Big Data?100s of TB into PB and higherInvolving data from: financial data, sensors, web logs,social media, etc.Parallel processing often involved  Hadoop is emblematic, but other technologies are Big Data tooProcessing of data sets too large for transactionaldatabases  Analyzing interactions, rather than transactions  The three V’s: Volume, Velocity, Variety•Big Data tech sometimes imposed on small data problems
  7. 7. What is Hadoop?Open source implementation of Google’s MapReduce andGFS (Google File System)Allows for scale-out processing of petabyte scale data 1 PB = 1,024 TBAlso distributed storageCommodity hardwareCan work against flat files, or certain database formatsNative processing involves imperative Java codeOther languages supported through “Streaming”7
  8. 8. What is HDInsight?Microsoft’s Hadoop distribution, on Windows Most other distros on LinuxBased on Hortonworks Data Platform (HDP)Runs on Azure, eventually on Windows Server, and assandbox on dev PCFor .NET devs: .NET SDK for Hadoop, LINQ provider8
  9. 9. DemoHDInsight Global Sponsor:
  10. 10. The Hadoop Stack Log file integration Machine Learning/Data Mining RDBMS Import/Export Query: HiveQL and Pig Latin Database MapReduce, HDFS
  11. 11. MapReduce, in a Diagram Input mapper Output K1 Input mapper Output Input reducer Output Output K2 Input mapper Output Input reducer OutputInput K3 Input mapper Output Input reducer Output Input mapper Output Input mapper Output
  12. 12. A MapReduce Example • Count by suite, on each floor • Send per-suite, per platform totals to lobby • Sort totals by platform • Send two platform packets to 10th, 20th, 30th floor • Tally up each platform • Collect the tallies • Merge tallies into one spreadsheet
  13. 13. MapReduce Options Java JavaScript (“Rhino”) Other languages, especially Python, via Streaming C# via Streaming C# via .NET SDKPig, Hive, Sqoop, Mahout also generate MapReduce code13
  14. 14. Amenities forVisual Studio/.NET MRLib (NuGet Package) MR code in C#, HadoopJob, LINQ to Hive MapperBase, ReducerBase Hortonworks Data Platform for Windows OdbcClient + WebHDFS Hive ODBC client Driver Deployment
  15. 15. DemoMapReduce Global Sponsor:
  16. 16. HiveBegan as Hadoop sub-project  Now top-level Apache projectProvides a SQL-like (“HiveQL”) abstraction overMapReduceHas its own HDFS table file format (and it’s fully schema-bound)Can also work over HBaseActs as a bridge to many BI products which expect tabulardata
  17. 17. Hive ODBC Consumers Excel 2010 or 2013 (including via add-in) PowerPivot SQL Server Analysis Services, Tabular Mode SQL Server Reporting Services ADO.NET OdbcClient provider LINQ provider17
  18. 18. xVelocity TechnologiesFormerly known as VertiPaqPowerPivot, SSAS Tabular, SQL Server columnar indexesImplements BI Semantic Model (BISM)Uses column store technology  Compression  In-memory  SpeedNot a Big Data technology per se, but very useful foranalysis of job output18
  19. 19. Power ViewReports on BISM models (PowerPivot, SSAS Tabular)Hosted in SharePoint 2010, 2013 EnterpriseAlso Excel 2013 (but not on ARM/Windows RT)Interactive data exploration19
  20. 20. DemoHive ODBC + BI Stack Global Sponsor:
  21. 21. Project “Hekaton”In-memory engine for SQL Server transactional workloadsTables must be declared as in-memory explicitlyIn-memory and standard tables can coexist in same dbStored procs on in-mem tables are compiled to native codeHekaton and xVelocity are separateHekaton ≠ PowerPivot/SSAS TabularHekaton ≠ Columnstore indexesCompare to SAP HANA  In-memory, transactional, analytical, column store21
  22. 22. NoSQLNoSQL databases are non-relational and non- or loosely-schematizedHBase is a NoSQL database, of the wide column variety Hive implements a SQL layer over it HBase not yet in HDInsightHBase table = HDFS fileThree other NoSQL categories Key-value store, document store, graph database Azure Table Storage is a key-value store NoSQL databaseSome of them aren’t really Big Data tools, but marketthemselves that way anyway22
  23. 23. SQL Parallel Data Warehouse (PDW)SQL PDW is a Massively Parallel Processing (MPP)databaseTeradata, IBM Netezza, HP Vertica also in this categoryIt’s an array/cluster of SQL servers made to look like oneSQL ServerAvailable as appliance only Purchase from HP, Dell Server, storage and network all pre-built and configuredMany other MPP products based on PostgreSQLPDW loosely based on acquired DATAllegro product Implemented MPP with Ingres, written in Java, running on Linux23
  24. 24. MapReduce versus MPPMapReduce MPP Splits preprocessing amongst mapper  Splits query amongst nodes then unifies nodes and aggregation amongst reducers result sets Scales infinitely on commodity hardware  Scales to high-end assets in the appliance cabinet Uses locally attached commodity disks on  Uses shared storage (can be more nodes network traffic) Uses imperative code  Uses SQL Processes flat files, wide column tables  Works with relational tables only (HBase), relational tables (Hive) Divide-and-conquer approach, parallel,  Divide-and-conquer approach, parallel, distributed distributed24
  25. 25. PolyBaseTo be included in next version of PDWMashup of SQL Server and HadoopEnables PDW to address Hadoop data nodes (HDFS)directlyParallelism managed by PDWTables are imported into SQL Server db They are EXTERNAL tables They can participate in joins with standard tables25
  26. 26. ResourcesMS Big Data/HDInsight  Hadoop  HBase  PDW    memory.aspxColumn store  View  data-HA102835634.aspxHekaton  with-in-memory-technologies.aspx26
  27. 27. Questions? Global Sponsor:
  28. 28. Thank You for Attending Global Sponsor: