SQLSaturday #230 - Introduction to Microsoft Big Data (Part 2)

2,018 views

Published on

Im zweiten Teil unserer Microsoft Big Data Session geht es darum, wie Big Data Informationen über "klassisches" SQL zugänglich gemacht werden können und wie sich mit der neuen PolyBase-Engine unstrukturierte Hadoop-Daten mit relationalen Data Warehouse-Daten einfach verknüpfen lassen.
In der Hadoop-Welt wird der SQL-Zugriff über die Komponente Hive ermöglicht.
Über den Microsoft Hive ODBC-Konnektor können die üblichen BI-Tools, wie PowerPivot, diesen Zugriff direkt nutzen.
Die PolyBase-Engine schließlich wird ein Bestandteil des SQL Server 2012 Parallel Data Warehouse werden und erlaubt einem transparenten SQL-Zugriff, egal, wo sich die Daten befinden.

Published in: Technology, Business
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
2,018
On SlideShare
0
From Embeds
0
Number of Embeds
949
Actions
Shares
0
Downloads
0
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

SQLSaturday #230 - Introduction to Microsoft Big Data (Part 2)

  1. 1. SQLSaturday #230 Rheinland Sascha Dittmann Softwarearchitekt & Entwickler – Ernst & Young GmbH www.sascha-dittmann.de Georg Urban Snr. Technology Solution Professional | Data Platform georg.urban@microsoft.com 13.07.2013
  2. 2. HIVE IN A NUTSHELL
  3. 3. Hadoop & Business Intelligence  Hadoop is great for storing & processing *large* amounts of data  (but) Map/Reduce jobs are kind of low level  (most) BI tools rely on relational or multidimensional data sources and declarative languages like SQL | MDX | DAX ?
  4. 4. The Hive Project  Hive was started at Facebook (2008)  Goal: empower business users to query Hadoop clusters with standard tools & SQL  Famous paper at VLDB conference 2009 in 2009 already 700TB data „lived“ in Hive at Facebook: 5.000 queries a day from over 100 users Hive is a „Data Warehouse“ for Hadoop! (a system for managing data structures build on top of Hadoop) http://www.vldb.org/pvldb/2/vldb09-938.pdf
  5. 5. Hive architecture  Query Language: HiveQL (subset of SQL)  Uses Map/Reduce for execution  Rule based optimizer Driver (Compiler, Optimizer, Executor) Command Line Interface Web Interface Thrift Server Metastore JDBC ODBC Performance is an issue: Hortonworks stinger initiative aims for „human-time use cases“
  6. 6. Hive concepts  Well known: Databases | Tables | Rows & Columns  Table = (file or) directory e.g. twitter_feeds -> /user/hive/warehouse/twitter_feeds  Storage: ORC (Optimized Row Columnar), Textfile, RCFile (Record Columnar File), etc.  Primitive Types: integer, float, string, date, boolean  …plus arrays, maps, user defined types  Partitions = subdirectories  Indexes = data subsets or bitmaps  HiveQL: SELECT…FROM…WHERE (incl. Joins, Aggregates, Union All, Subqueries)  Can embed M/R scripts
  7. 7. HiveQL Example CREATE TABLE logdata( logdate string, logtime string, … time_taken int) ROW FORMAT DELIMITED FIELDS TERMINATED BY ' '; LOAD DATA INPATH '/w3c/input/data' OVERWRITE INTO TABLE logdata; SELECT logdate, logtime, time_taken FROM logdata LIMIT 200;
  8. 8. Working with Hive
  9. 9. BIG DATA & BI
  10. 10. HDInsight in Action: Managing Big Data
  11. 11. Enriching Big Data in PowerPivot
  12. 12. Big Data Mashup with Power View = Insights
  13. 13. BI on HDInsight
  14. 14. Real World Big Data  Yahoo! 180 PB raw data in > 40.000 computers (polystructured)*  Biggest Hadoop cluster: 4.500 nodes (2x4 CPUs, 4x1 TB disks, 16 GB RAM)  Page Impressions:  Cube with 207 Measures | 24 Dimensions | 247 Attributes  Desktop Clients (MS & Tableau): < 6s ad hoc query time http://wiki.apache.org/hadoop/PoweredBy
  15. 15. Real Life Example (Sensor Data Analytics) XML, structured & unstructured files Preprocessing(z.B.C24,…) HiveHive Integration Services Database Service IntegrationServices ERP & other DBs HDInsight Browsers Excel & PowerPivot 3. Party Mobile Clients Analytical & DM Tools (R, SPSS, MS Data Mining,…) Longterm Storage & Preprocessing Power View More Microsoft… (Reporting Services, Performance Point,…) Integration, Analysis & Persistance Publishing & Collaboration Integration Services SharePoint Self Service analytical Applications Analysis Services
  16. 16. SQL SERVER PARALLEL DATA WAREHOUSE
  17. 17. Parallel Data Warehouse Concepts
  18. 18. V1 Reference PDW V2 The Basic Full Rack 10X Faster & 50% Lower Capital Cost Control Node Mgmt. Node Landing Zone Backup Node Estimated Total HW List Price: $1MM$ Estimated Total HW List Price: $500K$ Infiniband & Ethernet Fiber Channel 70% more disk I/O bandwidth Infiniband & Ethernet • 128 cores on 8 compute nodes • 2TB of RAM on compute • Up to 168 TB of tempdb • Up to 1PB of user data • 160 cores on 10 compute nodes • 1.28 TB of RAM on compute • Up to 30 TB of tempdb • Up to 150 TB of user data ComputeNode Storage Compute & Storage Control Node
  19. 19. SQL Server 2012 Parallel Data Warehouse  Up- & downscale  2-56 Compute Nodes  Unique standardized Nodes: 256 GB  1-6 Racks  Compute & Storage Nodes are VMs  Simple Management  Hardware Abstraction  Different Workloads  ColumnStore v2 storage  High compressions  Updatable for incremental loads Development Goals FDR Infiniband Direct attached SAS Hardware Architecture
  20. 20. Startsmall &grow Dynamic Scale Up Start small with a few Terabyte warehouse Add capacity up to 5 Petabytes Increments by 2-3 Compute Nodes plus Storage 0TB 5 PB Enterprise Warehouse PB NoDowntime
  21. 21. Infiniband Infiniband Ethernet Ethernet Control Node Failover Node JBOD 1 Compute Node 1 Compute Node 2 JBOD 2 Compute Node 3 Compute Node 4 JBOD 3 Compute Node 5 Compute Node 6 JBOD 4 Compute Node 7 Compute Node 8 Customer Use Base Unit (6U): • Redundant Infiniband • Redundant Ethernet • Mgmt & Control (Active) • Rack Failover Node (Passive) Base Unit (7U): • 2 HP 1U Servers • (16 Cores/Ea. Total: 32) • JBOD 5U • 1TB Drives • User Data Capacity: 75TB Scale Unit (7U): • 2 HP 1U Servers • (16 Cores/Ea. Total: 32) • JBOD 5U • 1TB Drives • User Data Capacity: 75TB ¼Rack 15TB (Raw) 1/2Rack 30TB(Raw) Customer Space (8U) • ETL Servers • Backup Servers • Passive Unit (Additional spares) Scale Unit (7U): • 2 HP 1U Servers • (16 Cores/Ea. Total: 32) • JBOD 5U • 1TB Drives • User Data Capacity: 75TB Scale Unit (7U): • 2 HP 1U Servers • (16 Cores/Ea. Total: 32) • JBOD 5U • 1TB Drives • User Data Capacity: 75TB FullRack 60TB(Raw) Infiniband Infiniband Ethernet Ethernet Failover Node JBOD 5 Compute Node 9 Compute Node 10 JBOD 6 Compute Node 11 Compute Node 12 JBOD 7 Compute Node 13 Compute Node 14 JBOD 8 Compute Node 15 Compute Node 16 Customer Use Extension Base Unit (5U): • Redundant Infiniband • Redundant Ethernet • Rack Failover Node (Passive) Extension Base Unit (7U): • 2 HP 1U Servers • (16 Cores/Ea. Total: 32) • JBOD 5U • 1TB Drives • User Data Capacity: 75TB Scale Unit (7U): • 2 HP 1U Servers • (16 Cores/Ea. Total: 32) • JBOD 5U • 1TB Drives • User Data Capacity: 75TB 1¼Rack 75.5TB (Raw) Customer Space (9U) • ETL Servers • Backup Servers • Passive Unit (Additional spares) Scale Unit (7U): • 2 HP 1U Servers • (16 Cores/Ea. Total: 32) • JBOD 5U • 1TB Drives • User Data Capacity: 75TB Scale Unit (7U): • 2 HP 1U Servers • (16 Cores/Ea. Total: 32) • JBOD 5U • 1TB Drives • User Data Capacity: 75TB 3Rack 181.2TB(Raw) 11/2Rack 90.6TB(Raw) 2Rack 120.8TB(Raw) Infiniband Infiniband Ethernet Ethernet Failover Node JBOD 9 Compute Node 17 Compute Node 18 JBOD 10 Compute Node 19 Compute Node 20 JBOD 11 Compute Node 21 Compute Node 22 JBOD 12 Compute Node 23 Compute Node 24 Customer Use Extension Base Unit (5U): • Redundant Infiniband • Redundant Ethernet • Rack Failover Node (Passive) Extension Base Unit (7U): • 2 HP 1U Servers • (16 Cores/Ea. Total: 32) • JBOD 5U • 1TB Drives • User Data Capacity: 75TB Scale Unit (7U): • 2 HP 1U Servers • (16 Cores/Ea. Total: 32) • JBOD 5U • 1TB Drives • User Data Capacity: 75TB Customer Space (9U) • ETL Servers • Backup Servers • Passive Unit (Additional spares) Scale Unit (7U): • 2 HP 1U Servers • (16 Cores/Ea. Total: 32) • JBOD 5U • 1TB Drives • User Data Capacity: 75TB Scale Unit (7U): • 2 HP 1U Servers • (16 Cores/Ea. Total: 32) • JBOD 5U • 1TB Drives • User Data Capacity: 75TB • 2 – 56 compute nodes • 1 – 7 racks • 1, 2, or 3 TB drives • 15.1 – 1268.4 TB raw • 53 – 6342 TB User data • Up to 7 spare nodes available across the entire appliance HP Configuration
  22. 22.  VMs for different workloads (e.g. HDInsight zone)  Storage Spaces manage  physical disks on JBOD(s)  33 logical mirrored drives (66 drives & 4 hot spares)  Clustered Shared Volumes (CSV) allows all nodes to access the LUNs on the JBOD  One cluster across the whole appliance  VMs are automatically migrated on failure Host 1 Host 0 Host 2 Host 3 JBOD CTL MAD FAB AD VMM Compute 1 Compute 2 Host 2 Compute 1 Agility Due to Virtualization * 3 nodes per JBOD in Dell Configuration
  23. 23. xVelocity Columnstore as primary Storage C1 C 2 C4 C5 C6 C 3 T.C1 T.C3T.C2 T.C4 T.C1 T.C3T.C2 T.C4 T.C1 T.C3T.C2 T.C4 Better IO & Caching  columns stored independend  early segment elimination  aggressives read ahead Speicher-Optimierung  new Memory Broker  segments are loaded when needed  …and stay as long as possible Batch Mode  max. parallelism  ca. 1.000 values per kernel  CPU time is reduced by ratio 7 to 40 SELECT Region, SUM(Sales) … T.C2 T.C4 Bitmapofqualifiedrows Column vectors Batch- Object
  24. 24. In Memory Columnstore Index
  25. 25. Compression Rates in the Demo 82 34.2 22.9 11.2 4 0 10 20 30 40 50 60 70 80 90 Rohdaten Page Compression Backup Compression Columnstore-Index (Disk) Columstore-Index (Memory) in GB Kompressionsrate bei *diesen* Daten: ca. 20
  26. 26. Columnstore: The next Generation  Columnstore becomes primary data structure (clustered index)  No need for base table  Allows Updates & Deletes (temporary row store)  Easy data managment  Improvements:  Supports all (reasonable) data types  Support more query operators  Statistics on partitioned tables PDW v2 & SQL Server 2014
  27. 27. Microsoft BI Stack Connectivity Targeted Driver SNAC11 SNAC10 .NET (sqlclient) OLEDB ODBC OLEDB ODBC PDW Tool 32 bit 64 bit 32 bit 64 bit 32 bit 64 bit 32 bit 64 bit 32 bit 64 bit SSRS/Reporting Services SSRS - SS2012 (report builder and SSDT) X X X X X X SSRS - SS2008 R2 (report builder) X n/a X n/a X n/a SSAS/Analysis Services SSAS – SS2012 X X X X X X X X X X SSAS – SS2008 R2 X X X X X X Linked Server (DQ) DQ – SS2008 X X X X SSIS/Integration Services SSIS - SS2012 X X X X X X X X X X SSIS - SS2008 R2 X X X X X X PowerPivot for Excel n/a n/a X X n/a n/a X X Power View SS2012 w/ and w/o direct query X X X X X X MS BI Direct Query Excel X X X X n/a n/a Direct Query n/a n/a X X n/a n/a Access n/a n/a X X n/a n/a Master Data Services SS2012 X X X X X X SS2008 X X X X Quality Services SS2012 X X X X X X SS2008 X X X X
  28. 28. Monitoring  Build in Monitoring by GUI or Management Views (DMVs)  System Center Management Packs for PDW
  29. 29. Simple Resource Management  Pre-built resource classes in PDW  Resource class =  PDW concurrency slots in use  Memory utilization  Priority  DBA controls how requests are mapped to resource classes.  PDW honors resource class at run- time PDW Concurrency slots in use: 3 Memory: V1 HW – 600MB, V2 HW – 1.2GB Priority: Medium PDW Concurrency slots in use: 21 Memory: V1 HW ~4.2GB, V2 HW ~8.4GB Priority: High PDW Concurrency slots in use: 1 Memory: V1 HW ~200MB, V2 HW ~400MB Priority: Medium PDW Concurrency slots in use: 7 Memory: V1 HW – 1.4GB, V2 HW ~2.8GB Priority: High
  30. 30. Improve T-SQL Parity T-SQL additions to increase compatibility:  SQL Server Data Tools  Microsoft BI Tools  3rd Party Tools, like Tableau Dedicated PDW tools are deprecated Catalog SPs examples: sp_tables_rowset;2 sp_catalogs_rowset sp_executesql Built-in function examples: db_id db_name object_id General T-SQL improvements examples: cross/outer apply sp_prepare sp_execute Configuration Functions: • @@LANGUAGE • @@SPID SET options: • SET ROWCOUNT • SET FMTONLY
  31. 31. POLYBASE
  32. 32. Hadoop / Big Data-Integration: Microsoft  T-SQL query engine for RDBMS & Hadoop  Cost base optimizer. decides on:  Moving HDFS data into RDBMS storage  Rendering operators in Map/Reduce-Jobs or  HDFS-Bridge for parallelized Data Transport & HDFS Data Nodes Regular T-SQL Results PDW V2
  33. 33. External Tables are mapped to HDFS files Fields in the file are defined as columns in the PDW External table File characteristics are also provided during definition This works for HDInsight, Hortonworks HDP & Cloudera CREATE EXTERNAL TABLE ClickEvent ( url varchar(50), event_date date, user_IP varchar(50)), WITH (LOCATION =‘hdfs://MyHadoop:5000/clickstream/click. txt’, FORMAT_OPTIONS (FIELD_TERMINATOR = '|')); External Tables Hadoop Integration
  34. 34. Polybase: Creating an external Table  familiar tooling: SSMS & Data Tools
  35. 35. Polybase: A really simple Query Example  Here: external data movement is ExternalRoundRobinMove  Parallel HDFS readers will run on every data node (e.g. 10 nodes à 8 threads)
  36. 36. BI & Big Data Solution with SQL PDW v2 &
  37. 37. Single Query; Structured and Unstructured Query and join Hadoop tables with Relational Tables Use Standard SQL language ExistingSQL Skillset NoIT Intervention SaveTime andCostsDatabas e HDFS (Hadoop) SQL Server 2012 PDW Powered by PolyBase SQL Analyzeall DataTypes PolyBase: Breakthrough in Data Processing
  38. 38. Resources  SQL Server CAT-Blog http://sqlcat.com  bwin Case Study http://www.microsoft.com/casestudies/Case_Study_Detail.aspx?casestudyid=4000001470  Microsoft Big Data Site http://www.microsoft.com/en-us/sqlserver/solutions-technologies/business-intelligence/big- data.aspx  Introduction to Hadoop on Windows Azure http://channel9.msdn.com/Events/windowsazure/learn/Introduction-to-Hadoop-on-Windows-Azure  SQL Server Team Blog http://blogs.technet.com/b/dataplatforminsider  Microsoft YouTube Big Data Channel http://www.youtube.com/playlist?list=PLD471EE01A293CC34  TechEd Sessions http://channel9.msdn.com/Events/TechEd  Microsoft Connect (Product Feedback) http://connect.microsoft.com
  39. 39. Vielen Dank an die Volunteers! 13.07.2013 |
  40. 40. Große Verlosung!  Am Ende der Veranstaltung (ca. 18:00 Uhr)  Gewinnt viele Preise!  Deshalb: 13.07.2013 | Besucht unsere Sponsoren!
  41. 41. Unsere „You Rock! “ Sponsoren 13.07.2013 |
  42. 42. Vielen Dank an all unsere Sponsoren! 13.07.2013 | Gold Silber Bronze
  43. 43. Media Sponsoren: 13.07.2013 |
  44. 44. Hands-on event: PASS Camp 2013! 13.07.2013 |

×