SQLRally Amsterdam 2013 - Hadoop

  • 693 views
Uploaded on

SQLRally Amsterdam 2013 presentation about Hadoop. Including HDFS, Hive and PolyBase.

SQLRally Amsterdam 2013 presentation about Hadoop. Including HDFS, Hive and PolyBase.

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
693
On Slideshare
0
From Embeds
0
Number of Embeds
1

Actions

Shares
Downloads
10
Comments
0
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide
  • DEMO:Upload a local file with hadoop -copyFromLocal
  • - Hadoopcommand- CoudXplorer
  • DEMO: Total hit count W3C logs
  • - TotalHits MR job
  • ‘Of the 150k jobs Facebook runs daily, only 500 are MapReduce jobs. The rest are is HiveQL’
  • Hive <0.11:stores data is plain text filesno join optimization.typical DHW query (star schema join) results in 6 MR jobsHive 0.11:introduces (O)RC files, loosely based on column store indexesjoin optimizationTypical DWH query result in 1 MR jobHive 0.12:- Uses Yarn and Tez, optimized for DWH queries and less overhead then MR
  • DEMO: Retrieving data via ODBC and Power Query in Excel
  • Via ExcelData Explorer (Azure BLOB storage)

Transcript

  • 1. Henk van der Valk Technical Sales Professional Jan Pieter Posthuma Microsoft BI Consultant 7/11/2013 Hadoop
  • 2. Access to online training content JOIN THE PASS COMMUNITY Become a PASS member for free and join the world‟s biggest SQL Server Community. Join Local Chapters Personalize your PASS website experience Access to events at discounted rates Join Virtual Chapters 2
  • 3. Agenda • • • • • • • • • Introduction Hadoop HDFS Data access to HDFS Map/Reduce Hive Data access from HDFS SQL PDW PolyBase Wrap up 3
  • 4. Introduction Henk • • • • • 10 years of Unisys-EMEA Performance Center 2002- Largest SQL DWH in the world (SQL2000) Project Real – (SQL 2005) ETL WR - loading 1TB within 30 mins (SQL 2008) Contributed to various SQL whitepapers • • • Schuberg Philis-100% uptime for mission critical applications Since april 1st, 2011 – Microsoft SQL PDW - Western Europe SQLPass speaker & volunteer since 2005 4
  • 5. Introduction Alerts, Notifications SQL Server StreamInsight Big Data Sources (Raw, Unstructured) Data & Compute Intensive Application Business Insights SQL Server FTDW Data Marts Sensors Load SQL Server Reporting Services Fast Devices Summarize & Load HDInsight on Windows Azure Bots HDInsight on Windows Server SQL Server Parallel Data Warehouse Historical Data (Beyond Active Window) Interactive Reports Integrate/Enrich SQL Server Analysis Server Crawlers Performance Scorecards Azure Market Place Enterprise ETL with SSIS, DQS, MDS ERP CRM LOB APPS Source Systems 5
  • 6. Introduction Jan Pieter Posthuma Jan Pieter Posthuma • Technical Lead Microsoft BI and Big Data consultant • Inter Access, local consultancy firm in the Netherlands • Architect role at multiple projects • Analysis Service, Reporting Service, PerformancePoint Service, Big Data, HDInsight, Cloud BI http://twitter.com/jppp http://linkedin.com/jpposthuma jan.pieter.posthuma@interaccess.nl 6
  • 7. Hadoop • Hadoop is a collection of software to create a data-intensive distributed cluster running on commodity hardware. • Original idea by Google (2003). • Widely accepted by Database vendors as a solution for unstructured data • Microsoft partners with HortonWorks and delivers their Hadoop Data Platform as Microsoft HDInsight • Available as an Azure service and on premise • HortonWorks Data Platform (HDP) 100% Open Source! 7 7
  • 8. Hadoop Map/ Reduce HBase HDFS Poly base Avro (Serialization) Zookeeper • HDFS – distributed, fault tolerant file system • MapReduce – framework for writing/executing distributed, fault tolerant algorithms • Hive & Pig – SQL-like declarative languages • Sqoop/PolyBase – package for moving data between HDFS BI ETL RDBMS Reporting Tools and relational DB systems • + Others… Hive & Pig • Hadoop 2.0 Sqoop / 8
  • 9. HDFS Large File 1100101010011100101010011100101010011100101010011100110010101001110010101001110010101001110010101001110010101001 1100101010011100101010011100101010011100101010011100110010101001110010101001110010101001110010101001110010101001 1100101010011100101010011100101010011100101010011100110010101001110010101001110010101001110010101001110010101001 1100101010011100101010011100101010011100101010011100110010101001110010101001110010101001110010101001110010101001 1100101010011100101010011100101010011100101010011100110010101001110010101001110010101001110010101001110010101001 1100101010011100101010011100101010011100101010011100110010101001110010101001110010101001110010101001110010101001 1100101010011100101010011100101010011100101010011100110010101001110010101001110010101001110010101001110010101001 1100101010011100101010011100101010011100101010011100110010101001110010101001110010101001110010101001110010101001 … 6440MB Let‟s color-code them Block 1 Block 2 Block 3 Block 4 Block 5 Block 6 64MB 64MB 64MB 64MB 64MB 64MB e.g., Block Size = 64MB … Block 100 Block 101 64MB 40MB Files are composed of set of blocks • Typically 64MB in size • Each block is stored as a separate file in the local file system (e.g. NTFS) 9 9
  • 10. HDFS NameNode HDFS was designed with the expectation that failures (both hardware and software) would occur frequently BackupNode namespace backups Hadoop 2.0 is more decentralized • Interaction between DataNodes • Less dependent on primary NameNode (heartbeat, balancing, replication, etc.) DataNode DataNode DataNode DataNode nodes write to local disk DataNode
  • 11. Data access to HDFS FTP – Upload your data files Streaming – Via AVRO (RPC) or Flume Hadoop command – hadoop fs -copyFromLocal Windows Azure BLOB storage – HDInsight Service (Azure) uses BLOB storage instead of local VM storage. Data can be uploaded without a provisioned Hadoop cluster • PolyBase – Feature of PDW 2012. Direct read/write data access to the datanodes. • • • • 11
  • 12. Data access Hadoop command Demo 12
  • 13. 13
  • 14. Map/Reduce • MR: all functions in a batch oriented architecture • • Map: Apply the logic to the data, eg page hits count. Reduce: Reduces (aggregate) the results of the Mappers to one. • YARN: split the JobTracker in to Resource Manager and Node Manager. And MR in Hadoop 2.0 uses YARN as its JobTacker 14
  • 15. Map/Reduce Total page hits 15
  • 16. Hive • • • • • • • • • Build for easy data retrieval Uses Map/Reduce Created by Facebook HiveQL: SQL like language Stores data in tables, which are stored as HDFS file(s) Only initial INSERT supported, no UPDATE or DELETE External tables possible on existing (CSV) file(s) Extra language options to use benefits of Hadoop Stinger initiative: Phase 1 (0.11) and Phase 2 (0.12). Improve Hive 100x 16
  • 17. Hive Star schema join – (Based on TPC-DS Query 27) SELECT col5, avg(col6) FROM store_sales_fact ssf 41 GB join item_dim on (ssf.col1 = item_dim.col1) 58 MB join date_dim on (ssf.col2 = date_dim.col2) 11 MB join custdmgrphcs_dim on (ssf.col3 = custdmgrphcs_dim.col3) 80 MB join store_dim on (ssf.col4 = store_dim.col4) 106 KB GROUP BY col5 ORDER BY col5 LIMIT 100; Cluster: 6 Nodes (2 Name, 4 Compute – dual core, 14GB) 17
  • 18. Hive File Type # MR jobs Input Size # Mappers Time Text / Hive 0.10 5 43.1 GB 179 21:00 min Text / Hive 0.11 1 38.0 GB 151 4:06 min RC / Hive 0.11 1 8.21 GB 76 2:16 min ORC / Hive 0.11 1 2.83 GB 38 1:44 min RC / Hive 0.11 / Partitioned / Bucketed 1 1.73 GB 19 1:44 min ORC / Hive 0.11 / Partitioned / Bucketed 1 687 MB 27 01:19 min Data: ~64x less data Time; ~16x times faster 18
  • 19. Data access from Hadoop Excel FTP Hadoop command – hadoop fs -copyToLocal ODBC[1] – Via Hive (HiveQL) data can be extracted. Power Query – Is capable of extracting data directly from HDFS or Azure BLOB storage • PolyBase – Feature of PDW 2012. Direct read/write data access to the datanodes. • • • • • [1] http://www.microsoft.com/en-us/download/details.aspx?id=40886 [2] Power BI Excel add-in – http://www.powerbi.com 19
  • 20. Data access Excel 2013 Demo 20
  • 21. 21
  • 22. PDW – Polybase … SQL Server SQL Server SQL Server SQL Server Sqoop This is PDW! DN DN DN DN DN DN DN DN DN DN DN DN Hadoop Cluster 22 22
  • 23. PDW – External Tables • An external table is PDW‟s representation of data residing in HDFS • The “table” (metadata) lives in the context of a SQL Server database • The actual table data resides in HDFS • No support for DML operations • No concurrency control or isolation level guarantees CREATE EXTERNAL TABLE table_name ({<column_definition>} [,...n ]) {WITH (LOCATION =‘<URI>’,[FORMAT_OPTIONS = (<VALUES>)])} [;] Required to indicate location of Hadoop cluster Optional format options associated with parsing of data from HDFS (e.g. field delimiters & reject-related thresholds) 23
  • 24. PDW – Hadoop use cases & examples [1] Retrieve data from HDFS with a PDW query • Seamlessly join structured and semi-structured data SELECT Username FROM ClickStream c, User u WHERE c.UserID = u.ID AND c.URL=‘www.bing.com’; [2] Import data from HDFS to PDW • Parallelized CREATE TABLE AS SELECT (CTAS) • External tables as the source • PDW table, either replicated or distributed, as destination CREATE TABLE ClickStreamInPDW WITH DISTRIBUTION = HASH(URL) AS SELECT URL, EventDate, UserID FROM ClickStream; [3] Export data from PDW to HDFS • Parallelized CREATE EXTERNAL TABLE AS SELECT (CETAS) • External table as the destination; creates a set of HDFS files CREATE EXTERNAL TABLE ClickStream2 (URL, EventDate, UserID) WITH (LOCATION =‘hdfs://MyHadoop:5000/joe’, FORMAT_OPTIONS (...) AS SELECT URL, EventDate, UserID FROM ClickStreamInPDW;
  • 25. SQL Server 2012 PDW Polybase demo 25
  • 26. Wrap up Hadoop „just another data source‟ @ your fingertips! Batch processing large datasets before loading into your DWH Offloading DWH data, but still accessible for analysis/reporting Integrate Hadoop via SQOOP, ODBC (Hive) or PolyBase Near future: deeply integration between Hadoop and SQL PDW Try Hadoop / HDInsight yourself: Azure: http://www.windowsazure.com/en-us/pricing/free-trial/ Web PI: http://www.microsoft.com/web/downloads/platform.aspx 26
  • 27. Q&A 27
  • 28. References Microsoft Big Data http://www.microsoft.com/bigdata Windows Azure HDInsight Service (3 months free trail) http://www.windowsazure.com/en-us/services/hdinsight/ SQL Server Parallel Data Warehouse (PDW) Landing Page http://www.microsoft.com/PDW http://www.upgradetopdw.com Introduction to Polybase http://www.microsoft.com/en-us/sqlserver/solutionstechnologies/data-warehousing/polybase.aspx 28 28
  • 29. Thanks! 29