SQL Server Konferenz 2014 - SSIS & HDInsight


Published on

Published in: Technology
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

SQL Server Konferenz 2014 - SSIS & HDInsight

  1. 1. SSIS & HDInsight Tillmann Eitelberg Oliver Engels
  2. 2. Who we are… Tillmann Eitelberg Oliver Engels • CTO of oh22information services GmbH • CEO of oh22data AG • PASS Regional Mentor Germany • PASS Regional Mentor Germany • Vice-president PASS Germany • President PASS Germany • Chapter Leader CologneBonn, Germany • Chapter Leader Frankfurt, Germany • Microsoft MVP • Microsoft MVP • Microsoft vTSP
  3. 3. Agenda • • • • • • • • • Traditional ETL Process Challenges of Big Data and unstructured data Useful Apache Hadoop Components for ETL Some statements to be clarified... Using Apache Hadoop within the ETL process SSIS – not just an simple ETL Tool Tools to work with HDInsight Get started using Windows Azure HDInsight Use SQL Server Integration Services to …
  4. 4. Traditional ETL Process • Extract data from different sources • different source systems • different data organization and/or format • (non-)relational databases, flat files • Transforms it to fit operational needs • • • • • Translating coded values Encoding free-form values Deriving a new calculated value Aggregation data profiling, data quality • Loads it into the end target • database, data mart, data warehouse
  5. 5. Traditional ETL Process OLAP Analysis CRM Load Extract Transform Load ERP Load Data Warehouse Data Mining Web Site Traffic Reporting
  6. 6. Traditional ETL Process OLAP Analysis CRM ERP E T L L L DBMS E T L L L Data Warehouse E T L L L Data Mining Web Site Traffic Staging Area Data Marts Reporting
  7. 7. Traditional ETL Process (Microsoft Glasses) • Control Flow • implement repeating workflows • Connecting containers and tasks into an ordered control flow by using precedence constraints • controlling external processes • load meta objects and data container • prepare data files
  8. 8. Traditional ETL Process (Microsoft Glasses) • Data Flow • Adding one or more sources to extract data from files and databases • Adding the transformations that meet the business requirements • Adding one or more destinations to load data into data stores such as files and databases • Configuring error outputs on components to handle problems
  9. 9. Microsoft Big Data Solution
  10. 10. Challenges of Big Data • large amounts of data from multiple sources • the volume of this amount of data goes into the terabytes, petabytes and exabytes • Classic relational database systems as well as statistical and visualization programs are often not able to handle such large amounts of data • according to calculations from the year 2011, the global volume of data doubles every 2 years
  11. 11. Challenges of unstructured data • does not have a pre-defined data model or is not organized in a predefined manner • typically text-heavy, but may contain data such as dates, numbers, and facts as well • structure, while not formally defined, can still be implied • aggregates can not be accessed with computer programs through a single interface • Emails, audio - and video files without tags, also contributions in different media such as online forums or on social-media platforms
  12. 12. Objectives of Big data
  13. 13. Objectives of Big data Real time tweets visualized on a map
  14. 14. HDInsight/Hadoop Eco-System Red Blue Purple = Core Hadoop = Data processing = Microsoft integration points and value adds Orange = Data Movement Green = Packages
  15. 15. Useful Apache Hadoop Components (for ETL) Apache Flume Apache Sqoop • Stream data from multiple sources into Hadoop for analysis • Allows data imports from external datastores and enterprise data warehouses into Hadoop • a large scale log aggregation framework • Parallelizes data transfer for fast performance and optimal system utilization • Collect high-volume Web logs in real time • Insulate themselves from transient spikes when the rate of incoming data exceeds the rate at which data can be written to the destination • Guarantee data delivery • Scale horizontally to handle additional data volume • Copies data quickly from external systems to Hadoop • Makes data analysis more efficient • Mitigates excessive loads to external systems
  16. 16. Useful Apache Hadoop Components (for ETL) Apache Hive Apache Pig • data warehouse infrastructure built on top of Hadoop • Platform for cerating MapReduce programs • supports analysis of large datasets stored in Hadoop's HDFS • SQL-like language called HiveQL • Internally a compiler translates HiveQL statements into a directed acyclic graph of MapReduce jobs • Language is called Pig Latin • abstracts Java MapReduce Job to something similar to SQL • Can use User Defined Functions written in Java, Python, JavaScript, Ruby or Groovy • Pig uses ETL
  17. 17. Useful Apache Hadoop Components (for ETL) ODBC/JDBC Connectors Apache Storm • Microsoft® Hive ODBC Driver • distributed real-time computation system for processing fast, large streams of data • processing one million 100 byte messages per second per node • Scalable with parallel calculations that run across a cluster of machines • Fault-tolerant – when workers die, Storm will automatically restart them. If a node dies, the worker will be restarted on another node • Storm guarantees that each unit of data (tuple) will be processed at least once or exactly once. http://www.microsoft.com/en-us/download/details.aspx?id=40886 • Original: Apache Hive ODBC Driver provided by Simba • transforms an application’s SQL query into the equivalent form in HiveQL • Supports all major on-premise and cloud Hadoop / Hive distributions • Supports data types: TinyInt, SmallInt, Int, BigInt, Float, Double, Boolean, String, Decimal and TimeStamp
  18. 18. Some statements to be clarified... • Hadoop will steal work from ETL solutions • ETL is running faster on Hadoop • Hadoop is not a data integration tool • Hadoop is a batch processing system and Hadoop jobs tend to have high latency • Data integration solutions do not run natively in Hadoop • Elephants do not live isolated • Hadoop is not a solution for data quality (and other specialized Transformations)
  19. 19. Using Apache Hadoop within the ETL process OLAP Analysis CRM ERP E T DBMS L L L E T L L L Data Warehouse E T L L L Data Mining Web Site Traffic Staging Area Social Media Sensor Logs Sqoop Flume Storm Hive Pig Data Marts ODBC JDBC Sqoop Reporting Data Science
  20. 20. SSIS – not just a simple ETL Tool
  21. 21. Use SQL Server Integration Services to… • • • • • • build complex workflows manage Windows Azure and HDInsight clusters load data into HDInsight/HDFS control jobs on HDInsight get data from Hive, Pig, … combine Hadoop with „traditional“ ETL
  22. 22. Tools to work with HDInsight • SSIS Tasks for HDInsight http://www.youtube.com/watch?v=2Aj9_w3y9Xo&feature=player_embedded &list=PLoGAcXKPcRvbTr23ujEN953pLP_nDyZJC#t=2184 • Announced at PASS Summit 2013 • Experimental Release on Codeplex • No timeline yet
  23. 23. Tools to work with HDInsight
  24. 24. Tools to work with HDInsight
  25. 25. Tools to work with HDInsight • Azure Storage Explorer http://azurestorageexplorer.codeplex.com/ • CloudBerry Explorer for Azure Cloud Storage http://www.cloudberrylab.com/free-microsoft-azure-explorer.aspx • Cerebrata Azure Management Studio http://www.cerebrata.com/ • Red Gate HDFS Explorer (beta) http://bigdata.red-gate.com/
  26. 26. Tools to work with HDInsight • Microsoft .NET SDK For Hadoop (nuget Packages) • Windows Azure HDInsight Provides a .NET API for cluster management and job submission on Windows Azure HDInsight service. • Microsoft .NET Map Reduce API For Hadoop Provides a .NET API for the Map/Reduce functionality of Hadoop Streaming. • Microsoft .NET API For Hadoop WebClient Provides a .NET API for WebClient • Microsoft .NET API for Hadoop Provides a .NET API for working with Hadoop clusters over HTTP
  27. 27. Tools to work with HDInsight • some API requries .NET 4.5 • By default SSIS 2012 uses .NET 4.0 • Use SSDT 2012 BI Edition (or higher) to work with .NET 4.5 in scripting tasks and components
  28. 28. Tools to work with HDInsight • NUGet Package Manager is not fully compatible with SQL Server Integration Services Script Task • nuget packages (assemblies) must be installed in the global assembly cache gacinstall –I <assembly.dll> • nuget packages/assemblies must be installed on all servers that are running the packages. • all assemblies need a strong name
  29. 29. Tools to work with HDInsight • Adding a Strong Name to an existing Assembly sn -k keyPair.snk ildasm AssemblyName.dll /out:AssemblyName.il ilasm AssemblyName.il /dll /key= keyPair.snk
  30. 30. Get started using Windows Azure HDInsight • Create a Storage Account • Define Name/URL of the storage account • Define location/affinity group, best setting currently „North Europe“ • Set replication, to avoid costs use „Locally Redundant“ • Create a container in the newly created storage account • Manage Access Keys • Get Storage Account Name • Get Primary Access Key
  31. 31. Get started using Windows Azure HDInsight • Create a Certificate makecert -sky exchange -r -n "CN=SQLKonferenz" -pe -a sha1 -len 2048 -ss My "SQLKonferenz.cer“ • Upload Certificate to Windows Azure • Get ScubscriptionId • Get Thumbprint
  32. 32. Get started using Windows Azure HDInsight
  33. 33. Demo Get started using Windows Azure HDInsight
  34. 34. Manage Your HDInsight Cluster • Create a container in your Windows Azure Storage account • Create HDInsight Cluster • Storage Container • Authentication (Username/Password) • Cluster Size • Delete HDInsight Cluster • (Delete corresponding container)
  35. 35. Manage Your HDInsight Cluster // Get the certificate object from certificate store using thumbprint var store = new X509Store(); store.Open(OpenFlags.ReadOnly); var cert = store.Certificates.Cast<X509Certificate2>().First( item => item.Thumbprint == thumbprint ); // Create HDInsightClient object using factory method var creds = new HDInsightCertificateCredential( new Guid(subscriptionId), cert ); var client = HDInsightClient.Connect(creds);
  36. 36. Demo
  37. 37. Upload data to HDInsight var storageCredentials = new StorageCredentials( defaultStorageAccountName, defaultStorageAccountKey ); var storageAccount = new CloudStorageAccount(storageCredentials, true); var cloudBlobClient = storageAccount.CreateCloudBlobClient(); var cloudBlobContainer = cloudBlobClient.GetContainerReference(defaultStorageCont); var blockBlob = cloudBlobContainer.GetBlockBlobReference( @"example/data/gutenberg/" ); using (var fileStream = System.IO.File.OpenRead(filename)) { blockBlob.UploadFromStream(fileStream); }
  38. 38. Upload data to HDInsight • ~ 300 MB ca. 45 Sec. • from an Azure VM in the same region
  39. 39. Run a MapReduce Program // Create Job Submission Client object var creds = new JobSubmissionCertificateCredential( new Guid(subscriptionId), cert, clusterName); var jobClient = JobSubmissionClientFactory.Connect(creds); // Create job object that captures details of the job var mrJobDefinition = new MapReduceJobCreateParameters() { JarFile = "wasb:///example/jars/hadoop-examples.jar", ClassName = "wordcount" }; mrJobDefinition.Arguments.Add("wasb:///example/data/gutenberg/davinci.txt"); mrJobDefinition.Arguments.Add("wasb:///example/data/WordCountOutput"); // Submit job to the cluster var jobResults = jobClient.CreateMapReduceJob(mrJobDefinition);
  40. 40. Demo
  41. 41. Run a Hive Query • Hive Query via .NET Hadoop SDK • Download result from Hive query • Load result from Hive query direct in the data flow • Microsoft® Hive ODBC Driver http://www.microsoft.com/en-us/download/confirmation.aspx?id=40886 (available for x86 and x64)
  42. 42. Demo
  43. 43. Complete HDInsight Package
  44. 44. Vielen Dank! Tillmann Eitelberg t.eitelberg@oh22.net Oliver Engels o.engels@oh22.net