SQL SERVER 2012 AND BIG DATAHadoop Connectors for SQL Server
TECHNICALLY – WHAT IS HADOOP• Hadoop consists of two key services: • Data storage using the Hadoop Distributed File System (HDFS) • High-performance parallel data processing using a technique called MapReduce.
HADOOP IS AN ENTIRE ECOSYSTEM• Hbase as database• Hive as a Data Warehouse• Pig as the query language• Built on top of Hadoop and the Map-Reduce framework.
HDFS• HDFS is designed to scale seamlessly • That‟s it‟s strength!• Scaling horizontally is non-trivial in most cases.• HDFS scales by throwing more hardware at it. • A lot of it! • HDFS is asynchronous • Is what links Hadoop to Cloud computing.
DIFFERENCES• SQL Server & Windows 2008 R2′s NTFS? • Data is not stored in the traditional table column format. • HDFS supports only forward only parsing • Databases built on HDFS don‟t guarantee ACID properties • Taking code to the data • SQL Server scales better vertically
UNSTRUCTURED DATA• Doesn‟t know/care about column names, column data types, column sizes or even number of columns.• Data is stored in delimited flat files• You‟re on your own with respect to data cleansing• Data input in Hadoop is as simple as loading your data file into HDFS • It‟s very close to copying files on an OS.
NO SQL, NO TABLES, NO COLUMNS NO DATA?• Write code to do Map-Reduce • You have to write code to get data• The best way to get data • write code that calls the MapReduce framework to slices and dices the stored data• Step 1 is Map and Step 2 is Reduce.
MAP (REDUCE)• Mapping • Pick your selection of keys from record (Linefeed) • Tell the framework what your Key is and what values that key will hold • MR will deal with actual creation of the Map • Control on what keys to include or what values to filter out • End up with a giant hashtable
(MAP) REDUCE• Reducing Data: Once the map phase is complete code moves on to the reduce phase. The reduce phase works on mapped data and can potentially do all the aggregation and summation activities.• Finally you get a blob of the mapped and reduced data.
JAVA… VS. PIG…• Pig is a querying engine • Has a „business-friendly‟ syntax • Spits out MapReduce code • syntax for Pig is called : Pig Latin (Don‟t ask) • Pig Latin is very similar syntactically to LINQ.• Pig converts into MapReduce and sends it off to Hadoop then retrieves the results• Half the performance• 10 times faster to write
HBASE• HBase is a key value store on top of HDFS• This is the NOSql Database• Very thin layer over raw HDFS • Data is grouped in a Table that has rows of data. • Each row can have multiple „Column Families‟ • Each „Column Family‟ contain(s) multiple columns. • Each column name is the key and it has it‟s corresponding column value. • Each row doesn‟t need to have the same number of columns
HIVE• Hive is a little closer to RDBMS systems• Is a DWH system on top of HDFS and Hbase • Performs join operations between HBase tables• Maintains a meta layer • data summation, ad-hoc queries and analysis of large data stores in HFDS• High level language • Hive Query Language, looks like SQL but restricted • No, Updates or Deletes are allowed • partitioning can be used to update information o Essentially re-writing a chunk of data.
WINDOWS HADOOP- PROJECT ISOTOPE• 2 Flavours • Cloud o Azure CTP • On Permise o integration of the Hadoop File System with Active Directory o integrate System Center Operations Manager with Hadoop o BI Integration • Are not all that interesting in and of themselves, but data and tools are o Sqoop – Integration with SQL Server o Flume – Access to Lots of data
SQOOP• Is a framework that facilitates transfer between (RDBMS) and HDFS.• Uses MapReduce programs to import and export data;• Imports and exports are performed in parallel with fault tolerance.• Source / Target files being used by Sqoop can be: • delimited text files • binary SequenceFiles containing serialized record data.
SQL SERVER – HORTONWORKS - HADOOP• Spin-off from Yahoo• Bridge the technological gaps between Hadoop and Windows Server• CTP of the Hadoop-based distribution for Windows Server ( somewhere in 2012)• Will work with Microsoft‟s business-intelligence tools • including o Excel o PowerPivot o PowerView
WITH SQL SERVER-HADOOP CONNECTOR, YOU CAN:• Sqoop-based connector• Import • tables in SQL Server to delimited text files on HDFS • tables in SQL Server to SequenceFiles files on HDFS • tables in SQL Server to tables in Hive • Result of queries executed on SQL Server to delimited text files on HDFS • Result of queries executed on SQL Server to SequenceFiles files on HDFS • Result of queries executed on SQL Server to tables in Hive• Export • Delimited text files on HDFS to SQL Server • DequenceFiles on HDFS to SQL Server • Hive Tables to tables in SQL Server
SQL SERVER 2012 ALONGSIDE THE ELEPHANT• PowerView utilizes its own class of apps, if you will, that Microsoft is calling insights.• SQL Server will extend insights to Hadoop data sets• Interesting insights can be • Brought into a SQL Server environment using connectors • Drive analysis across it using BI tools.
WHY USE HADOOP WITH SQL SERVER• Don‟t just think about big data being large volumes • Analyze both structured and unstructured datasets • Think about workload, growth, accessibility and even location • Can the amount of data stored every day reliably written to a traditional HDD• Mapreduce is more complex then TSQL • Many companies try to avoid writing java for queries • Front ends are immature relative to the tooling available in the relational database world • It‟s not going to replace your database, but your database isn‟t likely to replace Hadoop either.
MICROSOFT AND HADOOP• Broader access of Hadoop to: • End users • IT professionals • Developers• Enterprise ready Hadoop distribution with greater security, performance, ease of management.• Breakthrough insights through the use of familiar tools such as Excel, PowerPivot, SQL Server Analysis Services and Reporting Services.
MICROSOFT ENTERPRISE HADOOP• Machines in the Hadoop cluster must be running Windows Server 2008 or higher• Ipv4 network enabled on all nodes • Deployment does not work on Ipv6 only network.• The ability to create a new user account called “Isotope”. • Will be created on all nodes of the cluster. • Used for running Hadoop daemons and running jobs. • Must be able to copy and install the deployment binaries to each machine• Windows File Sharing services must be enabled on each machine that will be joined to the Hadoop cluster.• .Net Framework 4 installed on all nodes.• Minimum of 10G free space in C drive (JBOD HDFS configuration is supported)