1. Data is not stored in the traditional table column format. At best some of the database layers mimic this, but deep in the bowels of HDFS, there are no tables, no primary keys, no indexes. Everything is a flat file with predetermined delimiters. HDFS is optimized to recognize <Key, Value> mode of storage. Every things maps down to <Key, Value> pairs.2. HDFS supports only forward only parsing. So you are either reading ahead or appending to the end. There is no concept of ‘Update’ or ‘Insert’.3. Databases built on HDFS don’t guarantee ACID properties. Specially ‘Consistency’. It offers what is called as ‘Eventual Consistency’, meaning data will be saved eventually, but because of the highly asynchronous nature of the file system you are not guaranteed at what point it will finish. So HDFS based systems are NOT ideal for OLTP systems. RDBMS still rock there.4. Taking code to the data. In traditional systems you fire a query to get data and then write code on it to manipulate it. In MapReduce, you write code and send it to Hadoop’s data store and get back the manipulated data. Essentially you are sending code to the data.5. Traditional databases like SQL Server scale better vertically, so more cores, more memory, faster cores is the way to scale. However Hadoop by design scales horizontally. Keep throwing hardware at it and it will scale.
Mapping Data: If it is plain de-limited text data, you have the freedom to pick your selection of keys from the record (remember records are typically linefeed separated) and values and tell the framework what your Key is and what values that key will hold. MR will deal with actual creation of the Map. When the map is being created you can control on what keys to include or what values to filter out. In the end you end up with a giant hashtable of filtered key value pairs. Now what?
Well, if you are that scared of Java, then you have Pig. No, I am not calling names here. Pig is a querying engine that has more ‘business-friendly’ syntax but spits out MapReduce code in the backend and does all the dirty work for you. The syntax for Pig is called, of course, Pig Latin.When you write queries in Pig Latin, Pig converts it into MapReduce and sends it off to Hadoop, then retrieves the results and hands it back to you.Analysis shows you get about half the performance of raw optimal hand written MapReduce java code, but the same code takes more than 10 times the time to write when compared to a Pig query.If you are in the mood for a start-up idea, generating optimal MapReduce code from Pig Latin is a topic to consider …For those in the .NET world, Pig Latin is very similar syntactically to LINQ.
HBase is a key value store that sits on top of HDFS. It is a NOSql Database.It has a very thin veneer over raw HDFS where in it mandates that data is grouped in a Table that has rows of data.Each row can have multiple ‘Column Families’ and each ‘Column Family’ can contain multiple columns.Each column name is the key and it has it’s corresponding column value.So a column of data can be represented asrow[family][column] = valueEach row need not have the same number of columns. Think of each row as a horizontal linked list, that links to a column family and then each column family links to multiple columns as <Key, Value> pairs.row1->family1->col A = val A->family2->col B = val Band so on.
Hive is a little closer to traditional RDBMS systems. In fact it is a Data Warehousing system that sits on top of HDFS but maintains a meta layer that helps data summation, ad-hoc queries and analysis of large data stores in HFDS.Hive supports a high level language called Hive Query Language, that looks like SQL but restricted in a few ways like no, Updates or Deletes are allowed. However Hive has this concept of partitioning that can be used to update information, which is essentially re-writing a chunk of data whose granularity depends on the schema design.Hive can actually sit on top of HBase and perform join operations between HBase tables.
Isotope is more than the distributions that the Softies are building with Hortonworks. Isotope also refers to the whole “tool chain” of supporting big-data analytics offerings that Microsoft is packaging up around the distributions. Microsoft’s big-picture concept is Isotope is what will give all kinds of users, from technical to “ordinary” productivity workers, access from inside data-analysis tools they know — like Microsoft’s own SQL Server Analysis Services, PowerPivot and Excel on their PCs — to data stored in Windows Servers and/or Windows Azure. (The Windows Azure Marketplace fits in here, as this is the place that third-party providers can publish free or paid collections of data which users will be able to download/buy.)To accelerate its adoption in the Enterprise, Microsoft will make Hadoop Enterprise ready by Active Directory Integration: Providing Enterprise-class security through integration of Hadoop with Active Directory High Performance: Boosting Hadoop performance to offer consistently high data throughput System Center Integration: Simplifying management of the Hadoop infrastructure through integration with Microsoft’s management tools such as System Center BI Integration: Enabling integration of relational and Hadoop data into Enterprise BI solution with Hadoop connectors Flexibility and Choice with deployment options for Windows Server and Windows Azure which offers customers: o Freedom to choose: More control as they can choose which data to keep in-house instead of the cloud. o Lower TCO: Cost saving, as fewer resources are required to run their Hadoop deployment in the cloud o Elasticity to meet demand: Elasticity reduces your costs, since more nodes can be added to the Windows Azure deployment for more demanding workloads. In addition, the Azure deployment of Hadoop can be used to extend the on premise solution in periods of high demand o Increased Performance: Bringing computing closer to the data – our solution enables customers to process data closer to where data is born, whether on premise or in the cloud We do this while maintaining compatibility with existing Hadoop tools such as Pig, Hive, and Java. Our goal is to ensure that applications built on Apache Hadoop can be easily migrated to our distribution to run on Windows Azure or Windows Server.
Sqoop is an open source connectivity framework that facilitates transfer between multiple Relational Database Management Systems (RDBMS) and HDFS. Sqoop uses MapReduce programs to import and export data; the imports and exports are performed in parallel with fault tolerance. The Source / Target files being used by Sqoop can be delimited text files (for example, with commas or tabs separating each field), or binary SequenceFiles containing serialized record data. Please refer to section 7.2.7 in Sqoop User Guide for more details on supported file types. For information on SequenceFile format, please refer to Hadoop API page.
Companies do not have to be at Google scale to have data issues. Scalability issues occur with less than a terabyte of data. If a company works with relational databases and SQL, they can drown in complex data transformations and calculations that do not fit naturally into sequences of set operations. In that sense, the “big data” mantra is misguided at times…The big issue is not that everyone will suddenly operate at petabyte scale; a lot of folks do not have that much data. The more important topics are the specifics of the storage and processing infrastructure and what approaches best suit each problem.attack unstructured and semi-structured datasets without the overhead of an ETL step to insert them into a traditional relational database. From CSV to XML, we can load in a single step and begin querying.
Gain new insights from your dataHave you ever had trouble finding data you needed? Or combining data from different, incompatible sources? How about sharing the results with others in a web-friendly way? If so, we want you to try Microsoft Codename “Data Explorer” Cloud service.With "Data Explorer" you can:Identify the data you care about from the sources you work with (e.g. Excel spreadsheets, files, SQL Server databases).Discover relevant data and services via automatic recommendations from the Windows Azure Marketplace.Enrich your data by combining it and visualizing the results.Collaborate with your colleagues to refine the data.Publish the results to share them with others or power solutions.In short, we help you harness the richness of data on the Web to generate new insights.
Blue - Use for Cloud on Your Terms specific content
Green - Use for Mission Critical Confidence specific content
Orange - Use for Breakthrough Insight specific content
SQL SERVER 2012 AND BIG DATAHadoop Connectors for SQL Server
TECHNICALLY – WHAT IS HADOOP• Hadoop consists of two key services: • Data storage using the Hadoop Distributed File System (HDFS) • High-performance parallel data processing using a technique called MapReduce.
HADOOP IS AN ENTIRE ECOSYSTEM• Hbase as database• Hive as a Data Warehouse• Pig as the query language• Built on top of Hadoop and the Map-Reduce framework.
HDFS• HDFS is designed to scale seamlessly • That‟s it‟s strength!• Scaling horizontally is non-trivial in most cases.• HDFS scales by throwing more hardware at it. • A lot of it! • HDFS is asynchronous • Is what links Hadoop to Cloud computing.
DIFFERENCES• SQL Server & Windows 2008 R2′s NTFS? • Data is not stored in the traditional table column format. • HDFS supports only forward only parsing • Databases built on HDFS don‟t guarantee ACID properties • Taking code to the data • SQL Server scales better vertically
UNSTRUCTURED DATA• Doesn‟t know/care about column names, column data types, column sizes or even number of columns.• Data is stored in delimited flat files• You‟re on your own with respect to data cleansing• Data input in Hadoop is as simple as loading your data file into HDFS • It‟s very close to copying files on an OS.
NO SQL, NO TABLES, NO COLUMNS NO DATA?• Write code to do Map-Reduce • You have to write code to get data• The best way to get data • write code that calls the MapReduce framework to slices and dices the stored data• Step 1 is Map and Step 2 is Reduce.
MAP (REDUCE)• Mapping • Pick your selection of keys from record (Linefeed) • Tell the framework what your Key is and what values that key will hold • MR will deal with actual creation of the Map • Control on what keys to include or what values to filter out • End up with a giant hashtable
(MAP) REDUCE• Reducing Data: Once the map phase is complete code moves on to the reduce phase. The reduce phase works on mapped data and can potentially do all the aggregation and summation activities.• Finally you get a blob of the mapped and reduced data.
JAVA… VS. PIG…• Pig is a querying engine • Has a „business-friendly‟ syntax • Spits out MapReduce code • syntax for Pig is called : Pig Latin (Don‟t ask) • Pig Latin is very similar syntactically to LINQ.• Pig converts into MapReduce and sends it off to Hadoop then retrieves the results• Half the performance• 10 times faster to write
HBASE• HBase is a key value store on top of HDFS• This is the NOSql Database• Very thin layer over raw HDFS • Data is grouped in a Table that has rows of data. • Each row can have multiple „Column Families‟ • Each „Column Family‟ contain(s) multiple columns. • Each column name is the key and it has it‟s corresponding column value. • Each row doesn‟t need to have the same number of columns
HIVE• Hive is a little closer to RDBMS systems• Is a DWH system on top of HDFS and Hbase • Performs join operations between HBase tables• Maintains a meta layer • data summation, ad-hoc queries and analysis of large data stores in HFDS• High level language • Hive Query Language, looks like SQL but restricted • No, Updates or Deletes are allowed • partitioning can be used to update information o Essentially re-writing a chunk of data.
WINDOWS HADOOP- PROJECT ISOTOPE• 2 Flavours • Cloud o Azure CTP • On Permise o integration of the Hadoop File System with Active Directory o integrate System Center Operations Manager with Hadoop o BI Integration • Are not all that interesting in and of themselves, but data and tools are o Sqoop – Integration with SQL Server o Flume – Access to Lots of data
SQOOP• Is a framework that facilitates transfer between (RDBMS) and HDFS.• Uses MapReduce programs to import and export data;• Imports and exports are performed in parallel with fault tolerance.• Source / Target files being used by Sqoop can be: • delimited text files • binary SequenceFiles containing serialized record data.
SQL SERVER – HORTONWORKS - HADOOP• Spin-off from Yahoo• Bridge the technological gaps between Hadoop and Windows Server• CTP of the Hadoop-based distribution for Windows Server ( somewhere in 2012)• Will work with Microsoft‟s business-intelligence tools • including o Excel o PowerPivot o PowerView
WITH SQL SERVER-HADOOP CONNECTOR, YOU CAN:• Sqoop-based connector• Import • tables in SQL Server to delimited text files on HDFS • tables in SQL Server to SequenceFiles files on HDFS • tables in SQL Server to tables in Hive • Result of queries executed on SQL Server to delimited text files on HDFS • Result of queries executed on SQL Server to SequenceFiles files on HDFS • Result of queries executed on SQL Server to tables in Hive• Export • Delimited text files on HDFS to SQL Server • DequenceFiles on HDFS to SQL Server • Hive Tables to tables in SQL Server
SQL SERVER 2012 ALONGSIDE THE ELEPHANT• PowerView utilizes its own class of apps, if you will, that Microsoft is calling insights.• SQL Server will extend insights to Hadoop data sets• Interesting insights can be • Brought into a SQL Server environment using connectors • Drive analysis across it using BI tools.
WHY USE HADOOP WITH SQL SERVER• Don‟t just think about big data being large volumes • Analyze both structured and unstructured datasets • Think about workload, growth, accessibility and even location • Can the amount of data stored every day reliably written to a traditional HDD• Mapreduce is more complex then TSQL • Many companies try to avoid writing java for queries • Front ends are immature relative to the tooling available in the relational database world • It‟s not going to replace your database, but your database isn‟t likely to replace Hadoop either.
MICROSOFT AND HADOOP• Broader access of Hadoop to: • End users • IT professionals • Developers• Enterprise ready Hadoop distribution with greater security, performance, ease of management.• Breakthrough insights through the use of familiar tools such as Excel, PowerPivot, SQL Server Analysis Services and Reporting Services.
MICROSOFT ENTERPRISE HADOOP• Machines in the Hadoop cluster must be running Windows Server 2008 or higher• Ipv4 network enabled on all nodes • Deployment does not work on Ipv6 only network.• The ability to create a new user account called “Isotope”. • Will be created on all nodes of the cluster. • Used for running Hadoop daemons and running jobs. • Must be able to copy and install the deployment binaries to each machine• Windows File Sharing services must be enabled on each machine that will be joined to the Hadoop cluster.• .Net Framework 4 installed on all nodes.• Minimum of 10G free space in C drive (JBOD HDFS configuration is supported)