Big Data-BI Fusion:Microsoft HDInsight & MS BILevel: IntermediateMarch 28, 2013Andrew BrustCEO and FounderBlue Badge Insights
• CEO and Founder, Blue Badge Insights• Big Data blogger for ZDNet• Microsoft Regional Director, MVP• Co-chair VSLive! and 18 years as a speaker• Founder, MS BI and Big Data User Group of NYC– http://www.msbigdatanyc.com• Co-moderator, NYC .NET Developers Group– http://www.nycdotnetdev.com• “Redmond Review” columnist forVisual Studio Magazine and Redmond DeveloperNews• brustblog.com, Twitter: @andrewbrustMeet Andrew
What is Big Data?• 100s of TB into PB and higher• Involving data from: financial data,sensors, web logs, social media, etc.• Parallel processing often involved– Hadoop is emblematic, but other technologies are BigData too• Processing of data sets too large fortransactional databases– Analyzing interactions, rather than transactions– The three V’s: Volume, Velocity, Variety• Big Data tech sometimes imposed onsmall data problems
The Hadoop StackMapReduce, HDFSDatabaseRDBMS Import/ExportQuery: HiveQL and Pig LatinMachine Learning/Data MiningLog file integration
What’s MapReduce?• Divide and conquer approach to “Big”data processing• Partition the data and send to mappers(nodes in cluster)• Mappers pre-process into key-value pairs,then all output for (a) given key(s) goes toa reducer• Reducer performs aggregations; oneoutput per key, with value• Map and Reduce code natively written asJava functions
MapReduce, in a DiagrammappermappermappermappermappermapperInputreducerreducerreducerInputInputInputInputInputInputOutputOutputOutputOutputOutputOutputOutputInputInputInputK1K2K3OutputOutputOutput
HDFS• File system whose data gets distributedover commodity disks on commodityservers• Data is replicated• If one box goes down, no data lost– “Shared Nothing”– Except the name node• BUT: Immutable– Files can only be written to once– So updates require drop + re-write (slow)– You can append though– Like a DVD/CD-ROM
HBase• A Wide-Column Store, NoSQL database• Modeled after Google BigTable• HBase tables are HDFS files– Therefore, Hadoop-compatible• Hadoop often used with HBase– But you can use either without the other• HDInsight (more on next slide) does not(yet) include HBase
Azure HDInsight Provisioning• HDInsight preview now public, so…• Go to Windows Azure portal• Sign up for the public preview• Select HDInsight from left navbar• Click “+ NEW” button @ lower-left• Specify cluster name, number of nodes, adminpassword, storage account– Credentials used for browser login, RDP and ODBC– During preview, you will be billed 50% of Azure compute ratesfor nodes in cluster. Will be 100% at GA.• Click “CREATE HDINSIGHT CLUSTER”• Wait for provisioning to complete• Navigate to http://clustername.azurehdinsight.netNew!
The “Data-Refinery” Idea• Use Hadoop to “on-board” unstructureddata, then extract manageable subsets• Load the subsets into conventional DW/BIservers and use familiar analytics tool toexamine• This is the current rationalization ofHadoop + BI tools’ coexistence• Will it stay this way?
Hive• Used by most BI products which connectto Hadoop• Provides a SQL-like abstraction overHadoop– Officially HiveQL, or HQL• Works on own tables, but also on HBase• Query generates MapReduce job, output ofwhich becomes result set• Microsoft has Hive ODBC driver– Connects Excel, Reporting Services, PowerPivot,Analysis Services Tabular Mode (only)
HDInsight Data Sources• Files in HDFS• Azure Blob Storage (Azure HDInsight only)– Use asv:// URLs (“Azure Storage Vault”)• Hive tables• HBase?
Just-in-time Schema• When looking at unstructured data,schema is imposed at query time• Schema is context specific– If scanning a book, are the values words, lines, orpages?– Are notes a single field, or is each word a value?– Are date and time two fields or one?– Are street, city, state, zip separate or one value?– Pig and Hive let you determine this at query time– So does the Map function in MapReduce code
How Does MS BI Fit In?• Excel, PowerPivot: can query via HiveODBC driver• Analysis Services (SSAS) Tabular Mode– Also compatible with Hive ODBC DriverMultidimensional mode is not• Power View– Works against PowerPivot and SSAS Tabular• RDBMS + Parallel Data Warehouse (PDW)– Sqoop connectors– Columnstore IndexesEnterprise Edition and PDW only• PDW: PolyBase
Excel, PowerPivot• Excel and PowerPivot use the BI SemanticModel (BISM), which can query Hadoop viaHive and its ODBC driver• Excel also features “Data Explorer”(currently in Beta) which can query HDFSdirectly and insert the results into a BISMrepository• Excel BISM accommodates millions ofrows through compression. Not petabytescale, but sufficient to store and analyzeoutput of Hadoop queries.
PowerPivot, SSAS Tabular• SQL Server Analysis Services Tabularmode is the enterprise serverimplementation of BISM• Features partitioning and role-basedsecurity• Can store billions of rows. So even betterfor Hadoop output analysis.• Excel-based BISM repositories can beupsized to SSAS Tabular
Sqoop• Acronym for “SQL to Hadoop”• Essentially a technology for moving databetween data warehouses and Hadoop• Command line utility; allows specificationof source/target HDFS file and relationalserver, database and table• Sqoop connectors available for SQLServer and PDW• Sqoop generates MapReduce job toextract data from, or insert data into, HDFS
PDW, PolyBase• SQL Server Parallel Data Warehouse(PDW) is a Massively Parallel Proicessing(MPP) data warehouse appliance versionof SQL Server• MPP manages a grid of relational databaseservers for divide-and-conquer processingof large data sets.• PDW v2 includes “PolyBase,” acomponent which allows PDW to querydata in Hadoop directly.– Bypasses MapReduce; addresses data nodes directlyand orchestrates parallelism itself
PolyBase Versus Hive, Sqoop• Hive and Sqoop generate MapReducejobs, and work in batch mode• PolyBase addresses HDFS data itself• This is true SQL over Hadoop.• Competitors:– Cloudera Impala– Teradata Aster SQL-H– EMC/Greenplum Pivotal HD– Hadapt
Usability Impact• PowerPivot makes analysis much easier,self-service• Power View is great for discovery andvisualization; also self-service• Combine with the Hive ODBC driver andsuddenly Hadoop is accessible tobusiness users• Caveats– Someone has to write the HiveQL– Can query Big Data, but must have smaller result
Resources• Big On Data blog– http://www.zdnet.com/blog/big-data• Apache Hadoop home page– http://hadoop.apache.org/• Hive & Pig home pages– http://hive.apache.org/– http://pig.apache.org/• Hadoop on Azure home page– https://www.hadooponazure.com/• SQL Server 2012 Big Data– http://bit.ly/sql2012bigdata