Big Data and NoSQL in Microsoft-Land
Upcoming SlideShare
Loading in...5
×
 

Big Data and NoSQL in Microsoft-Land

on

  • 3,362 views

SQL Server Live! Orlando 2012

SQL Server Live! Orlando 2012

Statistics

Views

Total Views
3,362
Views on SlideShare
3,362
Embed Views
0

Actions

Likes
2
Downloads
92
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Big Data and NoSQL in Microsoft-Land Big Data and NoSQL in Microsoft-Land Document Transcript

  • SQL Server Live! Orlando 2012 Big Data and NoSQL in Microsoft-Land Andrew Brust and Lynn Langit Blue Badge Insights & Data Wrangler Level: Intermediate Meet Andrew • CEO and Founder, Blue Badge Insights • Big Data blogger for ZDNet • Microsoft Regional Director, MVP • Co-chair VSLive! and 17 years as a speaker • Founder, Microsoft BI User Group of NYC – http://www.msbinyc.com • Co-moderator, NYC .NET Developers Group – http://www.nycdotnetdev.com • “Redmond Review” columnist for Visual Studio Magazine and Redmond Developer News • brustblog.com, Twitter: @andrewbrustSQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 1
  • SQL Server Live! Orlando 2012 Andrew’s New Blog (bit.ly/bigondata) Meet Lynn • CEO and Founder, Lynn Langit consulting • Former Microsoft Evangelist (4 years) • Google Developer Expert • MongoDB Master • MCT 13 years – 7 certifications • Cloudera Certified Developer • MSDN Magazine articles – SQL Azure – Hadoop on Azure – MongoDB on Azure • www.LynnLangit.com • @LynnLangitSQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 2
  • SQL Server Live! Orlando 2012 Lynn’s YouTube Channel • recipes) www.TeachingKidsProgramming.org • Free Courseware ( • Do a Recipe  Teach a Kid (Ages 10 ++) • Java or Microsoft SmallBasic SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 3
  • SQL Server Live! Orlando 2012 Read all about it! Agenda • Overview / Landscape – Big Data, and Hadoop – NoSQL – The Big Data-NoSQL Intersection • Drilldown on Big Data • Drilldown on NoSQLSQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 4
  • SQL Server Live! Orlando 2012 What is Big Data? • 100s of TB into PB and higher • Involving data from: financial data, sensors, web logs, social media, etc. • Parallel processing often involved – Hadoop is emblematic, but other technologies are Big Data too • Processing of data sets too large for transactional databases – Analyzing interactions, rather than transactions – The three V’s: Volume, Velocity, Variety • Big Data tech sometimes imposed on small data problems BigData = Exponentially More Data • Retail Example -> ‘Feedback Economy’ – Number of transactions – Number of behaviors (collected every minute)SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 5
  • SQL Server Live! Orlando 2012 BigData = ‘Next State’ Questions • What could happen? • Why didn’t this happen? Collecting • When will the next new thing Behavioral happen? data • What will the next new thing be? • What happens? What’s MapReduce? • “Big” input data as key-value pair series • Partition the data and send to mappers (nodes in cluster) • Mappers pre-process, put into key-value format, and send all output for a given (set of) key(s) to a reducer • Reducer aggregates; one output per key, with value • Map and Reduce code natively written as Java functionsSQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 6
  • SQL Server Live! Orlando 2012 MapReduce, in a Diagram Input mapper Output K1 Input mapper Output Input reducer Output Output K2 Input mapper Output Input reducer Output Input K3 Input mapper Output Input reducer Output Input mapper Output Input mapper Output A MapReduce Example • Count by suite, on each floor • Send per-suite, per platform totals to lobby • Sort totals by platform • Send two platform packets to 10th, 20th, 30th floor • Tally up each platform • Collect the tallies • Merge tallies into one spreadsheetSQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 7
  • SQL Server Live! Orlando 2012 What’s a Distributed File System? • One where data gets distributed over commodity drives on commodity servers • Data is replicated • If one box goes down, no data lost – “Shared Nothing” • BUT: Immutable – Files can only be written to once – So updates require drop + re-write (slow) – You can append though – Like a DVD/CD-ROM Hadoop = MapReduce + HDFS • Modeled after Google MapReduce + GFS • Have more data? Just add more nodes to cluster. – Mappers execute in parallel – Hardware is commodity – “Scaling out” • Use of HDFS means data may well be local to mapper processing • So, not just parallel, but minimal data movement, which avoids network bottlenecksSQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 8
  • SQL Server Live! Orlando 2012 Example Comparison: RDBMS vs. Hadoop Traditional RDBMS Hadoop / MapReduce Data Size Gigabytes (Terabytes) Petabytes (Hexabytes) Access Interactive and Batch Batch – NOT Interactive Updates Read / Write many times Write once, Read many times Structure Static Schema Dynamic Schema Integrity High (ACID) Low Scaling Nonlinear Linear Query Response Can be near immediate Has latency (due to batch processing) Time Just-in-time Schema • When looking at unstructured data, schema is imposed at query time • Schema is context specific – If scanning a book, are the values words, lines, or pages? – Are notes a single field, or is each word value? – Are date and time two fields or one? – Are street, city, state, zip separate or one value? – Pig and Hive let you determine this at query time – So does the Map function in MapReduce codeSQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 9
  • SQL Server Live! Orlando 2012 What’s HBase? • A Wide-Column Store NoSQL database • Modeled after Google BigTable • Uses HDFS – Therefore, Hadoop-compatible • Hadoop often used with HBase – But you can use either without the otherSQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 10
  • SQL Server Live! Orlando 2012 NoSQL Confusion • Many ‘flavors’ of NoSQL data stores • Easiest to group by functionality, but… – Dividing lines are not clear or consistent • NoSQL choice(s) driven by many factors – Type of data – Quantity of tool – Knowledge of technical staff – Product maturity – Tooling So much wrong information People are Everything is religious about ‘new’ data storage Lots of ‘Try’ before incorrect you ‘buy’ (or information use) Watch out for Confusion over over vendor simplification offeringsSQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 11
  • SQL Server Live! Orlando 2012 Common NoSQL Misconceptions Problems Solutions Everything is ‘new’ People are religious about ‘Try’ before you ‘buy’ (or use) data storage Leverage NoSQL Open source is always communities cheaper Add NoSQL to existing Cloud is always cheaper RDBMS solution Replace RDBMS with NoSQL NoSQL + Big Data • HBase and Cassandra work with Hadoop, are NoSQL databases • MongoDB brands itself a Big Data technology • Couchbase does too • Just-in-time schema • MapReduce in MongoDB, others • Hadoop and most NoSQL DBs are partitioned, scale-out technologies • It’s all about analytics on semi- or un- structured dataSQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 12
  • SQL Server Live! Orlando 2012 DRILLDOWN ON BIG DATA The Hadoop Stack Log file integration Machine Learning/Data Mining RDBMS Import/Export Query: HiveQL and Pig Latin Database MapReduce, HDFSSQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 13
  • SQL Server Live! Orlando 2012 What’s Hive? • Began as Hadoop sub-project – Now top-level Apache project • Provides a SQL-like (“HiveQL”) abstraction over MapReduce • Has its own HDFS table file format (and it’s fully schema-bound) • Can also work over HBase • Acts as a bridge to many BI products which expect tabular data Hadoop Distributions • Cloudera • Hortonworks – HCatalog: Hive/Pig/MR Interop • MapR – Network File System replaces HDFS • IBM InfoSphere BigInsights – HDFS<->DB2 integration • And now Microsoft…SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 14
  • SQL Server Live! Orlando 2012 Microsoft HDInsight • Developed with Hortonworks and incorporates Hortonworks Data Platform (HDP) for Windows • Windows Azure HDInsight and Microsoft HDInsight (for Windows Server) – Single node preview runs on Windows client • Includes ODBC Driver for Hive – And Excel Add-In that uses it • JavaScript MapReduce framework • Contribute it all back to open source Apache Project Amenities for Visual Studio/.NET MRLib (NuGet Package) MR code in C#, HadoopJob, LINQ to Hive MapperBase, ReducerBase Hortonworks Data Platform for Windows OdbcClient + Debugging Hive ODBC Driver DeploymentSQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 15
  • SQL Server Live! Orlando 2012 Some ways to work • Microsoft HDInsight – Cloud: go to www.hadooponazure.com, request invite – Local: Download Microsoft HDInsight Runs on just about anything, including Windows XP Get it via the Web Platform installer (WebPI) – Both are free for now; Azure HDInsight will be fee-based when RTM • Amazon Web Services Elastic MapReduce – Create AWS account – Select Elastic MapReduce in Dashboard – Cheap for experimenting, but not free • Cloudera CDH VM image – Download as .tar.gz file – “Un-tar” (can use WinRAR, 7zip) – Run via VMWare Player or Virtual Box – Everything’s free Some ways to work HDInsight EMR CDH 4SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 16
  • SQL Server Live! Orlando 2012 Microsoft HDInsight • Much simpler than the others • Browser-based portal – Launch MapReduce jobs – Azure: Provisioning cluster, managing ports, gather external data • Interactive JavaScript & Hive console – JS: HDFS, Pig, light data visualization – Hive commands and metadata discovery – New console coming • Desktop Shortcuts: – Command window, MapReduce, Name Node status in browser – Azure: from portal page you can RDP directly to Hadoop head node for these desktop shortcuts Windows Azure HDInsightSQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 17
  • SQL Server Live! Orlando 2012 Amazon Elastic MapReduce • Lots of steps! • At a high level: – Setup AWS account and S3 “buckets” – Generate Key Pair and PEM file – Install Ruby and EMR Command Line Interface – Provision the cluster using CLI A batch file can work very well here – Setup and run SSH/PuTTY – Work interactively at command line Amazon EMR – Prep Steps • Create an AWS account • Create an S3 bucket for log storage – with list permissions for authenticated users • Create a Key Pair and save PEM file • Install Ruby • Install Amazon Web Services Elastic MapReduce Command Line Interface – aka AWS EMR CLI  • Create credentials.json in EMR CLI folder – Associate with same region as where key pair createdSQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 18
  • SQL Server Live! Orlando 2012 Amazon – Security and Startup • Security – Download PuTTYgen and run it – Click Load and browse to PEM file – Save it in PPK format – Exit PuTTYgen • In a command window, navigate to EMR CLI folder and enter command: – ruby elastic-mapreduce --create --alive [--num-instance xx] [--pig-interactive] [--hive-interactive] [--hbase --instance-type m1.large] • In AWS Console, go to EC2 Dashboard and click Instances on left nav bar • Wait until instance is running and get its Public DNS name – Use Compatibility View in IE or copy may not work Connect! • Download and run PuTTY • Paste DNS name of EC2 instance into hostname field • In Treeview, drill down and navigate to ConnectionSSHAuth, browse to PPK file • Once EC2 instance(s) running, click Open • Click Yes to “The server’s host key is not cached in the registry…” PuTTY Security Alert • When prompted for user name, type “hadoop” and hit Enter • cd bin, then hive, pig, hbase shell • Right-click to paste from clipboard; option to go full-screen • (Kill EC2 instance(s) from Dashboard when done)SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 19
  • SQL Server Live! Orlando 2012 Amazon Elastic MapReduce Cloudera CDH4 Virtual Machine • Get it for free, in VMWare and Virtual Box versions. – VMWare player and Virtual Box are free too • Run it, and configure it to have its own IP on your network. Use ifconfig to discover IP. • Assuming IP of 192.168.1.59, open browser on your own (host) machine and navigate to: – http://192.168.1.59:8888 • Can also use browser in VM and hit: – http://localhost:8888 • Work in “Hue”…SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 20
  • SQL Server Live! Orlando 2012 Hue • Browser based UI, with front ends for: – HDFS (w/ upload & download) – MapReduce job creation and monitoring – Hive (“Beeswax”) • And in-browser command line shells for: – HBase – Pig (“Grunt”) Impala: What it Is • Distributed SQL query engine over Hadoop cluster • Announced at Strata/Hadoop World in NYC on October 24th • In Beta, as part of CDH 4.1 • Works with HDFS and Hive data • Compatible with HiveQL and Hive drivers – Query with BeeswaxSQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 21
  • SQL Server Live! Orlando 2012 Impala: What it’s Not • Impala is not Hive – Hive converts HiveQL to Java MapReduce code and executes it in batch mode – Impala executes query interactively over the data – Brings BI tools and Hadoop closer together • Impala is not an Apache Software Foundation project – Though it is open source and Apache-licensed, but it’s still incubated by Cloudera – Only in CDH Cloudera CDH4SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 22
  • SQL Server Live! Orlando 2012 Hadoop commands • HDFS – hadoop fs filecommand – Create and remove directories: mkdir, rm, rmr – Upload and download files to/from HDFS get, put – View directory contents ls, lsr – Copy, move, view files cp, mv, cat • MapReduce – Run a Java jar-file based job hadoop jar jarname params Hadoop (directly)SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 23
  • SQL Server Live! Orlando 2012 HBase • Concepts: – Tables, column families – Columns, rows – Keys, values • Commands: – Definition: create, alter, drop, truncate – Manipulation: get, put, delete, deleteall, scan – Discovery: list, exists, describe, count – Enablement: disable, enable – Utilities: version, status, shutdown, exit – Reference: http://wiki.apache.org/hadoop/Hbase/Shell • Moreover, – Interesting HBase work can be done in MapReduce, Pig HBase Examples • create t1, f1, f2, f3 • describe t1 • alter t1, {NAME => f1, VERSIONS => 5} • put t1, r1, c1:f1, value • get t1, r1 • count t1SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 24
  • SQL Server Live! Orlando 2012 HBase Submitting, Running and Monitoring Jobs • Upload a JAR • Use Streaming – Use other languages (i.e. other than Java) to write MapReduce code – Python is popular option – Any executable works, even C# console apps – On MS HDInsight, JavaScript works too – Still uses a JAR file: streaming.jar • Run at command line (passing JAR name and params) or use GUISQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 25
  • SQL Server Live! Orlando 2012 Running MapReduce Jobs Hive • Used by most BI products which connect to Hadoop • Provides a SQL-like abstraction over Hadoop – Officially HiveQL, or HQL • Works on own tables, but also on HBase • Query generates MapReduce job, output of which becomes result set • Microsoft has Hive ODBC driver – Connects Excel, Reporting Services, PowerPivot, Analysis Services Tabular Mode (only)SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 26
  • SQL Server Live! Orlando 2012 Hive, Continued • Load data from flat HDFS files – LOAD DATA [LOCAL] INPATH myfile INTO TABLE mytable; • SQL Queries – CREATE, ALTER, DROP – INSERT OVERWRITE (creates whole tables) – SELECT, JOIN, WHERE, GROUP BY – SORT BY, but ordering data is tricky! – MAP/REDUCE/TRANSFORM…USING allows for custom map, reduce steps utilizing Java or streaming code Excel Add-In for HiveSQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 27
  • SQL Server Live! Orlando 2012 Hive Pig • Instead of SQL, employs a language (“Pig Latin”) that accommodates data flow expressions – Do a combo of Query and ETL • “10 lines of Pig Latin ≈ 200 lines of Java.” • Works with structured or unstructured data • Operations – As with Hive, a MapReduce job is generated – Unlike Hive, output is only flat file to HDFS or text at command line console – With MS Hadoop, can easily convert to JavaScript array, then manipulate • Use command line (“Grunt”) or build scriptsSQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 28
  • SQL Server Live! Orlando 2012 Example • A = LOAD myfile AS (x, y, z); B = FILTER A by x > 0; C = GROUP B BY x; D = FOREACH A GENERATE x, COUNT(B); STORE D INTO output; Pig Latin Examples • Imperative, file system commands – LOAD, STORE Schema specified on LOAD • Declarative, query commands (SQL-like) – xxx = file or data set – FOREACH xxx GENERATE (SELECT…FROM xxx) – JOIN (WHERE/INNER JOIN) – FILTER xxx BY (WHERE) – ORDER xxx BY (ORDER BY) – GROUP xxx BY / GENERATE COUNT(xxx) (SELECT COUNT(*) GROUP BY) – DISTINCT (SELECT DISTINCT) • Syntax is assignment statement-based: – MyCusts = FILTER Custs BY SalesPerson eq 15; • Access Hbase – CpuMetrics = LOAD hbase://SystemMetrics USING org.apache.pig.backend.hadoop.hbase.HBaseStorage(cp u:,-loadKey -returnTuple);SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 29
  • SQL Server Live! Orlando 2012 Pig Sqoop sqoop import --connect "jdbc:sqlserver://<servername>. database.windows.net:1433; database=<dbname>; user=<username>@<servername>; password=<password>" --table <from_table> --target-dir <to_hdfs_folder> --split-by <from_table_column>SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 30
  • SQL Server Live! Orlando 2012 Sqoop sqoop export --connect "jdbc:sqlserver://<servername>. database.windows.net:1433; database=<dbname>; user=<username>@<servername>; password=<password>" --table <to_table> --export-dir <from_hdfs_folder> --input-fields-terminated-by "<delimiter>" Flume NG • Source – Avro (data serialization system – can read json- encoded data files, and can work over RPC) – Exec (reads from stdout of long-running process) • Sinks – HDFS, HBase, Avro • Channels – Memory, JDBC, fileSQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 31
  • SQL Server Live! Orlando 2012 Flume NG (next generation) • Setup conf/flume.conf # Define a memory channel called ch1 on agent1 agent1.channels.ch1.type = memory # Define an Avro source called avro-source1 on agent1 and tell it # to bind to 0.0.0.0:41414. Connect it to channel ch1. agent1.sources.avro-source1.channels = ch1 agent1.sources.avro-source1.type = avro agent1.sources.avro-source1.bind = 0.0.0.0 agent1.sources.avro-source1.port = 41414 # Define a logger sink that simply logs all events it receives # and connect it to the other end of the same channel. agent1.sinks.log-sink1.channel = ch1 agent1.sinks.log-sink1.type = logger # Finally, now that weve defined all of our components, tell # agent1 which ones we want to activate. agent1.channels = ch1 agent1.sources = avro-source1 agent1.sinks = log-sink1 • From the command line: flume-ng agent --conf ./conf/ -f conf/flume.conf -n agent1 Mahout Algorithms • Recommendation – Your info + community info – Give users/items/ratings; get user-user/item-item – itemsimilarity • Classification/Categorization – Drop into buckets – Naïve Bayes, Complementary Naïve Bayes, Decision Forests • Clustering – Like classification, but with categories unknown – K-Means, Fuzzy K-Means, Canopy, Dirichlet, Mean- ShiftSQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 32
  • SQL Server Live! Orlando 2012 Workflow, Syntax • Workflow – Run the job – Dump the output – Visualize, predict • mahout algorithm -- input folderspec -- output folderspec -- param1 value1 -- param2 value2 … • Example: – mahout itemsimilarity --input <input-hdfs-path> --output <output-hdfs-path> --tempDir <tmp-hdfs-path> -s SIMILARITY_LOGLIKELIHOOD The Truth About Mahout • Mahout is really just an algorithm engine • Its output is almost unusable by non- statisticians/non-data scientists • You need a staff or a product to visualize, or make into a usable prediction model • Investigate Predixion Software – CTO, Jamie MacLennan, used to lead SQL Server Data Mining team – Excel add-in can use Mahout remotely, visualize its output, run predictive analyses – Also integrates with SQL Server, Greenplum, MapReduce – http://www.predixionsoftware.comSQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 33
  • SQL Server Live! Orlando 2012 The “Data-Refinery” Idea • Use Hadoop to “on-board” unstructured data, then extract manageable subsets • Load the subsets into conventional DW/BI servers and use familiar analytics tool to examine • This is the current rationalization of Hadoop + BI tools’ coexistence • Will it stay this way? DRILLDOWN ON NOSQLSQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 34
  • SQL Server Live! Orlando 2012 Hitting (Relational) Walls • CA – Highly-available consistency • CP – Enforced consistency • AP – Eventual consistency The reality…two pivots Storage Storage Methods Locations • SQL (RDBMS) • On premises • NoSQL • Cloud-hostedSQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 35
  • SQL Server Live! Orlando 2012 So many NoSQL options • More than just the Elephant in the room • Over 120+ types of noSQL databases Flavors of NoSQLSQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 36
  • SQL Server Live! Orlando 2012 Graph Database Use for data with – a lot of many-to-many relationships – recursive self-joins – when your primary objective is quickly finding connections, patterns and relationships between the objects within lots of data – Examples: Neo4J, FreeBase (Google) Column Database • Wide, sparse column sets • Schema-light • Examples: – Cassandra – HBase – BigTable – GAE HR DSSQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 37
  • SQL Server Live! Orlando 2012 More about Column Databases • Type A – Column-families – Non-relational – Sparse – Examples: HBase, Cassandra, xVelocity (SQL 2012 BISM) • Type B – Column-stores – Relational – Dense – Example: SQL Server 2012 Columnstore index Demo - Document Database (MongoDB) • Use for data that is – document-oriented (collection of JSON documents) w/semi structured data Encodings include XML, YAML, JSON & BSON – binary forms PDF, Microsoft Office documents -- Word, Excel…) • Examples: MongoDB, CouchDBSQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 38
  • SQL Server Live! Orlando 2012 Demo MongoDB Persistent Key / Value Database • Schema-less • State - Persistent • Examples – AWS DynamoDB – Azure Tables – Project VoldemortSQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 39
  • SQL Server Live! Orlando 2012 Volatile Key / Value Database • Schema-less • State - Volatile • Examples – Redis – Memcahed Which type of NoSQL for which type of data? Type of Data Type of NoSQL Example solution Log files Wide Column HBase Product Catalogs Key Value on disk DynamoDB User profiles Key Value in memory Redis Startups Document MongoDB Social media Graph Neo4j connections LOB w/Transactions NONE! Use RDBMS SQL ServerSQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 40
  • SQL Server Live! Orlando 2012 What about the cloud? Cloud-hosted NoSQL up to 50x CHEAPERSQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 41
  • SQL Server Live! Orlando 2012 Consumer Storage Buckets • Dropbox • Box • Windows SkyDrive • Google Drive • Amazon Cloud Drive • Apple iCloud Developer BLOB Storage Buckets • Amazon – S3 or Glacier • Google – Cloud Storage • Microsoft Azure BLOBS • OthersSQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 42
  • SQL Server Live! Orlando 2012 Cloud-hosted RDBMS • AWS RDS – SQL Server, MySQL, Oracle – Medium cost – Solid feature set, i.e. backup, snapshot – Use existing tooling • Google – MySQL – Lowest cost – Most limited RDBMS functionality • Microsoft – Windows Azure SQL Database – Highest cost – Azure VMs w/MySQLSQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 43
  • SQL Server Live! Orlando 2012 Other cloud data services Hosting public datasets • Pay to read • Earn revenue by offering for read Cleaning / matching (your) data • ETL – Microsoft Data Explorer, Google Refine • Data Quality – Windows Azure Marketplace, InfoChimps, DataMarket.com Cloud – RDBMS, NoSQL & Hadoop AWS Google Microsoft Cloud RDBMS SQL Server, Oracle MySQL SQL Azure / mySQL NoSQL buckets S3 or Glacier Cloud Storage Azure Storage NoSQL databases DynamoDB H/R Datastore on Azure Tables GAE Streaming Custom EC2 Prospective StreamInsight & Machine Learning Search & Mahout with Prediction API Hadoop Document or MongoDB on EC2 Freebase (g) MongoDB on Graph Windows Azure Hadoop Elastic MapReduce MapR & GCE Windows Azure using S3 & EC2 HDInsight Data sets & other Karmasphere Translation API Azure DataMarket Full-text searchSQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 44
  • SQL Server Live! Orlando 2012 Demo Amazon RDS Pick your mix and then… • Use Cloud Data Markets Other • Use Cloud ETL Services RDBMS • Host locally • Host in the Cloud NoSQL • Host locally • Host in the CloudSQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 45
  • SQL Server Live! Orlando 2012 What about me? Common DBA Tasks in NoSQL RDBMS NoSQL Import Data Import Data Setup Security Setup Security Perform a Backup Make a copy of the data Restore a Database Move a copy to a location Create an Index Create an Index Join Tables Together Run MapReduce Schedule a Job Schedule a (Cron) Job Run Database Maintenance Monitor space and resources used Send an Email from SQL Server Set up resource threshold alerts Search BOL Interpret DocumentationSQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 46
  • SQL Server Live! Orlando 2012 Making Sense – Asking Questions Data Scientists…SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 47
  • SQL Server Live! Orlando 2012 Comparing… Karmasphere Studio for AWSSQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 48
  • SQL Server Live! Orlando 2012 Google BigQuery w/Excel • Dremel-based service – For massive amounts of data – BigQuery currently has quota limits – SQL-like query language Demo Google Big QuerySQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 49
  • SQL Server Live! Orlando 2012 NoSQL To-Do List Understand CAP & types of NoSQL databases • Use NoSQL when business needs designate • Use the right type of NoSQL for your business problem Try out NoSQL on the cloud • Quick and cheap for behavioral data • Mashup cloud datasets • Good for specialized use cases, i.e. dev, test , training environments Learn noSQL access technologies • New query languages, i.e. MapReduce, R, Infer.NET • New query tools (vendor-specific) – Google Refine, Amazon Karmasphere, Microsoft Excel connectors, etc… The Changing Data Landscape Other Services RDBMS NoSQLSQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 50
  • SQL Server Live! Orlando 2012 NoSQL for .NET Developers • RavenDB • MongoDB C#/.NET Driver • MongoDB on Windows Azure • CouchBase .NET Client Library • Riak client for .NET • AWS Toolkit for Visual Studio • Google cloud APIs (REST-based)SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 51