Big Data and NoSQL for Database and BI Pros

April 10-12 | Chicago, IL
Big Data and NoSQL for
Database and BI Pros
Andrew J. Brust, Founder and CEO, Blue Badge Insights

Please silence
cell phones

Meet Andrew
CEO and Founder, Blue Badge Insights
Big Data blogger for ZDNet
Microsoft Regional Director, MVP
Co-chair VSLive! and 17 years as a speaker
Founder, Microsoft BI User Group of NYC
• http://www.msbinyc.com
Co-moderator, NYC .NET Developers Group
• http://www.nycdotnetdev.com
“Redmond Review” columnist for Visual Studio Magazine and Redmond Developer News
brustblog.com, Twitter: @andrewbrust
3

Andrew‟s New Blog (bit.ly/bigondata)

Lynn Langit (in absentia)
CEO and Founder, Lynn Langit consulting
Former Microsoft Evangelist (4 years)
Google Developer Expert
MongoDB Master
MCT 13 years – 7 certifications
Cloudera Certified Developer
MSDN Magazine articles
• SQL Azure
• Hadoop on Azure
• MongoDB on Azure
www.LynnLangit.com
@LynnLangit
L

Agenda
Overview / Landscape
• Big Data, and Hadoop
• NoSQL
• The Big Data-NoSQL Intersection
Drilldown on Big Data
Drilldown on NoSQL

What is Big Data?
100s of TB into PB and higher
Involving data from: financial data, sensors, web logs, social media, etc.
Parallel processing often involved
Hadoop is emblematic, but other technologies are Big Data too
Processing of data sets too large for transactional databases
Analyzing interactions, rather than transactions
The three V‟s: Volume, Velocity, Variety
Big Data tech sometimes imposed on small data problems

Big Data = Exponentially More Data
Retail Example -> „Feedback Economy‟
• Number of transactions
• Number of behaviors (collected every minute)
9
L

Big Data = „Next State‟ Questions
10
• What could happen?
• Why didn‟t this happen?
• When will the next new thing
happen?
• What will the next new thing be?
• What happens?
Collecting
Behavioral
data
L

My Data: An Example from Health Care
Medical records
• Regular
• Emergency
• Genetic data – 23andMe
Food data
• SparkPeople
Purchasing
• Grocery card
• credit card
Search – Google
Social media
• Twitter
• Facebook
Exercise
• Nike Fuel Band
• Kinect
• Location - phone
11
L

Big Data Considerations
Collection –
get the data
Storage –
keep the
data
Querying –
make sense
of the data
Visualization
– see the
business
value
L

Data Collection
Types of Data
• Structured, semi-structured, unstructured vs. data standards
• Behavioral vs. transactional data
Methods of collection
• Sensors everywhere
• Machine-2-Machine
• Public Datasets
• Freebase
• Azure DataMarket
• Hillary Mason‟s list
14
L

What‟s MapReduce?
Partition the bulk input data and send to mappers (nodes in cluster)
Mappers pre-process, put into key-value format, and send all output for
a given (set of) key(s) to a reducer
Reducer aggregates; one output per key, with value
Map and Reduce code natively written as Java functions

MapReduce, in a Diagram
mapper
mapper
mapper
mapper
mapper
mapper
Input
reducer
reducer
reducer
Input
Input
Input
Input
Input
Input
Output
Output
Output
Output
Output
Output
Output
Input
Input
Input
K1 , K4
K3 , K6
Output
Output
Output
K2 , K5

• Count by suite, on each floor
• Send per-suite, per platform totals to lobby
• Sort totals by platform
• Send two platform packets to 10th, 20th, 30th floor
• Tally up each platform
• Merge tallies into one spreadsheet
• Collect the tallies
A MapReduce Example

What‟s a Distributed File System?
One where data gets distributed over commodity drives on commodity
servers
Data is replicated
• If one box goes down, no data lost
• “Shared Nothing”
BUT: Immutable
• Files can only be written to once
• So updates require drop + re-write (slow)
• You can append though
• Like a DVD/CD-ROM

Hadoop = MapReduce + HDFS
Modeled after Google MapReduce + GFS
Have more data? Just add more nodes to cluster.
• Mappers execute in parallel
• Hardware is commodity
• “Scaling out”
Use of HDFS means data may well be local to mapper processing
• So, not just parallel, but minimal data movement, which avoids
network bottlenecks

Comparison: RDBMS vs. Hadoop
Traditional RDBMS Hadoop / MapReduce
Data Size Gigabytes (Terabytes) Petabytes (Hexabytes)
Updates Read / Write many times Write once, Read many times
Integrity High (ACID) Low
Query Response Time Can be near immediate Has latency (due to batch
processing)
20
L

Just-in-Time Schema
When looking at unstructured data, schema is imposed at query time
Schema is context specific
• If scanning a book, are the values words, lines, or pages?
• Are notes a single field, or is each word a value?
• Are date and time two fields or one?
• Are street, city, state, zip separate or one value?
• Pig and Hive let you determine this at query time
• So does the Map function in MapReduce code

What‟s HBase?
A Wide-Column Store NoSQL database
Modeled after Google BigTable
Uses HDFS
Therefore, Hadoop-compatible
Hadoop MapReduce often used with HBase
But you can use either without the other

NoSQL Confusion
Many „flavors‟ of NoSQL data stores
Easiest to group by functionality, but…
• Dividing lines are not clear or consistent
NoSQL choice(s) driven by many factors
• Type of data
• Quantity of data
• Knowledge of technical staff
• Product maturity
• Tooling
L

So much wrong information
Everything is
„new‟
People are
religious about
data storage
Lots of
incorrect
information
„Try‟ before
you „buy‟ (or
use)
Watch out for
over
simplification
Confusion over
vendor
offerings
L

Common NoSQL Misconceptions
Problems
Everything is „new‟
People are religious about data
storage
Open source is always cheaper
Cloud is always cheaper
Replace RDBMS with NoSQL
Solutions
„Try‟ before you „buy‟ (or use)
Leverage NoSQL communities
Add NoSQL to existing RDBMS
solution
L

Drilldown on Big Data

The Hadoop Stack
MapReduce, HDFS
Database
RDBMS Import/Export
Query: HiveQL and Pig Latin
Machine Learning/Data Mining
Log file integration

What‟s Hive?
Began as Hadoop sub-project
Now top-level Apache project
Provides a SQL-like (“HiveQL”) abstraction over MapReduce
Has its own HDFS table file format (and it‟s fully schema-bound)
Can also work over HBase
Acts as a bridge to many BI products which expect tabular data

Hadoop Distributions
Cloudera
Hortonworks
HCatalog: Hive/Pig/MR Interop
MapR
Network File System replaces HDFS
IBM InfoSphere BigInsights
HDFS<->DB2 integration
And now Microsoft…

Microsoft HDInsight
Developed with Hortonworks and incorporates Hortonworks Data
Platform (HDP) for Windows
Windows Azure HDInsight and Microsoft HDInsight (for Windows
Server)
• Single node preview runs on Windows client
Includes ODBC Driver for Hive
JavaScript MapReduce framework
Contribute it all back to open source Apache Project

Hortonworks Data Platform
for Windows
MRLib (NuGet
Package)
LINQ to Hive
OdbcClient + Hive
ODBC Driver
Deployment
Debugging
MR code in
C#, HadoopJob, Mapp
erBase, ReducerBase
Amenities for
Visual Studio/.NET

Some ways to work
Microsoft HDInsight
• Cloud: go to www.windowsazure.com, request a cluster
• Local: Download Microsoft HDInsight
• Runs on just about anything, including Windows XP
• Get it via the Web Platform installer (WebPI)
• Local version is free; cloud billed at 50% discount during preview
Amazon Web Services Elastic MapReduce
• Create AWS account
• Select Elastic MapReduce in Dashboard
• Cheap for experimenting, but not free
Cloudera CDH VM image
• Download as .tar.gz file
• “Un-tar” (can use WinRAR, 7zip)
• Run via VMWare Player or Virtual Box
• Everything’s free

Some ways to work
HDInsight EMR CDH 4

Microsoft HDInsight
Much simpler than the others
Browser-based portal
• Launch MapReduce jobs
• Azure: Provisioning cluster, managing ports, gather external data
Interactive JavaScript & Hive console
• JS: HDFS, Pig, light data visualization
• Hive commands and metadata discovery
• New console coming
Desktop Shortcuts:
• Command window, MapReduce, Name Node status in browser
• Azure: from portal page you can RDP directly to Hadoop head node for these
desktop shortcuts
35

Demo
Windows Azure HDInsight

Amazon Elastic MapReduce
Lots of steps!
At a high level:
• Setup AWS account and S3 “buckets”
• Generate Key Pair and PEM file
• Install Ruby and EMR Command Line Interface
• Provision the cluster using CLI
• A batch file can work very well here
• Setup and run SSH/PuTTY
• Work interactively at command line

Demo
Amazon Elastic MapReduce

Cloudera CDH4 Virtual Machine
Get it for free, in VMWare and Virtual Box versions.
• VMWare player and Virtual Box are free too
Run it, and configure it to have its own IP on your network. Use ifconfig to
discover IP.
Assuming IP of 192.168.1.59, open browser on your own (host) machine and
navigate to:
• http://192.168.1.59:8888
Can also use browser in VM and hit:
• http://localhost:8888
Work in “Hue”…

Hue
Browser based UI, with front
ends for:
HDFS (w/ upload & download)
MapReduce job creation and
monitoring
Hive (“Beeswax”)
And in-browser command line
shells for:
HBase
Pig (“Grunt”)

Impala: What it Is
Distributed SQL query engine over Hadoop cluster
Announced at Strata/Hadoop World in NYC on October 24th
In Beta, as part of CDH 4.1
Works with HDFS and Hive data
Compatible with HiveQL and Hive drivers
• Query with Beeswax

Impala: What it‟s Not
Impala is not Hive
• Hive converts HiveQL to Java MapReduce code and executes it in batch
mode
• Impala executes query interactively over the data
• Brings BI tools and Hadoop closer together
Impala is not an Apache Software Foundation project
• Though it is open source and Apache-licensed, but it‟s still incubated by
Cloudera
• Only in CDH

Demo
Cloudera CDH4, Impala

Hadoop commands
HDFS
• hadoop fs filecommand
• Create and remove directories
• mkdir, rm, rmr
• Upload and download files to/from HDFS
• get, put
• View directory contents
• ls, lsr
• Copy, move, view files
• cp, mv, cat
MapReduce
• Run a Java jar-file based job
• hadoop jar jarname params

Demo
Hadoop (directly)

HBase
Concepts:
• Tables, column families
• Columns, rows
• Keys, values
Commands:
• Definition: create, alter, drop, truncate
• Manipulation: get, put, delete, deleteall, scan
• Discovery: list, exists, describe, count
• Enablement: disable, enable
• Utilities: version, status, shutdown, exit
• Reference: http://wiki.apache.org/hadoop/Hbase/Shell
Moreover,
• Interesting HBase work can be done in MapReduce, Pig

HBase Examples
create 't1', 'f1', 'f2', 'f3'
describe 't1'
alter 't1', {NAME => 'f1', VERSIONS => 5}
put 't1', 'r1', 'c1:f1', 'value'
get 't1', 'r1'
count 't1'

Demo
HBase

Submitting, Running and Monitoring
Jobs
Upload a JAR
Use Streaming
• Use other languages (i.e. other than Java) to write MapReduce code
• Python is popular option
• Any executable works, even C# console apps
• On MS HDInsight, JavaScript works too
• Still uses a JAR file: streaming.jar
Run at command line (passing JAR name and params) or use GUI

Demo
Running MapReduce Jobs

Hive
Used by most BI products which connect to Hadoop
Provides a SQL-like abstraction over Hadoop
• Officially HiveQL, or HQL
Works on own tables, but also on HBase
Query generates MapReduce job, output of which becomes result set
Microsoft has Hive ODBC driver
• Connects Excel, Reporting Services, PowerPivot, Analysis Services Tabular
Mode (only)

Hive, Continued
Load data from flat HDFS files
• LOAD DATA [LOCAL] INPATH 'myfile'
INTO TABLE mytable;
SQL Queries
• CREATE, ALTER, DROP
• INSERT OVERWRITE (creates whole tables)
• SELECT, JOIN, WHERE, GROUP BY
• SORT BY, but ordering data is tricky!
• MAP/REDUCE/TRANSFORM…USING allows for custom map, reduce steps
utilizing Java or streaming code

Data Explorer
• Beta add-in for Excel
• Acquire, transform data
• Data sources include
Facebook, HDFS
• Visually- or script-driven
• Also includes Azure BLOB
storage backing up
HDInsight
56

Demo
Hive, Data Explorer

Pig
Instead of SQL, employs a language (“Pig Latin”) that accommodates data flow
expressions
• Do a combo of Query and ETL
“10 lines of Pig Latin ≈ 200 lines of Java.”
Works with structured or unstructured data
Operations
• As with Hive, a MapReduce job is generated
• Unlike Hive, output is only flat file to HDFS or text at command line console
• With HDInsight, can easily convert to JavaScript array, then manipulate
Use command line (“Grunt”) or build scripts

Example
A = LOAD 'myfile'
AS (x, y, z);
B = FILTER A by x > 0;
C = GROUP B BY x;
D = FOREACH A GENERATE
x, COUNT(B);
STORE D INTO 'output';

Pig Latin Examples
Imperative, file system commands
• LOAD, STORE
•Schema specified on LOAD
Declarative, query commands (SQL-like)
• xxx = file or data set
• FOREACH xxx GENERATE (SELECT…FROM xxx)
• JOIN (WHERE/INNER JOIN)
• FILTER xxx BY (WHERE)
• ORDER xxx BY (ORDER BY)
• GROUP xxx BY / GENERATE COUNT(xxx)
(SELECT COUNT(*) GROUP BY)
• DISTINCT (SELECT DISTINCT)
Syntax is assignment statement-based:
• MyCusts = FILTER Custs BY SalesPerson eq 15;
Access Hbase
• CpuMetrics = LOAD 'hbase://SystemMetrics' USING
org.apache.pig.backend.hadoop.hbase.HBaseStorage('cpu:','-loadKey -returnTuple');

Demo
Pig

Sqoop
sqoop import
--connect
"jdbc:sqlserver://<servername>.
database.windows.net:1433;
database=<dbname>;
user=<username>@<servername>;
password=<password>"
--table <from_table>
--target-dir <to_hdfs_folder>
--split-by <from_table_column>

Sqoop
sqoop export
--connect
"jdbc:sqlserver://<servername>.
database.windows.net:1433;
database=<dbname>;
user=<username>@<servername>;
password=<password>"
--table <to_table>
--export-dir <from_hdfs_folder>
--input-fields-terminated-by
"<delimiter>"

Flume NG
Source
• Avro (data serialization system – can read json-encoded data files, and can
work over RPC)
• Exec (reads from stdout of long-running process)
Sinks
• HDFS, HBase, Avro
Channels
• Memory, JDBC, file

Flume NG (next generation)
Setup conf/flume.conf
# Define a memory channel called ch1 on agent1
agent1.channels.ch1.type = memory
# Define an Avro source called avro-source1 on agent1 and tell it
# to bind to 0.0.0.0:41414. Connect it to channel ch1.
agent1.sources.avro-source1.channels = ch1
agent1.sources.avro-source1.type = avro
agent1.sources.avro-source1.bind = 0.0.0.0
agent1.sources.avro-source1.port = 41414
# Define a logger sink that simply logs all events it receives
# and connect it to the other end of the same channel.
agent1.sinks.log-sink1.channel = ch1
agent1.sinks.log-sink1.type = logger
# Finally, now that we've defined all of our components, tell
# agent1 which ones we want to activate.
agent1.channels = ch1
agent1.sources = avro-source1
agent1.sinks = log-sink1
From the command line:
flume-ng agent --conf ./conf/ -f conf/flume.conf -n agent1

Mahout Algorithms
Recommendation
• Your info + community info
• Give users/items/ratings; get user-user/item-item
• itemsimilarity
Classification/Categorization
• Drop into buckets
• Naïve Bayes, Complementary Naïve Bayes, Decision Forests
Clustering
• Like classification, but with categories unknown
• K-Means, Fuzzy K-Means, Canopy, Dirichlet, Mean-Shift

Workflow, Syntax
Workflow
• Run the job
• Dump the output
• Visualize, predict
mahout algorithm
-- input folderspec
-- output folderspec
-- param1 value1
-- param2 value2
…
Example:
• mahout itemsimilarity
--input <input-hdfs-path>
--output <output-hdfs-path>
--tempDir <tmp-hdfs-path>
-s SIMILARITY_LOGLIKELIHOOD

The Truth About Mahout
Mahout is really just an algorithm engine
Its output is almost unusable by non-statisticians/non-data scientists
You need a staff or a product to visualize, or make into a usable
prediction model
Investigate Predixion Software
• CTO, Jamie MacLennan, used to lead SQL Server Data Mining team
• Excel add-in can use Mahout remotely, visualize its output, run predictive
analyses
• Also integrates with SQL Server, Greenplum, MapReduce
• http://www.predixionsoftware.com

The “Data-Refinery” Idea
Use Hadoop to “on-board” unstructured data, then extract manageable
subsets
Load the subsets into conventional DW/BI servers and use familiar
analytics tool to examine
This is the current rationalization of Hadoop + BI tools‟ coexistence
Will it stay this way?

Dremel-based service for massive amounts of data
Pay for query and storage
SQL-like query language
Has an Excel connector
Google BigQuery
L

Google BigQuery

Drilldown on NoSQL

NoSQL Data Fodder
Addresses Preferences
Notes
Friends, Foll
owers
Documents

“Web Scale”
This the term used to justify NoSQL
Scenario is simple needs but “made up for in
volume”
• Millions of concurrent users
Think of sites like Amazon or Google
Think of non-transactional tasks like loading
catalog data to display product page, or
environment preferences

NoSQL Common Traits
Non-relational
Non-schematized/schema-free
Open source
Distributed
Eventual consistency
“Web scale”
Developed at big Internet companies

More than just the Elephant in the room
Over 120+ types of noSQL databases
So many NoSQL options
L

Concepts
Consistency
CAP Theorem
Indexing
Queries
MapReduce
Sharding

Consistency
CAP Theorem
• Databases may only excel at two of the following three attributes:
consistency, availability and partition tolerance
NoSQL does not offer “ACID” guarantees
• Atomicity, consistency, isolation and durability
Instead offers “eventual consistency”
Similar to DNS propagation

Things like inventory, account balances should be consistent
• Imagine updating a server in Seattle that stock was depleted
• Imagine not updating the server in NY
• Customer in NY goes to order 50 pieces of the item
• Order processed even though no stock
Things like catalog information don‟t have to be, at least not immediately
• If a new item is entered into the catalog, it‟s OK for some customers to see it
even before the other customers‟ server knows about it
But catalog info must come up quickly
• Therefore don‟t lock data in one location while waiting to update the other
Therefore, OK to sacrifice consistency for speed, in some cases
Consistency

CAP Theorem
Consistency
Availability
Partition
Tolerance
Relational
NoSQL

Indexing
Most NoSQL databases are indexed by key
Some allow so-called “secondary” indexes
Often the primary key indexes are clustered
HBase uses HDFS (the Hadoop Distributed File System), which is
append-only
• Writes are logged
• Logged writes are batched
• File is re-created and sorted

Queries
Typically no query language
Instead, create procedural program
Sometimes SQL is supported
Sometimes MapReduce code is used…

MapReduce
This is not Hadoop‟s MapReduce, but it‟s conceptually related
Map step: pre-processes data
Reduce step: summarizes/aggregates data
Will show a MapReduce code sample for Mongo soon
Will demo map code on CouchDB

Sharding
A partitioning pattern where separate servers store partitions
Fan-out queries supported
Partitions may be duplicated, so replication also provided
• Good for disaster recovery
Since “shards” can be geographically distributed, sharding can act like a
CDN
Good for keeping data close to processing
• Reduces network traffic when MapReduce splitting takes place

NoSQL Categories
GraphWide ColumnDocumentKey/Value
L

Key-Value Stores
The most common; not necessarily the most popular
Has rows, each with something like a big dictionary/associative array
• Schema may differ from row to row
Common on cloud platforms
• e.g. Amazon SimpleDB, Azure Table Storage
MemcacheDB, Voldemort, Couchbase, DynamoDB
(AWS), Dynomite, Redis and Riak
87

Key-Value Stores
Table: Customers
Row ID: 101
First_Name: Andrew
Last_Name: Brust
Address: 123 Main Street
Last_Order: 1501
Row ID: 202
First_Name: Jane
Last_Name: Doe
Address: 321 Elm Street
Last_Order: 1502
Table: Orders
Row ID: 1501
Price: 300 USD
Item1: 52134
Item2: 24457
Row ID: 1502
Price: 2500 GBP
Item1: 98456
Item2: 59428
Database

Wide Column Stores
Has tables with declared column families
• Each column family has “columns” which are KV pairs that can vary from row to row
These are the most foundational for large sites
• BigTable (Google)
• HBase (Originally part of Yahoo-dominated Hadoop project)
• Cassandra (Facebook)
• Calls column families “super columns” and tables “super column families”
They are the most “Big Data”-ready
• Especially HBase + Hadoop

Table: Customers
Row ID: 101
Super Column: Name
Column: First_Name:
Andrew
Column: Last_Name: Brust
Super Column: Address
Column: Number: 123
Column: Street: Main Street
Super Column: Orders
Column: Last_Order: 1501
Table: Orders
Row ID: 1501
Super Column: Pricing
Column: Price: 300
USD
Super Column: Items
Column: Item1: 52134
Row ID: 1502
Super Column: Pricing
Column: Price: 2500
GBP
Super Column: Items
Row ID: 202
Super Column: Name
Column: First_Name: Jane
Column: Last_Name: Doe
Super Column: Address
Column: Number: 321
Column: Street: Elm Street
Super Column: Orders
Column: Last_Order: 1502
Wide Column Stores

Demo
Wide Column Stores

Document Stores
Have “databases,” which are akin to tables
Have “documents,” akin to rows
• Documents are typically JSON objects
• Each document has properties and values
• Values can be scalars, arrays, links to documents in other databases or sub-documents (i.e. contained
JSON objects - Allows for hierarchical storage)
• Can have attachments as well
Old versions are retained
• So Doc Stores work well for content management
Some view doc stores as specialized KV stores
Most popular with developers, startups, VCs
The biggies:
• CouchDB
• Derivatives
• MongoDB

Document Store Application Orientation
Documents can each be addressed by URIs
CouchDB supports full REST interface
Very geared towards JavaScript and JSON
• Documents are JSON objects
• CouchDB/MongoDB use JavaScript as native language
In CouchDB, “view functions” also have unique URIs and they return
HTML
• So you can build entire applications in the database

Database: Customers
Document ID: 101
First_Name: Andrew
Last_Name: Brust
Address:
Orders:
Database: Orders
Document ID: 1501
Price: 300 USD
Item1: 52134
Item2: 24457
Document ID: 1502
Price: 2500 GBP
Item1: 98456
Item2: 59428
Number: 123
Street: Main Street
Most_recent: 1501
Document ID: 202
First_Name: Jane
Last_Name: Doe
Address:
Orders:
Number: 321
Street: Elm Street
Most_recent: 1502
Document Stores

Demo
Document Stores

Graph Databases
Great for social network applications and others where relationships are
important
Nodes and edges
• Edge like a join
• Nodes like rows in a table
Nodes can also have properties and values
Neo4j is a popular graph db

Database
Sent invitation
to
Commented on
photo by
Friend
of
Address
Placed order
Item
2
Item
1
Joe Smith Jane
Doe
Andrew Brust
Street: 123 Main
Street
City: New York
State: NY
Zip: 10014
ID: 52134
Type: Dress
Color: Blue
ID: 24457
Type: Shirt
Color: Red
ID: 252
Total Price: 300
USD
George Washington
Graph Databases

NoSQL on Windows Azure
Platform as a Service
• Cloudant: https://cloudant.com/azure/
• MongoDB (via MongoLab): http://blog.mongolab.com/2012/10/azure/
MongoDB, DIY:
• On an Azure Worker Role:
http://www.mongodb.org/display/DOCS/MongoDB+on+Azure+Worker+Roles
• On a Windows VM:
http://www.mongodb.org/display/DOCS/MongoDB+on+Azure+VM+-+Windows+Installer
• On a Linux VM:
http://www.mongodb.org/display/DOCS/MongoDB+on+Azure+VM+-+Linux+Tutorial
http://www.windowsazure.com/en-us/manage/linux/common-tasks/mongodb-on-a-linux-
vm/

NoSQL on Windows Azure
Others, DIY (Linux VMs):
• Couchbase:
http://blog.couchbase.com/couchbase-server-new-windows-azure
• CouchDB: http://ossonazure.interoperabilitybridges.com/articles/couchdb-
installer-for-windows-azure
• Riak:
http://basho.com/blog/technical/2012/10/09/Riak-on-Microsoft-Azure/
• Redis: http://blogs.msdn.com/b/tconte/archive/2012/06/08/running-redis-
on-a-centos-linux-vm-in-windows-azure.aspx
• Cassandra: http://www.windowsazure.com/en-us/manage/linux/other-
resources/how-to-run-cassandra-with-linux/

NoSQL + BI
NoSQL databases are bad for ad hoc query and data warehousing
BI applications involve models; models rely on schema
Extract, transform and load (ETL) may be your friend
Wide-column stores, however are good for “Big Data”
• See next slide
Wide-column stores and column-oriented databases are similar
technologically

NoSQL + Big Data
Big Data and NoSQL are interrelated
Typically, Wide-Column stores used in Big Data scenarios
Prime example:
• HBase and Hadoop
Why?
• Lack of indexing not a problem
• Consistency not an issue
• Fast reads very important
• Distributed file systems important too
• Commodity hardware and disk assumptions also important
• Not Web scale but massive scale-out, so similar concerns

NoSQL Compromises
Eventual consistency
Write buffering
Only primary keys can be indexed
Queries must be written as programs
Tooling
• Productivity (= money)

Common DBA Tasks in NoSQL
RDBMS NoSQL
Import Data Import Data
Setup Security Setup Security
Perform a Backup Make a copy of the data
Restore a Database Move a copy to a location
Create an Index Create an Index
Join Tables Together Run MapReduce
Schedule a Job Schedule a (Cron) Job
Run Database Maintenance Monitor space and resources used
Send an Email from SQL Server Set up resource threshold alerts
Search BOL Interpret Documentation
104
L

Which Type of NoSQL for
Which Type of Data?
Type of Data Type of NoSQL solution Example
Log files Wide Column HBase
Product Catalogs Key Value on disk DynamoDB
User profiles Key Value in memory Redis
Startups Document MongoDB
Social media connections Graph Neo4j
LOB w/Transactions NONE! Use RDBMS SQL Server
105
L

Relational vs. NoSQL
Line of Business -> Relational
Large, public (consumer)-facing sites -> NoSQL
Complex data structures -> Relational
Big Data -> NoSQL
Transactional -> Relational
Content Management -> NoSQL
Enterprise->Relational
Consumer Web -> NoSQL

Understand CAP & types of NoSQL databases
• Use NoSQL when business needs designate
• Use the right type of NoSQL for your business problem
Try out NoSQL on the cloud
• Quick and cheap for behavioral data
• Mashup cloud datasets
• Good for specialized use cases, i.e. dev, test , training environments
Learn NoSQL access technologies
• New query languages, i.e. MapReduce, R, Infer.NET
• New query tools (vendor-specific) – Google Refine, Amazon
Karmasphere, Microsoft Excel connectors, etc…
NoSQL To-Do List
L

NoSQL for .NET Developers
RavenDB
MongoDB C#/.NET Driver
MongoDB on Windows Azure
CouchBase .NET Client Library
Riak client for .NET
AWS Toolkit for Visual Studio
Google cloud APIs (REST-based)

Thank You
• andrew.brust@bluebadgeinsights.com
• @andrewbrust on twitter
• Want to get on Blue Badge Insights‟ list?”
Text “bluebadge” to 22828

Thank you!
Diamond Sponsor

Big Data and NoSQL for Database and BI Pros

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Big Data and NoSQL for Database and BI Pros

Similar to Big Data and NoSQL for Database and BI Pros (20)

More from Andrew Brust

More from Andrew Brust (9)

Recently uploaded

Recently uploaded (20)

Big Data and NoSQL for Database and BI Pros

Editor's Notes