SlideShare a Scribd company logo
April 10-12 | Chicago, IL
Big Data and NoSQL for
Database and BI Pros
Andrew J. Brust, Founder and CEO, Blue Badge Insights
April 10-12 | Chicago, IL
Please silence
cell phones
Meet Andrew
CEO and Founder, Blue Badge Insights
Big Data blogger for ZDNet
Microsoft Regional Director, MVP
Co-chair VSLive! and 17 years as a speaker
Founder, Microsoft BI User Group of NYC
• http://www.msbinyc.com
Co-moderator, NYC .NET Developers Group
• http://www.nycdotnetdev.com
“Redmond Review” columnist for Visual Studio Magazine and Redmond Developer News
brustblog.com, Twitter: @andrewbrust
3
Andrew‟s New Blog (bit.ly/bigondata)
Lynn Langit (in absentia)
CEO and Founder, Lynn Langit consulting
Former Microsoft Evangelist (4 years)
Google Developer Expert
MongoDB Master
MCT 13 years – 7 certifications
Cloudera Certified Developer
MSDN Magazine articles
• SQL Azure
• Hadoop on Azure
• MongoDB on Azure
www.LynnLangit.com
@LynnLangit
L
Read all about it!
Agenda
Overview / Landscape
• Big Data, and Hadoop
• NoSQL
• The Big Data-NoSQL Intersection
Drilldown on Big Data
Drilldown on NoSQL
What is Big Data?
100s of TB into PB and higher
Involving data from: financial data, sensors, web logs, social media, etc.
Parallel processing often involved
Hadoop is emblematic, but other technologies are Big Data too
Processing of data sets too large for transactional databases
Analyzing interactions, rather than transactions
The three V‟s: Volume, Velocity, Variety
Big Data tech sometimes imposed on small data problems
Big Data = Exponentially More Data
Retail Example -> „Feedback Economy‟
• Number of transactions
• Number of behaviors (collected every minute)
9
L
Big Data = „Next State‟ Questions
10
• What could happen?
• Why didn‟t this happen?
• When will the next new thing
happen?
• What will the next new thing be?
• What happens?
Collecting
Behavioral
data
L
My Data: An Example from Health Care
Medical records
• Regular
• Emergency
• Genetic data – 23andMe
Food data
• SparkPeople
Purchasing
• Grocery card
• credit card
Search – Google
Social media
• Twitter
• Facebook
Exercise
• Nike Fuel Band
• Kinect
• Location - phone
11
L
Big Data = More Data
12
L
Big Data Considerations
Collection –
get the data
Storage –
keep the
data
Querying –
make sense
of the data
Visualization
– see the
business
value
L
Data Collection
Types of Data
• Structured, semi-structured, unstructured vs. data standards
• Behavioral vs. transactional data
Methods of collection
• Sensors everywhere
• Machine-2-Machine
• Public Datasets
• Freebase
• Azure DataMarket
• Hillary Mason‟s list
14
L
What‟s MapReduce?
Partition the bulk input data and send to mappers (nodes in cluster)
Mappers pre-process, put into key-value format, and send all output for
a given (set of) key(s) to a reducer
Reducer aggregates; one output per key, with value
Map and Reduce code natively written as Java functions
MapReduce, in a Diagram
mapper
mapper
mapper
mapper
mapper
mapper
Input
reducer
reducer
reducer
Input
Input
Input
Input
Input
Input
Output
Output
Output
Output
Output
Output
Output
Input
Input
Input
K1 , K4
K3 , K6
Output
Output
Output
K2 , K5
• Count by suite, on each floor
• Send per-suite, per platform totals to lobby
• Sort totals by platform
• Send two platform packets to 10th, 20th, 30th floor
• Tally up each platform
• Merge tallies into one spreadsheet
• Collect the tallies
A MapReduce Example
What‟s a Distributed File System?
One where data gets distributed over commodity drives on commodity
servers
Data is replicated
• If one box goes down, no data lost
• “Shared Nothing”
BUT: Immutable
• Files can only be written to once
• So updates require drop + re-write (slow)
• You can append though
• Like a DVD/CD-ROM
Hadoop = MapReduce + HDFS
Modeled after Google MapReduce + GFS
Have more data? Just add more nodes to cluster.
• Mappers execute in parallel
• Hardware is commodity
• “Scaling out”
Use of HDFS means data may well be local to mapper processing
• So, not just parallel, but minimal data movement, which avoids
network bottlenecks
Comparison: RDBMS vs. Hadoop
Traditional RDBMS Hadoop / MapReduce
Data Size Gigabytes (Terabytes) Petabytes (Hexabytes)
Updates Read / Write many times Write once, Read many times
Integrity High (ACID) Low
Query Response Time Can be near immediate Has latency (due to batch
processing)
20
L
Just-in-Time Schema
When looking at unstructured data, schema is imposed at query time
Schema is context specific
• If scanning a book, are the values words, lines, or pages?
• Are notes a single field, or is each word a value?
• Are date and time two fields or one?
• Are street, city, state, zip separate or one value?
• Pig and Hive let you determine this at query time
• So does the Map function in MapReduce code
What‟s HBase?
A Wide-Column Store NoSQL database
Modeled after Google BigTable
Uses HDFS
Therefore, Hadoop-compatible
Hadoop MapReduce often used with HBase
But you can use either without the other
L
NoSQL Confusion
Many „flavors‟ of NoSQL data stores
Easiest to group by functionality, but…
• Dividing lines are not clear or consistent
NoSQL choice(s) driven by many factors
• Type of data
• Quantity of data
• Knowledge of technical staff
• Product maturity
• Tooling
L
So much wrong information
Everything is
„new‟
People are
religious about
data storage
Lots of
incorrect
information
„Try‟ before
you „buy‟ (or
use)
Watch out for
over
simplification
Confusion over
vendor
offerings
L
Common NoSQL Misconceptions
Problems
Everything is „new‟
People are religious about data
storage
Open source is always cheaper
Cloud is always cheaper
Replace RDBMS with NoSQL
Solutions
„Try‟ before you „buy‟ (or use)
Leverage NoSQL communities
Add NoSQL to existing RDBMS
solution
L
April 10-12 | Chicago, IL
Drilldown on Big Data
The Hadoop Stack
MapReduce, HDFS
Database
RDBMS Import/Export
Query: HiveQL and Pig Latin
Machine Learning/Data Mining
Log file integration
What‟s Hive?
Began as Hadoop sub-project
Now top-level Apache project
Provides a SQL-like (“HiveQL”) abstraction over MapReduce
Has its own HDFS table file format (and it‟s fully schema-bound)
Can also work over HBase
Acts as a bridge to many BI products which expect tabular data
Hadoop Distributions
Cloudera
Hortonworks
HCatalog: Hive/Pig/MR Interop
MapR
Network File System replaces HDFS
IBM InfoSphere BigInsights
HDFS<->DB2 integration
And now Microsoft…
Microsoft HDInsight
Developed with Hortonworks and incorporates Hortonworks Data
Platform (HDP) for Windows
Windows Azure HDInsight and Microsoft HDInsight (for Windows
Server)
• Single node preview runs on Windows client
Includes ODBC Driver for Hive
JavaScript MapReduce framework
Contribute it all back to open source Apache Project
Hortonworks Data Platform
for Windows
MRLib (NuGet
Package)
LINQ to Hive
OdbcClient + Hive
ODBC Driver
Deployment
Debugging
MR code in
C#, HadoopJob, Mapp
erBase, ReducerBase
Amenities for
Visual Studio/.NET
Some ways to work
Microsoft HDInsight
• Cloud: go to www.windowsazure.com, request a cluster
• Local: Download Microsoft HDInsight
• Runs on just about anything, including Windows XP
• Get it via the Web Platform installer (WebPI)
• Local version is free; cloud billed at 50% discount during preview
Amazon Web Services Elastic MapReduce
• Create AWS account
• Select Elastic MapReduce in Dashboard
• Cheap for experimenting, but not free
Cloudera CDH VM image
• Download as .tar.gz file
• “Un-tar” (can use WinRAR, 7zip)
• Run via VMWare Player or Virtual Box
• Everything’s free
Some ways to work
HDInsight EMR CDH 4
Microsoft HDInsight
Much simpler than the others
Browser-based portal
• Launch MapReduce jobs
• Azure: Provisioning cluster, managing ports, gather external data
Interactive JavaScript & Hive console
• JS: HDFS, Pig, light data visualization
• Hive commands and metadata discovery
• New console coming
Desktop Shortcuts:
• Command window, MapReduce, Name Node status in browser
• Azure: from portal page you can RDP directly to Hadoop head node for these
desktop shortcuts
35
April 10-12 | Chicago, IL
Demo
Windows Azure HDInsight
Amazon Elastic MapReduce
Lots of steps!
At a high level:
• Setup AWS account and S3 “buckets”
• Generate Key Pair and PEM file
• Install Ruby and EMR Command Line Interface
• Provision the cluster using CLI
• A batch file can work very well here
• Setup and run SSH/PuTTY
• Work interactively at command line
April 10-12 | Chicago, IL
Demo
Amazon Elastic MapReduce
Cloudera CDH4 Virtual Machine
Get it for free, in VMWare and Virtual Box versions.
• VMWare player and Virtual Box are free too
Run it, and configure it to have its own IP on your network. Use ifconfig to
discover IP.
Assuming IP of 192.168.1.59, open browser on your own (host) machine and
navigate to:
• http://192.168.1.59:8888
Can also use browser in VM and hit:
• http://localhost:8888
Work in “Hue”…
Hue
Browser based UI, with front
ends for:
HDFS (w/ upload & download)
MapReduce job creation and
monitoring
Hive (“Beeswax”)
And in-browser command line
shells for:
HBase
Pig (“Grunt”)
Impala: What it Is
Distributed SQL query engine over Hadoop cluster
Announced at Strata/Hadoop World in NYC on October 24th
In Beta, as part of CDH 4.1
Works with HDFS and Hive data
Compatible with HiveQL and Hive drivers
• Query with Beeswax
Impala: What it‟s Not
Impala is not Hive
• Hive converts HiveQL to Java MapReduce code and executes it in batch
mode
• Impala executes query interactively over the data
• Brings BI tools and Hadoop closer together
Impala is not an Apache Software Foundation project
• Though it is open source and Apache-licensed, but it‟s still incubated by
Cloudera
• Only in CDH
April 10-12 | Chicago, IL
Demo
Cloudera CDH4, Impala
Hadoop commands
HDFS
• hadoop fs filecommand
• Create and remove directories
• mkdir, rm, rmr
• Upload and download files to/from HDFS
• get, put
• View directory contents
• ls, lsr
• Copy, move, view files
• cp, mv, cat
MapReduce
• Run a Java jar-file based job
• hadoop jar jarname params
April 10-12 | Chicago, IL
Demo
Hadoop (directly)
HBase
Concepts:
• Tables, column families
• Columns, rows
• Keys, values
Commands:
• Definition: create, alter, drop, truncate
• Manipulation: get, put, delete, deleteall, scan
• Discovery: list, exists, describe, count
• Enablement: disable, enable
• Utilities: version, status, shutdown, exit
• Reference: http://wiki.apache.org/hadoop/Hbase/Shell
Moreover,
• Interesting HBase work can be done in MapReduce, Pig
HBase Examples
create 't1', 'f1', 'f2', 'f3'
describe 't1'
alter 't1', {NAME => 'f1', VERSIONS => 5}
put 't1', 'r1', 'c1:f1', 'value'
get 't1', 'r1'
count 't1'
April 10-12 | Chicago, IL
Demo
HBase
Submitting, Running and Monitoring
Jobs
Upload a JAR
Use Streaming
• Use other languages (i.e. other than Java) to write MapReduce code
• Python is popular option
• Any executable works, even C# console apps
• On MS HDInsight, JavaScript works too
• Still uses a JAR file: streaming.jar
Run at command line (passing JAR name and params) or use GUI
April 10-12 | Chicago, IL
Demo
Running MapReduce Jobs
Hive
Used by most BI products which connect to Hadoop
Provides a SQL-like abstraction over Hadoop
• Officially HiveQL, or HQL
Works on own tables, but also on HBase
Query generates MapReduce job, output of which becomes result set
Microsoft has Hive ODBC driver
• Connects Excel, Reporting Services, PowerPivot, Analysis Services Tabular
Mode (only)
Hive, Continued
Load data from flat HDFS files
• LOAD DATA [LOCAL] INPATH 'myfile'
INTO TABLE mytable;
SQL Queries
• CREATE, ALTER, DROP
• INSERT OVERWRITE (creates whole tables)
• SELECT, JOIN, WHERE, GROUP BY
• SORT BY, but ordering data is tricky!
• MAP/REDUCE/TRANSFORM…USING allows for custom map, reduce steps
utilizing Java or streaming code
Data Explorer
• Beta add-in for Excel
• Acquire, transform data
• Data sources include
Facebook, HDFS
• Visually- or script-driven
• Also includes Azure BLOB
storage backing up
HDInsight
56
April 10-12 | Chicago, IL
Demo
Hive, Data Explorer
Pig
Instead of SQL, employs a language (“Pig Latin”) that accommodates data flow
expressions
• Do a combo of Query and ETL
“10 lines of Pig Latin ≈ 200 lines of Java.”
Works with structured or unstructured data
Operations
• As with Hive, a MapReduce job is generated
• Unlike Hive, output is only flat file to HDFS or text at command line console
• With HDInsight, can easily convert to JavaScript array, then manipulate
Use command line (“Grunt”) or build scripts
Example
A = LOAD 'myfile'
AS (x, y, z);
B = FILTER A by x > 0;
C = GROUP B BY x;
D = FOREACH A GENERATE
x, COUNT(B);
STORE D INTO 'output';
Pig Latin Examples
Imperative, file system commands
• LOAD, STORE
•Schema specified on LOAD
Declarative, query commands (SQL-like)
• xxx = file or data set
• FOREACH xxx GENERATE (SELECT…FROM xxx)
• JOIN (WHERE/INNER JOIN)
• FILTER xxx BY (WHERE)
• ORDER xxx BY (ORDER BY)
• GROUP xxx BY / GENERATE COUNT(xxx)
(SELECT COUNT(*) GROUP BY)
• DISTINCT (SELECT DISTINCT)
Syntax is assignment statement-based:
• MyCusts = FILTER Custs BY SalesPerson eq 15;
Access Hbase
• CpuMetrics = LOAD 'hbase://SystemMetrics' USING
org.apache.pig.backend.hadoop.hbase.HBaseStorage('cpu:','-loadKey -returnTuple');
April 10-12 | Chicago, IL
Demo
Pig
Sqoop
sqoop import
--connect
"jdbc:sqlserver://<servername>.
database.windows.net:1433;
database=<dbname>;
user=<username>@<servername>;
password=<password>"
--table <from_table>
--target-dir <to_hdfs_folder>
--split-by <from_table_column>
Sqoop
sqoop export
--connect
"jdbc:sqlserver://<servername>.
database.windows.net:1433;
database=<dbname>;
user=<username>@<servername>;
password=<password>"
--table <to_table>
--export-dir <from_hdfs_folder>
--input-fields-terminated-by
"<delimiter>"
Flume NG
Source
• Avro (data serialization system – can read json-encoded data files, and can
work over RPC)
• Exec (reads from stdout of long-running process)
Sinks
• HDFS, HBase, Avro
Channels
• Memory, JDBC, file
Flume NG (next generation)
Setup conf/flume.conf
# Define a memory channel called ch1 on agent1
agent1.channels.ch1.type = memory
# Define an Avro source called avro-source1 on agent1 and tell it
# to bind to 0.0.0.0:41414. Connect it to channel ch1.
agent1.sources.avro-source1.channels = ch1
agent1.sources.avro-source1.type = avro
agent1.sources.avro-source1.bind = 0.0.0.0
agent1.sources.avro-source1.port = 41414
# Define a logger sink that simply logs all events it receives
# and connect it to the other end of the same channel.
agent1.sinks.log-sink1.channel = ch1
agent1.sinks.log-sink1.type = logger
# Finally, now that we've defined all of our components, tell
# agent1 which ones we want to activate.
agent1.channels = ch1
agent1.sources = avro-source1
agent1.sinks = log-sink1
From the command line:
flume-ng agent --conf ./conf/ -f conf/flume.conf -n agent1
Mahout Algorithms
Recommendation
• Your info + community info
• Give users/items/ratings; get user-user/item-item
• itemsimilarity
Classification/Categorization
• Drop into buckets
• Naïve Bayes, Complementary Naïve Bayes, Decision Forests
Clustering
• Like classification, but with categories unknown
• K-Means, Fuzzy K-Means, Canopy, Dirichlet, Mean-Shift
Workflow, Syntax
Workflow
• Run the job
• Dump the output
• Visualize, predict
mahout algorithm
-- input folderspec
-- output folderspec
-- param1 value1
-- param2 value2
…
Example:
• mahout itemsimilarity
--input <input-hdfs-path>
--output <output-hdfs-path>
--tempDir <tmp-hdfs-path>
-s SIMILARITY_LOGLIKELIHOOD
The Truth About Mahout
Mahout is really just an algorithm engine
Its output is almost unusable by non-statisticians/non-data scientists
You need a staff or a product to visualize, or make into a usable
prediction model
Investigate Predixion Software
• CTO, Jamie MacLennan, used to lead SQL Server Data Mining team
• Excel add-in can use Mahout remotely, visualize its output, run predictive
analyses
• Also integrates with SQL Server, Greenplum, MapReduce
• http://www.predixionsoftware.com
The “Data-Refinery” Idea
Use Hadoop to “on-board” unstructured data, then extract manageable
subsets
Load the subsets into conventional DW/BI servers and use familiar
analytics tool to examine
This is the current rationalization of Hadoop + BI tools‟ coexistence
Will it stay this way?
Dremel-based service for massive amounts of data
Pay for query and storage
SQL-like query language
Has an Excel connector
Google BigQuery
L
April 10-12 | Chicago, IL
Google BigQuery
April 10-12 | Chicago, IL
Drilldown on NoSQL
NoSQL Data Fodder
Addresses Preferences
Notes
Friends, Foll
owers
Documents
“Web Scale”
This the term used to justify NoSQL
Scenario is simple needs but “made up for in
volume”
• Millions of concurrent users
Think of sites like Amazon or Google
Think of non-transactional tasks like loading
catalog data to display product page, or
environment preferences
NoSQL Common Traits
Non-relational
Non-schematized/schema-free
Open source
Distributed
Eventual consistency
“Web scale”
Developed at big Internet companies
More than just the Elephant in the room
Over 120+ types of noSQL databases
So many NoSQL options
L
Concepts
Consistency
CAP Theorem
Indexing
Queries
MapReduce
Sharding
Consistency
CAP Theorem
• Databases may only excel at two of the following three attributes:
consistency, availability and partition tolerance
NoSQL does not offer “ACID” guarantees
• Atomicity, consistency, isolation and durability
Instead offers “eventual consistency”
Similar to DNS propagation
Things like inventory, account balances should be consistent
• Imagine updating a server in Seattle that stock was depleted
• Imagine not updating the server in NY
• Customer in NY goes to order 50 pieces of the item
• Order processed even though no stock
Things like catalog information don‟t have to be, at least not immediately
• If a new item is entered into the catalog, it‟s OK for some customers to see it
even before the other customers‟ server knows about it
But catalog info must come up quickly
• Therefore don‟t lock data in one location while waiting to update the other
Therefore, OK to sacrifice consistency for speed, in some cases
Consistency
CAP Theorem
Consistency
Availability
Partition
Tolerance
Relational
NoSQL
Indexing
Most NoSQL databases are indexed by key
Some allow so-called “secondary” indexes
Often the primary key indexes are clustered
HBase uses HDFS (the Hadoop Distributed File System), which is
append-only
• Writes are logged
• Logged writes are batched
• File is re-created and sorted
Queries
Typically no query language
Instead, create procedural program
Sometimes SQL is supported
Sometimes MapReduce code is used…
MapReduce
This is not Hadoop‟s MapReduce, but it‟s conceptually related
Map step: pre-processes data
Reduce step: summarizes/aggregates data
Will show a MapReduce code sample for Mongo soon
Will demo map code on CouchDB
L
Sharding
A partitioning pattern where separate servers store partitions
Fan-out queries supported
Partitions may be duplicated, so replication also provided
• Good for disaster recovery
Since “shards” can be geographically distributed, sharding can act like a
CDN
Good for keeping data close to processing
• Reduces network traffic when MapReduce splitting takes place
NoSQL Categories
GraphWide ColumnDocumentKey/Value
L
Key-Value Stores
The most common; not necessarily the most popular
Has rows, each with something like a big dictionary/associative array
• Schema may differ from row to row
Common on cloud platforms
• e.g. Amazon SimpleDB, Azure Table Storage
MemcacheDB, Voldemort, Couchbase, DynamoDB
(AWS), Dynomite, Redis and Riak
87
Key-Value Stores
Table: Customers
Row ID: 101
First_Name: Andrew
Last_Name: Brust
Address: 123 Main Street
Last_Order: 1501
Row ID: 202
First_Name: Jane
Last_Name: Doe
Address: 321 Elm Street
Last_Order: 1502
Table: Orders
Row ID: 1501
Price: 300 USD
Item1: 52134
Item2: 24457
Row ID: 1502
Price: 2500 GBP
Item1: 98456
Item2: 59428
Database
Wide Column Stores
Has tables with declared column families
• Each column family has “columns” which are KV pairs that can vary from row to row
These are the most foundational for large sites
• BigTable (Google)
• HBase (Originally part of Yahoo-dominated Hadoop project)
• Cassandra (Facebook)
• Calls column families “super columns” and tables “super column families”
They are the most “Big Data”-ready
• Especially HBase + Hadoop
Table: Customers
Row ID: 101
Super Column: Name
Column: First_Name:
Andrew
Column: Last_Name: Brust
Super Column: Address
Column: Number: 123
Column: Street: Main Street
Super Column: Orders
Column: Last_Order: 1501
Table: Orders
Row ID: 1501
Super Column: Pricing
Column: Price: 300
USD
Super Column: Items
Column: Item1: 52134
Column: Item2: 24457
Row ID: 1502
Super Column: Pricing
Column: Price: 2500
GBP
Super Column: Items
Column: Item1: 98456
Column: Item2: 59428
Row ID: 202
Super Column: Name
Column: First_Name: Jane
Column: Last_Name: Doe
Super Column: Address
Column: Number: 321
Column: Street: Elm Street
Super Column: Orders
Column: Last_Order: 1502
Wide Column Stores
April 10-12 | Chicago, IL
Demo
Wide Column Stores
Document Stores
Have “databases,” which are akin to tables
Have “documents,” akin to rows
• Documents are typically JSON objects
• Each document has properties and values
• Values can be scalars, arrays, links to documents in other databases or sub-documents (i.e. contained
JSON objects - Allows for hierarchical storage)
• Can have attachments as well
Old versions are retained
• So Doc Stores work well for content management
Some view doc stores as specialized KV stores
Most popular with developers, startups, VCs
The biggies:
• CouchDB
• Derivatives
• MongoDB
Document Store Application Orientation
Documents can each be addressed by URIs
CouchDB supports full REST interface
Very geared towards JavaScript and JSON
• Documents are JSON objects
• CouchDB/MongoDB use JavaScript as native language
In CouchDB, “view functions” also have unique URIs and they return
HTML
• So you can build entire applications in the database
Database: Customers
Document ID: 101
First_Name: Andrew
Last_Name: Brust
Address:
Orders:
Database: Orders
Document ID: 1501
Price: 300 USD
Item1: 52134
Item2: 24457
Document ID: 1502
Price: 2500 GBP
Item1: 98456
Item2: 59428
Number: 123
Street: Main Street
Most_recent: 1501
Document ID: 202
First_Name: Jane
Last_Name: Doe
Address:
Orders:
Number: 321
Street: Elm Street
Most_recent: 1502
Document Stores
Comparing…
April 10-12 | Chicago, IL
Demo
Document Stores
Graph Databases
Great for social network applications and others where relationships are
important
Nodes and edges
• Edge like a join
• Nodes like rows in a table
Nodes can also have properties and values
Neo4j is a popular graph db
Database
Sent invitation
to
Commented on
photo by
Friend
of
Address
Placed order
Item
2
Item
1
Joe Smith Jane
Doe
Andrew Brust
Street: 123 Main
Street
City: New York
State: NY
Zip: 10014
ID: 52134
Type: Dress
Color: Blue
ID: 24457
Type: Shirt
Color: Red
ID: 252
Total Price: 300
USD
George Washington
Graph Databases
NoSQL on Windows Azure
Platform as a Service
• Cloudant: https://cloudant.com/azure/
• MongoDB (via MongoLab): http://blog.mongolab.com/2012/10/azure/
MongoDB, DIY:
• On an Azure Worker Role:
http://www.mongodb.org/display/DOCS/MongoDB+on+Azure+Worker+Roles
• On a Windows VM:
http://www.mongodb.org/display/DOCS/MongoDB+on+Azure+VM+-+Windows+Installer
• On a Linux VM:
http://www.mongodb.org/display/DOCS/MongoDB+on+Azure+VM+-+Linux+Tutorial
http://www.windowsazure.com/en-us/manage/linux/common-tasks/mongodb-on-a-linux-
vm/
NoSQL on Windows Azure
Others, DIY (Linux VMs):
• Couchbase:
http://blog.couchbase.com/couchbase-server-new-windows-azure
• CouchDB: http://ossonazure.interoperabilitybridges.com/articles/couchdb-
installer-for-windows-azure
• Riak:
http://basho.com/blog/technical/2012/10/09/Riak-on-Microsoft-Azure/
• Redis: http://blogs.msdn.com/b/tconte/archive/2012/06/08/running-redis-
on-a-centos-linux-vm-in-windows-azure.aspx
• Cassandra: http://www.windowsazure.com/en-us/manage/linux/other-
resources/how-to-run-cassandra-with-linux/
NoSQL + BI
NoSQL databases are bad for ad hoc query and data warehousing
BI applications involve models; models rely on schema
Extract, transform and load (ETL) may be your friend
Wide-column stores, however are good for “Big Data”
• See next slide
Wide-column stores and column-oriented databases are similar
technologically
NoSQL + Big Data
Big Data and NoSQL are interrelated
Typically, Wide-Column stores used in Big Data scenarios
Prime example:
• HBase and Hadoop
Why?
• Lack of indexing not a problem
• Consistency not an issue
• Fast reads very important
• Distributed file systems important too
• Commodity hardware and disk assumptions also important
• Not Web scale but massive scale-out, so similar concerns
NoSQL Compromises
Eventual consistency
Write buffering
Only primary keys can be indexed
Queries must be written as programs
Tooling
• Productivity (= money)
Common DBA Tasks in NoSQL
RDBMS NoSQL
Import Data Import Data
Setup Security Setup Security
Perform a Backup Make a copy of the data
Restore a Database Move a copy to a location
Create an Index Create an Index
Join Tables Together Run MapReduce
Schedule a Job Schedule a (Cron) Job
Run Database Maintenance Monitor space and resources used
Send an Email from SQL Server Set up resource threshold alerts
Search BOL Interpret Documentation
104
L
Which Type of NoSQL for
Which Type of Data?
Type of Data Type of NoSQL solution Example
Log files Wide Column HBase
Product Catalogs Key Value on disk DynamoDB
User profiles Key Value in memory Redis
Startups Document MongoDB
Social media connections Graph Neo4j
LOB w/Transactions NONE! Use RDBMS SQL Server
105
L
Relational vs. NoSQL
Line of Business -> Relational
Large, public (consumer)-facing sites -> NoSQL
Complex data structures -> Relational
Big Data -> NoSQL
Transactional -> Relational
Content Management -> NoSQL
Enterprise->Relational
Consumer Web -> NoSQL
Data Scientists…
L
Understand CAP & types of NoSQL databases
• Use NoSQL when business needs designate
• Use the right type of NoSQL for your business problem
Try out NoSQL on the cloud
• Quick and cheap for behavioral data
• Mashup cloud datasets
• Good for specialized use cases, i.e. dev, test , training environments
Learn NoSQL access technologies
• New query languages, i.e. MapReduce, R, Infer.NET
• New query tools (vendor-specific) – Google Refine, Amazon
Karmasphere, Microsoft Excel connectors, etc…
NoSQL To-Do List
L
NoSQL for .NET Developers
RavenDB
MongoDB C#/.NET Driver
MongoDB on Windows Azure
CouchBase .NET Client Library
Riak client for .NET
AWS Toolkit for Visual Studio
Google cloud APIs (REST-based)
Thank You
• andrew.brust@bluebadgeinsights.com
• @andrewbrust on twitter
• Want to get on Blue Badge Insights‟ list?”
Text “bluebadge” to 22828
April 10-12 | Chicago, IL
Thank you!
Diamond Sponsor

More Related Content

What's hot

Cloud Computing and the Microsoft Developer - A Down-to-Earth Analysis
Cloud Computing and the Microsoft Developer - A Down-to-Earth AnalysisCloud Computing and the Microsoft Developer - A Down-to-Earth Analysis
Cloud Computing and the Microsoft Developer - A Down-to-Earth Analysis
Andrew Brust
 
Relational databases vs Non-relational databases
Relational databases vs Non-relational databasesRelational databases vs Non-relational databases
Relational databases vs Non-relational databases
James Serra
 
Microsoft's Big Play for Big Data
Microsoft's Big Play for Big DataMicrosoft's Big Play for Big Data
Microsoft's Big Play for Big Data
Andrew Brust
 
Relational and non relational database 7
Relational and non relational database 7Relational and non relational database 7
Relational and non relational database 7
abdulrahmanhelan
 
Intro to Big Data and NoSQL
Intro to Big Data and NoSQLIntro to Big Data and NoSQL
Intro to Big Data and NoSQLDon Demcsak
 
NoSQL databases and managing big data
NoSQL databases and managing big dataNoSQL databases and managing big data
NoSQL databases and managing big data
Steven Francia
 
Nonrelational Databases
Nonrelational DatabasesNonrelational Databases
Nonrelational Databases
Udi Bauman
 
The Future of Analytics, Data Integration and BI on Big Data Platforms
The Future of Analytics, Data Integration and BI on Big Data PlatformsThe Future of Analytics, Data Integration and BI on Big Data Platforms
The Future of Analytics, Data Integration and BI on Big Data Platforms
Mark Rittman
 
Data Modeling for NoSQL
Data Modeling for NoSQLData Modeling for NoSQL
Data Modeling for NoSQL
Tony Tam
 
Schemaless Databases
Schemaless DatabasesSchemaless Databases
Schemaless Databases
Dan Gunter
 
Big Data: Guidelines and Examples for the Enterprise Decision Maker
Big Data: Guidelines and Examples for the Enterprise Decision MakerBig Data: Guidelines and Examples for the Enterprise Decision Maker
Big Data: Guidelines and Examples for the Enterprise Decision Maker
MongoDB
 
Rdbms vs. no sql
Rdbms vs. no sqlRdbms vs. no sql
Rdbms vs. no sql
Amar Jagdale
 
Introduction to Azure DocumentDB
Introduction to Azure DocumentDBIntroduction to Azure DocumentDB
Introduction to Azure DocumentDB
Denny Lee
 
Azure Databricks is Easier Than You Think
Azure Databricks is Easier Than You ThinkAzure Databricks is Easier Than You Think
Azure Databricks is Easier Than You Think
Ike Ellis
 
Sql vs NoSQL
Sql vs NoSQLSql vs NoSQL
Sql vs NoSQL
RTigger
 
RDX Insights Presentation - Microsoft Business Intelligence
RDX Insights Presentation - Microsoft Business IntelligenceRDX Insights Presentation - Microsoft Business Intelligence
RDX Insights Presentation - Microsoft Business Intelligence
Christopher Foot
 
Evolved BI with SQL Server 2012
Evolved BIwith SQL Server 2012Evolved BIwith SQL Server 2012
Evolved BI with SQL Server 2012Andrew Brust
 
Introduction to PolyBase
Introduction to PolyBaseIntroduction to PolyBase
Introduction to PolyBase
James Serra
 

What's hot (20)

Cloud Computing and the Microsoft Developer - A Down-to-Earth Analysis
Cloud Computing and the Microsoft Developer - A Down-to-Earth AnalysisCloud Computing and the Microsoft Developer - A Down-to-Earth Analysis
Cloud Computing and the Microsoft Developer - A Down-to-Earth Analysis
 
Relational databases vs Non-relational databases
Relational databases vs Non-relational databasesRelational databases vs Non-relational databases
Relational databases vs Non-relational databases
 
Microsoft's Big Play for Big Data
Microsoft's Big Play for Big DataMicrosoft's Big Play for Big Data
Microsoft's Big Play for Big Data
 
Relational and non relational database 7
Relational and non relational database 7Relational and non relational database 7
Relational and non relational database 7
 
Relational vs. Non-Relational
Relational vs. Non-RelationalRelational vs. Non-Relational
Relational vs. Non-Relational
 
Intro to Big Data and NoSQL
Intro to Big Data and NoSQLIntro to Big Data and NoSQL
Intro to Big Data and NoSQL
 
NoSQL databases and managing big data
NoSQL databases and managing big dataNoSQL databases and managing big data
NoSQL databases and managing big data
 
Nonrelational Databases
Nonrelational DatabasesNonrelational Databases
Nonrelational Databases
 
RDBMS vs NoSQL
RDBMS vs NoSQLRDBMS vs NoSQL
RDBMS vs NoSQL
 
The Future of Analytics, Data Integration and BI on Big Data Platforms
The Future of Analytics, Data Integration and BI on Big Data PlatformsThe Future of Analytics, Data Integration and BI on Big Data Platforms
The Future of Analytics, Data Integration and BI on Big Data Platforms
 
Data Modeling for NoSQL
Data Modeling for NoSQLData Modeling for NoSQL
Data Modeling for NoSQL
 
Schemaless Databases
Schemaless DatabasesSchemaless Databases
Schemaless Databases
 
Big Data: Guidelines and Examples for the Enterprise Decision Maker
Big Data: Guidelines and Examples for the Enterprise Decision MakerBig Data: Guidelines and Examples for the Enterprise Decision Maker
Big Data: Guidelines and Examples for the Enterprise Decision Maker
 
Rdbms vs. no sql
Rdbms vs. no sqlRdbms vs. no sql
Rdbms vs. no sql
 
Introduction to Azure DocumentDB
Introduction to Azure DocumentDBIntroduction to Azure DocumentDB
Introduction to Azure DocumentDB
 
Azure Databricks is Easier Than You Think
Azure Databricks is Easier Than You ThinkAzure Databricks is Easier Than You Think
Azure Databricks is Easier Than You Think
 
Sql vs NoSQL
Sql vs NoSQLSql vs NoSQL
Sql vs NoSQL
 
RDX Insights Presentation - Microsoft Business Intelligence
RDX Insights Presentation - Microsoft Business IntelligenceRDX Insights Presentation - Microsoft Business Intelligence
RDX Insights Presentation - Microsoft Business Intelligence
 
Evolved BI with SQL Server 2012
Evolved BIwith SQL Server 2012Evolved BIwith SQL Server 2012
Evolved BI with SQL Server 2012
 
Introduction to PolyBase
Introduction to PolyBaseIntroduction to PolyBase
Introduction to PolyBase
 

Viewers also liked

MongoDB and Hadoop: Driving Business Insights
MongoDB and Hadoop: Driving Business InsightsMongoDB and Hadoop: Driving Business Insights
MongoDB and Hadoop: Driving Business Insights
MongoDB
 
Research on big data
Research on big dataResearch on big data
Research on big data
Roby Chen
 
MongoDB et Hadoop
MongoDB et HadoopMongoDB et Hadoop
MongoDB et HadoopMongoDB
 
Nosql databases for the .net developer
Nosql databases for the .net developerNosql databases for the .net developer
Nosql databases for the .net developerJesus Rodriguez
 
NOSQL Database: Apache Cassandra
NOSQL Database: Apache CassandraNOSQL Database: Apache Cassandra
NOSQL Database: Apache Cassandra
Folio3 Software
 
A practical introduction to Oracle NoSQL Database - OOW2014
A practical introduction to Oracle NoSQL Database - OOW2014A practical introduction to Oracle NoSQL Database - OOW2014
A practical introduction to Oracle NoSQL Database - OOW2014
Anuj Sahni
 
Con8862 no sql, json and time series data
Con8862   no sql, json and time series dataCon8862   no sql, json and time series data
Con8862 no sql, json and time series dataAnuj Sahni
 
Nosql databases
Nosql databasesNosql databases
Nosql databases
ateeq ateeq
 
An Intro to NoSQL Databases
An Intro to NoSQL DatabasesAn Intro to NoSQL Databases
An Intro to NoSQL Databases
Rajith Pemabandu
 
Using Spring with NoSQL databases (SpringOne China 2012)
Using Spring with NoSQL databases (SpringOne China 2012)Using Spring with NoSQL databases (SpringOne China 2012)
Using Spring with NoSQL databases (SpringOne China 2012)
Chris Richardson
 
NoSQL-Database-Concepts
NoSQL-Database-ConceptsNoSQL-Database-Concepts
NoSQL-Database-ConceptsBhaskar Gunda
 
Big data for cio 2015
Big data for cio 2015Big data for cio 2015
Big data for cio 2015
Zohar Elkayam
 
Understanding and building big data Architectures - NoSQL
Understanding and building big data Architectures - NoSQLUnderstanding and building big data Architectures - NoSQL
Understanding and building big data Architectures - NoSQL
Hyderabad Scalability Meetup
 
Time series storage in Cassandra
Time series storage in CassandraTime series storage in Cassandra
Time series storage in Cassandra
Eric Evans
 
Operational Analytics Using Spark and NoSQL Data Stores
Operational Analytics Using Spark and NoSQL Data StoresOperational Analytics Using Spark and NoSQL Data Stores
Operational Analytics Using Spark and NoSQL Data Stores
DATAVERSITY
 
NoSQL Databases, Not just a Buzzword
NoSQL Databases, Not just a Buzzword NoSQL Databases, Not just a Buzzword
NoSQL Databases, Not just a Buzzword
Haitham El-Ghareeb
 
Test Automation for NoSQL Databases
Test Automation for NoSQL DatabasesTest Automation for NoSQL Databases
Test Automation for NoSQL Databases
Tobias Trelle
 
Data Modeling IoT and Time Series data in NoSQL
Data Modeling IoT and Time Series data in NoSQLData Modeling IoT and Time Series data in NoSQL
Data Modeling IoT and Time Series data in NoSQL
Basho Technologies
 
Maximizing MongoDB Performance on AWS
Maximizing MongoDB Performance on AWSMaximizing MongoDB Performance on AWS
Maximizing MongoDB Performance on AWS
MongoDB
 

Viewers also liked (20)

MongoDB and Hadoop: Driving Business Insights
MongoDB and Hadoop: Driving Business InsightsMongoDB and Hadoop: Driving Business Insights
MongoDB and Hadoop: Driving Business Insights
 
Research on big data
Research on big dataResearch on big data
Research on big data
 
MongoDB et Hadoop
MongoDB et HadoopMongoDB et Hadoop
MongoDB et Hadoop
 
NoSql Databases
NoSql DatabasesNoSql Databases
NoSql Databases
 
Nosql databases for the .net developer
Nosql databases for the .net developerNosql databases for the .net developer
Nosql databases for the .net developer
 
NOSQL Database: Apache Cassandra
NOSQL Database: Apache CassandraNOSQL Database: Apache Cassandra
NOSQL Database: Apache Cassandra
 
A practical introduction to Oracle NoSQL Database - OOW2014
A practical introduction to Oracle NoSQL Database - OOW2014A practical introduction to Oracle NoSQL Database - OOW2014
A practical introduction to Oracle NoSQL Database - OOW2014
 
Con8862 no sql, json and time series data
Con8862   no sql, json and time series dataCon8862   no sql, json and time series data
Con8862 no sql, json and time series data
 
Nosql databases
Nosql databasesNosql databases
Nosql databases
 
An Intro to NoSQL Databases
An Intro to NoSQL DatabasesAn Intro to NoSQL Databases
An Intro to NoSQL Databases
 
Using Spring with NoSQL databases (SpringOne China 2012)
Using Spring with NoSQL databases (SpringOne China 2012)Using Spring with NoSQL databases (SpringOne China 2012)
Using Spring with NoSQL databases (SpringOne China 2012)
 
NoSQL-Database-Concepts
NoSQL-Database-ConceptsNoSQL-Database-Concepts
NoSQL-Database-Concepts
 
Big data for cio 2015
Big data for cio 2015Big data for cio 2015
Big data for cio 2015
 
Understanding and building big data Architectures - NoSQL
Understanding and building big data Architectures - NoSQLUnderstanding and building big data Architectures - NoSQL
Understanding and building big data Architectures - NoSQL
 
Time series storage in Cassandra
Time series storage in CassandraTime series storage in Cassandra
Time series storage in Cassandra
 
Operational Analytics Using Spark and NoSQL Data Stores
Operational Analytics Using Spark and NoSQL Data StoresOperational Analytics Using Spark and NoSQL Data Stores
Operational Analytics Using Spark and NoSQL Data Stores
 
NoSQL Databases, Not just a Buzzword
NoSQL Databases, Not just a Buzzword NoSQL Databases, Not just a Buzzword
NoSQL Databases, Not just a Buzzword
 
Test Automation for NoSQL Databases
Test Automation for NoSQL DatabasesTest Automation for NoSQL Databases
Test Automation for NoSQL Databases
 
Data Modeling IoT and Time Series data in NoSQL
Data Modeling IoT and Time Series data in NoSQLData Modeling IoT and Time Series data in NoSQL
Data Modeling IoT and Time Series data in NoSQL
 
Maximizing MongoDB Performance on AWS
Maximizing MongoDB Performance on AWSMaximizing MongoDB Performance on AWS
Maximizing MongoDB Performance on AWS
 

Similar to Big Data and NoSQL for Database and BI Pros

Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012Andrew Brust
 
Microsoft's Hadoop Story
Microsoft's Hadoop StoryMicrosoft's Hadoop Story
Microsoft's Hadoop Story
Michael Rys
 
Big Data on the Microsoft Platform - With Hadoop, MS BI and the SQL Server stack
Big Data on the Microsoft Platform - With Hadoop, MS BI and the SQL Server stackBig Data on the Microsoft Platform - With Hadoop, MS BI and the SQL Server stack
Big Data on the Microsoft Platform - With Hadoop, MS BI and the SQL Server stack
Andrew Brust
 
Big Data in the Microsoft Platform
Big Data in the Microsoft PlatformBig Data in the Microsoft Platform
Big Data in the Microsoft PlatformJesus Rodriguez
 
Architecting Your First Big Data Implementation
Architecting Your First Big Data ImplementationArchitecting Your First Big Data Implementation
Architecting Your First Big Data Implementation
Adaryl "Bob" Wakefield, MBA
 
Big Data Developers Moscow Meetup 1 - sql on hadoop
Big Data Developers Moscow Meetup 1  - sql on hadoopBig Data Developers Moscow Meetup 1  - sql on hadoop
Big Data Developers Moscow Meetup 1 - sql on hadoop
bddmoscow
 
Introduction to Microsoft's Big Data Platform and Hadoop Primer
Introduction to Microsoft's Big Data Platform and Hadoop PrimerIntroduction to Microsoft's Big Data Platform and Hadoop Primer
Introduction to Microsoft's Big Data Platform and Hadoop Primer
Denny Lee
 
Above the cloud: Big Data and BI
Above the cloud: Big Data and BIAbove the cloud: Big Data and BI
Above the cloud: Big Data and BI
Denny Lee
 
NoSQL on the move
NoSQL on the moveNoSQL on the move
NoSQL on the move
Codemotion
 
Introduction to BIg Data and Hadoop
Introduction to BIg Data and HadoopIntroduction to BIg Data and Hadoop
Introduction to BIg Data and Hadoop
Amir Shaikh
 
Latest Developments in H2O
Latest Developments in H2OLatest Developments in H2O
Latest Developments in H2O
Sri Ambati
 
Hadoop Master Class : A concise overview
Hadoop Master Class : A concise overviewHadoop Master Class : A concise overview
Hadoop Master Class : A concise overview
Abhishek Roy
 
Bi on Big Data - Strata 2016 in London
Bi on Big Data - Strata 2016 in LondonBi on Big Data - Strata 2016 in London
Bi on Big Data - Strata 2016 in London
Dremio Corporation
 
IBM - Introduction to Cloudant
IBM - Introduction to CloudantIBM - Introduction to Cloudant
IBM - Introduction to Cloudant
Francisco González Jiménez
 
0812 2014 01_toronto-smac meetup_i_os_cloudant_worklight_part2
0812 2014 01_toronto-smac meetup_i_os_cloudant_worklight_part20812 2014 01_toronto-smac meetup_i_os_cloudant_worklight_part2
0812 2014 01_toronto-smac meetup_i_os_cloudant_worklight_part2
Raul Chong
 
.NET per la Data Science e oltre
.NET per la Data Science e oltre.NET per la Data Science e oltre
.NET per la Data Science e oltre
Marco Parenzan
 
AnalyticsConf2016 - Zaawansowana analityka na platformie Azure HDInsight
AnalyticsConf2016 - Zaawansowana analityka na platformie Azure HDInsightAnalyticsConf2016 - Zaawansowana analityka na platformie Azure HDInsight
AnalyticsConf2016 - Zaawansowana analityka na platformie Azure HDInsight
Łukasz Grala
 
SQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for ImpalaSQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for Impala
markgrover
 
USQL Trivadis Azure Data Lake Event
USQL Trivadis Azure Data Lake EventUSQL Trivadis Azure Data Lake Event
USQL Trivadis Azure Data Lake Event
Trivadis
 
Lightning Talk: Why and How to Integrate MongoDB and NoSQL into Hadoop Big Da...
Lightning Talk: Why and How to Integrate MongoDB and NoSQL into Hadoop Big Da...Lightning Talk: Why and How to Integrate MongoDB and NoSQL into Hadoop Big Da...
Lightning Talk: Why and How to Integrate MongoDB and NoSQL into Hadoop Big Da...
MongoDB
 

Similar to Big Data and NoSQL for Database and BI Pros (20)

Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012
 
Microsoft's Hadoop Story
Microsoft's Hadoop StoryMicrosoft's Hadoop Story
Microsoft's Hadoop Story
 
Big Data on the Microsoft Platform - With Hadoop, MS BI and the SQL Server stack
Big Data on the Microsoft Platform - With Hadoop, MS BI and the SQL Server stackBig Data on the Microsoft Platform - With Hadoop, MS BI and the SQL Server stack
Big Data on the Microsoft Platform - With Hadoop, MS BI and the SQL Server stack
 
Big Data in the Microsoft Platform
Big Data in the Microsoft PlatformBig Data in the Microsoft Platform
Big Data in the Microsoft Platform
 
Architecting Your First Big Data Implementation
Architecting Your First Big Data ImplementationArchitecting Your First Big Data Implementation
Architecting Your First Big Data Implementation
 
Big Data Developers Moscow Meetup 1 - sql on hadoop
Big Data Developers Moscow Meetup 1  - sql on hadoopBig Data Developers Moscow Meetup 1  - sql on hadoop
Big Data Developers Moscow Meetup 1 - sql on hadoop
 
Introduction to Microsoft's Big Data Platform and Hadoop Primer
Introduction to Microsoft's Big Data Platform and Hadoop PrimerIntroduction to Microsoft's Big Data Platform and Hadoop Primer
Introduction to Microsoft's Big Data Platform and Hadoop Primer
 
Above the cloud: Big Data and BI
Above the cloud: Big Data and BIAbove the cloud: Big Data and BI
Above the cloud: Big Data and BI
 
NoSQL on the move
NoSQL on the moveNoSQL on the move
NoSQL on the move
 
Introduction to BIg Data and Hadoop
Introduction to BIg Data and HadoopIntroduction to BIg Data and Hadoop
Introduction to BIg Data and Hadoop
 
Latest Developments in H2O
Latest Developments in H2OLatest Developments in H2O
Latest Developments in H2O
 
Hadoop Master Class : A concise overview
Hadoop Master Class : A concise overviewHadoop Master Class : A concise overview
Hadoop Master Class : A concise overview
 
Bi on Big Data - Strata 2016 in London
Bi on Big Data - Strata 2016 in LondonBi on Big Data - Strata 2016 in London
Bi on Big Data - Strata 2016 in London
 
IBM - Introduction to Cloudant
IBM - Introduction to CloudantIBM - Introduction to Cloudant
IBM - Introduction to Cloudant
 
0812 2014 01_toronto-smac meetup_i_os_cloudant_worklight_part2
0812 2014 01_toronto-smac meetup_i_os_cloudant_worklight_part20812 2014 01_toronto-smac meetup_i_os_cloudant_worklight_part2
0812 2014 01_toronto-smac meetup_i_os_cloudant_worklight_part2
 
.NET per la Data Science e oltre
.NET per la Data Science e oltre.NET per la Data Science e oltre
.NET per la Data Science e oltre
 
AnalyticsConf2016 - Zaawansowana analityka na platformie Azure HDInsight
AnalyticsConf2016 - Zaawansowana analityka na platformie Azure HDInsightAnalyticsConf2016 - Zaawansowana analityka na platformie Azure HDInsight
AnalyticsConf2016 - Zaawansowana analityka na platformie Azure HDInsight
 
SQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for ImpalaSQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for Impala
 
USQL Trivadis Azure Data Lake Event
USQL Trivadis Azure Data Lake EventUSQL Trivadis Azure Data Lake Event
USQL Trivadis Azure Data Lake Event
 
Lightning Talk: Why and How to Integrate MongoDB and NoSQL into Hadoop Big Da...
Lightning Talk: Why and How to Integrate MongoDB and NoSQL into Hadoop Big Da...Lightning Talk: Why and How to Integrate MongoDB and NoSQL into Hadoop Big Da...
Lightning Talk: Why and How to Integrate MongoDB and NoSQL into Hadoop Big Da...
 

More from Andrew Brust

Azure ml screen grabs
Azure ml screen grabsAzure ml screen grabs
Azure ml screen grabsAndrew Brust
 
A Practical Look at the NOSQL and Big Data Hullabaloo
A Practical Look at the NOSQL and Big Data HullabalooA Practical Look at the NOSQL and Big Data Hullabaloo
A Practical Look at the NOSQL and Big Data HullabalooAndrew Brust
 
Hadoop and its Ecosystem Components in Action
Hadoop and its Ecosystem Components in ActionHadoop and its Ecosystem Components in Action
Hadoop and its Ecosystem Components in Action
Andrew Brust
 
Brust hadoopecosystem
Brust hadoopecosystemBrust hadoopecosystem
Brust hadoopecosystemAndrew Brust
 
SQL Server Workshop for Developers - Visual Studio Live! NY 2012
SQL Server Workshop for Developers - Visual Studio Live! NY 2012SQL Server Workshop for Developers - Visual Studio Live! NY 2012
SQL Server Workshop for Developers - Visual Studio Live! NY 2012Andrew Brust
 
Power View: Analysis and Visualization for Your Application’s Data
Power View: Analysis and Visualization for Your Application’s DataPower View: Analysis and Visualization for Your Application’s Data
Power View: Analysis and Visualization for Your Application’s DataAndrew Brust
 
Grasping The LightSwitch Paradigm
Grasping The LightSwitch ParadigmGrasping The LightSwitch Paradigm
Grasping The LightSwitch Paradigm
Andrew Brust
 
SQL Server Denali: BI on Your Terms
SQL Server Denali: BI on Your Terms SQL Server Denali: BI on Your Terms
SQL Server Denali: BI on Your Terms
Andrew Brust
 
Microsoft and its Competition: A Developer-Friendly Market Analysis
Microsoft and its Competition: A Developer-Friendly Market Analysis Microsoft and its Competition: A Developer-Friendly Market Analysis
Microsoft and its Competition: A Developer-Friendly Market Analysis
Andrew Brust
 

More from Andrew Brust (9)

Azure ml screen grabs
Azure ml screen grabsAzure ml screen grabs
Azure ml screen grabs
 
A Practical Look at the NOSQL and Big Data Hullabaloo
A Practical Look at the NOSQL and Big Data HullabalooA Practical Look at the NOSQL and Big Data Hullabaloo
A Practical Look at the NOSQL and Big Data Hullabaloo
 
Hadoop and its Ecosystem Components in Action
Hadoop and its Ecosystem Components in ActionHadoop and its Ecosystem Components in Action
Hadoop and its Ecosystem Components in Action
 
Brust hadoopecosystem
Brust hadoopecosystemBrust hadoopecosystem
Brust hadoopecosystem
 
SQL Server Workshop for Developers - Visual Studio Live! NY 2012
SQL Server Workshop for Developers - Visual Studio Live! NY 2012SQL Server Workshop for Developers - Visual Studio Live! NY 2012
SQL Server Workshop for Developers - Visual Studio Live! NY 2012
 
Power View: Analysis and Visualization for Your Application’s Data
Power View: Analysis and Visualization for Your Application’s DataPower View: Analysis and Visualization for Your Application’s Data
Power View: Analysis and Visualization for Your Application’s Data
 
Grasping The LightSwitch Paradigm
Grasping The LightSwitch ParadigmGrasping The LightSwitch Paradigm
Grasping The LightSwitch Paradigm
 
SQL Server Denali: BI on Your Terms
SQL Server Denali: BI on Your Terms SQL Server Denali: BI on Your Terms
SQL Server Denali: BI on Your Terms
 
Microsoft and its Competition: A Developer-Friendly Market Analysis
Microsoft and its Competition: A Developer-Friendly Market Analysis Microsoft and its Competition: A Developer-Friendly Market Analysis
Microsoft and its Competition: A Developer-Friendly Market Analysis
 

Recently uploaded

FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
DianaGray10
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
Dorra BARTAGUIZ
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Inflectra
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
91mobiles
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
UiPathCommunity
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
BookNet Canada
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
Elena Simperl
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
Thijs Feryn
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
DianaGray10
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
Product School
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
RTTS
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
KatiaHIMEUR1
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
ThousandEyes
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
Product School
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Product School
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
Jemma Hussein Allen
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
Product School
 

Recently uploaded (20)

FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
 

Big Data and NoSQL for Database and BI Pros

  • 1. April 10-12 | Chicago, IL Big Data and NoSQL for Database and BI Pros Andrew J. Brust, Founder and CEO, Blue Badge Insights
  • 2. April 10-12 | Chicago, IL Please silence cell phones
  • 3. Meet Andrew CEO and Founder, Blue Badge Insights Big Data blogger for ZDNet Microsoft Regional Director, MVP Co-chair VSLive! and 17 years as a speaker Founder, Microsoft BI User Group of NYC • http://www.msbinyc.com Co-moderator, NYC .NET Developers Group • http://www.nycdotnetdev.com “Redmond Review” columnist for Visual Studio Magazine and Redmond Developer News brustblog.com, Twitter: @andrewbrust 3
  • 4. Andrew‟s New Blog (bit.ly/bigondata)
  • 5. Lynn Langit (in absentia) CEO and Founder, Lynn Langit consulting Former Microsoft Evangelist (4 years) Google Developer Expert MongoDB Master MCT 13 years – 7 certifications Cloudera Certified Developer MSDN Magazine articles • SQL Azure • Hadoop on Azure • MongoDB on Azure www.LynnLangit.com @LynnLangit L
  • 7. Agenda Overview / Landscape • Big Data, and Hadoop • NoSQL • The Big Data-NoSQL Intersection Drilldown on Big Data Drilldown on NoSQL
  • 8. What is Big Data? 100s of TB into PB and higher Involving data from: financial data, sensors, web logs, social media, etc. Parallel processing often involved Hadoop is emblematic, but other technologies are Big Data too Processing of data sets too large for transactional databases Analyzing interactions, rather than transactions The three V‟s: Volume, Velocity, Variety Big Data tech sometimes imposed on small data problems
  • 9. Big Data = Exponentially More Data Retail Example -> „Feedback Economy‟ • Number of transactions • Number of behaviors (collected every minute) 9 L
  • 10. Big Data = „Next State‟ Questions 10 • What could happen? • Why didn‟t this happen? • When will the next new thing happen? • What will the next new thing be? • What happens? Collecting Behavioral data L
  • 11. My Data: An Example from Health Care Medical records • Regular • Emergency • Genetic data – 23andMe Food data • SparkPeople Purchasing • Grocery card • credit card Search – Google Social media • Twitter • Facebook Exercise • Nike Fuel Band • Kinect • Location - phone 11 L
  • 12. Big Data = More Data 12 L
  • 13. Big Data Considerations Collection – get the data Storage – keep the data Querying – make sense of the data Visualization – see the business value L
  • 14. Data Collection Types of Data • Structured, semi-structured, unstructured vs. data standards • Behavioral vs. transactional data Methods of collection • Sensors everywhere • Machine-2-Machine • Public Datasets • Freebase • Azure DataMarket • Hillary Mason‟s list 14 L
  • 15. What‟s MapReduce? Partition the bulk input data and send to mappers (nodes in cluster) Mappers pre-process, put into key-value format, and send all output for a given (set of) key(s) to a reducer Reducer aggregates; one output per key, with value Map and Reduce code natively written as Java functions
  • 16. MapReduce, in a Diagram mapper mapper mapper mapper mapper mapper Input reducer reducer reducer Input Input Input Input Input Input Output Output Output Output Output Output Output Input Input Input K1 , K4 K3 , K6 Output Output Output K2 , K5
  • 17. • Count by suite, on each floor • Send per-suite, per platform totals to lobby • Sort totals by platform • Send two platform packets to 10th, 20th, 30th floor • Tally up each platform • Merge tallies into one spreadsheet • Collect the tallies A MapReduce Example
  • 18. What‟s a Distributed File System? One where data gets distributed over commodity drives on commodity servers Data is replicated • If one box goes down, no data lost • “Shared Nothing” BUT: Immutable • Files can only be written to once • So updates require drop + re-write (slow) • You can append though • Like a DVD/CD-ROM
  • 19. Hadoop = MapReduce + HDFS Modeled after Google MapReduce + GFS Have more data? Just add more nodes to cluster. • Mappers execute in parallel • Hardware is commodity • “Scaling out” Use of HDFS means data may well be local to mapper processing • So, not just parallel, but minimal data movement, which avoids network bottlenecks
  • 20. Comparison: RDBMS vs. Hadoop Traditional RDBMS Hadoop / MapReduce Data Size Gigabytes (Terabytes) Petabytes (Hexabytes) Updates Read / Write many times Write once, Read many times Integrity High (ACID) Low Query Response Time Can be near immediate Has latency (due to batch processing) 20 L
  • 21. Just-in-Time Schema When looking at unstructured data, schema is imposed at query time Schema is context specific • If scanning a book, are the values words, lines, or pages? • Are notes a single field, or is each word a value? • Are date and time two fields or one? • Are street, city, state, zip separate or one value? • Pig and Hive let you determine this at query time • So does the Map function in MapReduce code
  • 22. What‟s HBase? A Wide-Column Store NoSQL database Modeled after Google BigTable Uses HDFS Therefore, Hadoop-compatible Hadoop MapReduce often used with HBase But you can use either without the other
  • 23. L
  • 24. NoSQL Confusion Many „flavors‟ of NoSQL data stores Easiest to group by functionality, but… • Dividing lines are not clear or consistent NoSQL choice(s) driven by many factors • Type of data • Quantity of data • Knowledge of technical staff • Product maturity • Tooling L
  • 25. So much wrong information Everything is „new‟ People are religious about data storage Lots of incorrect information „Try‟ before you „buy‟ (or use) Watch out for over simplification Confusion over vendor offerings L
  • 26. Common NoSQL Misconceptions Problems Everything is „new‟ People are religious about data storage Open source is always cheaper Cloud is always cheaper Replace RDBMS with NoSQL Solutions „Try‟ before you „buy‟ (or use) Leverage NoSQL communities Add NoSQL to existing RDBMS solution L
  • 27. April 10-12 | Chicago, IL Drilldown on Big Data
  • 28. The Hadoop Stack MapReduce, HDFS Database RDBMS Import/Export Query: HiveQL and Pig Latin Machine Learning/Data Mining Log file integration
  • 29. What‟s Hive? Began as Hadoop sub-project Now top-level Apache project Provides a SQL-like (“HiveQL”) abstraction over MapReduce Has its own HDFS table file format (and it‟s fully schema-bound) Can also work over HBase Acts as a bridge to many BI products which expect tabular data
  • 30. Hadoop Distributions Cloudera Hortonworks HCatalog: Hive/Pig/MR Interop MapR Network File System replaces HDFS IBM InfoSphere BigInsights HDFS<->DB2 integration And now Microsoft…
  • 31. Microsoft HDInsight Developed with Hortonworks and incorporates Hortonworks Data Platform (HDP) for Windows Windows Azure HDInsight and Microsoft HDInsight (for Windows Server) • Single node preview runs on Windows client Includes ODBC Driver for Hive JavaScript MapReduce framework Contribute it all back to open source Apache Project
  • 32. Hortonworks Data Platform for Windows MRLib (NuGet Package) LINQ to Hive OdbcClient + Hive ODBC Driver Deployment Debugging MR code in C#, HadoopJob, Mapp erBase, ReducerBase Amenities for Visual Studio/.NET
  • 33. Some ways to work Microsoft HDInsight • Cloud: go to www.windowsazure.com, request a cluster • Local: Download Microsoft HDInsight • Runs on just about anything, including Windows XP • Get it via the Web Platform installer (WebPI) • Local version is free; cloud billed at 50% discount during preview Amazon Web Services Elastic MapReduce • Create AWS account • Select Elastic MapReduce in Dashboard • Cheap for experimenting, but not free Cloudera CDH VM image • Download as .tar.gz file • “Un-tar” (can use WinRAR, 7zip) • Run via VMWare Player or Virtual Box • Everything’s free
  • 34. Some ways to work HDInsight EMR CDH 4
  • 35. Microsoft HDInsight Much simpler than the others Browser-based portal • Launch MapReduce jobs • Azure: Provisioning cluster, managing ports, gather external data Interactive JavaScript & Hive console • JS: HDFS, Pig, light data visualization • Hive commands and metadata discovery • New console coming Desktop Shortcuts: • Command window, MapReduce, Name Node status in browser • Azure: from portal page you can RDP directly to Hadoop head node for these desktop shortcuts 35
  • 36. April 10-12 | Chicago, IL Demo Windows Azure HDInsight
  • 37. Amazon Elastic MapReduce Lots of steps! At a high level: • Setup AWS account and S3 “buckets” • Generate Key Pair and PEM file • Install Ruby and EMR Command Line Interface • Provision the cluster using CLI • A batch file can work very well here • Setup and run SSH/PuTTY • Work interactively at command line
  • 38. April 10-12 | Chicago, IL Demo Amazon Elastic MapReduce
  • 39. Cloudera CDH4 Virtual Machine Get it for free, in VMWare and Virtual Box versions. • VMWare player and Virtual Box are free too Run it, and configure it to have its own IP on your network. Use ifconfig to discover IP. Assuming IP of 192.168.1.59, open browser on your own (host) machine and navigate to: • http://192.168.1.59:8888 Can also use browser in VM and hit: • http://localhost:8888 Work in “Hue”…
  • 40. Hue Browser based UI, with front ends for: HDFS (w/ upload & download) MapReduce job creation and monitoring Hive (“Beeswax”) And in-browser command line shells for: HBase Pig (“Grunt”)
  • 41. Impala: What it Is Distributed SQL query engine over Hadoop cluster Announced at Strata/Hadoop World in NYC on October 24th In Beta, as part of CDH 4.1 Works with HDFS and Hive data Compatible with HiveQL and Hive drivers • Query with Beeswax
  • 42. Impala: What it‟s Not Impala is not Hive • Hive converts HiveQL to Java MapReduce code and executes it in batch mode • Impala executes query interactively over the data • Brings BI tools and Hadoop closer together Impala is not an Apache Software Foundation project • Though it is open source and Apache-licensed, but it‟s still incubated by Cloudera • Only in CDH
  • 43. April 10-12 | Chicago, IL Demo Cloudera CDH4, Impala
  • 44. Hadoop commands HDFS • hadoop fs filecommand • Create and remove directories • mkdir, rm, rmr • Upload and download files to/from HDFS • get, put • View directory contents • ls, lsr • Copy, move, view files • cp, mv, cat MapReduce • Run a Java jar-file based job • hadoop jar jarname params
  • 45. April 10-12 | Chicago, IL Demo Hadoop (directly)
  • 46. HBase Concepts: • Tables, column families • Columns, rows • Keys, values Commands: • Definition: create, alter, drop, truncate • Manipulation: get, put, delete, deleteall, scan • Discovery: list, exists, describe, count • Enablement: disable, enable • Utilities: version, status, shutdown, exit • Reference: http://wiki.apache.org/hadoop/Hbase/Shell Moreover, • Interesting HBase work can be done in MapReduce, Pig
  • 47. HBase Examples create 't1', 'f1', 'f2', 'f3' describe 't1' alter 't1', {NAME => 'f1', VERSIONS => 5} put 't1', 'r1', 'c1:f1', 'value' get 't1', 'r1' count 't1'
  • 48. April 10-12 | Chicago, IL Demo HBase
  • 49. Submitting, Running and Monitoring Jobs Upload a JAR Use Streaming • Use other languages (i.e. other than Java) to write MapReduce code • Python is popular option • Any executable works, even C# console apps • On MS HDInsight, JavaScript works too • Still uses a JAR file: streaming.jar Run at command line (passing JAR name and params) or use GUI
  • 50. April 10-12 | Chicago, IL Demo Running MapReduce Jobs
  • 51. Hive Used by most BI products which connect to Hadoop Provides a SQL-like abstraction over Hadoop • Officially HiveQL, or HQL Works on own tables, but also on HBase Query generates MapReduce job, output of which becomes result set Microsoft has Hive ODBC driver • Connects Excel, Reporting Services, PowerPivot, Analysis Services Tabular Mode (only)
  • 52. Hive, Continued Load data from flat HDFS files • LOAD DATA [LOCAL] INPATH 'myfile' INTO TABLE mytable; SQL Queries • CREATE, ALTER, DROP • INSERT OVERWRITE (creates whole tables) • SELECT, JOIN, WHERE, GROUP BY • SORT BY, but ordering data is tricky! • MAP/REDUCE/TRANSFORM…USING allows for custom map, reduce steps utilizing Java or streaming code
  • 53. Data Explorer • Beta add-in for Excel • Acquire, transform data • Data sources include Facebook, HDFS • Visually- or script-driven • Also includes Azure BLOB storage backing up HDInsight 56
  • 54. April 10-12 | Chicago, IL Demo Hive, Data Explorer
  • 55. Pig Instead of SQL, employs a language (“Pig Latin”) that accommodates data flow expressions • Do a combo of Query and ETL “10 lines of Pig Latin ≈ 200 lines of Java.” Works with structured or unstructured data Operations • As with Hive, a MapReduce job is generated • Unlike Hive, output is only flat file to HDFS or text at command line console • With HDInsight, can easily convert to JavaScript array, then manipulate Use command line (“Grunt”) or build scripts
  • 56. Example A = LOAD 'myfile' AS (x, y, z); B = FILTER A by x > 0; C = GROUP B BY x; D = FOREACH A GENERATE x, COUNT(B); STORE D INTO 'output';
  • 57. Pig Latin Examples Imperative, file system commands • LOAD, STORE •Schema specified on LOAD Declarative, query commands (SQL-like) • xxx = file or data set • FOREACH xxx GENERATE (SELECT…FROM xxx) • JOIN (WHERE/INNER JOIN) • FILTER xxx BY (WHERE) • ORDER xxx BY (ORDER BY) • GROUP xxx BY / GENERATE COUNT(xxx) (SELECT COUNT(*) GROUP BY) • DISTINCT (SELECT DISTINCT) Syntax is assignment statement-based: • MyCusts = FILTER Custs BY SalesPerson eq 15; Access Hbase • CpuMetrics = LOAD 'hbase://SystemMetrics' USING org.apache.pig.backend.hadoop.hbase.HBaseStorage('cpu:','-loadKey -returnTuple');
  • 58. April 10-12 | Chicago, IL Demo Pig
  • 61. Flume NG Source • Avro (data serialization system – can read json-encoded data files, and can work over RPC) • Exec (reads from stdout of long-running process) Sinks • HDFS, HBase, Avro Channels • Memory, JDBC, file
  • 62. Flume NG (next generation) Setup conf/flume.conf # Define a memory channel called ch1 on agent1 agent1.channels.ch1.type = memory # Define an Avro source called avro-source1 on agent1 and tell it # to bind to 0.0.0.0:41414. Connect it to channel ch1. agent1.sources.avro-source1.channels = ch1 agent1.sources.avro-source1.type = avro agent1.sources.avro-source1.bind = 0.0.0.0 agent1.sources.avro-source1.port = 41414 # Define a logger sink that simply logs all events it receives # and connect it to the other end of the same channel. agent1.sinks.log-sink1.channel = ch1 agent1.sinks.log-sink1.type = logger # Finally, now that we've defined all of our components, tell # agent1 which ones we want to activate. agent1.channels = ch1 agent1.sources = avro-source1 agent1.sinks = log-sink1 From the command line: flume-ng agent --conf ./conf/ -f conf/flume.conf -n agent1
  • 63. Mahout Algorithms Recommendation • Your info + community info • Give users/items/ratings; get user-user/item-item • itemsimilarity Classification/Categorization • Drop into buckets • Naïve Bayes, Complementary Naïve Bayes, Decision Forests Clustering • Like classification, but with categories unknown • K-Means, Fuzzy K-Means, Canopy, Dirichlet, Mean-Shift
  • 64. Workflow, Syntax Workflow • Run the job • Dump the output • Visualize, predict mahout algorithm -- input folderspec -- output folderspec -- param1 value1 -- param2 value2 … Example: • mahout itemsimilarity --input <input-hdfs-path> --output <output-hdfs-path> --tempDir <tmp-hdfs-path> -s SIMILARITY_LOGLIKELIHOOD
  • 65. The Truth About Mahout Mahout is really just an algorithm engine Its output is almost unusable by non-statisticians/non-data scientists You need a staff or a product to visualize, or make into a usable prediction model Investigate Predixion Software • CTO, Jamie MacLennan, used to lead SQL Server Data Mining team • Excel add-in can use Mahout remotely, visualize its output, run predictive analyses • Also integrates with SQL Server, Greenplum, MapReduce • http://www.predixionsoftware.com
  • 66. The “Data-Refinery” Idea Use Hadoop to “on-board” unstructured data, then extract manageable subsets Load the subsets into conventional DW/BI servers and use familiar analytics tool to examine This is the current rationalization of Hadoop + BI tools‟ coexistence Will it stay this way?
  • 67. Dremel-based service for massive amounts of data Pay for query and storage SQL-like query language Has an Excel connector Google BigQuery L
  • 68. April 10-12 | Chicago, IL Google BigQuery
  • 69. April 10-12 | Chicago, IL Drilldown on NoSQL
  • 70. NoSQL Data Fodder Addresses Preferences Notes Friends, Foll owers Documents
  • 71. “Web Scale” This the term used to justify NoSQL Scenario is simple needs but “made up for in volume” • Millions of concurrent users Think of sites like Amazon or Google Think of non-transactional tasks like loading catalog data to display product page, or environment preferences
  • 72. NoSQL Common Traits Non-relational Non-schematized/schema-free Open source Distributed Eventual consistency “Web scale” Developed at big Internet companies
  • 73. More than just the Elephant in the room Over 120+ types of noSQL databases So many NoSQL options L
  • 75. Consistency CAP Theorem • Databases may only excel at two of the following three attributes: consistency, availability and partition tolerance NoSQL does not offer “ACID” guarantees • Atomicity, consistency, isolation and durability Instead offers “eventual consistency” Similar to DNS propagation
  • 76. Things like inventory, account balances should be consistent • Imagine updating a server in Seattle that stock was depleted • Imagine not updating the server in NY • Customer in NY goes to order 50 pieces of the item • Order processed even though no stock Things like catalog information don‟t have to be, at least not immediately • If a new item is entered into the catalog, it‟s OK for some customers to see it even before the other customers‟ server knows about it But catalog info must come up quickly • Therefore don‟t lock data in one location while waiting to update the other Therefore, OK to sacrifice consistency for speed, in some cases Consistency
  • 78. Indexing Most NoSQL databases are indexed by key Some allow so-called “secondary” indexes Often the primary key indexes are clustered HBase uses HDFS (the Hadoop Distributed File System), which is append-only • Writes are logged • Logged writes are batched • File is re-created and sorted
  • 79. Queries Typically no query language Instead, create procedural program Sometimes SQL is supported Sometimes MapReduce code is used…
  • 80. MapReduce This is not Hadoop‟s MapReduce, but it‟s conceptually related Map step: pre-processes data Reduce step: summarizes/aggregates data Will show a MapReduce code sample for Mongo soon Will demo map code on CouchDB
  • 81. L
  • 82. Sharding A partitioning pattern where separate servers store partitions Fan-out queries supported Partitions may be duplicated, so replication also provided • Good for disaster recovery Since “shards” can be geographically distributed, sharding can act like a CDN Good for keeping data close to processing • Reduces network traffic when MapReduce splitting takes place
  • 84. Key-Value Stores The most common; not necessarily the most popular Has rows, each with something like a big dictionary/associative array • Schema may differ from row to row Common on cloud platforms • e.g. Amazon SimpleDB, Azure Table Storage MemcacheDB, Voldemort, Couchbase, DynamoDB (AWS), Dynomite, Redis and Riak 87
  • 85. Key-Value Stores Table: Customers Row ID: 101 First_Name: Andrew Last_Name: Brust Address: 123 Main Street Last_Order: 1501 Row ID: 202 First_Name: Jane Last_Name: Doe Address: 321 Elm Street Last_Order: 1502 Table: Orders Row ID: 1501 Price: 300 USD Item1: 52134 Item2: 24457 Row ID: 1502 Price: 2500 GBP Item1: 98456 Item2: 59428 Database
  • 86. Wide Column Stores Has tables with declared column families • Each column family has “columns” which are KV pairs that can vary from row to row These are the most foundational for large sites • BigTable (Google) • HBase (Originally part of Yahoo-dominated Hadoop project) • Cassandra (Facebook) • Calls column families “super columns” and tables “super column families” They are the most “Big Data”-ready • Especially HBase + Hadoop
  • 87. Table: Customers Row ID: 101 Super Column: Name Column: First_Name: Andrew Column: Last_Name: Brust Super Column: Address Column: Number: 123 Column: Street: Main Street Super Column: Orders Column: Last_Order: 1501 Table: Orders Row ID: 1501 Super Column: Pricing Column: Price: 300 USD Super Column: Items Column: Item1: 52134 Column: Item2: 24457 Row ID: 1502 Super Column: Pricing Column: Price: 2500 GBP Super Column: Items Column: Item1: 98456 Column: Item2: 59428 Row ID: 202 Super Column: Name Column: First_Name: Jane Column: Last_Name: Doe Super Column: Address Column: Number: 321 Column: Street: Elm Street Super Column: Orders Column: Last_Order: 1502 Wide Column Stores
  • 88. April 10-12 | Chicago, IL Demo Wide Column Stores
  • 89. Document Stores Have “databases,” which are akin to tables Have “documents,” akin to rows • Documents are typically JSON objects • Each document has properties and values • Values can be scalars, arrays, links to documents in other databases or sub-documents (i.e. contained JSON objects - Allows for hierarchical storage) • Can have attachments as well Old versions are retained • So Doc Stores work well for content management Some view doc stores as specialized KV stores Most popular with developers, startups, VCs The biggies: • CouchDB • Derivatives • MongoDB
  • 90. Document Store Application Orientation Documents can each be addressed by URIs CouchDB supports full REST interface Very geared towards JavaScript and JSON • Documents are JSON objects • CouchDB/MongoDB use JavaScript as native language In CouchDB, “view functions” also have unique URIs and they return HTML • So you can build entire applications in the database
  • 91. Database: Customers Document ID: 101 First_Name: Andrew Last_Name: Brust Address: Orders: Database: Orders Document ID: 1501 Price: 300 USD Item1: 52134 Item2: 24457 Document ID: 1502 Price: 2500 GBP Item1: 98456 Item2: 59428 Number: 123 Street: Main Street Most_recent: 1501 Document ID: 202 First_Name: Jane Last_Name: Doe Address: Orders: Number: 321 Street: Elm Street Most_recent: 1502 Document Stores
  • 93. April 10-12 | Chicago, IL Demo Document Stores
  • 94. Graph Databases Great for social network applications and others where relationships are important Nodes and edges • Edge like a join • Nodes like rows in a table Nodes can also have properties and values Neo4j is a popular graph db
  • 95. Database Sent invitation to Commented on photo by Friend of Address Placed order Item 2 Item 1 Joe Smith Jane Doe Andrew Brust Street: 123 Main Street City: New York State: NY Zip: 10014 ID: 52134 Type: Dress Color: Blue ID: 24457 Type: Shirt Color: Red ID: 252 Total Price: 300 USD George Washington Graph Databases
  • 96. NoSQL on Windows Azure Platform as a Service • Cloudant: https://cloudant.com/azure/ • MongoDB (via MongoLab): http://blog.mongolab.com/2012/10/azure/ MongoDB, DIY: • On an Azure Worker Role: http://www.mongodb.org/display/DOCS/MongoDB+on+Azure+Worker+Roles • On a Windows VM: http://www.mongodb.org/display/DOCS/MongoDB+on+Azure+VM+-+Windows+Installer • On a Linux VM: http://www.mongodb.org/display/DOCS/MongoDB+on+Azure+VM+-+Linux+Tutorial http://www.windowsazure.com/en-us/manage/linux/common-tasks/mongodb-on-a-linux- vm/
  • 97. NoSQL on Windows Azure Others, DIY (Linux VMs): • Couchbase: http://blog.couchbase.com/couchbase-server-new-windows-azure • CouchDB: http://ossonazure.interoperabilitybridges.com/articles/couchdb- installer-for-windows-azure • Riak: http://basho.com/blog/technical/2012/10/09/Riak-on-Microsoft-Azure/ • Redis: http://blogs.msdn.com/b/tconte/archive/2012/06/08/running-redis- on-a-centos-linux-vm-in-windows-azure.aspx • Cassandra: http://www.windowsazure.com/en-us/manage/linux/other- resources/how-to-run-cassandra-with-linux/
  • 98. NoSQL + BI NoSQL databases are bad for ad hoc query and data warehousing BI applications involve models; models rely on schema Extract, transform and load (ETL) may be your friend Wide-column stores, however are good for “Big Data” • See next slide Wide-column stores and column-oriented databases are similar technologically
  • 99. NoSQL + Big Data Big Data and NoSQL are interrelated Typically, Wide-Column stores used in Big Data scenarios Prime example: • HBase and Hadoop Why? • Lack of indexing not a problem • Consistency not an issue • Fast reads very important • Distributed file systems important too • Commodity hardware and disk assumptions also important • Not Web scale but massive scale-out, so similar concerns
  • 100. NoSQL Compromises Eventual consistency Write buffering Only primary keys can be indexed Queries must be written as programs Tooling • Productivity (= money)
  • 101. Common DBA Tasks in NoSQL RDBMS NoSQL Import Data Import Data Setup Security Setup Security Perform a Backup Make a copy of the data Restore a Database Move a copy to a location Create an Index Create an Index Join Tables Together Run MapReduce Schedule a Job Schedule a (Cron) Job Run Database Maintenance Monitor space and resources used Send an Email from SQL Server Set up resource threshold alerts Search BOL Interpret Documentation 104 L
  • 102. Which Type of NoSQL for Which Type of Data? Type of Data Type of NoSQL solution Example Log files Wide Column HBase Product Catalogs Key Value on disk DynamoDB User profiles Key Value in memory Redis Startups Document MongoDB Social media connections Graph Neo4j LOB w/Transactions NONE! Use RDBMS SQL Server 105 L
  • 103. Relational vs. NoSQL Line of Business -> Relational Large, public (consumer)-facing sites -> NoSQL Complex data structures -> Relational Big Data -> NoSQL Transactional -> Relational Content Management -> NoSQL Enterprise->Relational Consumer Web -> NoSQL
  • 105. Understand CAP & types of NoSQL databases • Use NoSQL when business needs designate • Use the right type of NoSQL for your business problem Try out NoSQL on the cloud • Quick and cheap for behavioral data • Mashup cloud datasets • Good for specialized use cases, i.e. dev, test , training environments Learn NoSQL access technologies • New query languages, i.e. MapReduce, R, Infer.NET • New query tools (vendor-specific) – Google Refine, Amazon Karmasphere, Microsoft Excel connectors, etc… NoSQL To-Do List L
  • 106. NoSQL for .NET Developers RavenDB MongoDB C#/.NET Driver MongoDB on Windows Azure CouchBase .NET Client Library Riak client for .NET AWS Toolkit for Visual Studio Google cloud APIs (REST-based)
  • 107. Thank You • andrew.brust@bluebadgeinsights.com • @andrewbrust on twitter • Want to get on Blue Badge Insights‟ list?” Text “bluebadge” to 22828
  • 108. April 10-12 | Chicago, IL Thank you! Diamond Sponsor

Editor's Notes

  1. http://www.chegg.com/textbooks/foundations-of-sql-server-2008-r2-business-intelligence-2nd-edition-9781430233244-1430233249http://www.chegg.com/textbooks/smart-business-intelligence-solutions-with-microsoft-sql-server-2008-1st-edition-9780735625808-0735625808
  2. http://www.chantcafe.com/2010/10/reality-in-catholic-music-massive.html
  3. https://developers.google.com/bigquery/docs/browser_toolDremel -- http://research.google.com/pubs/pub36632.html
  4. http://nosql-database.org/http://hadoop.apache.org/ &amp; http://www.mongodb.org/Wikipedia - http://en.wikipedia.org/wiki/NoSQLList of noSQL databases – http://nosql-database.org/The good, the bad - http://www.techrepublic.com/blog/10things/10-things-you-should-know-about-nosql-databases/1772
  5. When the volume of data is too much for simple human interpretation -&gt;Man PLUS Machine (Data Mining / Statistics)
  6. http://bigdatanerd.wordpress.com/2012/01/04/why-nosql-part-2-overview-of-data-modelrelational-nosql/http://docs.jboss.org/hibernate/ogm/3.0/reference/en-US/html_single/
  7. http://rickosborne.org/download/SQL-to-MongoDB.pdf
  8. About Data Science -- http://www.romymisra.com/the-new-job-market-rulers-data-scientists/R language - http://www.r-project.org/Infer.NET - http://research.microsoft.com/en-us/um/cambridge/projects/infernet/There are a plethora of languages to access, manipulate and process bigData. These languages fall into a couple of categories:RESTful – simple, standardsETL – Pig (Hadoop) is an exampleQuery – Hive (again Hadoop), lots of *QLAnalyze – R, Mahout, Infer.NET, DMX, etc.. Applying statistical (data-mining) algorithms to the data output