SQL Server and Big Data Projects in the Real World

SQL Server and Big Data
Projects in the Real World
Mark Kromer
Pentaho Big Data Analytics Product Manager
@mssqldude
@kromerbigdata
http://www.kromerbigdata.com

What we’ll (try) to cover today
1. The Big Data Technology Landscape
2. Big Data Analytics
3. 3 Big Data Analytics Scenarios:
❯ Digital Marketing Analytics
• Hadoop, Aster Data, SQL Server
❯ Sentiment Analysis
• MongoDB, SQL Server
❯ Data Refinery
• Hadoop, MPP, SQL Server, Pentaho
4. SQL Server in the Big Data world

Big Data 101
3 V’s
❯ Volume – Terabyte records, transactions, tables, files
❯ Velocity – Batch, near-time, real-time (analytics), streams.
❯ Variety – Structures, unstructured, semi-structured, and all the above in a mix
Text Processing
❯ Techniques for processing and analyzing unstructured (and structured) LARGE files
Analytics & Insights
Distributed File System & Programming

• Batch Processing
• Commodity Hardware
• Data Locality, no shared
storage
• Scales linearly
• Great for large text file
processing, not so great on
small files
• Distributed programming
paradigm
Hadoop 1.x

Hadoop 1 vs Hadoop 2
HADOOP 1.0
MapReduce
(cluster resource management
& data processing)
© Hortonworks Inc. 2014
HDFS
(redundant, reliable storage)
HADOOP 2.0
YARN
MapReduce
(data processing)
Others
(cluster resource management)
HDFS2
(redundant, highly-available & reliable storage)
Single Use System
Batch Apps
Multi Purpose Platform
Batch, Interactive, Online, Streaming, …

YARN: Taking Hadoop Beyond Batch
Applications Run Natively in Hadoop
YARN (Cluster Resource Management)
HDFS2 (Redundant, Reliable Storage)
BATCH
(MapReduce)
INTERACTIVE
(Tez)
STREAMING
(Storm, S4,…)
GRAPH
(Giraph)
IN-MEMORY
(Spark)
HPC MPI
(OpenMPI)
ONLINE
(HBase)
OTHER
(Search)
(Weave…)
Store ALL DATA in one place…
Interact with that data in MULTIPLE WAYS
with Predictable Performance and Quality of Service

YARN Eco-system
Page 7
Applications Powered by YARN
Apache Giraph – Graph Processing
Apache Hama - BSP
Apache Hadoop MapReduce – Batch
Apache Tez – Batch/Interactive
Apache S4 – Stream Processing
Apache Samza – Stream Processing
Apache Storm – Stream Processing
Apache Spark – Iterative applications
Elastic Search – Scalable Search
Cloudera Llama – Impala on YARN
DataTorrent – Data Analysis
HOYA – HBase on YARN
Frameworks Powered By YARN
Apache Twill
REEF by Microsoft
Spring support for Hadoop 2

Apache Spark
High-Speed In-Memory Analytics over Hadoop
● Open Source
● Alternative to Map Reduce for certain applications
● A low latency cluster computing system
● For very large data sets
● May be 100 times faster than Map Reduce for
– Iterative algorithms
– Interactive data mining
● Used with Hadoop / HDFS
● Released under BSD License

Popular Hadoop Distributions
Hosted PaaS Hadoop platforms: Amazon EMR, Pivotal, Microsoft
Hadoop on Azure

Popular NoSQL Distributions
Transactional-based, not analytics schemas

Popular MPP Distributions
Big Data as distributed, scale-out, sharded data stores

Big Data Analytics Web Platform – RA 1

Sentiment Analysis
Reference Architecture 2
MongoDB
Hadoop
PDW
Big Data
Platforms
Social Media
Sources
Data
Orchestration
Data Mining
OLAP Cubes
Data Models
Analytical
Models
OLAP
Analytics
Tools,
Reporting
Tools,
Dashboards

Streamlined Data Refinery
Reference Architecture 3
Transactions – Batch
& Real-time
Enrollments &
Redemptions
Location, Email,
Other Data
Hadoop
Cluster
Analytics
Reports
Data
Orchestration

Big Data Analytics
Core Tenets
• Distributed Data (Data Locality)
❯ HDFS / MapReduce
❯ YARN / TEZ
❯ Replicated / Sharded Data
• MPP Databases
❯ Vertica, Aster, Microsoft, Greenplum … In-database analytics that can scale-out
with distributed processing across nodes
• Distributed Analytics
❯ SAS: Quickly solve complex problems using big data and sophisticated analytics in a
distributed, in-memory and parallel environment.”
http://www.sas.com/resources/whitepaper/wp_46345.pdf
• In-memory Analytics
❯ Microsoft PowerPivot (Tabular models)
❯ SAP HANA
❯ Tableau

SQL on Hadoop
Hortonworks and Cloudera DW Engine Approaches

SQL on Hadoop Landscape
Gartner Research on SQL on Hadoop
Not Quite Real Time
Many vendors market their SQL interfaces to Hadoop as providing so called "real-time access" to
data stored in a Hadoop cluster … SQL on Hadoop provides a purely interactive data query and data
manipulation experience — faster than batch, but not truly real time. In the case of Hadoop and the types
of tasks it performs, we define interactive time frames as between 30 milliseconds and 10 minutes.
If your usage truly needs realtime, a different set of technologies and vendors may be required.

SQL on Hadoop
Vendor Perspective: MapR
Batch SQL
Hive is used primarily for queries on very large data sets and large ETL jobs. The queries can take anywhere between a few minutes to several
hours depending on the complexity of the job. The Apache Tez project aims to provide targeted performance improvements for Hive to deliver
interactive query capabilities in future. MapR ships and supports Apache Hive today.
Interactive SQL
Technologies such as Impala and Apache Drill provide interactive query capabilities to enable traditional business intelligence and analytics on
Hadoop-scale datasets. The response times vary between milliseconds to minutes depending on the query complexity. Users expect SQL-on-
Hadoop technologies to support common BI tools such as Tableau and MicroStrategy (to name a couple) for reporting and ad-hoc queries. MapR
supports customers using Impala on the MapR distribution of Hadoop today. Apache Drill will be available Q2 2014.

MapReduce Framework (Map)
using Microsoft.Hadoop.MapReduce;
using System.Text.RegularExpressions;
public class TotalHitsForPageMap : MapperBase
{
public override void Map(string inputLine, MapperContext context)
{
context.Log(inputLine);
var parts = Regex.Split(inputLine, "s+");
if (parts.Length != expected) //only take records with all values
{
return;
}
context.EmitKeyValue(parts[pagePos], hit);
}
}

MapReduce Framework (Reduce & Job)
public class TotalHitsForPageReducerCombiner : ReducerCombinerBase
{
public override void Reduce(string key, IEnumerable<string> values, ReducerCombinerContext context)
{
context.EmitKeyValue(key, values.Sum(e=>long.Parse(e)).ToString());
}
}
public class TotalHitsJob : HadoopJob<TotalHitsForPageMap,TotalHitsForPageReducerCombiner>
{
public override HadoopJobConfiguration Configure(ExecutorContext context)
{
var retVal = new HadoopJobConfiguration();
retVal.InputPath = Environment.GetEnvironmentVariable("W3C_INPUT");
retVal.OutputFolder = Environment.GetEnvironmentVariable("W3C_OUTPUT");
retVal.DeleteOutputFolder = true;
return retVal;
}
}

Get Data into Hadoop
Linux shell commands to access data in HDFS
Put file in HDFS: hadoop fs -put sales.csv /import/sales.csv
List files in HDFS:
c:Hadoop>hadoop fs -ls /import
Found 1 items
-rw-r--r-- 1 makromer supergroup 114 2013-05-07 12:11 /import/sales.csv
View file in HDFS:
c:Hadoop>hadoop fs -cat /import/sales.csv
Kromer,123,5,55
Smith,567,1,25
Jones,123,9,99
James,11,12,1
Johnson,456,2,2.5
Singh,456,1,3.25
Yu,123,1,11
Now, we can work on the data with MapReduce, Hive, Pig, etc.

Use Hive for Data Schema and Analysis
create external table ext_sales
(
lastname string,
productid int,
quantity int,
sales_amount float
)
row format delimited fields terminated by ',' stored as textfile location
'/user/makromer/hiveext/input';
LOAD DATA INPATH '/user/makromer/import/sales.csv' OVERWRITE INTO TABLE ext_sales;

Sqoop
Data transfer to & from Hadoop & SQL Server
sqoop import –connect jdbc:sqlserver://localhost –username sqoop -password password –table customers -m 1
> hadoop fs -cat /user/mark/customers/part-m-00000
> 5,Bob Smith
sqoop export –connect jdbc:sqlserver://localhost –username sqoop -password password -m 1 –table customers –export-dir
/user/mark/data/employees3
12/11/11 22:19:24 INFO mapreduce.ExportJobBase: Transferred 201 bytes in 32.6364 seconds (6.1588 bytes/sec)
12/11/11 22:19:24 INFO mapreduce.ExportJobBase: Exported 4 records.

Role of NoSQL in a Big Data Analytics Solution
‣ Use NoSQL to store data quickly without the overhead of RDBMS
‣ Hbase, Plain Old HDFS, Cassandra, MongoDB, Dynamo, just to name a few
‣ Why NoSQL?
‣ In the world of “Big Data”
‣ “Schema later”
‣ Ignore ACID properties
‣ Drop data into key-value store quick & dirty
‣ Worry about query & read later
‣ Why NOT NoSQL?
‣ In the world of Big Data Analytics, you will need support from analytical tools with a SQL, SAS, MR interface
‣ SQL Server and NoSQL
‣ Not a natural fit
‣ Use HDFS or your favorite NoSQL database
‣ Consider turning off SQL Server locking mechanisms
‣ Focus on writes, not reads (read uncommitted)

MongoDB and Enterprise IT Stack
Applications
CRM, ERP, Collaboration, Mobile, BI
Data Management
Online Data Offline Data
Hadoop EDW
Management & Monitoring
Security & Auditing
RDBMS
RDBMS
Infrastructure
OS & Virtualization, Compute, Storage, Network

General document per customer per account
{
_id : ObjectId("4e2e3f92268cdda473b628f6"),
sourceIDs: {
ABCSystemIDPart1: 8397897,
ABCSystemIDPart2: 2937430,
ABCSystemIDPart3: 932018 }
accountType: “Checking”,
accountOwners: [
{ firstName : ”John",
lastName: “Smith”,
contactMethods: [
{ type: “phone”, subtype: “mobile”, number: 8743927394},
{ type: “mail”, address: “58 3rd St.”, city: …} ]
possibleMatchCriteria: {
govtID: 2938932432, fullName: “johnsmith”, dob: … } },
{ firstName : ”Anne",
maidenName: “Collins”,
lastName: “Smith”, …} ],
openDate: ISODate("2013-02-15 10:00”),
accountFeatures { Overdraft: true, APR: 20, … }
}
OR creditCardNumber: 8392384938391293
OR mortgageID: 2374389
OR policyID: 18374923

Text Search Example
(e.g. address typo so do fuzzy match)
// Text search for address filtered by first name and NY
> db.ticks.runCommand(
“text”,
{ search: “vanderbilt ave. vander bilt”,
filter: {name: “Smith”,
city: “New York”} })

Aggregate: Total Value of Accounts
//Find total value of each customer’s accounts for a given RM (or Agent) sorted by value
db.accts.aggregate(
{ $match: {relationshipManager: “Smith”}},
{ $group :
{ _id : “$ssn”,
totalValue: {$sum: ”$value”} }},
{ $sort: { totalValue: -1}} )

SQL Server Big Data – Data Loading
Amazon HDFS & EMR Data Loading
Amazon S3 Bucket

SQL Server Big Data Environment
SQL Server Database
❯ SQL 2012 Enterprise Edition
❯ Page Compression
❯ 2012 Columnar Compression on Fact Tables
❯ Clustered Index on all tables
❯ Auto-update Stats Asynch
❯ Partition Fact Tables by month and archive data with sliding window technique
❯ Drop all indexes before nightly ETL load jobs
❯ Rebuild all indexes when ETL completes
SQL Server Analysis Services
❯ SSAS 2012 Enterprise Edition
❯ 2008 R2 OLAP cubes partition-aligned with DW
❯ 2012 cubes in-memory tabular cubes
❯ All access through MSMDPUMP or SharePoint

SQL Server Big Data Analytics Features
Columnstore
Sqoop adapter
PolyBase
Hive
In-memory analytics
Scale-out MPP
SQL Server APS

Pentaho Big Data Analytics
DBA ETL/BI Developer Business Users & Executives Analysts & Data Scientists
Enterprise &
Interactive
Reporting
Pentaho Business Analytics
Interactive
Analysis
Dashboards Predictive
Analytics
DIRECT ACCESS
Data Integration
Instaview | Visual Map Reduce
OPERATIONAL DATA BIG DATA PUBLIC/PRIVATE CLOUDS DATA STREAM

Pentaho Big Data Analytics
Accelerate the time to big data value
• Full continuity from data
access to decisions –
complete data integration &
analytics for any big data
store
• Faster development,
faster runtime – visual
development, distributed
execution
• Instant and interactive
analysis – no coding and
no ETL required

Product Components
• Visual data exploration
• Ad hoc analysis
• Interactive charts & visualizations
Pentaho Data Integration
• Visual development for big data
• Broad connectivity
• Data quality & enrichment
• Integrated scheduling
• Security integration
Pentaho Dashboards
• Self-service dashboard builder
• Content linking & drill through
• Highly customized mash-ups
Pentaho Data Mining &
Predictive Analytics
• Model construction & evaluation
• Learning schemes
• Integration with 3rd part models
using PMML
Pentaho Enterprise &
Interactive Reports
• Both ad hoc & distributed reporting
• Drag & drop interactive reporting
• Pixel-perfect enterprise reports
Pentaho for Big Data
MapReduce & Instaview
• Visual Interface for Developing
MR
• Self-service big data discovery
• Big data access to Data Analysts
Pentaho Analyzer

Pentaho Interactive Analysis & Data Discovery
Highly Flexible Advanced Visualizations
❯ Simple, easy-to-use visual data exploration
❯ Web-based thin client; in-memory caching
❯ Rich library of interactive visualizations
• Geo-mapping, heat grids, scatter plots, bubble
charts, line over bar and more
• Pluggable visualizations
❯ Java ROLAP engine to analyze structured and
unstructured data, with SQL dialects for querying
data from RDBMs
❯ Pluggable cache integrating with leading caching
architectures: Infinispan (JBoss Data Grid) &
Memcached

SQL Server and Big Data Projects in the Real World

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to SQL Server and Big Data Projects in the Real World

Similar to SQL Server and Big Data Projects in the Real World (20)

More from Mark Kromer

More from Mark Kromer (20)

Recently uploaded

Recently uploaded (20)

SQL Server and Big Data Projects in the Real World