Big Data Analytics Projects
in the Real World
Mark Kromer
Pentaho Big Data Analytics Product Manager
@mssqldude
@kromerbigdata
http://www.kromerbigdata.com
1. The Big Data Technology Landscape
2. Big Data Analytics
3. Big Data Analytics Scenarios:
❯ Digital Marketing Analytics
• Hadoop, Aster Data, SQL Server
❯ Sentiment Analysis
• MongoDB, SQL Server
❯ Data Refinery
• Hadoop, MPP, SQL Server, Pentaho
4. SQL Server in the Big Data world (Quasi-Real World)
What we’ll (try) to cover today
Big Data 101
3 V’s
❯ Volume – Terabyte records, transactions, tables, files
❯ Velocity – Batch, near-time, real-time (analytics), streams.
❯ Variety – Structures, unstructured, semi-structured, and all the above in a mix
Text Processing
❯ Techniques for processing and analyzing unstructured (and structured) LARGE files
Analytics & Insights
Distributed File System & Programming
Big Data ≠ NoSQL
❯ NoSQL has similar Internet-scale Web origins of Hadoop stack (Yahoo!, Google, Facebook, et al) but not
the same thing
❯ Facebook, for example, uses Hbase from the Hadoop stack
❯ NoSQL does not have to be Big Data
Big Data ≠ Real Time
❯ Big Data is primarily about batch processing huge files in a distributed manner and analyzing data that was
otherwise too complex to provide value
❯ Use in-memory analytics for real time insights
Big Data ≠ Data Warehouse
❯ I still refer to large multi-TB DWs as “VLDB”
❯ Big Data is about crunching stats in text files for discovery of new patterns and insights
❯ Use the DW to aggregate and store the summaries of those calculations for reporting
Mark’s Big Data Myths
• Batch Processing
• Commodity Hardware
• Data Locality, no shared
storage
• Scales linearly
• Great for large text file
processing, not so great on
small files
• Distributed programming
paradigm
Hadoop 1.x
© Hortonworks Inc. 2014
Hadoop 1 vs Hadoop 2
HADOOP 1.0
HDFS
(redundant, reliable storage)
MapReduce
(cluster resource management
& data processing)
HDFS2
(redundant, highly-available & reliable storage)
YARN
(cluster resource management)
MapReduce
(data processing)
Others
HADOOP 2.0
Single Use System
Batch Apps
Multi Purpose Platform
Batch, Interactive, Online, Streaming, …
© Hortonworks Inc. 2014
YARN: Taking Hadoop Beyond Batch
Applications Run Natively in Hadoop
HDFS2 (Redundant, Reliable Storage)
YARN (Cluster Resource Management)
BATCH
(MapReduce)
INTERACTIVE
(Tez)
STREAMING
(Storm, S4,…)
GRAPH
(Giraph)
IN-MEMORY
(Spark)
HPC MPI
(OpenMPI)
ONLINE
(HBase)
OTHER
(Search)
(Weave…)
Store ALL DATA in one place…
Interact with that data in MULTIPLE WAYS
with Predictable Performance and Quality of Service
© Hortonworks Inc. 2014
Tez – Introduction
1. Distributed execution framework targeted towards
data-processing applications.
2. Based on expressing a computation as a dataflow
graph.
3. Highly customizable to meet a broad spectrum of
use cases.
4. Built on top of YARN – the resource management
framework for Hadoop.
5. Open source Apache incubator project and Apache
licensed.
© Hortonworks Inc. 2014
Tez – Deep Dive – DAG API
DAG dag = new DAG();
Vertex map1 = new Vertex(MapProcessor.class);
Vertex map2 = new Vertex(MapProcessor.class);
Vertex reduce1 = new Vertex(ReduceProcessor.class);
Vertex reduce2 = new Vertex(ReduceProcessor.class);
Vertex join1 = new Vertex(JoinProcessor.class);
…….
Edge edge1 = Edge(map1, reduce1, SCATTER_GATHER, PERSISTED, SEQUENTIAL,
MOutput.class, RInput.class);
Edge edge2 = Edge(map2, reduce2, SCATTER_GATHER, PERSISTED, SEQUENTIAL, MOutput.class, RInput.class);
Edge edge3 = Edge(reduce1, join1, SCATTER_GATHER, PERSISTED, SEQUENTIAL, MOutput.class, RInput.class);
Edge edge4 = Edge(reduce2, join1, SCATTER_GATHER, PERSISTED, SEQUENTIAL, MOutput.class, RInput.class);
…….
dag.addVertex(map1).addVertex(map2)
.addVertex(reduce1).addVertex(reduce2)
.addVertex(join1)
.addEdge(edge1).addEdge(edge2)
.addEdge(edge3).addEdge(edge4);
reduce
1
map2
reduce
2
join1
map1
Scatter_Gather
Bipartite
Sequential
Scatter_Gather
Bipartite
Sequential
Simple DAG definition API
© Hortonworks Inc. 2014
YARN Eco-system
Page 10
Applications Powered by YARN
Apache Giraph – Graph Processing
Apache Hama - BSP
Apache Hadoop MapReduce – Batch
Apache Tez – Batch/Interactive
Apache S4 – Stream Processing
Apache Samza – Stream Processing
Apache Storm – Stream Processing
Apache Spark – Iterative applications
Elastic Search – Scalable Search
Cloudera Llama – Impala on YARN
DataTorrent – Data Analysis
HOYA – HBase on YARN
Frameworks Powered By YARN
Apache Twill
REEF by Microsoft
Spring support for Hadoop 2
Apache Spark
High-Speed In-Memory Analytics over Hadoop
● Open Source
● Alternative to Map Reduce for certain applications
● A low latency cluster computing system
● For very large data sets
● May be 100 times faster than Map Reduce for
– Iterative algorithms
– Interactive data mining
● Used with Hadoop / HDFS
● Released under BSD License
Popular Hadoop Distributions
Popular NoSQL Distributions
Transactional-based, not analytical schemas
Popular MPP Distributions
Big Data as distributed, scale-out, sharded data stores
Big Data Analytics Web Platform – RA 1
Sentiment Analysis
Reference Architecture 2
Big Data
Platforms
Hadoop
PDW
MongoDB
Social Media
Sources
Data
Orchestration
Data Models
Analytical
Models
OLAP Cubes
Data Mining
OLAP
Analytics
Tools,
Reporting
Tools,
Dashboards
Big Data Analytics
• Distributed Data (Data Locality)
❯ HDFS / MapReduce
❯ YARN / TEZ
❯ Replicated / Sharded Data
• MPP Databases
❯ Vertica, Aster, PDW, Greenplum … In-database analytics that can scale-out with
distributed processing across nodes
• Distributed Analytics
❯ SAS: Quickly solve complex problems using big data and sophisticated analytics in a
distributed, in-memory and parallel environment.”
http://www.sas.com/resources/whitepaper/wp_46345.pdf
• In-memory Analytics
❯ Microsoft PowerPivot (Tabular models)
❯ SAP HANA
❯ Tableau
Big Data Analytics
Core Tenets
using Microsoft.Hadoop.MapReduce;
using System.Text.RegularExpressions;
public class TotalHitsForPageMap : MapperBase
{
public override void Map(string inputLine, MapperContext context)
{
context.Log(inputLine);
var parts = Regex.Split(inputLine, "s+");
if (parts.Length != expected) //only take records with all values
{
return;
}
context.EmitKeyValue(parts[pagePos], hit);
}
}
MapReduce Framework (Map)
public class TotalHitsForPageReducerCombiner : ReducerCombinerBase
{
public override void Reduce(string key, IEnumerable<string> values, ReducerCombinerContext context)
{
context.EmitKeyValue(key, values.Sum(e=>long.Parse(e)).ToString());
}
}
public class TotalHitsJob : HadoopJob<TotalHitsForPageMap,TotalHitsForPageReducerCombiner>
{
public override HadoopJobConfiguration Configure(ExecutorContext context)
{
var retVal = new HadoopJobConfiguration();
retVal.InputPath = Environment.GetEnvironmentVariable("W3C_INPUT");
retVal.OutputFolder = Environment.GetEnvironmentVariable("W3C_OUTPUT");
retVal.DeleteOutputFolder = true;
return retVal;
}
}
MapReduce Framework (Reduce & Job)
Linux shell commands to access data in HDFS
Put file in HDFS: hadoop fs -put sales.csv /import/sales.csv
List files in HDFS:
c:Hadoop>hadoop fs -ls /import
Found 1 items
-rw-r--r-- 1 makromer supergroup 114 2013-05-07 12:11 /import/sales.csv
View file in HDFS:
c:Hadoop>hadoop fs -cat /import/sales.csv
Kromer,123,5,55
Smith,567,1,25
Jones,123,9,99
James,11,12,1
Johnson,456,2,2.5
Singh,456,1,3.25
Yu,123,1,11
Now, we can work on the data with MapReduce, Hive, Pig, etc.
Get Data into Hadoop
create external table ext_sales
(
lastname string,
productid int,
quantity int,
sales_amount float
)
row format delimited fields terminated by ',' stored as textfile location
'/user/makromer/hiveext/input';
LOAD DATA INPATH '/user/makromer/import/sales.csv' OVERWRITE INTO TABLE ext_sales;
Use Hive for Data Schema and Analysis
sqoop import –connect jdbc:sqlserver://localhost –username sqoop -password password –table customers -m 1
> hadoop fs -cat /user/mark/customers/part-m-00000
> 5,Bob Smith
sqoop export –connect jdbc:sqlserver://localhost –username sqoop -password password -m 1 –table customers –export-dir
/user/mark/data/employees3
12/11/11 22:19:24 INFO mapreduce.ExportJobBase: Transferred 201 bytes in 32.6364 seconds (6.1588 bytes/sec)
12/11/11 22:19:24 INFO mapreduce.ExportJobBase: Exported 4 records.
Sqoop
Data transfer to & from Hadoop & SQL Server
Role of NoSQL in a Big Data Analytics Solution
‣ Use NoSQL to store data quickly without the overhead of RDBMS
‣ Hbase, Plain Old HDFS, Cassandra, MongoDB, Dynamo, just to name a few
‣ Why NoSQL?
‣ In the world of “Big Data”
‣ “Schema later”
‣ Ignore ACID properties
‣ Drop data into key-value store quick & dirty
‣ Worry about query & read later
‣ Why NOT NoSQL?
‣ In the world of Big Data Analytics, you will need support from analytical tools with a SQL, SAS, MR interface
‣ SQL Server and NoSQL
‣ Not a natural fit
‣ Use HDFS or your favorite NoSQL database
‣ Consider turning off SQL Server locking mechanisms
‣ Focus on writes, not reads (read uncommitted)
MongoDB and Enterprise IT Stack
EDWHadoop
Management&Monitoring
Security&Auditing
RDBMS
CRM, ERP, Collaboration, Mobile, BI
OS & Virtualization, Compute, Storage, Network
RDBMS
Applications
Infrastructure
Data Management
Online Data Offline Data
{
_id : ObjectId("4e2e3f92268cdda473b628f6"),
sourceIDs: {
ABCSystemIDPart1: 8397897,
ABCSystemIDPart2: 2937430,
ABCSystemIDPart3: 932018 }
accountType: “Checking”,
accountOwners: [
{ firstName : ”John",
lastName: “Smith”,
contactMethods: [
{ type: “phone”, subtype: “mobile”, number: 8743927394},
{ type: “mail”, address: “58 3rd St.”, city: …} ]
possibleMatchCriteria: {
govtID: 2938932432, fullName: “johnsmith”, dob: … } },
{ firstName : ”Anne",
maidenName: “Collins”,
lastName: “Smith”, …} ],
openDate: ISODate("2013-02-15 10:00”),
accountFeatures { Overdraft: true, APR: 20, … }
}
General document per customer per account
OR creditCardNumber: 8392384938391293
OR mortgageID: 2374389
OR policyID: 18374923
Text Search Example
(e.g. address typo so do fuzzy match)
// Text search for address filtered by first name and NY
> db.ticks.runCommand(
“text”,
{ search: “vanderbilt ave. vander bilt”,
filter: {name: “Smith”,
city: “New York”} })
//Find total value of each customer’s accounts for a given RM (or Agent) sorted by value
db.accts.aggregate(
{ $match: {relationshipManager: “Smith”}},
{ $group :
{ _id : “$ssn”,
totalValue: {$sum: ”$value”} }},
{ $sort: { totalValue: -1}} )
Aggregate: Total Value of Accounts
SQL Server Big Data – Data Loading
Amazon HDFS & EMR Data Loading
Amazon S3 Bucket
SQL Server Database
❯ SQL 2012 Enterprise Edition
❯ Page Compression
❯ 2012 Columnar Compression on Fact Tables
❯ Clustered Index on all tables
❯ Auto-update Stats Asynch
❯ Partition Fact Tables by month and archive data with sliding window technique
❯ Drop all indexes before nightly ETL load jobs
❯ Rebuild all indexes when ETL completes
SQL Server Analysis Services
❯ SSAS 2012 Enterprise Edition
❯ 2008 R2 OLAP cubes partition-aligned with DW
❯ 2012 cubes in-memory tabular cubes
❯ All access through MSMDPUMP or SharePoint
SQL Server Big Data Environment
SQL Server Big Data Analytics Features
DBA ETL/BI Developer Business Users & Executives Analysts & Data Scientists
OPERATIONAL DATA BIG DATA DATA STREAMPUBLIC/PRIVATE CLOUDS
Enterprise &
Interactive
Reporting
Interactive
Analysis
Dashboards Predictive
Analytics
Pentaho Business Analytics
Data Integration
Instaview | Visual Map Reduce
DIRECT ACCESS
Pentaho Big Data Analytics
Pentaho Big Data Analytics
Accelerate the time to big data value
• Full continuity from data
access to decisions –
complete data integration &
analytics for any big data
store
• Faster development,
faster runtime – visual
development, distributed
execution
• Instant and interactive
analysis – no coding and
no ETL required
Product Components
Pentaho Data Integration
• Visual development for big data
• Broad connectivity
• Data quality & enrichment
• Integrated scheduling
• Security integration
• Visual data exploration
• Ad hoc analysis
• Interactive charts & visualizations
Pentaho Dashboards
• Self-service dashboard builder
• Content linking & drill through
• Highly customized mash-ups
Pentaho Data Mining &
Predictive Analytics
• Model construction & evaluation
• Learning schemes
• Integration with 3rd part models
using PMML
Pentaho Enterprise &
Interactive Reports
• Both ad hoc & distributed reporting
• Drag & drop interactive reporting
• Pixel-perfect enterprise reports
Pentaho for Big Data
MapReduce & Instaview
• Visual Interface for Developing
MR
• Self-service big data discovery
• Big data access to Data Analysts
Pentaho Analyzer
❯ Simple, easy-to-use visual data exploration
❯ Web-based thin client; in-memory caching
❯ Rich library of interactive visualizations
• Geo-mapping, heat grids, scatter plots, bubble
charts, line over bar and more
• Pluggable visualizations
❯ Java ROLAP engine to analyze structured and
unstructured data, with SQL dialects for querying
data from RDBMs
❯ Pluggable cache integrating with leading caching
architectures: Infinispan (JBoss Data Grid) &
Memcached
Pentaho Interactive Analysis & Data Discovery
Highly Flexible Advanced Visualizations

Big Data Analytics Projects - Real World with Pentaho

  • 1.
    Big Data AnalyticsProjects in the Real World Mark Kromer Pentaho Big Data Analytics Product Manager @mssqldude @kromerbigdata http://www.kromerbigdata.com
  • 2.
    1. The BigData Technology Landscape 2. Big Data Analytics 3. Big Data Analytics Scenarios: ❯ Digital Marketing Analytics • Hadoop, Aster Data, SQL Server ❯ Sentiment Analysis • MongoDB, SQL Server ❯ Data Refinery • Hadoop, MPP, SQL Server, Pentaho 4. SQL Server in the Big Data world (Quasi-Real World) What we’ll (try) to cover today
  • 3.
    Big Data 101 3V’s ❯ Volume – Terabyte records, transactions, tables, files ❯ Velocity – Batch, near-time, real-time (analytics), streams. ❯ Variety – Structures, unstructured, semi-structured, and all the above in a mix Text Processing ❯ Techniques for processing and analyzing unstructured (and structured) LARGE files Analytics & Insights Distributed File System & Programming
  • 4.
    Big Data ≠NoSQL ❯ NoSQL has similar Internet-scale Web origins of Hadoop stack (Yahoo!, Google, Facebook, et al) but not the same thing ❯ Facebook, for example, uses Hbase from the Hadoop stack ❯ NoSQL does not have to be Big Data Big Data ≠ Real Time ❯ Big Data is primarily about batch processing huge files in a distributed manner and analyzing data that was otherwise too complex to provide value ❯ Use in-memory analytics for real time insights Big Data ≠ Data Warehouse ❯ I still refer to large multi-TB DWs as “VLDB” ❯ Big Data is about crunching stats in text files for discovery of new patterns and insights ❯ Use the DW to aggregate and store the summaries of those calculations for reporting Mark’s Big Data Myths
  • 5.
    • Batch Processing •Commodity Hardware • Data Locality, no shared storage • Scales linearly • Great for large text file processing, not so great on small files • Distributed programming paradigm Hadoop 1.x
  • 6.
    © Hortonworks Inc.2014 Hadoop 1 vs Hadoop 2 HADOOP 1.0 HDFS (redundant, reliable storage) MapReduce (cluster resource management & data processing) HDFS2 (redundant, highly-available & reliable storage) YARN (cluster resource management) MapReduce (data processing) Others HADOOP 2.0 Single Use System Batch Apps Multi Purpose Platform Batch, Interactive, Online, Streaming, …
  • 7.
    © Hortonworks Inc.2014 YARN: Taking Hadoop Beyond Batch Applications Run Natively in Hadoop HDFS2 (Redundant, Reliable Storage) YARN (Cluster Resource Management) BATCH (MapReduce) INTERACTIVE (Tez) STREAMING (Storm, S4,…) GRAPH (Giraph) IN-MEMORY (Spark) HPC MPI (OpenMPI) ONLINE (HBase) OTHER (Search) (Weave…) Store ALL DATA in one place… Interact with that data in MULTIPLE WAYS with Predictable Performance and Quality of Service
  • 8.
    © Hortonworks Inc.2014 Tez – Introduction 1. Distributed execution framework targeted towards data-processing applications. 2. Based on expressing a computation as a dataflow graph. 3. Highly customizable to meet a broad spectrum of use cases. 4. Built on top of YARN – the resource management framework for Hadoop. 5. Open source Apache incubator project and Apache licensed.
  • 9.
    © Hortonworks Inc.2014 Tez – Deep Dive – DAG API DAG dag = new DAG(); Vertex map1 = new Vertex(MapProcessor.class); Vertex map2 = new Vertex(MapProcessor.class); Vertex reduce1 = new Vertex(ReduceProcessor.class); Vertex reduce2 = new Vertex(ReduceProcessor.class); Vertex join1 = new Vertex(JoinProcessor.class); ……. Edge edge1 = Edge(map1, reduce1, SCATTER_GATHER, PERSISTED, SEQUENTIAL, MOutput.class, RInput.class); Edge edge2 = Edge(map2, reduce2, SCATTER_GATHER, PERSISTED, SEQUENTIAL, MOutput.class, RInput.class); Edge edge3 = Edge(reduce1, join1, SCATTER_GATHER, PERSISTED, SEQUENTIAL, MOutput.class, RInput.class); Edge edge4 = Edge(reduce2, join1, SCATTER_GATHER, PERSISTED, SEQUENTIAL, MOutput.class, RInput.class); ……. dag.addVertex(map1).addVertex(map2) .addVertex(reduce1).addVertex(reduce2) .addVertex(join1) .addEdge(edge1).addEdge(edge2) .addEdge(edge3).addEdge(edge4); reduce 1 map2 reduce 2 join1 map1 Scatter_Gather Bipartite Sequential Scatter_Gather Bipartite Sequential Simple DAG definition API
  • 10.
    © Hortonworks Inc.2014 YARN Eco-system Page 10 Applications Powered by YARN Apache Giraph – Graph Processing Apache Hama - BSP Apache Hadoop MapReduce – Batch Apache Tez – Batch/Interactive Apache S4 – Stream Processing Apache Samza – Stream Processing Apache Storm – Stream Processing Apache Spark – Iterative applications Elastic Search – Scalable Search Cloudera Llama – Impala on YARN DataTorrent – Data Analysis HOYA – HBase on YARN Frameworks Powered By YARN Apache Twill REEF by Microsoft Spring support for Hadoop 2
  • 11.
    Apache Spark High-Speed In-MemoryAnalytics over Hadoop ● Open Source ● Alternative to Map Reduce for certain applications ● A low latency cluster computing system ● For very large data sets ● May be 100 times faster than Map Reduce for – Iterative algorithms – Interactive data mining ● Used with Hadoop / HDFS ● Released under BSD License
  • 12.
  • 13.
  • 14.
    Popular MPP Distributions BigData as distributed, scale-out, sharded data stores
  • 15.
    Big Data AnalyticsWeb Platform – RA 1
  • 16.
    Sentiment Analysis Reference Architecture2 Big Data Platforms Hadoop PDW MongoDB Social Media Sources Data Orchestration Data Models Analytical Models OLAP Cubes Data Mining OLAP Analytics Tools, Reporting Tools, Dashboards
  • 17.
  • 18.
    • Distributed Data(Data Locality) ❯ HDFS / MapReduce ❯ YARN / TEZ ❯ Replicated / Sharded Data • MPP Databases ❯ Vertica, Aster, PDW, Greenplum … In-database analytics that can scale-out with distributed processing across nodes • Distributed Analytics ❯ SAS: Quickly solve complex problems using big data and sophisticated analytics in a distributed, in-memory and parallel environment.” http://www.sas.com/resources/whitepaper/wp_46345.pdf • In-memory Analytics ❯ Microsoft PowerPivot (Tabular models) ❯ SAP HANA ❯ Tableau Big Data Analytics Core Tenets
  • 19.
    using Microsoft.Hadoop.MapReduce; using System.Text.RegularExpressions; publicclass TotalHitsForPageMap : MapperBase { public override void Map(string inputLine, MapperContext context) { context.Log(inputLine); var parts = Regex.Split(inputLine, "s+"); if (parts.Length != expected) //only take records with all values { return; } context.EmitKeyValue(parts[pagePos], hit); } } MapReduce Framework (Map)
  • 20.
    public class TotalHitsForPageReducerCombiner: ReducerCombinerBase { public override void Reduce(string key, IEnumerable<string> values, ReducerCombinerContext context) { context.EmitKeyValue(key, values.Sum(e=>long.Parse(e)).ToString()); } } public class TotalHitsJob : HadoopJob<TotalHitsForPageMap,TotalHitsForPageReducerCombiner> { public override HadoopJobConfiguration Configure(ExecutorContext context) { var retVal = new HadoopJobConfiguration(); retVal.InputPath = Environment.GetEnvironmentVariable("W3C_INPUT"); retVal.OutputFolder = Environment.GetEnvironmentVariable("W3C_OUTPUT"); retVal.DeleteOutputFolder = true; return retVal; } } MapReduce Framework (Reduce & Job)
  • 21.
    Linux shell commandsto access data in HDFS Put file in HDFS: hadoop fs -put sales.csv /import/sales.csv List files in HDFS: c:Hadoop>hadoop fs -ls /import Found 1 items -rw-r--r-- 1 makromer supergroup 114 2013-05-07 12:11 /import/sales.csv View file in HDFS: c:Hadoop>hadoop fs -cat /import/sales.csv Kromer,123,5,55 Smith,567,1,25 Jones,123,9,99 James,11,12,1 Johnson,456,2,2.5 Singh,456,1,3.25 Yu,123,1,11 Now, we can work on the data with MapReduce, Hive, Pig, etc. Get Data into Hadoop
  • 22.
    create external tableext_sales ( lastname string, productid int, quantity int, sales_amount float ) row format delimited fields terminated by ',' stored as textfile location '/user/makromer/hiveext/input'; LOAD DATA INPATH '/user/makromer/import/sales.csv' OVERWRITE INTO TABLE ext_sales; Use Hive for Data Schema and Analysis
  • 23.
    sqoop import –connectjdbc:sqlserver://localhost –username sqoop -password password –table customers -m 1 > hadoop fs -cat /user/mark/customers/part-m-00000 > 5,Bob Smith sqoop export –connect jdbc:sqlserver://localhost –username sqoop -password password -m 1 –table customers –export-dir /user/mark/data/employees3 12/11/11 22:19:24 INFO mapreduce.ExportJobBase: Transferred 201 bytes in 32.6364 seconds (6.1588 bytes/sec) 12/11/11 22:19:24 INFO mapreduce.ExportJobBase: Exported 4 records. Sqoop Data transfer to & from Hadoop & SQL Server
  • 24.
    Role of NoSQLin a Big Data Analytics Solution ‣ Use NoSQL to store data quickly without the overhead of RDBMS ‣ Hbase, Plain Old HDFS, Cassandra, MongoDB, Dynamo, just to name a few ‣ Why NoSQL? ‣ In the world of “Big Data” ‣ “Schema later” ‣ Ignore ACID properties ‣ Drop data into key-value store quick & dirty ‣ Worry about query & read later ‣ Why NOT NoSQL? ‣ In the world of Big Data Analytics, you will need support from analytical tools with a SQL, SAS, MR interface ‣ SQL Server and NoSQL ‣ Not a natural fit ‣ Use HDFS or your favorite NoSQL database ‣ Consider turning off SQL Server locking mechanisms ‣ Focus on writes, not reads (read uncommitted)
  • 25.
    MongoDB and EnterpriseIT Stack EDWHadoop Management&Monitoring Security&Auditing RDBMS CRM, ERP, Collaboration, Mobile, BI OS & Virtualization, Compute, Storage, Network RDBMS Applications Infrastructure Data Management Online Data Offline Data
  • 26.
    { _id : ObjectId("4e2e3f92268cdda473b628f6"), sourceIDs:{ ABCSystemIDPart1: 8397897, ABCSystemIDPart2: 2937430, ABCSystemIDPart3: 932018 } accountType: “Checking”, accountOwners: [ { firstName : ”John", lastName: “Smith”, contactMethods: [ { type: “phone”, subtype: “mobile”, number: 8743927394}, { type: “mail”, address: “58 3rd St.”, city: …} ] possibleMatchCriteria: { govtID: 2938932432, fullName: “johnsmith”, dob: … } }, { firstName : ”Anne", maidenName: “Collins”, lastName: “Smith”, …} ], openDate: ISODate("2013-02-15 10:00”), accountFeatures { Overdraft: true, APR: 20, … } } General document per customer per account OR creditCardNumber: 8392384938391293 OR mortgageID: 2374389 OR policyID: 18374923
  • 27.
    Text Search Example (e.g.address typo so do fuzzy match) // Text search for address filtered by first name and NY > db.ticks.runCommand( “text”, { search: “vanderbilt ave. vander bilt”, filter: {name: “Smith”, city: “New York”} })
  • 28.
    //Find total valueof each customer’s accounts for a given RM (or Agent) sorted by value db.accts.aggregate( { $match: {relationshipManager: “Smith”}}, { $group : { _id : “$ssn”, totalValue: {$sum: ”$value”} }}, { $sort: { totalValue: -1}} ) Aggregate: Total Value of Accounts
  • 29.
    SQL Server BigData – Data Loading Amazon HDFS & EMR Data Loading Amazon S3 Bucket
  • 30.
    SQL Server Database ❯SQL 2012 Enterprise Edition ❯ Page Compression ❯ 2012 Columnar Compression on Fact Tables ❯ Clustered Index on all tables ❯ Auto-update Stats Asynch ❯ Partition Fact Tables by month and archive data with sliding window technique ❯ Drop all indexes before nightly ETL load jobs ❯ Rebuild all indexes when ETL completes SQL Server Analysis Services ❯ SSAS 2012 Enterprise Edition ❯ 2008 R2 OLAP cubes partition-aligned with DW ❯ 2012 cubes in-memory tabular cubes ❯ All access through MSMDPUMP or SharePoint SQL Server Big Data Environment
  • 31.
    SQL Server BigData Analytics Features
  • 32.
    DBA ETL/BI DeveloperBusiness Users & Executives Analysts & Data Scientists OPERATIONAL DATA BIG DATA DATA STREAMPUBLIC/PRIVATE CLOUDS Enterprise & Interactive Reporting Interactive Analysis Dashboards Predictive Analytics Pentaho Business Analytics Data Integration Instaview | Visual Map Reduce DIRECT ACCESS Pentaho Big Data Analytics
  • 33.
    Pentaho Big DataAnalytics Accelerate the time to big data value • Full continuity from data access to decisions – complete data integration & analytics for any big data store • Faster development, faster runtime – visual development, distributed execution • Instant and interactive analysis – no coding and no ETL required
  • 34.
    Product Components Pentaho DataIntegration • Visual development for big data • Broad connectivity • Data quality & enrichment • Integrated scheduling • Security integration • Visual data exploration • Ad hoc analysis • Interactive charts & visualizations Pentaho Dashboards • Self-service dashboard builder • Content linking & drill through • Highly customized mash-ups Pentaho Data Mining & Predictive Analytics • Model construction & evaluation • Learning schemes • Integration with 3rd part models using PMML Pentaho Enterprise & Interactive Reports • Both ad hoc & distributed reporting • Drag & drop interactive reporting • Pixel-perfect enterprise reports Pentaho for Big Data MapReduce & Instaview • Visual Interface for Developing MR • Self-service big data discovery • Big data access to Data Analysts Pentaho Analyzer
  • 35.
    ❯ Simple, easy-to-usevisual data exploration ❯ Web-based thin client; in-memory caching ❯ Rich library of interactive visualizations • Geo-mapping, heat grids, scatter plots, bubble charts, line over bar and more • Pluggable visualizations ❯ Java ROLAP engine to analyze structured and unstructured data, with SQL dialects for querying data from RDBMs ❯ Pluggable cache integrating with leading caching architectures: Infinispan (JBoss Data Grid) & Memcached Pentaho Interactive Analysis & Data Discovery Highly Flexible Advanced Visualizations