© 2014 MapR Techno©lo g2i0e1s4 MapR Technologies 1 
Hadoop and NoSQL Joining Forces
© 2014 MapR Technologies 2 
Topics 
Big Data, Hadoop, and NoSQL 
The In-Hadoop Advantage 
NoSQL-on-Hadoop in Action 
Other In-Hadoop Examples 
Integrating with SQL
Big Data is Overwhelming Traditional Systems 
© 2014 MapR Technologies 3 
• Mission-critical reliability 
• Transaction guarantees 
• Deep security 
• Real-time performance 
• Backup and recovery 
• Interactive SQL 
• Rich analytics 
• Workload management 
• Data governance 
• Backup and recovery 
ENTERPRISE 
USERS 
Enterprise 
Data 
Architecture 
OPERATIONAL 
SYSTEMS 
ANALYTICAL 
SYSTEMS 
PRODUCTION 
REQUIREMENTS 
PRODUCTION 
REQUIREMENTS 
OUTSIDE SOURCES
High 
© 2014 MapR Technologies 4 
Scaling on Traditional Technologies 
Data volume, velocity 
Scale up to bigger, faster machines 
Data variety 
Extensive data modeling and ETL 
Low 
Low High
Data volume, velocity 
Low High 
NoSQL NoSQL NoSQL 
Data variety 
Low High 
© 2014 MapR Technologies 5 
Scaling on Newer Technologies 
Scale out with commodity hardware 
Use the right tool for unstructured, 
multi-structured, semi-structured, 
non-relational data
Hadoop and NoSQL Relieve the Pressure from Enterprise Systems 
Keys for Production Success 
1 Reliability and DR 
3 High performance 
© 2014 MapR Technologies 6 
OPERATIONAL 
SYSTEMS 
ANALYTICAL 
SYSTEMS 
ENTERPRISE 
USERS 
• Data staging 
• Archive 
• Data transformation 
• Data exploration 
• Streaming, 
interactions 
2 Interoperability 
4 
Supports operations 
and analytics 
+ NoSQL
© 2014 MapR Technologies 7 
You Already Know; 
• NoSQL is a class of databases that specialize in: 
– Scale-out on commodity servers – no application-level sharding 
– Flexible data models – no fixed schema required 
• Hadoop is a distributed platform designed for: 
– Storing/processing huge volumes of data cost-effectively 
– Spreading work across many servers (“divide and conquer”) 
Before we continue, let’s take a quick look back;
Google’s operational data store (BigTable) has enabled multiple revolutions 
within the company: 
© 2014 MapR Technologies 8 
What Would (Did) Google Do? 
2003 
GFS 
2004 
Web index is batch 
(GFS/MapReduce) 
2010 
Web index is real-time 
(BigTable) 
The transition from 
batch to real-time 
2004 
MapReduce 
2006 
BigTable 
The explosion in 
operational applications 
(1) 
(2)
© 2014 MapR Technologies 9 
Operations Vs. Analytics 
Operations (Databases) 
• Real-time 
• Reads/writes/updates 
• Current/recent data 
• Updated regularly 
• Fast inserts/updates 
• Large volumes of data 
Analytics (Hadoop) 
• Batch 
• Reports/Computations 
• Historical data 
• Generally non-volatile 
• Fast retrievals 
• Even larger volumes of data 
But is the data different?
© 2014 MapR Technologies 10 
Mobile 
application server 
Web 
application server 
Handling Multiple Workloads 
Analytics Operational 
Hadoop 
Data exploration 
(SQL) 
Operational NoSQL 
DBMS 
Batch import/export 
Customer 360 
dashboard 
Churn analysis 
(predictive analytics)
© 2014 MapR Technologies 11 
Mobile 
application server 
Product/service 
optimization and 
personalization 
Data exploration 
(SQL) 
Customer 360 
dashboard 
Churn analysis 
(predictive analytics) 
• Single cluster 
•High performance, low latency 
• Large-scale analytics 
• Enterprise-grade HA/DR 
•Unified file and table administration 
Real-time ad 
targeting 
Real-Time and Operational 
Actionable 
Analytics 
Web 
application server 
In-Hadoop Databases
© 2014 MapR Technologies 12 
Separate Clusters Versus Single Cluster 
Separate Hadoop and Database 
• Delays analyzing live data 
• Network traffic 
– Heavy bandwidth usage 
– Heavy cleanup upon error 
• Complexity 
– Higher maintenance, risk of error 
– More HA/DR administration 
– Risk to SLAs 
• Unnecessarily duplicated 
resources 
Consolidated Deployment 
• Real-time analysis/computation 
• Data locality 
– Reduced bandwidth utilization 
– Efficient divide-and-conquer analysis 
• Architectural simplicity 
– Lower risk of error 
– Lower administrative overhead 
• No unnecessary data/hardware 
duplication (except for HA/DR)
Databases on Direct Attached Storage (DAS) 
Advantages 
• Fast local file access 
• Lower cost vs. SAN/NAS 
© 2014 MapR Technologies 13
Databases on Networked Storage (SAN/NAS) 
Advantages 
• Snapshot/backup 
• Easy capacity expansion 
• Disaster recovery 
• Improved disk utilization 
• Seamless maintenance 
• Reliable 
© 2014 MapR Technologies 14
© 2014 MapR Technologies 15 
Databases on Hadoop (“In-Hadoop”) 
Advantages 
• Benefits of DAS 
• Reduced complexity vs. 
SAN 
• Lower operational cost 
• Faster local file access 
• Easy capacity expansion 
• Dynamic storage utilization 
Hadoop
Lambda Architecture (lambda-architecture.net) 
© 2014 MapR Technologies 16 
BATCH VIEWS 
BATCH LAYER 
SERVING LAYER 
SPEED LAYER 
MERGE 
ALL DATA 
(HDFS) 
HADOOP 
BATCH 
RECOMPUTE 
PROCESS 
STREAM 
REAL-TIME VIEWS 
INCREMENT 
VIEWS 
STORM 
Partial 
aggregate 
REAL-TIME 
INCREMENT 
Partial 
aggregate 
Partial 
aggregate 
MERGED 
VIEW 
(HBASE) 
REAL-TIME DATA 
NEW DATA 
STREAM 
PRECOMPUTE 
VIEWS 
(MAPREDUCE)
© 2014 MapR Technologies 17 
Enterprise Data Hub Architecture 
Load more data 
sources 
Enrich data in Hadoop Analyze 
Offload / Enrich / 
Reload 
RELATIONAL, 
SAAS, 
MAINFRAME 
DOCUMENTS, 
EMAILS 
BLOGS, 
TWEETS, 
LINK DATA 
LOG FILES, 
CLICKSTREAMS 
MapR Control System (MCS) 
Hadoop User Experience (HUE) 
Batch Processing 
MR, YARN, Hive, Pig, etc. 
Interactive Querying 
Drill, Impala, Presto, etc. 
HBase other data stores 
MapR Data Platform 
MapR-DB Tables 
MAPR DISTRIBUTION INCLUDING HADOOP 
BI REPORTS AND 
APPLICATIONS 
High 
speed 
streaming 
DATA MARTS DATA WAREHOUSE 
PARSE, PROFILE, ETL 
LOAD 
REPLICATE, CDC 
STREAMING 
CLEANSE, MATCH 
LOAD
Customer data, network 
security event data 
Anomaly detection on 
large volumes of security 
event data, analytics on 
customer data to enable 
incremental sales 
© 2014 MapR Technologies 18
Industry data analysis, 
SaaS-based reporting 
© 2014 MapR Technologies 19 
Advertising 
Automation 
Cloud 
Buyers 
Cloud 
Sales performance 
management data 
combined with fast 
responsiveness SaaS-delivered 
reports
Customer profile data, 
customer behavior data 
Analytics on customer 
behavior for better 
recommendations 
© 2014 MapR Technologies 20 
Telecommunications Company
© 2014 MapR Technologies 21 
MapR Overview 
BIG 
DATA 
BEST 
PRODUCT 
BUSINESS 
IMPACT 
Hadoop 
Top Ranked 
Production 
Success
The Power of the Open Source Community 
Provisioning 
& 
coordination 
Savannah* 
Workflow 
& Data 
Governance 
Data 
Integration 
& Access 
Hue 
HttpFS 
Flume Knox* Falcon* 
© 2014 MapR Technologies 22 
MMaannaaggeemmeenntt 
APACHE HADOOP AND OSS ECOSYSTEM 
Streaming 
Storm* 
NoSQL & 
Search 
Solr 
MapR Data Platform 
Security 
SQL 
Drill* 
Shark 
Impala 
YARN 
Batch 
Spark 
Cascading 
Pig 
Spark 
Streaming 
HBase 
Juju 
ML, Graph 
GraphX 
MLLib 
Mahout 
MapReduce 
v1 & v2 
EXECUTION ENGINES DATA GOVERNANCE AND OPERATIONS 
Tez* 
Accumulo* 
Hive 
Sqoop Sentry* Oozie ZooKeeper 
MapR-DB MapR-FS 
* Certification/support planned for 2014
MapR-DB: Powerful NoSQL Integrated with Hadoop 
Benefit Features 
High Performance Over 1 million ops/sec with 10 nodes, in-memory processing 
Continuous Low Latency No I/O storms, no compaction delays 
© 2014 MapR Technologies 
24x7 Applications 
Instant recovery, online schema modification, snapshots, 
mirroring 
Consistency Strong data consistency, row-level ACID transactions 
Simplified Database 
Administration 
No processes to manage, automated splits, self-tuning 
High Scalability 1 trillion tables, trillions of rows, millions of columns 
Low TCO Files and tables on one platform, more work with fewer nodes 
Performance 
Reliability 
Easy 
Administration
MapR-DB (in MapR Enterprise Database Edition) 
© 2014 MapR Technologies 24 
MapR-DB 
 NoSQL Table-Style Store 
 Apache HBase API 
 In-Hadoop Database 
HBase 
JVM 
HDFS 
JVM 
ext3/ext4 
Disks 
Other Distros 
Tables/Files 
Disks 
MapR 
Fast, scalable, reliable. 
HBase API, in-memory option, Hadoop integration.
© 2014 MapR Technologies 
Consistent, Low Read Latency 
--- MapR-DB Read Latency --- Other’s Read Latency
© 2014 MapR Technologies 26 
Other In-Hadoop Database Technologies 
• Databases in Hadoop 
– Apache HBase 
– Apache Accumulo 
– Splice Machine 
– MarkLogic 
• Data Warehouses on Hadoop 
– HP Vertica 
– Pivotal HAWQ
© 2014 MapR Technologies 27 
What Other Trends? 
• SQL query engines 
– Apache Drill 
– Impala 
– Presto 
– Etc. 
• In-memory processing 
– GridGain 
– Apache Spark 
– HAMRTech
SQL Query Engines for Hadoop and NoSQL Together 
© 2014 MapR Technologies 28 
Impala
• Pioneering Data Agility for Hadoop 
• Apache open source project 
• Scale-out execution engine for low-latency queries 
• Unified SQL-based API for analytics  operational applications 
© 2014 MapR Technologies 29 
APACHE DRILL 
Vibrant Community 
40+ contributors 
150+ years of experience building 
databases and distributed systems
© 2014 MapR Technologies 30 
Q  A 
Engage with us! 
@mapr maprtech 
dalekim@mapr.com 
MapR 
maprtech 
mapr-technologies

Hadoop and NoSQL joining forces by Dale Kim of MapR

  • 1.
    © 2014 MapRTechno©lo g2i0e1s4 MapR Technologies 1 Hadoop and NoSQL Joining Forces
  • 2.
    © 2014 MapRTechnologies 2 Topics Big Data, Hadoop, and NoSQL The In-Hadoop Advantage NoSQL-on-Hadoop in Action Other In-Hadoop Examples Integrating with SQL
  • 3.
    Big Data isOverwhelming Traditional Systems © 2014 MapR Technologies 3 • Mission-critical reliability • Transaction guarantees • Deep security • Real-time performance • Backup and recovery • Interactive SQL • Rich analytics • Workload management • Data governance • Backup and recovery ENTERPRISE USERS Enterprise Data Architecture OPERATIONAL SYSTEMS ANALYTICAL SYSTEMS PRODUCTION REQUIREMENTS PRODUCTION REQUIREMENTS OUTSIDE SOURCES
  • 4.
    High © 2014MapR Technologies 4 Scaling on Traditional Technologies Data volume, velocity Scale up to bigger, faster machines Data variety Extensive data modeling and ETL Low Low High
  • 5.
    Data volume, velocity Low High NoSQL NoSQL NoSQL Data variety Low High © 2014 MapR Technologies 5 Scaling on Newer Technologies Scale out with commodity hardware Use the right tool for unstructured, multi-structured, semi-structured, non-relational data
  • 6.
    Hadoop and NoSQLRelieve the Pressure from Enterprise Systems Keys for Production Success 1 Reliability and DR 3 High performance © 2014 MapR Technologies 6 OPERATIONAL SYSTEMS ANALYTICAL SYSTEMS ENTERPRISE USERS • Data staging • Archive • Data transformation • Data exploration • Streaming, interactions 2 Interoperability 4 Supports operations and analytics + NoSQL
  • 7.
    © 2014 MapRTechnologies 7 You Already Know; • NoSQL is a class of databases that specialize in: – Scale-out on commodity servers – no application-level sharding – Flexible data models – no fixed schema required • Hadoop is a distributed platform designed for: – Storing/processing huge volumes of data cost-effectively – Spreading work across many servers (“divide and conquer”) Before we continue, let’s take a quick look back;
  • 8.
    Google’s operational datastore (BigTable) has enabled multiple revolutions within the company: © 2014 MapR Technologies 8 What Would (Did) Google Do? 2003 GFS 2004 Web index is batch (GFS/MapReduce) 2010 Web index is real-time (BigTable) The transition from batch to real-time 2004 MapReduce 2006 BigTable The explosion in operational applications (1) (2)
  • 9.
    © 2014 MapRTechnologies 9 Operations Vs. Analytics Operations (Databases) • Real-time • Reads/writes/updates • Current/recent data • Updated regularly • Fast inserts/updates • Large volumes of data Analytics (Hadoop) • Batch • Reports/Computations • Historical data • Generally non-volatile • Fast retrievals • Even larger volumes of data But is the data different?
  • 10.
    © 2014 MapRTechnologies 10 Mobile application server Web application server Handling Multiple Workloads Analytics Operational Hadoop Data exploration (SQL) Operational NoSQL DBMS Batch import/export Customer 360 dashboard Churn analysis (predictive analytics)
  • 11.
    © 2014 MapRTechnologies 11 Mobile application server Product/service optimization and personalization Data exploration (SQL) Customer 360 dashboard Churn analysis (predictive analytics) • Single cluster •High performance, low latency • Large-scale analytics • Enterprise-grade HA/DR •Unified file and table administration Real-time ad targeting Real-Time and Operational Actionable Analytics Web application server In-Hadoop Databases
  • 12.
    © 2014 MapRTechnologies 12 Separate Clusters Versus Single Cluster Separate Hadoop and Database • Delays analyzing live data • Network traffic – Heavy bandwidth usage – Heavy cleanup upon error • Complexity – Higher maintenance, risk of error – More HA/DR administration – Risk to SLAs • Unnecessarily duplicated resources Consolidated Deployment • Real-time analysis/computation • Data locality – Reduced bandwidth utilization – Efficient divide-and-conquer analysis • Architectural simplicity – Lower risk of error – Lower administrative overhead • No unnecessary data/hardware duplication (except for HA/DR)
  • 13.
    Databases on DirectAttached Storage (DAS) Advantages • Fast local file access • Lower cost vs. SAN/NAS © 2014 MapR Technologies 13
  • 14.
    Databases on NetworkedStorage (SAN/NAS) Advantages • Snapshot/backup • Easy capacity expansion • Disaster recovery • Improved disk utilization • Seamless maintenance • Reliable © 2014 MapR Technologies 14
  • 15.
    © 2014 MapRTechnologies 15 Databases on Hadoop (“In-Hadoop”) Advantages • Benefits of DAS • Reduced complexity vs. SAN • Lower operational cost • Faster local file access • Easy capacity expansion • Dynamic storage utilization Hadoop
  • 16.
    Lambda Architecture (lambda-architecture.net) © 2014 MapR Technologies 16 BATCH VIEWS BATCH LAYER SERVING LAYER SPEED LAYER MERGE ALL DATA (HDFS) HADOOP BATCH RECOMPUTE PROCESS STREAM REAL-TIME VIEWS INCREMENT VIEWS STORM Partial aggregate REAL-TIME INCREMENT Partial aggregate Partial aggregate MERGED VIEW (HBASE) REAL-TIME DATA NEW DATA STREAM PRECOMPUTE VIEWS (MAPREDUCE)
  • 17.
    © 2014 MapRTechnologies 17 Enterprise Data Hub Architecture Load more data sources Enrich data in Hadoop Analyze Offload / Enrich / Reload RELATIONAL, SAAS, MAINFRAME DOCUMENTS, EMAILS BLOGS, TWEETS, LINK DATA LOG FILES, CLICKSTREAMS MapR Control System (MCS) Hadoop User Experience (HUE) Batch Processing MR, YARN, Hive, Pig, etc. Interactive Querying Drill, Impala, Presto, etc. HBase other data stores MapR Data Platform MapR-DB Tables MAPR DISTRIBUTION INCLUDING HADOOP BI REPORTS AND APPLICATIONS High speed streaming DATA MARTS DATA WAREHOUSE PARSE, PROFILE, ETL LOAD REPLICATE, CDC STREAMING CLEANSE, MATCH LOAD
  • 18.
    Customer data, network security event data Anomaly detection on large volumes of security event data, analytics on customer data to enable incremental sales © 2014 MapR Technologies 18
  • 19.
    Industry data analysis, SaaS-based reporting © 2014 MapR Technologies 19 Advertising Automation Cloud Buyers Cloud Sales performance management data combined with fast responsiveness SaaS-delivered reports
  • 20.
    Customer profile data, customer behavior data Analytics on customer behavior for better recommendations © 2014 MapR Technologies 20 Telecommunications Company
  • 21.
    © 2014 MapRTechnologies 21 MapR Overview BIG DATA BEST PRODUCT BUSINESS IMPACT Hadoop Top Ranked Production Success
  • 22.
    The Power ofthe Open Source Community Provisioning & coordination Savannah* Workflow & Data Governance Data Integration & Access Hue HttpFS Flume Knox* Falcon* © 2014 MapR Technologies 22 MMaannaaggeemmeenntt APACHE HADOOP AND OSS ECOSYSTEM Streaming Storm* NoSQL & Search Solr MapR Data Platform Security SQL Drill* Shark Impala YARN Batch Spark Cascading Pig Spark Streaming HBase Juju ML, Graph GraphX MLLib Mahout MapReduce v1 & v2 EXECUTION ENGINES DATA GOVERNANCE AND OPERATIONS Tez* Accumulo* Hive Sqoop Sentry* Oozie ZooKeeper MapR-DB MapR-FS * Certification/support planned for 2014
  • 23.
    MapR-DB: Powerful NoSQLIntegrated with Hadoop Benefit Features High Performance Over 1 million ops/sec with 10 nodes, in-memory processing Continuous Low Latency No I/O storms, no compaction delays © 2014 MapR Technologies 24x7 Applications Instant recovery, online schema modification, snapshots, mirroring Consistency Strong data consistency, row-level ACID transactions Simplified Database Administration No processes to manage, automated splits, self-tuning High Scalability 1 trillion tables, trillions of rows, millions of columns Low TCO Files and tables on one platform, more work with fewer nodes Performance Reliability Easy Administration
  • 24.
    MapR-DB (in MapREnterprise Database Edition) © 2014 MapR Technologies 24 MapR-DB NoSQL Table-Style Store Apache HBase API In-Hadoop Database HBase JVM HDFS JVM ext3/ext4 Disks Other Distros Tables/Files Disks MapR Fast, scalable, reliable. HBase API, in-memory option, Hadoop integration.
  • 25.
    © 2014 MapRTechnologies Consistent, Low Read Latency --- MapR-DB Read Latency --- Other’s Read Latency
  • 26.
    © 2014 MapRTechnologies 26 Other In-Hadoop Database Technologies • Databases in Hadoop – Apache HBase – Apache Accumulo – Splice Machine – MarkLogic • Data Warehouses on Hadoop – HP Vertica – Pivotal HAWQ
  • 27.
    © 2014 MapRTechnologies 27 What Other Trends? • SQL query engines – Apache Drill – Impala – Presto – Etc. • In-memory processing – GridGain – Apache Spark – HAMRTech
  • 28.
    SQL Query Enginesfor Hadoop and NoSQL Together © 2014 MapR Technologies 28 Impala
  • 29.
    • Pioneering DataAgility for Hadoop • Apache open source project • Scale-out execution engine for low-latency queries • Unified SQL-based API for analytics operational applications © 2014 MapR Technologies 29 APACHE DRILL Vibrant Community 40+ contributors 150+ years of experience building databases and distributed systems
  • 30.
    © 2014 MapRTechnologies 30 Q A Engage with us! @mapr maprtech dalekim@mapr.com MapR maprtech mapr-technologies