Unlocking Operational Intelligence from the Data Lake

Unlocking Operational Intelligence
from the Data Lake
Mat Keep
Director, Product & Market Analysis
mat.keep@mongodb.com
@matkeep

2
The World is Changing
Digital Natives & Digital Transformation
Volume
Velocity
Variety
Iterative
Agile
Short Cycles
Always On
Secure
Global
Open-Source
Cloud
Commodity
Data Time
Risk Cost

3
Creating the “Insight Economy”

6
• 24% CAGR: Hadoop,
Spark & Streaming
• 18% CAGR: Databases
• Databases are key
components within the
big data landscape
“Big Data” is More than Just Hadoop

7
Apache Hadoop Data Lake
• Risk modeling
• Retrospective & predictive analytics
• Machine learning & pattern
matching
• Customer segmentation & churn
analysis
• ETL pipelines
• Active archives
NoSQL
Database

8
http://www.infoworld.com/article/2980316/big-data/why-your-big-data-strategy-is-a-bust.html
“Thru 2018, 70 percent of Hadoop
deployments will not meet cost savings
and revenue generation objectives due to
skills and integration challenges.”
Nick Heudecker, Research Director, Data Management & Integration

9
How to Avoid Being in the 70%?
1. Unify data lake analytics with
the operational applications
2. Create smart, contextually
aware, data-driven apps &
insights
3. Integrate a database layer with
the data lake

10
Why a Database + Hadoop?
Distributed Processing & Analytics
• Data stored as large files (64MB-128MB
blocks). No indexes
• Write-once-read-many, append-only
• Designed for high throughput scans
across TB/PB of data.
• Multi-minute latency
Common Attributes
• Schema-on-read
• Multiple replicas
• Horizontal scale
• High throughput
• Low TCO

11
Why a Database + Hadoop?
• Random access to subsets of data
• Millisecond latency
• Expressive querying, rich
aggregations & flexible indexing
• Update fast changing data, avoid re-
write / re-compute entire data set
• Data stored as large files (64MB-128MB
blocks). No indexes
• Write-once-read-many, append-only
• Designed for high throughput scans
across TB/PB of data.
• Multi-minute latency
Common Attributes
• Schema-on-read
• High throughput
• Low TCO

12
MongoDB & Hadoop: What’s Common
Common Attributes
• Schema-on-read
• High throughput
• Low TCO

13
Bringing it Together
Online Services
powered by
Back-end machine learning
powered by
• User account & personalization
• Product catalog
• Session management & shopping cart
• Recommendations
• Customer classification & clustering
• Basket analysis
• Brand sentiment
• Price optimization
MongoDB
Connector for
Hadoop

MessageQueue
Customer Data Mgmt Mobile App IoT App Live Dashboards
Raw Data
Processed
Events
Distributed
Processing
Frameworks
Millisecond latency. Expressive querying & flexible indexing against subsets of data.
Updates-in place. In-database aggregations & transformations
Multi-minute latency with scans across TB/PB of data. No indexes. Data stored in
128MB blocks. Write-once-read-many & append-only storage model
Sensors
User Data
Clickstreams
Logs
Churn
Analysis
Enriched
Customer
Profiles
Risk
Modeling
Predictive
Analytics
Real-Time Access
Batch Processing, Batch Views
Design Pattern: Operationalized Data Lake

MessageQueue
Raw Data
Processed
Events
Distributed
Processing
Frameworks
Sensors
User Data
Clickstreams
Logs
Churn
Analysis
Enriched
Customer
Profiles
Risk
Modeling
Predictive
Analytics
Real-Time Access
Configure where to
land incoming data

MessageQueue
Raw Data
Processed
Events
Distributed
Processing
Frameworks
Sensors
User Data
Clickstreams
Logs
Churn
Analysis
Enriched
Customer
Profiles
Risk
Modeling
Predictive
Analytics
Real-Time Access
Raw data processed to
generate analytics models

MessageQueue
Raw Data
Processed
Events
Distributed
Processing
Frameworks
Sensors
User Data
Clickstreams
Logs
Churn
Analysis
Enriched
Customer
Profiles
Risk
Modeling
Predictive
Analytics
Real-Time Access
MongoDB exposes
analytics models to
operational apps.
Handles real time
updates

MessageQueue
Raw Data
Processed
Events
Distributed
Processing
Frameworks
Sensors
User Data
Clickstreams
Logs
Churn
Analysis
Enriched
Customer
Profiles
Risk
Modeling
Predictive
Analytics
Real-Time Access
Compute new
models against
MongoDB &
HDFS

19
Operational Database Requirements
1 “Smart” integration with the data lake
2 Powerful real-time analytics
3 Flexible, governed data model
4 Scale with the data lake
5 Sophisticated management & security

Operational Data Lake in Action

21
Problem Why MongoDB ResultsProblem Solution Results
Existing EDW with nightly
batch loads
No real-time analytics to
personalize user experience
Application changes broke ETL
pipeline
Unable to scale as services
expanded
Microservices architecture running on AWS
All application events written to Kafka queue,
routed to MongoDB and Hadoop
Events that personalize real-time experience (ie
triggering email send, additional questions,
offers) written to MongoDB
All event data aggregated with other data
sources and analyzed in Hadoop, updated
customer profiles written back to MongoDB
2x faster delivery of new
services after migrating to new
architecture
Enabled continuous delivery:
pushing new features every
day
Personalized user experience,
plus higher uptime and
scalability
UK’s Leading Price Comparison Site
Out-pacing Internet search giants with continuous delivery pipeline
powered by microservices & Docker running MongoDB, Kafka and
Hadoop in the cloud

22
Problem Why MongoDB Results
Problem Solution Results
Customer data scattered across
100+ different systems
Poor customer experience: no
personalization, no consistent
experience across brands or
devices
No way to analyze customer
behavior to deliver targeted offers
Selected MongoDB over HBase for
schema flexibility and rich query support
MongoDB stores all customer profiles,
served to web, mobile & call-center apps
Distributed across multiple regions for DR
and data locality
All customer interactions stored in
MongoDB, loaded into Hadoop for
customer segmentation
Unified processing pipeline with Spark
running across MongoDB and Hadoop
Single profile created for each
customer, personalizing
experience in real time
Revenue optimization by
calculating best ticket prices
Reduce competitive pressures
by identifying gaps in product
offerings
Customer Data Management
Single view and real-time analytics with MongoDB,
Spark, & Hadoop
Leading
Global
Airline

23
Commercialize a national security
platform
Massive volumes of multi-
structured data: news, RSS &
social feeds, geospatial, geological,
health & crime stats
Requires complex analysis,
delivered in real time, always on
Apache NiFI for data ingestion, routing
& metadata management
Hadoop for text analytics
HANA for geospatial analytics
MongoDB correlates analytics with
user profiles & location data to deliver
real-time alerts to corporate security
teams & individual travelers
Enables Prescient to uniquely
blend big data technology with its
security IP developed in
government
Dynamic data model supports
indexing 38k data sources,
growing at 200 per day
24x7 continuous availability
Scalability to PBs of data
World’s Most Sophisticated
Traveler Safety Platform
Analyzing PBs of Data with MongoDB, Hadoop, Apache NiFi
& SAP HANA

24
Requirement to analyze data over
many different dimensions to detect
real time threat profiles
HBase unable to query data
beyond primary key lookups
Lucene search unable to scale with
growth in data
MongoDB + Hadoop to collect and
analyze data from internet sensors in
real time
MongoDB dynamic schema enables
sensor data to be enriched with
geospatial tags
Auto-sharding to scale as data
volumes grow
Run complex, real-time analytics on
live data
Improved query performance by
over 3x
Scale to support doubling of data
volume every 24 months
Deploy across global data
centers for low latency user
experience
Engineering teams have more
time to develop new features
Powering Global Threat
Intelligence
Cloud-based real-time analytics with MongoDB & Hadoop

Conclusion
1 Data lakes enable
enterprises to affordably
capture & analyze more data
2 Operational and analytical
workloads are converging
3 MongoDB is the key
technology to operationalize
the data lake

27
MongoDB Compass MongoDB Connector for BI
MongoDB Enterprise Server
MongoDB Enterprise Advanced24x7Support
(1hourSLA)
CommercialLicense
(NoAGPLCopyleftRestrictions)
Platform
Certifications
MongoDB Ops Manager
Monitoring &
Alerting
Query
Optimization
Backup &
Recovery
Automation &
Configuration
Schema Visualization
Data Exploration
Ad-Hoc Queries
Visualization
Analysis
Reporting
Authorization Auditing
Encryption
(In Flight & at Rest)
Authentication
REST APIEmergency
Patches
Customer
Success
Program
On-Demand
Online Training
Warranty
Limitation of
Liability
Indemnification

28
Resources to Learn More
• Guide: Operational Data Lake
• Whitepaper: Real-Time
Analytics with Apache Spark &
MongoDB

30
For More Information
Resource Location
Case Studies mongodb.com/customers
Presentations mongodb.com/presentations
Free Online Training education.mongodb.com
Webinars and Events mongodb.com/events
Documentation docs.mongodb.org
MongoDB Downloads mongodb.com/download
Additional Info info@mongodb.com

31
System failures in online banking
systems creating customer sat
issues
No personalization experience
across channels
No enrichment of user data with
social media chatter
Apache Flume to ingest log data &
social media streams, Apache Spark
to process log events
MongoDB to persist log data and
KPIs, immediately rebuild user
sessions when a service fails
Integration with MongoDB query
language and secondary indexes to
selectively filter and query data in real
time
Improved user experience, with
more customers using online,
self-service channels
Improved services following
deeper understanding of how
users interact with systems
Greater user insight by adding
social media insights
One of World’s Largest Banks
Creating new customer insights with MongoDB & Spark

32
Fare Calculation Engine
One of World’s Largest Airlines Migrates from Oracle to
MongoDB and Apache Spark to Support 100x performance
improvement
China Eastern targeting 130,000 seats
sold every day across its web and
mobile channels
New fare calculation engine needed to
support 20,000 search queries per
second, but current Oracle platform
supported only 200 per second
Apache Spark used for fare
calculations, using business rules
stored in MongoDB
Fare calculations written to MongoDB
for access by the search application
MongoDB Connector for Apache Spark
allows seamless integration with data
locality awareness across the cluster
Cluster of less than 20 API, Spark &
MongoDB nodes supports 180m fare
calculations & 1.6 billion searches per
day
Each node delivers 15x higher
performance and 10x lower latency
than existing Oracle servers
MongoDB Enterprise Advanced
provided Ops Manager for operational
automation and access to expert
technical support

33
MongoDB Connector for Apache Spark
• Native Scala connector, certified by Databricks
• Exposes all Spark APIs & libraries
• Efficient data filtering with predicate pushdown,
secondary indexes, & in-database
aggregations
• Locality awareness to reduce data movement
“We reduced 100+ lines of integration code to just a
single line after moving to the MongoDB Spark connector.”
- Early Access Tester, Multi-National Banking Group Group

35
Query and Data Model
MongoDB Relational Column Family
(i.e. HBase)
Rich query language & secondary
indexes
Yes Yes Requires integration
with separate Spark /
Hadoop cluster
In-Database aggregations & search Yes Yes Requires integration
with separate Spark /
Hadoop cluster
Dynamic schema Yes No Partial
Data validation Yes Yes App-side code
• Why it matters
– Query & Aggregations: Rich, real time analytics against operational data
– Dynamic Schema: Manage multi-structured data
– Data Validation: Enforce data governance between data lake & operational apps

36
Data Lake Integration
(i.e. HBase)
Hadoop + secondary indexes Yes Yes: Expensive No secondary
indexes
Spark + secondary indexes Yes Yes: Expensive No secondary
indexes
Native BI connectivity Yes Yes 3rd-party connectors
Workload isolation Yes Yes: Expensive Load data to
separate
Spark/Hadoop
cluster
• Why it matters
– Hadoop + Spark: Efficient data movement between data lake, processing layer & database
– Native BI Connectivity: Visualizing operational data
– Workload isolation: separation between operational and analytical workloads

37
Operationalizing for Scale & Security
(i.e. HBase)
Robust security controls Yes Yes Yes
Scale-out on commodity hardware Yes No Yes
Sophisticated management platform Yes Yes Monitoring only
• Why it matters
– Security: Data protection for regulatory compliance
– Scale-Out: Grow with the data lake
– Management: Reduce TCO with platform automation, monitoring, disaster recovery

Adoption & Skills Availability

500+ employees
About
MongoDB, Inc.
2,000+
customers
13 offices
worldwide
$311M in
funding

Unlocking Operational Intelligence from the Data Lake

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Unlocking Operational Intelligence from the Data Lake

Similar to Unlocking Operational Intelligence from the Data Lake (20)

More from MongoDB

More from MongoDB (20)

Recently uploaded

Recently uploaded (20)

Unlocking Operational Intelligence from the Data Lake

Editor's Notes