A new generation of technologies is needed to consume and exploit today's real time, fast moving data sources. Apache Kafka, originally developed at LinkedIn, has emerged as one of these key new technologies.
This webinar explores the use-cases and architecture for Kafka, and how it integrates with MongoDB to build sophisticated data-driven applications that exploit new sources of data.
11. What is Streaming Data
Synchronous Req/Response
0 – 100s ms
Near Real Time
> 100s ms
Offline Batch
> 1 hour
KAFKA
Stream Data Platform
Search
RDBMS
Apps Monitoring
Real-time AnalyticsNoSQL Stream Processing
HADOOP
Data Lake
Impala
DWH
Hive
Spark Map-Reduce
13. Confluent Platform: It’s Kafka ++
Feature Benefit Apache Kafka
Confluent Open
Source
Confluent Enterprise
Apache Kafka
High throughput, low latency, high availability, secure distributed message
system
Kafka Connect
Advanced framework for connecting external sources/destinations into
Kafka
Kafka Streams
Simple library that enables streaming application development within the
Kafka framework
Additional Clients Supports non-Java clients; C, C++, Python, etc.
REST Proxy
Provides universal access to Kafka from any network connected device via
HTTP
Schema Registry
Central registry for the format of Kafka data – guarantees all data is always
consumable
Pre-Built Connectors
HDFS, JDBC, elasticsearch and other connectors fully certified
and fully supported by Confluent
Confluent Control Center Enables easy connector management and stream monitoring
Data Center & Cloud MDC Replication, auto-data balancing
Support
Enterprise class support to keep your Kafka environment running at top
performance
Community Community 24x7x365
Free Free Subscription
14. Common Kafka Use Cases
Data transport and integration
• Log data
• Database changes
• Sensors and device data
• Monitoring streams
• Call data records
• Stock ticker data
Real-time stream processing
• Monitoring
• Asynchronous applications
• Fraud and security
15. Kafka Adoption in Large Enterprises
6 of the top 10
travel companies
8 of the top 10
insurance companies
7 of the top 10
global banks
9 of the top 10
telecom companies
16. People Using Kafka Today
Financial Services
Entertainment & Media
Consumer Tech
Travel & Leisure
Enterprise Tech
Telecom Retail
30. MessageQueue
Customer Data Mgmt Mobile App IoT App Live Dashboards
Raw Data
Processed
Events
Distributed
Processing
Frameworks
Millisecond latency. Expressive querying & flexible indexing against subsets
of data. Updates-in place. In-database aggregations & transformations
Multi-minute latency with scans across TB/PB of data. No indexes. Data
stored in 128MB blocks. Write-once-read-many & append-only storage model
Sensors
User Data
Clickstreams
Logs
Churn
Analysis
Enriched
Customer
Profiles
Risk
Modeling
Predictive
Analytics
Real-Time Access
Batch Processing, Batch Views
Design Pattern: Operationalized Data Lake
Kafka
Streams
31. MessageQueue
Customer Data Mgmt Mobile App IoT App Live Dashboards
Raw Data
Processed
Events
Millisecond latency. Expressive querying & flexible indexing against subsets
of data. Updates-in place. In-database aggregations & transformations
Multi-minute latency with scans across TB/PB of data. No indexes. Data
stored in 128MB blocks. Write-once-read-many & append-only storage model
Sensors
User Data
Clickstreams
Logs
Churn
Analysis
Enriched
Customer
Profiles
Risk
Modeling
Predictive
Analytics
Real-Time Access
Batch Processing, Batch Views
Design Pattern: Operationalized Data Lake
Configure where to
land incoming data
Distributed
Processing
Frameworks
Kafka
Streams
32. MessageQueue
Customer Data Mgmt Mobile App IoT App Live Dashboards
Raw Data
Processed
Events
Millisecond latency. Expressive querying & flexible indexing against subsets
of data. Updates-in place. In-database aggregations & transformations
Multi-minute latency with scans across TB/PB of data. No indexes. Data
stored in 128MB blocks. Write-once-read-many & append-only storage model
Sensors
User Data
Clickstreams
Logs
Churn
Analysis
Enriched
Customer
Profiles
Risk
Modeling
Predictive
Analytics
Real-Time Access
Batch Processing, Batch Views
Design Pattern: Operationalized Data Lake
Raw data processed to
generate analytics models
Distributed
Processing
Frameworks
Kafka
Streams
33. MessageQueue
Customer Data Mgmt Mobile App IoT App Live Dashboards
Raw Data
Processed
Events
Millisecond latency. Expressive querying & flexible indexing against subsets
of data. Updates-in place. In-database aggregations & transformations
Multi-minute latency with scans across TB/PB of data. No indexes. Data
stored in 128MB blocks. Write-once-read-many & append-only storage model
Sensors
User Data
Clickstreams
Logs
Churn
Analysis
Enriched
Customer
Profiles
Risk
Modeling
Predictive
Analytics
Real-Time Access
Batch Processing, Batch Views
Design Pattern: Operationalized Data Lake
MongoDB exposes
analytics models to
operational apps.
Handles real time
updates
Distributed
Processing
Frameworks
Kafka
Streams
34. MessageQueue
Customer Data Mgmt Mobile App IoT App Live Dashboards
Raw Data
Processed
Events
Millisecond latency. Expressive querying & flexible indexing against subsets
of data. Updates-in place. In-database aggregations & transformations
Multi-minute latency with scans across TB/PB of data. No indexes. Data
stored in 128MB blocks. Write-once-read-many & append-only storage model
Sensors
User Data
Clickstreams
Logs
Churn
Analysis
Enriched
Customer
Profiles
Risk
Modeling
Predictive
Analytics
Real-Time Access
Batch Processing, Batch Views
Design Pattern: Operationalized Data Lake
Compute new
models against
MongoDB &
HDFS
Distributed
Processing
Frameworks
Kafka
Streams
42. MongoDB Atlas
Database as a service for MongoDB
MongoDB Atlas is…
• Automated: The easiest way to build, launch, and scale apps on MongoDB
• Flexible: The only database as a service with all you need for modern applications
• Secured: Multiple levels of security available to give you peace of mind
• Scalable: Deliver massive scalability with zero downtime as you grow
• Highly available: Your deployments are fault-tolerant and self-healing by default
• High performance: The performance you need for your most demanding workloads
43. MongoDB Atlas Features
• Spin up a cluster in
minutes
• Replicated & always-
on deployments
• Fully elastic: scale
out or up in a few
clicks with zero
downtime
• Automatic patches &
simplified upgrades
for the newest
MongoDB features
• Authenticated &
encrypted
• Continuous backup
with point-in-time
recovery
• Fine-grained
monitoring &
custom alerts
Safe & SecureRun for You
• On-demand pricing
model; billed by the
hour
• Multi-cloud support
(AWS available with
others coming
soon)
• Part of a suite of
products & services
designed for all
phases of your app;
migrate easily to
different
environments
(private cloud, on-
prem, etc) when
needed
No Lock-In
Database as a service for MongoDB
44. MongoDB Enterprise Advanced
• MongoDB Ops
Manager or
MongoDB Cloud
Manager Premium
• MongoDB Compass
• MongoDB
Connector for BI
• Cloud Foundry
Integration
• Encrypted Storage
Engine
• LDAP / Kerberos
Integration
• DDL & DML
Auditing
• FIPS 140-2 Support
SecurityTooling
• 24 x 7 Support
• 1 hr SLA
• Emergency
Patches
• Customer Success
Program
• On-Demand
Training
Support License
• Commercial
License
45. Resources
• Data Streaming with Apache Kafka & MongoDB
• https://www.mongodb.com/collateral/data-streaming-with-apache-
kafka-and-mongodb
• Implementing a Kafka Consumer for MongoDB
• https://www.mongodb.com/blog/post/mongodb-and-data-streaming-
implementing-a-mongodb-kafka-consumer
• Tailing the Oplog on a sharded MongoDB Cluster
• https://www.mongodb.com/blog/post/tailing-mongodb-oplog-sharded-
clusters
A lot of people expect us to come in and bash relational database or say we don’t think they’re good. And that’s simply not true.
Relational databases have laid the foundation for what you’d want out of a database, and we absolutely think there are capabilities that remain critical today
Expressive queries. Users should be able to access and manipulate their data in sophisticated ways – and you need a query language that let’s you do all that out of the box. Indexes are a critical part of providing efficient access to data. We believe these are table stakes for a database.
Strong consistency. Strong consistency has become second nature for how we think about building applications, and for good reason. The database should always provide access to the most up-to-date copy of the data. Strong consistency is the right way to design a database.
Enterprise Management and Integrations. Finally, databases are just one piece of the puzzle, and they need to fit into the enterprise IT stack. Organizations need a database that can be secured, monitored, automated, and integrated with their existing IT infrastructure and staff, such as operations teams, DBAs, and data analysts.
But of course the world has changed a lot since the 1980s when the relational database first came about.
First of all, data and risk are significantly up.
In terms of data
90% data created in last 2 years - think about that for a moment, of all the data ever created, 90% of it was in the last 2 years
80% of enterprise data is unstructured - this is data that doesn’t fit into the neat tables of a relational database
Unstructured data is growing 2X rate of structured data
At the same time, risks of running a database are higher than ever before. You are now faced with:
More users - Apps have shifted from small internal departmental system with thousands of users to large external audiences with millions of users
No downtime - It’s no longer the case that apps only need to be available during standard business hours. They must be up 24/7.
All across the globe - your users are everywhere, in multiple timezones and they are always connected
On the other hand, time and costs are way down.
There’s less time to build apps than ever before. You’re being asked to:
Ship apps in a few months not years - Development methods have shifted from a waterfall process to an iterative process that ships new functionality in weeks and in some cases multiple times per day at companies like Facebook and Amazon.
And costs are way down too. Companies want to:
- Pay for value over time - Companies have shifted to open-source business and SaaS models that allow them to pay for value over time
- They use cloud and commodity resources - to reduce the time to provision their infrastructure, and to lower their total cost of ownership
Because the relational database was not designed for modern applications, starting about 10 years ago a number of companies began to build their own databases that are fundamentally different. The market calls these NoSQL.
NoSQL databases were designed for this new world…
Flexibility. All of them have some kind of flexible data model to allow for faster iteration and to accommodate the data we see dominating modern applications. While they all have different approaches, what they have in common is they want to be more flexible.
Scalability + Performance. Similarly, they were all built with a focus on scalability, so they all include some form of sharding or partitioning. And they're all designed to deliver great performance. Some are better at reads, some are better at writes, but more or less they all strive to have better performance and scalability than a relational database.
Always-On Global Deployments. Lastly, NoSQL databases are designed for highly available systems that provide a consistent, high quality experience for users all over the world. They are designed to run on many computers, and they include replication to automatically synchronize the data across servers, racks, and data centers.
However, when you take a closer look at these NoSQL systems, it turns out they have thrown out the baby with the bathwater. They have sacrificed the core database capabilities you’ve come to expect and rely on in order to build fully functional apps, like rich querying and secondary indexes, strong consistency, and enterprise management.
MongoDB was built to address the way the world has changed while preserving the core database capabilities required to build modern applications.
Our vision is to leverage the work that Oracle and others have done over the last 40 years to make relational databases what they are today, and to take the reins from here. We pick up where they left off, incorporating the work that internet pioneers like Google and Amazon did to address the requirements of modern applications.
MongoDB is the only database that harnesses the innovations of NoSQL and maintains the foundation of relational databases – and we call this our Nexus Architecture.
When using any database as a producer, it's necessary to capture any database changes so that they can be written to Kafka. With MongoDB this can be achieved by monitoring its oplog.
The oplog (operations log) is a special capped collection that keeps a rolling record of all operations that modify the data stored in your database. Tailable cursors are then used on that collection.
Tailable cursors, have many uses, such as real-time notifications of all the changes to your database. A tailable cursor is conceptually similar to the Unix `tail -f` command. Once you've reached the end of the result set, the cursor will not be closed, rather it will continue to wait forever for new data and when it arrives, return that too.
MongoDB replication is implemented using the oplog and tailable cursors; the primary node records all write operations in its oplog. The secondary members then asynchronously fetch and then apply those operations.
By using a tailable cursor on the oplog, an application receives all changes that are made to the database in near real-time.
A producer can be written to propagate all MongoDB writes to Kafka by tailing the oplog in the same way. The logic is more complex when using a sharded cluster:
The oplog for each shard must be tailed
The MongoDB shard balancer occasionally moves documents from one shard to another; causing *deletes* to be written to the originating shard's oplog and *inserts* to that of the receiving shard. Those internal operations must be filtered out.
This is a design pattern for the data lake – multiple components that collectively ingest, store, process and analyse data, then serve it to consuming operational apps
Step thru
Data ingestion: Data streams are ingested to a pub/sub message queue, which routes all raw data into HDFS.
You often also have event processing running against the queue to find interesting events that need to be consumed by the operational apps immediately - displaying an offer to a user browsing a product page, or alarms generated against vehicle telemetry from an IoT apps, are routed to MongoDB for immediate consumption by operational applications.
Raw data is loaded into the HDFS data lake where we can use Hadoop jobs or Spark to generate analytics models from the raw data – see examples in the layer above HDFS
MongoDB exposes these models to the operational processes, serving indexed queries and updates against them with real-time latency
The distributed processing frameworks can re-compute analytics models, against data stored in either HDFS or MongoDB, continuously flowing updates from the operational database to analytics models.
**Comparethemarket.com**: One of the UK’s leading price comparison providers, and one of the country’s best known household brands. Comparethemarket.com uses MongoDB as the default operational database across its microservices architecture. Its online comparison systems need to collect customer details efficiently and then securely send them to a number of different providers. Once the insurers' systems respond, Comparethemarket.com can aggregate and display prices for consumers. At the same time, MongoDB generates real-time analytics to personalize the customer experience across the company's web and mobile properties.
As Comparethemarket.com transitioned to microservices, the data warehousing and analytics stack were also modernized. While each microservice uses its own MongoDB database, the company needs to maintain synchronization between services, so every application event is written to a Kafka topic. Event processing runs against the topic to identify relevant events that can then trigger specific actions – for example customizing customer questions, firing off emails, presenting new offers and more. Relevant events are written to MongoDB, enabling the user experience to be personalized in real time as customers interact with the service.
Man is one of the largest hedge fund managers in the world. AHL is a subsiduary focussed on system trading – have been moving all of their data to MongoDB.
Need to analyse data from a large number of disparate data sources.
100x faster retrieving data than when they were using flat files and RDBMS.
Use MongoDB for futures and single stock forecasting. The Kafka use case is for ”tick data” – every change in the price of a stock -> 400,000,000 messages per day.
Source of data is 3rd party commercial feeds into a 3rd party message bus. 150K/sec ticks written to Kafka -> buffer, batch and replay ticks in the event of a problem. Then write the data to MongoDB. Each database holds a year’s worth of ticks.
25X greater tick throughput with just 2 machines => 250M per second. 40x cost saving.
Quant = Quantitative analyst
**State**: State is an intelligent opinion network; connecting people with similar beliefs who want to join forces and make waves. User and opinion data is written to MongoDB and then the oplog is tailed so that all changes are written to user and opinion topics in Kafka where they are consumed by the user recommendation engine.
Details on State's use of MongoDB and Kafka can be found in this [presentation](http://www.slideshare.net/danharvey/change-data-capture-with-mongodb-and-kafka "Use of MongoDB and Kafka in the State social network").
Built and managed by the same team that builds the database, MongoDB Atlas provides the features of MongoDB without the operational heavy lifting, enabling you to focus on what you do best.
MongoDB Enterprise Advanced provides everything you need to [insert relevant value driver. Draw from relevant bullets below to support this claim]
MongoDB Ops Manager or Cloud Manager Premium– full management platform to de-risk MongoDB in production
Monitor the health of your system
Visual Query profiler to identify slow-running queries
Index suggestions and automated index rollouts
Automate deployment, configuration, maintenance, upgrades and scaling
Back up and restore to any point in time (standard network mountable filesystems supported)
Visual Query profiler to identify slow-running queries
Index suggestions and automated index rollouts
APM integration with enhanced drivers
(Ops Manager) Runs behind your firewall.
MongoDB Compass – Schema and data visualization; understand the data stored in your database with no knowledge of the MongoDB query language. Ad hoc queries with a few clicks of your mouse
BI Connector – Visualize and analyze the multi-structured data stored in MongoDB using SQL-based BI tools such as Tableau, Qlikview, Spotfire and more
Enterprise-grade, follow the sun support with a 1-hour SLA
Not just break/fix support
Direct access to industry best-practices
On-Demand Training
Access to our online courses at your own pace to get team members up to speed
Advanced Security
Encrypted Storage Engine for end-to-end database encryption
LDAP and Kerberos to integrate with existing authentication and authorization infrastructure
Auditing of all database operations for compliance
Commercial license
To meet the needs of organizations that have policies against using open source, AGPL software
Platform Certification
Tested and certified for stability and performance on Windows, Red Hat/CentOS, Ubuntu, and Amazon Linux
IBM Power & zSeries
BETA ACCESS to
In memory storage engine for your ultra throughput, most demanding apps: in memory computing without sacrificing data durability