The document discusses operationalizing data lakes by integrating MongoDB with Hadoop to enable both real-time and batch processing capabilities. It describes how MongoDB can be used to power operational applications with low-latency access to analytics models generated from raw data stored in Hadoop, while Hadoop is still used for its batch processing and analytics capabilities on large datasets. By combining both technologies, companies can unlock insights from their data lakes and avoid being part of the 70% of Hadoop projects that fail to meet objectives due to skills and integration challenges.
Creating a Modern Data Architecture for Digital TransformationMongoDB
By managing Data in Motion, Data at Rest, and Data in Use differently, modern Information Management Solutions are enabling a whole range of architecture and design patterns that allow enterprises to fully harness the value in data flowing through their systems. In this session we explored some of the patterns (e.g. operational data lakes, CQRS, microservices and containerisation) that enable CIOs, CDOs and senior architects to tame the data challenge, and start to use data as a cross-enterprise asset.
Webinar: Data Streaming with Apache Kafka & MongoDBMongoDB
A new generation of technologies is needed to consume and exploit today's real time, fast moving data sources. Apache Kafka, originally developed at LinkedIn, has emerged as one of these key new technologies.
The importance of efficient data management for Digital TransformationMongoDB
Digital transformation involves profoundly transforming business activities, processes, competencies, and models to leverage changes from digital technologies strategically. It requires new capabilities and data management maturity. There are three areas of data management: data in motion which involves transferring data between systems; data at rest which refers to how data is stored; and data in use which is about extracting, transforming and analyzing data. A modern data platform uses cloud native technologies to manage data in real-time across all three areas at massive scales.
Webinar: 10-Step Guide to Creating a Single View of your BusinessMongoDB
Organizations have long seen the value in aggregating data from multiple systems into a single, holistic, real-time representation of a business entity. That entity is often a customer. But the benefits of a single view in enhancing business visibility and operational intelligence can apply equally to other business contexts. Think products, supply chains, industrial machinery, cities, financial asset classes, and many more.
However, for many organizations, delivering a single view to the business has been elusive, impeded by a combination of technology and governance limitations.
MongoDB has been used in many single view projects across enterprises of all sizes and industries. In this session, we will share the best practices we have observed and institutionalized over the years. By attending the webinar, you will learn:
- A repeatable, 10-step methodology to successfully delivering a single view
- The required technology capabilities and tools to accelerate project delivery
- Case studies from customers who have built transformational single view applications on MongoDB.
- MongoDB is a document database management system that is recognized as a leader by Gartner. It has over 520 employees, 2500+ customers, and offices globally.
- MongoDB ranked 4th in database mindshare according to DB-Engines. It has seen 172% growth in the last 20 months.
- Several companies such as a quantitative investment manager, an insurance company, a telecommunications company, and an ecommerce company migrated their systems to MongoDB and saw benefits like 100x faster data retrieval, 50% lower costs, and being able to build applications faster.
Big Data Paris - A Modern Enterprise ArchitectureMongoDB
Depuis les années 1980, le volume de données produit et le risque lié à ces données ont littéralement explosé. 90% des données existantes aujourd’hui ont été créé ces 2 dernières années, dont 80% sont non structurées. Avec plus d’utilisateurs et le besoin de disponibilité permanent, les risques sont beaucoup plus élevés.
Quels sont les paramètres de bases de données qu’un décideur doit prendre en compte pour déployer ses applications innovantes?
MongoDB Europe 2016 - The Rise of the Data LakeMongoDB
The document discusses the rise of data lakes and how MongoDB can be used to build modern data management architectures. It provides examples of how companies like a Spanish bank and an insurance leader used MongoDB to create a single customer view across siloed data sources and improve customer experiences. The document also outlines common data processing patterns and how to choose the best data store for different parts of the data pipeline.
In this slidedeck, Infochimps Director of Product, Tim Gasper, discusses how Infochimps tackles business problems for customers by deploying a comprehensive Big Data infrastructure in days; sometimes in just hours. Tim unlocks how Infochimps is now taking that same aggressive approach to deliver faster time to value by helping customers develop analytic applications with impeccable speed.
Creating a Modern Data Architecture for Digital TransformationMongoDB
By managing Data in Motion, Data at Rest, and Data in Use differently, modern Information Management Solutions are enabling a whole range of architecture and design patterns that allow enterprises to fully harness the value in data flowing through their systems. In this session we explored some of the patterns (e.g. operational data lakes, CQRS, microservices and containerisation) that enable CIOs, CDOs and senior architects to tame the data challenge, and start to use data as a cross-enterprise asset.
Webinar: Data Streaming with Apache Kafka & MongoDBMongoDB
A new generation of technologies is needed to consume and exploit today's real time, fast moving data sources. Apache Kafka, originally developed at LinkedIn, has emerged as one of these key new technologies.
The importance of efficient data management for Digital TransformationMongoDB
Digital transformation involves profoundly transforming business activities, processes, competencies, and models to leverage changes from digital technologies strategically. It requires new capabilities and data management maturity. There are three areas of data management: data in motion which involves transferring data between systems; data at rest which refers to how data is stored; and data in use which is about extracting, transforming and analyzing data. A modern data platform uses cloud native technologies to manage data in real-time across all three areas at massive scales.
Webinar: 10-Step Guide to Creating a Single View of your BusinessMongoDB
Organizations have long seen the value in aggregating data from multiple systems into a single, holistic, real-time representation of a business entity. That entity is often a customer. But the benefits of a single view in enhancing business visibility and operational intelligence can apply equally to other business contexts. Think products, supply chains, industrial machinery, cities, financial asset classes, and many more.
However, for many organizations, delivering a single view to the business has been elusive, impeded by a combination of technology and governance limitations.
MongoDB has been used in many single view projects across enterprises of all sizes and industries. In this session, we will share the best practices we have observed and institutionalized over the years. By attending the webinar, you will learn:
- A repeatable, 10-step methodology to successfully delivering a single view
- The required technology capabilities and tools to accelerate project delivery
- Case studies from customers who have built transformational single view applications on MongoDB.
- MongoDB is a document database management system that is recognized as a leader by Gartner. It has over 520 employees, 2500+ customers, and offices globally.
- MongoDB ranked 4th in database mindshare according to DB-Engines. It has seen 172% growth in the last 20 months.
- Several companies such as a quantitative investment manager, an insurance company, a telecommunications company, and an ecommerce company migrated their systems to MongoDB and saw benefits like 100x faster data retrieval, 50% lower costs, and being able to build applications faster.
Big Data Paris - A Modern Enterprise ArchitectureMongoDB
Depuis les années 1980, le volume de données produit et le risque lié à ces données ont littéralement explosé. 90% des données existantes aujourd’hui ont été créé ces 2 dernières années, dont 80% sont non structurées. Avec plus d’utilisateurs et le besoin de disponibilité permanent, les risques sont beaucoup plus élevés.
Quels sont les paramètres de bases de données qu’un décideur doit prendre en compte pour déployer ses applications innovantes?
MongoDB Europe 2016 - The Rise of the Data LakeMongoDB
The document discusses the rise of data lakes and how MongoDB can be used to build modern data management architectures. It provides examples of how companies like a Spanish bank and an insurance leader used MongoDB to create a single customer view across siloed data sources and improve customer experiences. The document also outlines common data processing patterns and how to choose the best data store for different parts of the data pipeline.
In this slidedeck, Infochimps Director of Product, Tim Gasper, discusses how Infochimps tackles business problems for customers by deploying a comprehensive Big Data infrastructure in days; sometimes in just hours. Tim unlocks how Infochimps is now taking that same aggressive approach to deliver faster time to value by helping customers develop analytic applications with impeccable speed.
This document summarizes a presentation about big data analytics solutions from Think Big Analytics and Infochimps. It discusses using their platforms together to power applications with next-generation big data stacks. It highlights case studies, architecture diagrams, and polls to demonstrate how their services can accelerate time to value through a combination of data science, engineering, strategy, and hands-on training and education.
The document outlines an agenda for a MongoDB event in Frankfurt on November 30th 2017. The agenda includes introductions, implementing a cloud-based data strategy, best practices for migrating from RDBMS to MongoDB, how MongoDB can provide support, and a Q&A session. It also lists the speakers which include representatives from MongoDB and Bosch Software Innovations.
This document discusses different architectures for big data systems, including traditional, streaming, lambda, kappa, and unified architectures. The traditional architecture focuses on batch processing stored data using Hadoop. Streaming architectures enable low-latency analysis of real-time data streams. Lambda architecture combines batch and streaming for flexibility. Kappa architecture avoids duplicating processing logic. Finally, a unified architecture trains models on batch data and applies them to real-time streams. Choosing the right architecture depends on use cases and available components.
Entity Resolution Service - Bringing Petabytes of Data Online for Instant AccessDataWorks Summit
2.5B+ ids, 2ms latency, 15K+ TPS and Petabytes of data.These numbers outline the challenges with eBay’s Entity Resolution Service (ERS). ERS provides a temporal map between anyid-anyid. The technology stack of ERS has Hadoop as the batch layer, Couchbase as cache layer, Spring Batch to load data to Couchbase and Rest API at Service layer. In our presentation we will take you through the journey from conceptual to production release. It’s a great story and we would like to share with you!
Join CIGNEX Datamatics, Alfresco’s Global Platinum Partner, as they share the case study experience of a leading global online university. Together we’ll take a close look at their document management and web portal solution and their integrations with Alfresco ECM, Liferay Portal and Moodle Learning Management System.
MongoDB and RDBMS: Using Polyglot Persistence at Equifax MongoDB
MongoDB and RDBMS: Using Polyglot Persistence at Equifax. Presented by Michael Lawrence, Pariveda Solutions on behalf of Equifax at MongoDB Evenings Atlanta on September 24, 2015.
Real time analytics is a beautiful thing, especially if you can build it in quick, scalable & robust way. We built a digital command center for our marketing team, which provided real time analytics on social media, clickstream and google search term in a span of couple of months. This solution was entirely build on open source technologies, using a combination of Apache Nifi, Elastic search & Hadoop. Simple but very effective. In this presentation i would like to share the architecture, learning and business benefits of this solution.
InfoSphere BigInsights - Analytics power for Hadoop - field experienceWilfried Hoge
This document provides an overview and summary of InfoSphere BigInsights, an analytics platform for Hadoop. It discusses key features such as real-time analytics, storage integration, search, data exploration, predictive modeling, and application tooling. Case studies are presented on analyzing binary data and developing applications for transformation and analysis. Partnerships and certifications with other vendors are also mentioned. The document aims to demonstrate how BigInsights brings enterprise-grade features to Apache Hadoop and provides analytics capabilities for business users.
Architecting Analytic Pipelines on GCP - Chicago Cloud Conference 2020Mariano Gonzalez
Modernizing analytics data pipelines to gain the most of your data while optimizing costs can be challenging. However, today cloud providers offer a good set of services that can help with this endeavor. We will do a tour across some GCP services during this hands-on session, using DataFlow (apache beam) as the backbone to architect a modern analytics pipeline to wire them all together.
Big data expert and Infochimps CEO, Jim Kaskade presents the Infinite Monkey Theorem at CloudCon Expo. He provides an energetic, inspiring, and practical perspective on why Big Data is disrupting. It’s more than historic data analyzed on Hadoop. It’s also more than real-time streaming data stored and queried using NoSQL. Learn more at www.Infochimps.com
Big data, agile development, and cloud computing
are driving new requirements for database
management systems. These requirements are in turn
driving the next phase of growth in the database
industry, mirroring the evolution of the OLAP
industry. This document describes this evolution, the
new application workload, and how MongoDB is
uniquely suited to address these challenges.
Learn why 451 Research believes Infochimps is well-positioned with an easy-to-consume managed service for those without Hadoop expertise, as well as a stack of technologically interesting projects for the 'devops' crowd.
Opening with a market positioning statement and ending with a competitive and SWOT analysis, Matt Aslett provides a comprehensive impact report.
This document discusses combining Apache Spark and MongoDB for real-time analytics. It describes how MongoDB provides rich analytics capabilities through queries, aggregations, and indexing. Apache Spark can further extend MongoDB's analytics by offering additional processing capabilities. Together, Spark and MongoDB enable organizations to perform real-time analytics directly on operational data without needing separate analytics infrastructure.
Big Data & Analytics - Use Cases in Mobile, E-commerce, Media and moreAmazon Web Services
This document discusses how companies can use Amazon Web Services (AWS) big data and analytics services like Amazon Elastic MapReduce (EMR), Amazon Redshift, Amazon DynamoDB, and Amazon Kinesis to gain insights from massive amounts of data. It provides examples of how companies in various industries like mobile, e-commerce, media, and gaming use these AWS services for use cases like recommendations, targeted advertising, fraud detection, and real-time analytics. The document also compares different AWS analytics services and discusses best practices for deploying big data solutions on AWS.
Hybrid Transactional/Analytics Processing: Beyond the Big Database HypeAli Hodroj
This presentation discusses hybrid transactional/analytical processing (HTAP) and the GigaSpaces solution. HTAP aims to support both real-time transactions and complex analytics by combining transaction processing and data warehousing capabilities. However, analytics needs have evolved faster than databases to include real-time streaming and predictive analytics. The GigaSpaces solution advocates a polyglot approach using Spark for analytics combined with an in-memory data grid for transactional storage and processing to better support insight-driven applications. Case studies demonstrate how the architecture provides unified low-latency access to data, distributed analytics, and triggered actions.
Spark DC Interactive Meetup: HTAP with Spark and In-Memory Data GridsAli Hodroj
This document discusses hybrid transactional/analytical processing (HTAP) with Apache Spark and in-memory data grids. It describes how combining an in-memory data grid for low-latency transactions with Spark enables real-time analytics over both historical and streaming data at scale. The approach integrates Spark and the data grid through connectors to provide a unified API, push down predicates from Spark to the grid for efficient processing, and leverage data locality. This hybrid model supports various data types and provides a scale-out, unified data store to meet the needs of Internet of Things and omni-channel applications.
Real-time Microservices and In-Memory Data GridsAli Hodroj
How in-memory data grids enable a real-time microservices architecture while diminishing the accidental complexity of persistence, orchestration, and fragmentation of scale.
2014 10 09 Top reasons to use IBM BigInsights as your Big Data Hadoop systemToby Woolfe
The document discusses why manufacturers should use IBM BigInsights as their Hadoop platform. It outlines 10 key reasons, including IBM's experience in the automotive industry, the capabilities BigInsights adds to open source Hadoop like performance and security features, IBM's commitment and track record of large Hadoop deployments, and case studies of manufacturers like General Motors that have successfully used BigInsights.
MphasiS provides various big data offerings including analytics on unstructured data like text, social media, images and logs. It also offers solutions to integrate structured and unstructured data for 360-degree insights. MphasiS has experience applying advanced analytics techniques like data mining and predictive modeling to solve problems in optimization, employee retention, and fraud prevention. It can help clients migrate to big data platforms like Hadoop, Hive, HBase, Vertica, and SAP HANA.
My other computer is a datacentre - 2012 editionSteve Loughran
An updated version of the "my other computer is a datacentre" talk, presented at the Bristol University HPC talk.
Because it is targeted at universities, it emphasises some of the interesting problems -the classic CS ones of scheduling, new ones of availability and failure handling within what is now a single computer, and emergent problems of power and heterogeneity. It also includes references, all of which are worth reading, and, being mostly Google and Microsoft papers, are free to download without needing ACM or IEEE library access.
Comments welcome.
"Adoption Tactics; Why Your End Users and Project Managers Will Rave Over Sha...Gina Montgomery, V-TSP
This document provides an agenda for a presentation on adopting SharePoint 2013. The presentation will include a survey, discussing why SharePoint intranets fail without proper adoption plans, components of a good end user adoption plan, what governance and information architecture are, what gamification is, a demo of SharePoint 2013, and a review with questions. The presentation aims to help organizations maximize adoption of SharePoint 2013 by their end users and project managers.
This document summarizes a presentation about big data analytics solutions from Think Big Analytics and Infochimps. It discusses using their platforms together to power applications with next-generation big data stacks. It highlights case studies, architecture diagrams, and polls to demonstrate how their services can accelerate time to value through a combination of data science, engineering, strategy, and hands-on training and education.
The document outlines an agenda for a MongoDB event in Frankfurt on November 30th 2017. The agenda includes introductions, implementing a cloud-based data strategy, best practices for migrating from RDBMS to MongoDB, how MongoDB can provide support, and a Q&A session. It also lists the speakers which include representatives from MongoDB and Bosch Software Innovations.
This document discusses different architectures for big data systems, including traditional, streaming, lambda, kappa, and unified architectures. The traditional architecture focuses on batch processing stored data using Hadoop. Streaming architectures enable low-latency analysis of real-time data streams. Lambda architecture combines batch and streaming for flexibility. Kappa architecture avoids duplicating processing logic. Finally, a unified architecture trains models on batch data and applies them to real-time streams. Choosing the right architecture depends on use cases and available components.
Entity Resolution Service - Bringing Petabytes of Data Online for Instant AccessDataWorks Summit
2.5B+ ids, 2ms latency, 15K+ TPS and Petabytes of data.These numbers outline the challenges with eBay’s Entity Resolution Service (ERS). ERS provides a temporal map between anyid-anyid. The technology stack of ERS has Hadoop as the batch layer, Couchbase as cache layer, Spring Batch to load data to Couchbase and Rest API at Service layer. In our presentation we will take you through the journey from conceptual to production release. It’s a great story and we would like to share with you!
Join CIGNEX Datamatics, Alfresco’s Global Platinum Partner, as they share the case study experience of a leading global online university. Together we’ll take a close look at their document management and web portal solution and their integrations with Alfresco ECM, Liferay Portal and Moodle Learning Management System.
MongoDB and RDBMS: Using Polyglot Persistence at Equifax MongoDB
MongoDB and RDBMS: Using Polyglot Persistence at Equifax. Presented by Michael Lawrence, Pariveda Solutions on behalf of Equifax at MongoDB Evenings Atlanta on September 24, 2015.
Real time analytics is a beautiful thing, especially if you can build it in quick, scalable & robust way. We built a digital command center for our marketing team, which provided real time analytics on social media, clickstream and google search term in a span of couple of months. This solution was entirely build on open source technologies, using a combination of Apache Nifi, Elastic search & Hadoop. Simple but very effective. In this presentation i would like to share the architecture, learning and business benefits of this solution.
InfoSphere BigInsights - Analytics power for Hadoop - field experienceWilfried Hoge
This document provides an overview and summary of InfoSphere BigInsights, an analytics platform for Hadoop. It discusses key features such as real-time analytics, storage integration, search, data exploration, predictive modeling, and application tooling. Case studies are presented on analyzing binary data and developing applications for transformation and analysis. Partnerships and certifications with other vendors are also mentioned. The document aims to demonstrate how BigInsights brings enterprise-grade features to Apache Hadoop and provides analytics capabilities for business users.
Architecting Analytic Pipelines on GCP - Chicago Cloud Conference 2020Mariano Gonzalez
Modernizing analytics data pipelines to gain the most of your data while optimizing costs can be challenging. However, today cloud providers offer a good set of services that can help with this endeavor. We will do a tour across some GCP services during this hands-on session, using DataFlow (apache beam) as the backbone to architect a modern analytics pipeline to wire them all together.
Big data expert and Infochimps CEO, Jim Kaskade presents the Infinite Monkey Theorem at CloudCon Expo. He provides an energetic, inspiring, and practical perspective on why Big Data is disrupting. It’s more than historic data analyzed on Hadoop. It’s also more than real-time streaming data stored and queried using NoSQL. Learn more at www.Infochimps.com
Big data, agile development, and cloud computing
are driving new requirements for database
management systems. These requirements are in turn
driving the next phase of growth in the database
industry, mirroring the evolution of the OLAP
industry. This document describes this evolution, the
new application workload, and how MongoDB is
uniquely suited to address these challenges.
Learn why 451 Research believes Infochimps is well-positioned with an easy-to-consume managed service for those without Hadoop expertise, as well as a stack of technologically interesting projects for the 'devops' crowd.
Opening with a market positioning statement and ending with a competitive and SWOT analysis, Matt Aslett provides a comprehensive impact report.
This document discusses combining Apache Spark and MongoDB for real-time analytics. It describes how MongoDB provides rich analytics capabilities through queries, aggregations, and indexing. Apache Spark can further extend MongoDB's analytics by offering additional processing capabilities. Together, Spark and MongoDB enable organizations to perform real-time analytics directly on operational data without needing separate analytics infrastructure.
Big Data & Analytics - Use Cases in Mobile, E-commerce, Media and moreAmazon Web Services
This document discusses how companies can use Amazon Web Services (AWS) big data and analytics services like Amazon Elastic MapReduce (EMR), Amazon Redshift, Amazon DynamoDB, and Amazon Kinesis to gain insights from massive amounts of data. It provides examples of how companies in various industries like mobile, e-commerce, media, and gaming use these AWS services for use cases like recommendations, targeted advertising, fraud detection, and real-time analytics. The document also compares different AWS analytics services and discusses best practices for deploying big data solutions on AWS.
Hybrid Transactional/Analytics Processing: Beyond the Big Database HypeAli Hodroj
This presentation discusses hybrid transactional/analytical processing (HTAP) and the GigaSpaces solution. HTAP aims to support both real-time transactions and complex analytics by combining transaction processing and data warehousing capabilities. However, analytics needs have evolved faster than databases to include real-time streaming and predictive analytics. The GigaSpaces solution advocates a polyglot approach using Spark for analytics combined with an in-memory data grid for transactional storage and processing to better support insight-driven applications. Case studies demonstrate how the architecture provides unified low-latency access to data, distributed analytics, and triggered actions.
Spark DC Interactive Meetup: HTAP with Spark and In-Memory Data GridsAli Hodroj
This document discusses hybrid transactional/analytical processing (HTAP) with Apache Spark and in-memory data grids. It describes how combining an in-memory data grid for low-latency transactions with Spark enables real-time analytics over both historical and streaming data at scale. The approach integrates Spark and the data grid through connectors to provide a unified API, push down predicates from Spark to the grid for efficient processing, and leverage data locality. This hybrid model supports various data types and provides a scale-out, unified data store to meet the needs of Internet of Things and omni-channel applications.
Real-time Microservices and In-Memory Data GridsAli Hodroj
How in-memory data grids enable a real-time microservices architecture while diminishing the accidental complexity of persistence, orchestration, and fragmentation of scale.
2014 10 09 Top reasons to use IBM BigInsights as your Big Data Hadoop systemToby Woolfe
The document discusses why manufacturers should use IBM BigInsights as their Hadoop platform. It outlines 10 key reasons, including IBM's experience in the automotive industry, the capabilities BigInsights adds to open source Hadoop like performance and security features, IBM's commitment and track record of large Hadoop deployments, and case studies of manufacturers like General Motors that have successfully used BigInsights.
MphasiS provides various big data offerings including analytics on unstructured data like text, social media, images and logs. It also offers solutions to integrate structured and unstructured data for 360-degree insights. MphasiS has experience applying advanced analytics techniques like data mining and predictive modeling to solve problems in optimization, employee retention, and fraud prevention. It can help clients migrate to big data platforms like Hadoop, Hive, HBase, Vertica, and SAP HANA.
My other computer is a datacentre - 2012 editionSteve Loughran
An updated version of the "my other computer is a datacentre" talk, presented at the Bristol University HPC talk.
Because it is targeted at universities, it emphasises some of the interesting problems -the classic CS ones of scheduling, new ones of availability and failure handling within what is now a single computer, and emergent problems of power and heterogeneity. It also includes references, all of which are worth reading, and, being mostly Google and Microsoft papers, are free to download without needing ACM or IEEE library access.
Comments welcome.
"Adoption Tactics; Why Your End Users and Project Managers Will Rave Over Sha...Gina Montgomery, V-TSP
This document provides an agenda for a presentation on adopting SharePoint 2013. The presentation will include a survey, discussing why SharePoint intranets fail without proper adoption plans, components of a good end user adoption plan, what governance and information architecture are, what gamification is, a demo of SharePoint 2013, and a review with questions. The presentation aims to help organizations maximize adoption of SharePoint 2013 by their end users and project managers.
Amazon QuickSight is a fast, cloud-powered business intelligence service that reduces the time and cost of traditional BI software. It requires no IT effort to set up, auto-discovers AWS data sources, and reduces time to first visualization to just one minute. QuickSight uses a parallel, in-memory calculation engine called SPICE to provide fast query response times in milliseconds. It connects to various AWS and third-party data sources and applications and allows easy data visualization, dashboard creation, and report sharing.
Groovy Domain Specific Languages - SpringOne2GX 2012Guillaume Laforge
Paul King, Andrew Eisenberg and Guillaume Laforge present about implementation of Domain-Specific Languages in Groovy, while at the SpringOne2GX 2012 conference in Washington DC.
The Rise of Microservices - Containers and OrchestrationMongoDB
The document discusses microservices and containers. It defines microservices as small, independent services with well-defined interfaces that allow for decentralized control and independent deployments. Containers are presented as a way to package and run microservices using technologies like Docker. Orchestration with systems like Kubernetes and Mesos is described as a way to automate deployment, linking, and maintenance of multiple containers across infrastructure. MongoDB is discussed as a good fit for microservices due to its flexibility, redundancy, scalability, and simplicity.
Past, Present and Future of Data Processing in Apache HadoopCodemotion
MongoDB scales easily to store mass volumes of data. However, when it comes to making sense of it all what options do you have? In this talk, we’ll take a look at 3 different ways of aggregating your data with MongoDB, and determine the reasons why you might choose one way over another. No matter what your big data needs are, you will find out how MongoDB the big data store is evolving to help make sense of your data.
This presentation covers practical implementation of Lambda with different patterns. It also explains how to achieve continuous deployment using lambda.
Airbus and Boeing have been involved in a fierce duopoly in the large jet airliner market since the 1990s. Airbus began as a European consortium while the American Boeing absorbed its former arch-rival, McDonnell Douglas in a 1997 merger
Manufacturers like Lockheed Martin, Convair and Fairchild Aircraft in the United States and British Aerospace and Fokker in Europe withdrew from the market as they were no longer in a position to compete effectively
Over the years, competition has been intense; each company regularly accuses the other of receiving unfair state aid from their respective governments.
Based on http://www.slideshare.net/arjunparekh/duopoly-boeing-versus-airbus?qid=90919b4f-b341-4d82-8f75-3474f9f15e57&v=&b=&from_search=16
Lambda Architecture: The Best Way to Build Scalable and Reliable Applications!Tugdual Grall
Lambda Architecture is a useful framework to think about designing big data applications. This framework has been built initially at Twitter. In this presentation you will learn, based on concrete examples how to build deploy scalable and fault tolerant applications, with a focus on Big Data and Hadoop.
This presentation was delivered at the OOP conference, Munich, Feb 2016
Learn how to build new classes of sophisticated, real-time analytics by combining Apache Spark, the industry's leading data processing engine, with MongoDB, the industry’s fastest growing database.
We live in a world of “big data.” But it isn’t just the data itself that is valuable – it’s the insight it can generate. How quickly an organization can unlock and act on that insight has become a major source of competitive advantage. Collecting data in operational systems and then relying on nightly batch extract, transform, load (ETL) processes to update the enterprise data warehouse (EDW) is no longer sufficient.
In this live session, we show you how MongoDB and Spark work together and provide examples using the new Spark Connector for MongoDB.
This session was sponsored by Stratio & Paradigma.
L’architettura di Classe Enterprise di Nuova GenerazioneMongoDB
This document discusses using MongoDB as part of an enterprise data management architecture. It begins by describing the rise of data lakes to manage growing and diverse data volumes. Traditional EDWs struggle with this new data variety and volume. The document then provides an overview of MongoDB's features like flexible schemas, secondary indexes, and aggregation capabilities that make it suitable for building different layers of an EDM pipeline for tasks like raw data storage, transformation, analysis, and serving data to downstream systems. Example use cases are presented for building a single customer view and for replacing Oracle with MongoDB.
MongoDB Europe 2016 - Who’s Helping Themselves To Your Data? Demystifying Mon...MongoDB
The document discusses MongoDB's security features including authentication, authorization, encryption, and auditing. It emphasizes that MongoDB's security features have minimal dependencies and keep the path to secure success clear. The key features are authentication using passwords, LDAP, certificates or Kerberos; role-based authorization; encryption of data in transit using TLS and at rest using the encrypted storage engine; and auditing of operations to a configurable destination.
MongoDB Europe 2016 - Big Data meets Big ComputeMongoDB
- The document discusses how Spark can be used to connect MongoDB for analytics and processing large datasets. Spark is a fast, general engine for large-scale data processing.
- It provides an overview of Spark including its programming model using resilient distributed datasets (RDDs), built-in fault tolerance, and libraries for SQL, streaming, machine learning and graphs.
- The new MongoDB connector for Spark allows seamless integration between MongoDB and Spark. It supports DataFrames and Datasets with automatic schema inference and conversion. Proper configuration and partitioning strategies are important to optimize performance and data locality.
- A demo is presented using Spark and the connector to solve a "traveling salesman problem" of planning efficient travel routes between Europe
MongoDB Europe 2016 - ETL for Pros – Getting Data Into MongoDB The Right WayMongoDB
The document discusses best practices for extracting, transforming, and loading (ETL) large amounts of data into MongoDB. It describes common mistakes made in ETL processes, such as performing nested queries to retrieve and assemble documents, and building documents within the database itself using update operations. The presentation provides a case study comparing these inefficient approaches to loading order, item, and tracking data from relational tables into MongoDB documents.
Unlocking Operational Intelligence from the Data LakeMongoDB
The document discusses unlocking operational intelligence from data lakes using MongoDB. It begins by describing how digital transformation is driving changes in data volume, velocity, and variety. It then discusses how MongoDB can help operationalize data lakes by providing real-time access and analytics on data stored in data lakes, while also integrating batch processing capabilities. The document provides an example reference architecture of how MongoDB can be used with a data lake (Hadoop) and stream processing framework (Kafka) to power operational applications and machine learning models with both real-time and batch data and analytics.
The document provides a case study on the lessons learned from Boeing's 787 Dreamliner project. It summarizes that Boeing aimed to cut costs and development time through an unconventional supply chain model where 70% of the work was outsourced. However, this resulted in the project being over budget by $11 billion and 4 years delayed. Key lessons identified include: assembling a management team with supply chain expertise, improving supply chain visibility, fully understanding all underlying project costs before estimating, improving supplier training and selection processes, proactively managing labor unions, and implementing risk-sharing contracts with incentives and penalties for partners.
This document provides an overview of big data concepts and technologies for managers. It discusses problems with relational databases for large, unstructured data and introduces NoSQL databases and Hadoop as solutions. It also summarizes common big data applications, frameworks like MapReduce, Spark, and Flink, and different NoSQL database categories including key-value, column-family, document, and graph stores.
Webinar: Enterprise Data Management in the Era of MongoDB and Data LakesMongoDB
1. The document discusses using MongoDB and data lakes for enterprise data management. It outlines the current issues with relational databases and how MongoDB addresses challenges like flexibility, scalability and performance.
2. Various architectures for enterprise data management with MongoDB are presented, including using it for raw, transformed and aggregated data stores.
3. The benefits of combining MongoDB and Hadoop in a data lake are greater agility, insight from handling different data structures, scalability and low latency for real-time decisions.
Lambda architecture for real time big dataTrieu Nguyen
- The document discusses the Lambda Architecture, a system designed by Nathan Marz for building real-time big data applications. It is based on three principles: human fault-tolerance, data immutability, and recomputation.
- The document provides two case studies of applying Lambda Architecture - at Greengar Studios for API monitoring and statistics, and at eClick for real-time data analytics on streaming user event data.
- Key lessons discussed are keeping solutions simple, asking the right questions to enable deep analytics and profit, using reactive and functional approaches, and turning data into useful insights.
Using real time big data analytics for competitive advantageAmazon Web Services
Many organisations find it challenging to successfully perform real-time data analytics using their own on premise IT infrastructure. Building a system that can adapt and scale rapidly to handle dramatic increases in transaction loads can potentially be quite a costly and time consuming exercise.
Most of the time, infrastructure is under-utilised and it’s near impossible for organisations to forecast the amount of computing power they will need in the future to serve their customers and suppliers.
To overcome these challenges, organisations can instead utilise the cloud to support their real-time data analytics activities. Scalable, agile and secure, cloud-based infrastructure enables organisations to quickly spin up infrastructure to support their data analytics projects exactly when it is needed. Importantly, they can ‘switch off’ infrastructure when it is not.
BluePi Consulting and Amazon Web Services (AWS) are giving you the opportunity to discover how organisations are using real time data analytics to gain new insights from their information to improve the customer experience and drive competitive advantage.
Cortana Analytics Workshop: The "Big Data" of the Cortana Analytics Suite, Pa...MSAdvAnalytics
Lance Olson. Cortana Analytics is a fully managed big data and advanced analytics suite that helps you transform your data into intelligent action. Come to this two-part session to learn how you can do "big data" processing and storage in Cortana Analytics. In the first part, we will provide an overview of the processing and storage services. We will then talk about the patterns and use cases which make up most big data solutions. In the second part, we will go hands-on, showing you how to get started today with writing batch/interactive queries, real-time stream processing, or NoSQL transactions all over the same repository of data. Crunch petabytes of data by scaling out your computation power to any sized cluster. Store any amount of unstructured data in its native format with no limits to file or account size. All of this can be done with no hardware to acquire or maintain and minimal time to setup giving you the value of "big data" within minutes. Go to https://channel9.msdn.com/ to find the recording of this session.
Key Data Management Requirements for the IoTMongoDB
The document discusses key data management requirements for Internet of Things (IoT) applications. It notes that IoT will generate massive amounts of structured and unstructured data from a large number of connected devices and sensors. This data must be managed in a way that allows for rich applications, a unified view of data, real-time operational insights, business agility, and continuous innovation. It argues that traditional relational databases may not be well-suited for IoT data management and that NoSQL databases can provide scalability, flexibility, analytics and a unified view of data from multiple sources.
MongoDB Breakfast Milan - Mainframe Offloading StrategiesMongoDB
The document summarizes a MongoDB event focused on modernizing mainframe applications. The event agenda includes presentations on moving from mainframes to operational data stores, demo of a mainframe offloading solution from Quantyca, and stories of mainframe modernization. Benefits of using MongoDB for mainframe modernization include 5-10x developer productivity and 80% reduction in mainframe costs.
Webinar: Data Streaming with Apache Kafka & MongoDBMongoDB
This document summarizes a webinar about integrating Apache Kafka and MongoDB for data streaming. The webinar covered:
- An overview of Apache Kafka and how it can be used for data transport and integration as well as real-time stream processing.
- How MongoDB can be used as both a Kafka producer, to stream data into Kafka topics, and as a Kafka consumer, to retrieve streamed data from Kafka for storage, querying, and analytics in MongoDB.
- Various use cases for integrating Kafka and MongoDB, including handling real-time updates, storing raw and processed event data, and powering real-time applications with analytics models built from streamed data.
Real life use cases from across Europe (Walid Aoudi - Cognizant)
This presentation will present some Cognizant Big Data clients return on experiences on continental Europe and UK. The main focus will be centered on use cases through the presentation of the business drivers behind these projects. Key highlights around the big data architecture and approach solutions will be presented. Finally, the business outcomes in terms of ROI provided by the solutions implementations will be discussed.
Accelerating a Path to Digital With a Cloud Data StrategyMongoDB
The document describes a conference on accelerating a path to digital transformation with a cloud data strategy. It provides an agenda for the conference including speakers on executing a cloud data strategy, customer stories from De Persgroep and Toyota Motor Europe, and a session on landing in the cloud with MongoDB Atlas. The document also provides background on the speakers and their companies.
IBM's Big Data platform provides tools for managing and analyzing large volumes of structured, unstructured, and streaming data. It includes Hadoop for storage and processing, InfoSphere Streams for real-time streaming analytics, InfoSphere BigInsights for analytics on data at rest, and PureData System for Analytics (formerly Netezza) for high performance data warehousing. The platform enables businesses to gain insights from all available data to capitalize on information resources and make data-driven decisions.
IBM's Big Data platform provides tools for managing and analyzing large volumes of data from various sources. It allows users to cost effectively store and process structured, unstructured, and streaming data. The platform includes products like Hadoop for storage, MapReduce for processing large datasets, and InfoSphere Streams for analyzing real-time streaming data. Business users can start with critical needs and expand their use of big data over time by leveraging different products within the IBM Big Data platform.
1. The document discusses how organizations can leverage data, analytics, and insights to fundamentally change and pioneer new business models.
2. It emphasizes that data analytics cannot be accomplished in a silo and must involve the entire organization. Modern cloud platforms, software methodologies, and data tools are needed.
3. Examples are provided of how various organizations have used tools like Pivotal Greenplum to gain insights from data to solve problems in areas like predictive maintenance, risk management, and national security.
Businesses are generating more data than ever before.
Doing real time data analytics requires IT infrastructure that often needs to be scaled up quickly and running an on-premise environment in this setting has its limitations.
Organisations often require a massive amount of IT resources to analyse their data and the upfront capital cost can deter them from embarking on these projects.
What’s needed is scalable, agile and secure cloud-based infrastructure at the lowest possible cost so they can spin up servers that support their data analysis projects exactly when they are required. This infrastructure must enable them to create proof-of-concepts quickly and cheaply – to fail fast and move on.
Data Streaming with Apache Kafka & MongoDBconfluent
Explore the use-cases and architecture for Apache Kafka, and how it integrates with MongoDB to build sophisticated data-driven applications that exploit new sources of data.
Tapdata provides a smart data as a service platform that offers:
1) Real-time data collection and synchronization from various sources like databases, files, and streaming data.
2) Data modeling and governance capabilities like data validation, quality checks, and AI-assisted cataloging.
3) Scalable data storage across TBs to PBs of data using a distributed database.
4) A code-less API publishing module to quickly build and deploy RESTful APIs for internal and external users.
Big Data: Its Characteristics And Architecture CapabilitiesAshraf Uddin
This document discusses big data, including its definition, characteristics, and architecture capabilities. It defines big data as large datasets that are challenging to store, search, share, visualize, and analyze due to their scale, diversity and complexity. The key characteristics of big data are described as volume, velocity and variety. The document then outlines the architecture capabilities needed for big data, including storage and management, database, processing, data integration and statistical analysis capabilities. Hadoop and MapReduce are presented as core technologies for storage, processing and analyzing large datasets in parallel across clusters of computers.
Big Data, IoT, data lake, unstructured data, Hadoop, cloud, and massively parallel processing (MPP) are all just fancy words unless you can find uses cases for all this technology. Join me as I talk about the many use cases I have seen, from streaming data to advanced analytics, broken down by industry. I’ll show you how all this technology fits together by discussing various architectures and the most common approaches to solving data problems and hopefully set off light bulbs in your head on how big data can help your organization make better business decisions.
Girish Juneja - Intel Big Data & Cloud Summit 2013IntelAPAC
This document discusses big data trends such as the growth of networked sensors, connected devices, and smartphone users. It then summarizes Intel's investments in big data technologies, including their software, processors, networking, storage and memory products. The document promotes Intel's Distribution for Apache Hadoop software and how it provides security, performance optimizations and support for workloads like data mining, graph analytics and full text search. Real-world customer examples are provided that demonstrate gains in performance, cost savings and new analytics capabilities.
Hadoop 2.0: YARN to Further Optimize Data ProcessingHortonworks
Data is exponentially increasing in both types and volumes, creating opportunities for businesses. Watch this video and learn from three Big Data experts: John Kreisa, VP Strategic Marketing at Hortonworks, Imad Birouty, Director of Technical Product Marketing at Teradata and John Haddad, Senior Director of Product Marketing at Informatica.
Multiple systems are needed to exploit the variety and volume of data sources, including a flexible data repository. Learn more about:
- Apache Hadoop 2 and YARN
- Data Lakes
- Intelligent data management layers needed to manage metadata and usage patterns as well as track consumption across these data platforms.
Overcoming Today's Data Challenges with MongoDBMongoDB
The document outlines an agenda for an event on overcoming data challenges with MongoDB. The event will feature speakers from MongoDB and Bosch discussing how the world has changed since relational databases were invented, how to radically transform IT environments with MongoDB, MongoDB and blockchain, and MongoDB for multiple use cases. The agenda includes presentations on these topics as well as a Q&A session and conclusion.
Data Streaming with Apache Kafka & MongoDB - EMEAAndrew Morgan
A new generation of technologies is needed to consume and exploit today's real time, fast moving data sources. Apache Kafka, originally developed at LinkedIn, has emerged as one of these key new technologies.
This webinar explores the use-cases and architecture for Kafka, and how it integrates with MongoDB to build sophisticated data-driven applications that exploit new sources of data.
Similar to Unlocking Operational Intelligence from the Data Lake (20)
MongoDB SoCal 2020: Migrate Anything* to MongoDB AtlasMongoDB
This presentation discusses migrating data from other data stores to MongoDB Atlas. It begins by explaining why MongoDB and Atlas are good choices for data management. Several preparation steps are covered, including sizing the target Atlas cluster, increasing the source oplog, and testing connectivity. Live migration, mongomirror, and dump/restore options are presented for migrating between replicasets or sharded clusters. Post-migration steps like monitoring and backups are also discussed. Finally, migrating from other data stores like AWS DocumentDB, Azure CosmosDB, DynamoDB, and relational databases are briefly covered.
MongoDB SoCal 2020: Go on a Data Safari with MongoDB Charts!MongoDB
These days, everyone is expected to be a data analyst. But with so much data available, how can you make sense of it and be sure you're making the best decisions? One great approach is to use data visualizations. In this session, we take a complex dataset and show how the breadth of capabilities in MongoDB Charts can help you turn bits and bytes into insights.
MongoDB SoCal 2020: Using MongoDB Services in Kubernetes: Any Platform, Devel...MongoDB
MongoDB Kubernetes operator and MongoDB Open Service Broker are ready for production operations. Learn about how MongoDB can be used with the most popular container orchestration platform, Kubernetes, and bring self-service, persistent storage to your containerized applications. A demo will show you how easy it is to enable MongoDB clusters as an External Service using the Open Service Broker API for MongoDB
MongoDB SoCal 2020: A Complete Methodology of Data Modeling for MongoDBMongoDB
Are you new to schema design for MongoDB, or are you looking for a more complete or agile process than what you are following currently? In this talk, we will guide you through the phases of a flexible methodology that you can apply to projects ranging from small to large with very demanding requirements.
MongoDB SoCal 2020: From Pharmacist to Analyst: Leveraging MongoDB for Real-T...MongoDB
Humana, like many companies, is tackling the challenge of creating real-time insights from data that is diverse and rapidly changing. This is our journey of how we used MongoDB to combined traditional batch approaches with streaming technologies to provide continues alerting capabilities from real-time data streams.
MongoDB SoCal 2020: Best Practices for Working with IoT and Time-series DataMongoDB
Time series data is increasingly at the heart of modern applications - think IoT, stock trading, clickstreams, social media, and more. With the move from batch to real time systems, the efficient capture and analysis of time series data can enable organizations to better detect and respond to events ahead of their competitors or to improve operational efficiency to reduce cost and risk. Working with time series data is often different from regular application data, and there are best practices you should observe.
This talk covers:
Common components of an IoT solution
The challenges involved with managing time-series data in IoT applications
Different schema designs, and how these affect memory and disk utilization – two critical factors in application performance.
How to query, analyze and present IoT time-series data using MongoDB Compass and MongoDB Charts
At the end of the session, you will have a better understanding of key best practices in managing IoT time-series data with MongoDB.
Join this talk and test session with a MongoDB Developer Advocate where you'll go over the setup, configuration, and deployment of an Atlas environment. Create a service that you can take back in a production-ready state and prepare to unleash your inner genius.
MongoDB .local San Francisco 2020: Powering the new age data demands [Infosys]MongoDB
Our clients have unique use cases and data patterns that mandate the choice of a particular strategy. To implement these strategies, it is mandatory that we unlearn a lot of relational concepts while designing and rapidly developing efficient applications on NoSQL. In this session, we will talk about some of our client use cases, the strategies we have adopted, and the features of MongoDB that assisted in implementing these strategies.
MongoDB .local San Francisco 2020: Using Client Side Encryption in MongoDB 4.2MongoDB
Encryption is not a new concept to MongoDB. Encryption may occur in-transit (with TLS) and at-rest (with the encrypted storage engine). But MongoDB 4.2 introduces support for Client Side Encryption, ensuring the most sensitive data is encrypted before ever leaving the client application. Even full access to your MongoDB servers is not enough to decrypt this data. And better yet, Client Side Encryption can be enabled at the "flick of a switch".
This session covers using Client Side Encryption in your applications. This includes the necessary setup, how to encrypt data without sacrificing queryability, and what trade-offs to expect.
MongoDB .local San Francisco 2020: Using MongoDB Services in Kubernetes: any ...MongoDB
MongoDB Kubernetes operator is ready for prime-time. Learn about how MongoDB can be used with most popular orchestration platform, Kubernetes, and bring self-service, persistent storage to your containerized applications.
MongoDB .local San Francisco 2020: Go on a Data Safari with MongoDB Charts!MongoDB
These days, everyone is expected to be a data analyst. But with so much data available, how can you make sense of it and be sure you're making the best decisions? One great approach is to use data visualizations. In this session, we take a complex dataset and show how the breadth of capabilities in MongoDB Charts can help you turn bits and bytes into insights.
MongoDB .local San Francisco 2020: From SQL to NoSQL -- Changing Your MindsetMongoDB
When you need to model data, is your first instinct to start breaking it down into rows and columns? Mine used to be too. When you want to develop apps in a modern, agile way, NoSQL databases can be the best option. Come to this talk to learn how to take advantage of all that NoSQL databases have to offer and discover the benefits of changing your mindset from the legacy, tabular way of modeling data. We’ll compare and contrast the terms and concepts in SQL databases and MongoDB, explain the benefits of using MongoDB compared to SQL databases, and walk through data modeling basics so you feel confident as you begin using MongoDB.
MongoDB .local San Francisco 2020: MongoDB Atlas JumpstartMongoDB
Join this talk and test session with a MongoDB Developer Advocate where you'll go over the setup, configuration, and deployment of an Atlas environment. Create a service that you can take back in a production-ready state and prepare to unleash your inner genius.
MongoDB .local San Francisco 2020: Tips and Tricks++ for Querying and Indexin...MongoDB
The document discusses guidelines for ordering fields in compound indexes to optimize query performance. It recommends the E-S-R approach: placing equality fields first, followed by sort fields, and range fields last. This allows indexes to leverage equality matches, provide non-blocking sorts, and minimize scanning. Examples show how indexes ordered by these guidelines can support queries more efficiently by narrowing the search bounds.
MongoDB .local San Francisco 2020: Aggregation Pipeline Power++MongoDB
Aggregation pipeline has been able to power your analysis of data since version 2.2. In 4.2 we added more power and now you can use it for more powerful queries, updates, and outputting your data to existing collections. Come hear how you can do everything with the pipeline, including single-view, ETL, data roll-ups and materialized views.
MongoDB .local San Francisco 2020: A Complete Methodology of Data Modeling fo...MongoDB
The document describes a methodology for data modeling with MongoDB. It begins by recognizing the differences between document and tabular databases, then outlines a three step methodology: 1) describe the workload by listing queries, 2) identify and model relationships between entities, and 3) apply relevant patterns when modeling for MongoDB. The document uses examples around modeling a coffee shop franchise to illustrate modeling approaches and techniques.
MongoDB .local San Francisco 2020: MongoDB Atlas Data Lake Technical Deep DiveMongoDB
MongoDB Atlas Data Lake is a new service offered by MongoDB Atlas. Many organizations store long term, archival data in cost-effective storage like S3, GCP, and Azure Blobs. However, many of them do not have robust systems or tools to effectively utilize large amounts of data to inform decision making. MongoDB Atlas Data Lake is a service allowing organizations to analyze their long-term data to discover a wealth of information about their business.
This session will take a deep dive into the features that are currently available in MongoDB Atlas Data Lake and how they are implemented. In addition, we'll discuss future plans and opportunities and offer ample Q&A time with the engineers on the project.
MongoDB .local San Francisco 2020: Developing Alexa Skills with MongoDB & GolangMongoDB
Virtual assistants are becoming the new norm when it comes to daily life, with Amazon’s Alexa being the leader in the space. As a developer, not only do you need to make web and mobile compliant applications, but you need to be able to support virtual assistants like Alexa. However, the process isn’t quite the same between the platforms.
How do you handle requests? Where do you store your data and work with it to create meaningful responses with little delay? How much of your code needs to change between platforms?
In this session we’ll see how to design and develop applications known as Skills for Amazon Alexa powered devices using the Go programming language and MongoDB.
MongoDB .local Paris 2020: Realm : l'ingrédient secret pour de meilleures app...MongoDB
aux Core Data, appréciée par des centaines de milliers de développeurs. Apprenez ce qui rend Realm spécial et comment il peut être utilisé pour créer de meilleures applications plus rapidement.
MongoDB .local Paris 2020: Upply @MongoDB : Upply : Quand le Machine Learning...MongoDB
Il n’a jamais été aussi facile de commander en ligne et de se faire livrer en moins de 48h très souvent gratuitement. Cette simplicité d’usage cache un marché complexe de plus de 8000 milliards de $.
La data est bien connu du monde de la Supply Chain (itinéraires, informations sur les marchandises, douanes,…), mais la valeur de ces données opérationnelles reste peu exploitée. En alliant expertise métier et Data Science, Upply redéfinit les fondamentaux de la Supply Chain en proposant à chacun des acteurs de surmonter la volatilité et l’inefficacité du marché.
Codeless Generative AI Pipelines
(GenAI with Milvus)
https://ml.dssconf.pl/user.html#!/lecture/DSSML24-041a/rate
Discover the potential of real-time streaming in the context of GenAI as we delve into the intricacies of Apache NiFi and its capabilities. Learn how this tool can significantly simplify the data engineering workflow for GenAI applications, allowing you to focus on the creative aspects rather than the technical complexities. I will guide you through practical examples and use cases, showing the impact of automation on prompt building. From data ingestion to transformation and delivery, witness how Apache NiFi streamlines the entire pipeline, ensuring a smooth and hassle-free experience.
Timothy Spann
https://www.youtube.com/@FLaNK-Stack
https://medium.com/@tspann
https://www.datainmotion.dev/
milvus, unstructured data, vector database, zilliz, cloud, vectors, python, deep learning, generative ai, genai, nifi, kafka, flink, streaming, iot, edge
Predictably Improve Your B2B Tech Company's Performance by Leveraging DataKiwi Creative
Harness the power of AI-backed reports, benchmarking and data analysis to predict trends and detect anomalies in your marketing efforts.
Peter Caputa, CEO at Databox, reveals how you can discover the strategies and tools to increase your growth rate (and margins!).
From metrics to track to data habits to pick up, enhance your reporting for powerful insights to improve your B2B tech company's marketing.
- - -
This is the webinar recording from the June 2024 HubSpot User Group (HUG) for B2B Technology USA.
Watch the video recording at https://youtu.be/5vjwGfPN9lw
Sign up for future HUG events at https://events.hubspot.com/b2b-technology-usa/
"Financial Odyssey: Navigating Past Performance Through Diverse Analytical Lens"sameer shah
Embark on a captivating financial journey with 'Financial Odyssey,' our hackathon project. Delve deep into the past performance of two companies as we employ an array of financial statement analysis techniques. From ratio analysis to trend analysis, uncover insights crucial for informed decision-making in the dynamic world of finance."
2. 2
The World is Changing
Digital Natives & Digital Transformation
Volume
Velocity
Variety
Iterative
Agile
Short Cycles
Always On
Secure
Global
Open-Source
Cloud
Commodity
Data Time
Risk Cost
6. 6
• 24% CAGR: Hadoop,
Spark & Streaming
• 18% CAGR: Databases
• Databases are key
components within the
big data landscape
“Big Data” is More than Just Hadoop
9. 9
How to Avoid Being in the 70%?
1. Unify data lake analytics with
the operational applications
2. Create smart, contextually
aware, data-driven apps &
insights
3. Integrate a database layer with
the data lake
10. 10
Why a Database + Hadoop?
Distributed Processing & Analytics
• Data stored as large files (64MB-128MB
blocks). No indexes
• Write-once-read-many, append-only
• Designed for high throughput scans
across TB/PB of data.
• Multi-minute latency
Common Attributes
• Schema-on-read
• Multiple replicas
• Horizontal scale
• High throughput
• Low TCO
11. 11
Why a Database + Hadoop?
Distributed Processing & Analytics
• Random access to subsets of data
• Millisecond latency
• Expressive querying, rich
aggregations & flexible indexing
• Update fast changing data, avoid re-
write / re-compute entire data set
• Data stored as large files (64MB-128MB
blocks). No indexes
• Write-once-read-many, append-only
• Designed for high throughput scans
across TB/PB of data.
• Multi-minute latency
Common Attributes
• Schema-on-read
• Multiple replicas
• Horizontal scale
• High throughput
• Low TCO
12. 12
MongoDB & Hadoop: What’s Common
Distributed Processing & Analytics
Common Attributes
• Schema-on-read
• Multiple replicas
• Horizontal scale
• High throughput
• Low TCO
13. 13
Bringing it Together
Online Services
powered by
Back-end machine learning
powered by
• User account & personalization
• Product catalog
• Session management & shopping cart
• Recommendations
• Customer classification & clustering
• Basket analysis
• Brand sentiment
• Price optimization
MongoDB
Connector for
Hadoop
14. MessageQueue
Customer Data Mgmt Mobile App IoT App Live Dashboards
Raw Data
Processed
Events
Distributed
Processing
Frameworks
Millisecond latency. Expressive querying & flexible indexing against subsets of data.
Updates-in place. In-database aggregations & transformations
Multi-minute latency with scans across TB/PB of data. No indexes. Data stored in
128MB blocks. Write-once-read-many & append-only storage model
Sensors
User Data
Clickstreams
Logs
Churn
Analysis
Enriched
Customer
Profiles
Risk
Modeling
Predictive
Analytics
Real-Time Access
Batch Processing, Batch Views
Design Pattern: Operationalized Data Lake
15. MessageQueue
Customer Data Mgmt Mobile App IoT App Live Dashboards
Raw Data
Processed
Events
Distributed
Processing
Frameworks
Millisecond latency. Expressive querying & flexible indexing against subsets of data.
Updates-in place. In-database aggregations & transformations
Multi-minute latency with scans across TB/PB of data. No indexes. Data stored in
128MB blocks. Write-once-read-many & append-only storage model
Sensors
User Data
Clickstreams
Logs
Churn
Analysis
Enriched
Customer
Profiles
Risk
Modeling
Predictive
Analytics
Real-Time Access
Batch Processing, Batch Views
Design Pattern: Operationalized Data Lake
Configure where to
land incoming data
16. MessageQueue
Customer Data Mgmt Mobile App IoT App Live Dashboards
Raw Data
Processed
Events
Distributed
Processing
Frameworks
Millisecond latency. Expressive querying & flexible indexing against subsets of data.
Updates-in place. In-database aggregations & transformations
Multi-minute latency with scans across TB/PB of data. No indexes. Data stored in
128MB blocks. Write-once-read-many & append-only storage model
Sensors
User Data
Clickstreams
Logs
Churn
Analysis
Enriched
Customer
Profiles
Risk
Modeling
Predictive
Analytics
Real-Time Access
Batch Processing, Batch Views
Design Pattern: Operationalized Data Lake
Raw data processed to
generate analytics models
17. MessageQueue
Customer Data Mgmt Mobile App IoT App Live Dashboards
Raw Data
Processed
Events
Distributed
Processing
Frameworks
Millisecond latency. Expressive querying & flexible indexing against subsets of data.
Updates-in place. In-database aggregations & transformations
Multi-minute latency with scans across TB/PB of data. No indexes. Data stored in
128MB blocks. Write-once-read-many & append-only storage model
Sensors
User Data
Clickstreams
Logs
Churn
Analysis
Enriched
Customer
Profiles
Risk
Modeling
Predictive
Analytics
Real-Time Access
Batch Processing, Batch Views
Design Pattern: Operationalized Data Lake
MongoDB exposes
analytics models to
operational apps.
Handles real time
updates
18. MessageQueue
Customer Data Mgmt Mobile App IoT App Live Dashboards
Raw Data
Processed
Events
Distributed
Processing
Frameworks
Millisecond latency. Expressive querying & flexible indexing against subsets of data.
Updates-in place. In-database aggregations & transformations
Multi-minute latency with scans across TB/PB of data. No indexes. Data stored in
128MB blocks. Write-once-read-many & append-only storage model
Sensors
User Data
Clickstreams
Logs
Churn
Analysis
Enriched
Customer
Profiles
Risk
Modeling
Predictive
Analytics
Real-Time Access
Batch Processing, Batch Views
Design Pattern: Operationalized Data Lake
Compute new
models against
MongoDB &
HDFS
19. 19
Operational Database Requirements
1 “Smart” integration with the data lake
2 Powerful real-time analytics
3 Flexible, governed data model
4 Scale with the data lake
5 Sophisticated management & security
21. 21
Problem Why MongoDB ResultsProblem Solution Results
Existing EDW with nightly
batch loads
No real-time analytics to
personalize user experience
Application changes broke ETL
pipeline
Unable to scale as services
expanded
Microservices architecture running on AWS
All application events written to Kafka queue,
routed to MongoDB and Hadoop
Events that personalize real-time experience (ie
triggering email send, additional questions,
offers) written to MongoDB
All event data aggregated with other data
sources and analyzed in Hadoop, updated
customer profiles written back to MongoDB
2x faster delivery of new
services after migrating to new
architecture
Enabled continuous delivery:
pushing new features every
day
Personalized user experience,
plus higher uptime and
scalability
UK’s Leading Price Comparison Site
Out-pacing Internet search giants with continuous delivery pipeline
powered by microservices & Docker running MongoDB, Kafka and
Hadoop in the cloud
22. 22
Problem Why MongoDB Results
Problem Solution Results
Customer data scattered across
100+ different systems
Poor customer experience: no
personalization, no consistent
experience across brands or
devices
No way to analyze customer
behavior to deliver targeted offers
Selected MongoDB over HBase for
schema flexibility and rich query support
MongoDB stores all customer profiles,
served to web, mobile & call-center apps
Distributed across multiple regions for DR
and data locality
All customer interactions stored in
MongoDB, loaded into Hadoop for
customer segmentation
Unified processing pipeline with Spark
running across MongoDB and Hadoop
Single profile created for each
customer, personalizing
experience in real time
Revenue optimization by
calculating best ticket prices
Reduce competitive pressures
by identifying gaps in product
offerings
Customer Data Management
Single view and real-time analytics with MongoDB,
Spark, & Hadoop
Leading
Global
Airline
23. 23
Problem Why MongoDB Results
Problem Solution Results
Commercialize a national security
platform
Massive volumes of multi-
structured data: news, RSS &
social feeds, geospatial, geological,
health & crime stats
Requires complex analysis,
delivered in real time, always on
Apache NiFI for data ingestion, routing
& metadata management
Hadoop for text analytics
HANA for geospatial analytics
MongoDB correlates analytics with
user profiles & location data to deliver
real-time alerts to corporate security
teams & individual travelers
Enables Prescient to uniquely
blend big data technology with its
security IP developed in
government
Dynamic data model supports
indexing 38k data sources,
growing at 200 per day
24x7 continuous availability
Scalability to PBs of data
World’s Most Sophisticated
Traveler Safety Platform
Analyzing PBs of Data with MongoDB, Hadoop, Apache NiFi
& SAP HANA
24. 24
Problem Why MongoDB Results
Problem Solution Results
Requirement to analyze data over
many different dimensions to detect
real time threat profiles
HBase unable to query data
beyond primary key lookups
Lucene search unable to scale with
growth in data
MongoDB + Hadoop to collect and
analyze data from internet sensors in
real time
MongoDB dynamic schema enables
sensor data to be enriched with
geospatial tags
Auto-sharding to scale as data
volumes grow
Run complex, real-time analytics on
live data
Improved query performance by
over 3x
Scale to support doubling of data
volume every 24 months
Deploy across global data
centers for low latency user
experience
Engineering teams have more
time to develop new features
Powering Global Threat
Intelligence
Cloud-based real-time analytics with MongoDB & Hadoop
26. Conclusion
1 Data lakes enable
enterprises to affordably
capture & analyze more data
2 Operational and analytical
workloads are converging
3 MongoDB is the key
technology to operationalize
the data lake
27. 27
MongoDB Compass MongoDB Connector for BI
MongoDB Enterprise Server
MongoDB Enterprise Advanced24x7Support
(1hourSLA)
CommercialLicense
(NoAGPLCopyleftRestrictions)
Platform
Certifications
MongoDB Ops Manager
Monitoring &
Alerting
Query
Optimization
Backup &
Recovery
Automation &
Configuration
Schema Visualization
Data Exploration
Ad-Hoc Queries
Visualization
Analysis
Reporting
Authorization Auditing
Encryption
(In Flight & at Rest)
Authentication
REST APIEmergency
Patches
Customer
Success
Program
On-Demand
Online Training
Warranty
Limitation of
Liability
Indemnification
28. 28
Resources to Learn More
• Guide: Operational Data Lake
• Whitepaper: Real-Time
Analytics with Apache Spark &
MongoDB
29.
30. 30
For More Information
Resource Location
Case Studies mongodb.com/customers
Presentations mongodb.com/presentations
Free Online Training education.mongodb.com
Webinars and Events mongodb.com/events
Documentation docs.mongodb.org
MongoDB Downloads mongodb.com/download
Additional Info info@mongodb.com
31. 31
Problem Why MongoDB Results
Problem Solution Results
System failures in online banking
systems creating customer sat
issues
No personalization experience
across channels
No enrichment of user data with
social media chatter
Apache Flume to ingest log data &
social media streams, Apache Spark
to process log events
MongoDB to persist log data and
KPIs, immediately rebuild user
sessions when a service fails
Integration with MongoDB query
language and secondary indexes to
selectively filter and query data in real
time
Improved user experience, with
more customers using online,
self-service channels
Improved services following
deeper understanding of how
users interact with systems
Greater user insight by adding
social media insights
One of World’s Largest Banks
Creating new customer insights with MongoDB & Spark
32. 32
Fare Calculation Engine
One of World’s Largest Airlines Migrates from Oracle to
MongoDB and Apache Spark to Support 100x performance
improvement
Problem Why MongoDB Results
Problem Solution Results
China Eastern targeting 130,000 seats
sold every day across its web and
mobile channels
New fare calculation engine needed to
support 20,000 search queries per
second, but current Oracle platform
supported only 200 per second
Apache Spark used for fare
calculations, using business rules
stored in MongoDB
Fare calculations written to MongoDB
for access by the search application
MongoDB Connector for Apache Spark
allows seamless integration with data
locality awareness across the cluster
Cluster of less than 20 API, Spark &
MongoDB nodes supports 180m fare
calculations & 1.6 billion searches per
day
Each node delivers 15x higher
performance and 10x lower latency
than existing Oracle servers
MongoDB Enterprise Advanced
provided Ops Manager for operational
automation and access to expert
technical support
33. 33
MongoDB Connector for Apache Spark
• Native Scala connector, certified by Databricks
• Exposes all Spark APIs & libraries
• Efficient data filtering with predicate pushdown,
secondary indexes, & in-database
aggregations
• Locality awareness to reduce data movement
“We reduced 100+ lines of integration code to just a
single line after moving to the MongoDB Spark connector.”
- Early Access Tester, Multi-National Banking Group Group
35. 35
Query and Data Model
MongoDB Relational Column Family
(i.e. HBase)
Rich query language & secondary
indexes
Yes Yes Requires integration
with separate Spark /
Hadoop cluster
In-Database aggregations & search Yes Yes Requires integration
with separate Spark /
Hadoop cluster
Dynamic schema Yes No Partial
Data validation Yes Yes App-side code
• Why it matters
– Query & Aggregations: Rich, real time analytics against operational data
– Dynamic Schema: Manage multi-structured data
– Data Validation: Enforce data governance between data lake & operational apps
36. 36
Data Lake Integration
MongoDB Relational Column Family
(i.e. HBase)
Hadoop + secondary indexes Yes Yes: Expensive No secondary
indexes
Spark + secondary indexes Yes Yes: Expensive No secondary
indexes
Native BI connectivity Yes Yes 3rd-party connectors
Workload isolation Yes Yes: Expensive Load data to
separate
Spark/Hadoop
cluster
• Why it matters
– Hadoop + Spark: Efficient data movement between data lake, processing layer & database
– Native BI Connectivity: Visualizing operational data
– Workload isolation: separation between operational and analytical workloads
37. 37
Operationalizing for Scale & Security
MongoDB Relational Column Family
(i.e. HBase)
Robust security controls Yes Yes Yes
Scale-out on commodity hardware Yes No Yes
Sophisticated management platform Yes Yes Monitoring only
• Why it matters
– Security: Data protection for regulatory compliance
– Scale-Out: Grow with the data lake
– Management: Reduce TCO with platform automation, monitoring, disaster recovery
Seen rapid growth in adoption of the data lake – a centralized repository for many new data sources orgs now collecting
But not without challenges – primary challenge is how to make analytics generated by the data lake available to our real time, operational apps
So we are going to cover
Rise of data lake
Challenges presented in getting most biz value out of data lake
Role that databases play, and requirements
Case studies who are unlockig insight from the data lake
As enterprises bring more products and services on line as part of digital transformation initiatives, one thing don’t lack today is data – from streams of sensor readings, to social sentiment, to machine logs, mobile apps, and more.
Analysts estimate volumes growing at 40% per annum, with 80% of all data unstructured.
Same time – we see more pressure on time to market, on exposing apps to global audiences, and in reducing cost of delivering new services
Trends fundamentally changes how enterprises build and run modern apps
What all of this new data available, we are creating an insight economy
Uncovering new insights by collecting and analyzing this data carries the promise of competitive advantage and efficiency savings. Better understand customers by predicting what they might buy based on behavior, on demographics – could be optimizing supply chain to better or faster routes. Reducing risk of fraud by identifying suspicious behavior – its all about that data
Those that don’t harness data are at major disadvantage
understand the past, monitor the present, and predict the future
MIT: data-driven decision environments have 5% higher productivity, 6% higher profit and up to 50% higher market value than other businesses.
Traditional source of data from operational apps has been DW, take all this data in, then create analytics from it
However, the traditional Enterprise Data Warehouse (EDW) is straining under the load, overwhelmed by the sheer volume and variety of data pouring into the business. Costs – hundreds to thousands of $ per TB v 10s to hundreds in commodity systems
Becaise of these challenges many organizations have turned to Hadoop as a centralized repository for this new data, creating what many call a data lake. Not are replacement – adjunct – stores all new data – apply new analytics which combined with traditional reporting coming from the DW
Gartner estimate around 50% of ents have or are in the process of rolling out data lakes
When we think about data lakes, think about big data, and big data often associated with Hadoop – reality is more than just Hadoop
Market growth forecast by wikibon – “big data revenues” growing from $19bn 2016 to $92bn in 2026. S/W outpacing h/w and PS. IDC forecasr Just under $50bn by 2019, 23% CAGR. Software growing fastest
Leading charge, Hadoop and spark. Closely followed by databases – key part of big data landscape – because they operationalize the data lake – link between backend data lake and front end apps that consume analytics to make those apps smarter
Hadoop – well established, celebrates 10th anniversry this year
Grown from HDFS and MR into dozens of projects - Gartner identify 19 common projects supported by 4 leading distros. Avg distro has many more projects – processing frameworks, to search, to provisionng and mgmt, to security to file formats to integration
Each project is developed independenytly – own roadmap, own dependencies – incredible complexity
HDFS is the common storage layer – against which processing frameworks run to produce outputs you see on the slide
While something like 50% of enterprises either have or are evaluating Hadoop to create new classes of app, not without its challenges
Appears in a number of Gartner analysis, any by the press
One of the fundamental challenges in integration is how to integrate data lake with your operational systems
Operational apps run the business – how do you expose analytics created in the data lake to better serve customers with more relevant products and offers, to better drive efficiency savings from IoT-enabled smart factory
Unify data lake analytics with the operational applications
Enables you to create smart, contextually aware, data-driven apps
Integrated database layer operationalizes the data lake
Differences come in how data is stored, accessed and updated. Hadoop is a file system – it stores data in files in blocks – has no knowledge of that underlying data – its has no indexes. If you want to access a specific record, scan all the data that stored in the file where the record is located – could be tens of MBs
HDFS characteristics
WORM, ie update customer data, rewite all that customer data, not just individual customers
Hadoop excels at generating analytics models by scanning and processing large datasets, is not designed to provide real-time, random access by operational applications.
the time to read the whole dataset is more important than the latency in reading the first record.
http://stackoverflow.com/questions/15675312/why-hdfs-is-write-once-and-read-multiple-times/37300268#37300268\
But MongoDB more than just a filesystem. Full database, so gives you a whole bunch of things hdfs doesn’t give –
Millisecond latency query responsiveness.
Random access to indexed subsets of data.
Expressive querying & flexible indexing: Supporting complex queries and aggregations against the data in real time, making online applications smarter and contextual.
Updating fast-changing data in real time as users interact with online applications, without having to rewrite the entire data set.
fine-grained access with complex filtering logic,
Use distributed processing libs against it – mongo collection or doc looks like an input or output in hdfs. Rather than load a file, load a dataframe. Hive sees Mongodb as a table
Longer jobs
Batch analytics
Append only files
Great for scanning all data or large subsets in files
Obvious question is why do we need a database when we have Hadoop. Comes down to how each platform persists and accesses data. HDFS is a file system – accesses data in batches of 128MB blocks. MongoDB is a database which provides fine grained access to data at the level of individual records – gives each system very different properties – talk through.
Despite those differences, lots of similarities – in how we process data – MR, Spark. These are unopinionated on underlying persistence layer – could be HDFS, could be MongoDB. Means can unify analytics across data lake and in your database
Both MongoDB and HDFS – common atrributes provide: Schema on Read, multiple replicas for fault tolerance horizontal scale, low TCO.
But have different characteristics in how they store and access data – means suited to different parts of the data lake deployment
When you bring the database and the data lake together, you can build powerful, data driven apps
Take a real life example – data lake of a large retailers
Online store front and ecomm engine is powered by MongoDB – handling customer profiles, sessions, baskets, product catalogs – presenting recommendations and offers
As they browse the ite, all of their activity is being written back to Hadoop –blending it with other data sources – social feeds, demogragpahics, market data, credit scores, currency feeds, to segment and cluster customers
These can then be exposed to MongoDB, so when customers come back, presented with personalized experience – based on what you have browsed before – what you are likely want to purchase next.
Could not serve that operational app that is dealing individual customers from hdfs – not real time, no indexes to access just the customer details you need. No way of updating customer record –everything is rewritten and recomputed
Regression and classification for customer clustering
Lets go deeper and wider
This is a design pattern for the data lake – multiple components that collectively handle ingest, storage, processing and analysis of data, then serving it to consuming operational apps
Step thru
Data ingestion: Data streams are ingested to a pub/sub message queue, which routes all raw data into HDFS.
Often also have event processing running against the queue to find interesting events that need to be consumed by the operational apps immediately - displaying an offer to a user browsing a product page, or alarms generated against vehicle telemetry from an IoT apps, are routed to MongoDB for immediate consumption by operational applications.
Raw data is loaded into the data lake where we can use Hadoop jobs – MR or Spark, generate analytics models from the raw data – see examples in the layer above HDFS
MongoDB exposes these models to the operational processes, serving indexed queries and updates against them with real-time latency
The distributed processing frameworks can re-compute analytics models, against data stored in either HDFS or MongoDB, continuously flowing updates from the operational database to analytics models
Look at some examples of users who have deployed this type of design pattern little later
Beyond low latency performance, specific requirements. Need much more than just a datastore, fully-featured database serving as a System of Record for online applications
Tight integration between MongoDB and the data lake – minimize data movement between them, fullt exploit native capabilities of each part of the system
Need to be able to serve operational workloads, run analytics against live operational data –ie top trending articles now so I know where to place my ads, how many widgets coming off my produiction line are failing QA, is that up or down with previous trends. Gartner calls it HTAP (Hybrid Transactional and Analytical Processing), Forrester = transalytics – to do that, need: Powerful query language, secondary indexes, aggregations & transformations all within the database – not ETL into a warehouse
Workload isolation: operational & analytics – so don’t contend for the same resource
Flexible schema to handle multi-structured data, but need to enforce governance to that data
Secure access to the data: – the operational DB typically accessed by a much broader audience than Hadoop, so security controls critical – robust access controls – LDAP, kerberos, RBAC
Auditing of all events for reg compliance. Encr of data in motion and at rest, all built into the database
Need to scale as the data lake scales – means scaling out on commodity hardware, often across geo regions
To simplify the envrionment, need sophisticated mgmt tools: to automate database deployment, scaling, monitoring and alerting, and disaster recovery.
Tight integration: not enough just to move data between analytics and operational layers – need to move it efficiently. Connectors should allow selective filtering by using secondary indexes to extract and process only the range of data it needs – for example, retrieving all customers located in a specific geography. This is very different from other databases that do not support secondary indexes. In these cases, Spark and Hadoop jobs are limited to extracting all data based on a simple primary key, even if only a subset of that data is required for the query. This means more processing overhead, more hardware, and longer time-to-insight for the user.
Workload isolation: provision database clusters with dedicated analytic nodes, allowing users to simultaneously run real-time analytics and reporting queries against live data, without impacting nodes servicing the operational application.
Flexible data model to store data of any structure, and easily evolve the model to capture new attribs – ie enriched user profiles with geospatial data. Also need to ensure data quality by enforcing validation rules against the data – to ensure it is appropriated typed, contains all attribs needed by the app
Expressive queries developers to build applications that can query and analyze the data in multiple ways – by single keys, ranges, text search, and geospatial queries through to complex aggregations and MapReduce jobs, returning responses in milliseconds. Complex queries are executed natively in the database without having to use additional analytics frameworks or tools, and avoiding the latency that comes from moving data between operational and analytical engines. Secondary indexes give oppt to filter data in any way you need – key for low latency operational queries
Robust security controls: govern access, provide audit trails and enc data in flight and at rest
Scale-out – match scale out of data lake, as it grows, add new nodes to service higher data volumes or user load
Advanced management platform. To reduce data lake TCO and risk of application downtime, powerful tooling to automate database deployment, scaling, monitoring and alerting, and disaster recovery.
Look at examples in action
CTM – UK’s leading price comparisons sites – moved from an on-prem RDBMS based monlithic app to microservices architecture powered by MongoDB with Hadoop at the back end providing analytics – enabled them better personalize customer experience and deepen relationships
Read through bullets
2nd example leading global airline. Thru M&A – multiple brands to service different countries and market sectors, but customer data spread across 100 different systems.
By using Hadoop and Spark, brought that data together to create a single view, and that is loaded into MongoDB which powers the online apps – web and mobile, as well as call center – so users get a consistent experience however they interact. All user data and ticket data is stored in MongoDB, then written back into Hadoop to run advanced analytics that allow ticket price optimization, identify offers, and gaps in product portfolio
Read bullets
Provide traveler safety platform for corp customers – if natural disaster or security incident while traveler away on biz, able to send real time alerts and advise on how to get to safety
Platform built for national govts, now launched for commercial usage - Analyzing PBs of Data with MongoDB, Hadoop, Apache NiFi & SAP HANA
Read bullets
McAfee – built its cloud based threat intelligence platform on MongoDB. Platform monitor threat activity for clients in RT – identifies attacks are taking place, identifies when users maybe interacting with insecure or suspicious sites
All RT activity is captured in MongoDB – provide alerting to security teams, sent to Hadoop for further backend analytics, with updated threat profiles written back to mongo
MongoDB is open source – also provide EA
Collection of software and support to run in production at scale
The Stratio Apache Spark-certified Big Data (BD) platform is used by an impressive client list including BBVA, Just Eat, Santander, SAP, Sony, and Telefonica. The company has implemented a unified real-time monitoring platform for a multinational banking group operating in 31 countries with 51 million clients all over the world. The bank wanted to ensure a high quality of service and personalized experience across its online channels, and needed to continuously monitor client activity to check service response times and identify potential issues. The application was built on a modern technology foundation including:
Apache Flume to aggregate log data
Apache Spark to process log events in real time
MongoDB to persist log data, processed events and Key Performance Indicators (KPIs).
The aggregated KPIs, stored by MongoDB enable the bank to analyze client and systems behavior in real time in order to improve the customer experience. Collecting raw log data allows the bank to immediately rebuild user sessions if a service fails, with analysis generated by MongoDB and Spark providing complete traceability to quickly identify the root cause of any issue.
The project required a database that provided always-on availability, high performance, and linear scalability. In addition, a fully dynamic schema was needed to support high volumes of rapidly changing semi-structured and unstructured JSON data being ingested from a variety of logs, clickstreams, and social networks. After evaluating the project’s requirements, Stratio concluded MongoDB was the best fit. With MongoDB’s query projections and secondary indexes, analytic processes run by the Stratio BD platform avoid the need to scan the entire data set, which is not the case with other databases.
China Eastern
Industry: Travel and Hospitality, Airline
Use Case: Search
While its impt to provide low latency access to data, not enough to just support simple K-V lookups – demand is to get insights from data faster – so this is the role of RT analytics - track in RT where vehicles in your fleet, what social sentiment to an announcement you’ve just made, Correlate patterns of real time fraud attempts against specific domains – so this is where expressive query lang, secondary indexes, aggs in database are valuable.
MongoDB and RDBMS both have strong features – RDBMS further ahead – column family – little more than k-v. Need to move data out to other query frameworks or analytics nodes to get any intelligence – adds latency, adds complexity – more moving parts
RDBMS good in many areas, but lacks data model flexibility needed to handle rapidly changing, multi-structured data is where it falls downs.
CF – more schema flexibility than relational, but still need to pre-define columns, restrict speed to evolve apps
Data validtion – apply rules to data structures operational database stores – apps creates single view of your customer – data maybe spread across many repositories – loaded into data lake, creates single view, loads in mongo to serve operational apps – needs to ensure docs contains mandatory fields: unique customer identifiers, typed and formed in a specific way, ie ID is always an integer, email address always contains @. Doc validation in mongo enables you to do this. RDBMS full schema validation, so a little ahead – have to enforce govn in code in a CF database
Look at aggregrated scores – relationla abnd mongo evenly matched, with CF, much simpler datastore, long way behind
Hadoop and Spark integration: need to do more than just move vast amounts of data between each layer of the stack – need intelligent connectors that can push down predicates, filter data with secondary indexes – ie access all customers in a specific geo, without being able to access the DBs secondary indexe, and pre-aggregate data, moving a ton of data backward and forward – more processing cycles, longer latency.
MngoDB connector for Hadoop, and for Spark, both support these capabilities. CF doesn’t offer secondary indexes or aggs, so nothing to filter the data
RDBMS offers these capabilities in its connectors, but generally only available as expensive add-ons, hence downgraded
Workload isolation – ability to perform real time analytics on live operational data, without interfering with operational apps – don’t want some type of aggregation looking at how many deliveries your fleet of trucks has made with how quickly you can detect from sensor data than a vehicle has developed a fault – key to do this is distribute queries to dedicated nodes in the database cluster – some provisioned for operational data, then replicating to nodes dedicated to analytics. MongoDB – up to 50 members in a single replica set – configure analytics as hidden so never hit by op queries. CF, restricted to just 3 data replicas – there for HA, not for separation of different workloads. RDBMS, expensive add-on
Native BI connectivity – may not be relevant in all cases, but many orgs want to be able to create live dashboards reporting current state from op systems. MongoDB had a native BI connector that exposes database as an ODBC data source – visualize in anything from tableau to biz objects to excel. Rich tooling in relational world. CF, connector exist, 3rd party, don’t push down queries to the database, instead extract all data – so more computationally and network intensive to power dashboards
Security: data from operational databases exposed to apps and potentially millions of users – need to provide robust access controls, may include integration with LDAP, kerberos, PKI environments and RBAC to titghly seggregate who can do what in the DB. Enc data in flight and at rest, need to maintain a log of activity in the DB for forensic analysis
All solutions do well – big investment in Hadoop ecosystem, rapidly gainining ground on RDBMS, but doing it at much lower cost
Scale out – need to be able to scale as data lake scales, and more digital services opened up to users – non-relational databaes, core strenght. Fundamental challenge in RDBMS requires scale up, limited headroom, very expenive in proprietary h/w
Mgmt – Hadoop is complex, mgmt tools still primitive. For op database, need a platform that provides powerful tooling to automate database deployment, scaling, fine grained monitoring and alerting, and disaster recovery with point in time backups and automated restores. Rich tooling in relational world – big investment from Mongo to close that gap
Left hand side – maintained attribs of relational – blended with innovation from NoSQL
Uniquely differentiates mongodb from its peers in the non-relational DB market
Invest in tech that has production proven deployments, broad skills availability
With availability of Hadoop skills cited by Gartner analysts as a top challenge, it is essential you choose an operational database with a large available talent pool. This enables you to find staff who can rapidly build differentiated big data applications. Across multiple measures, including DB Engines Rankings, 451 Group NoSQL Skills Index and the Gartner Magic Quadrant for Operational Databases, MongoDB is the leading non-relational database.