This webinar discusses Amazon DynamoDB, a NoSQL, highly scalable, SSD-based, zero administration database service in the AWS Cloud. We explain how DynamoDB works and also walk through some best practices and tips to get the most out of the service.
Jane Uyvova
Senior Solutions Architect, MongoDB
March 21, 2017
MongoDB Evenings San Francisco
Learn how easy it is to set up, operate, and scale your MongoDB deployments in the cloud with MongoDB Atlas.
This webinar discusses Amazon DynamoDB, a NoSQL, highly scalable, SSD-based, zero administration database service in the AWS Cloud. We explain how DynamoDB works and also walk through some best practices and tips to get the most out of the service.
Jane Uyvova
Senior Solutions Architect, MongoDB
March 21, 2017
MongoDB Evenings San Francisco
Learn how easy it is to set up, operate, and scale your MongoDB deployments in the cloud with MongoDB Atlas.
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Databricks
Spark SQL is a highly scalable and efficient relational processing engine with ease-to-use APIs and mid-query fault tolerance. It is a core module of Apache Spark. Spark SQL can process, integrate and analyze the data from diverse data sources (e.g., Hive, Cassandra, Kafka and Oracle) and file formats (e.g., Parquet, ORC, CSV, and JSON). This talk will dive into the technical details of SparkSQL spanning the entire lifecycle of a query execution. The audience will get a deeper understanding of Spark SQL and understand how to tune Spark SQL performance.
This developer-focused webinar will explain how to use the Cypher graph query language. Cypher, a query language designed specifically for graphs, allows for expressing complex graph patterns using simple ASCII art-like notation and offers a simple but expressive approach for working with graph data.
During this webinar you'll learn:
-Basic Cypher syntax
-How to construct graph patterns using Cypher
-Querying existing data
-Data import with Cypher
-Using aggregations such as statistical functions
-Extending the power of Cypher using procedures and functions
CDC Stream Processing with Apache FlinkTimo Walther
An instant world requires instant decisions at scale. This includes the ability to digest and react to changes in real-time. Thus, event logs such as Apache Kafka can be found in almost every architecture, while databases and similar systems still provide the foundation. Change Data Capture (CDC) has become popular for propagating changes. Nevertheless, integrating all these systems, which often have slightly different semantics, can be a challenge.
In this talk, we highlight what it means for Apache Flink to be a general data processor that acts as a data integration hub. Looking under the hood, we demonstrate Flink's SQL engine as a changelog processor that ships with an ecosystem tailored to processing CDC data and maintaining materialized views. We will discuss the semantics of different data sources and how to perform joins or stream enrichment between them. This talk illustrates how Flink can be used with systems such as Kafka (for upsert logging), Debezium, JDBC, and others.
Ozone is an object store for Hadoop. Ozone solves the small file problem of HDFS, which allows users to store trillions of files in Ozone and access them as if there are on HDFS. Ozone plugs into existing Hadoop deployments seamlessly, and programs like Hive, LLAP, and Spark work without any modifications. This talk looks at the architecture, reliability, and performance of Ozone.
In this talk, we will also explore Hadoop distributed storage layer, a block storage layer that makes this scaling possible, and how we plan to use the Hadoop distributed storage layer for scaling HDFS.
We will demonstrate how to install an Ozone cluster, how to create volumes, buckets, and keys, how to run Hive and Spark against HDFS and Ozone file systems using federation, so that users don’t have to worry about where the data is stored. In other words, a full user primer on Ozone will be part of this talk.
Speakers
Anu Engineer, Software Engineer, Hortonworks
Xiaoyu Yao, Software Engineer, Hortonworks
Data Build Tool (DBT) is an open source technology to set up your data lake using best practices from software engineering. This SQL first technology is a great marriage between Databricks and Delta. This allows you to maintain high quality data and documentation during the entire datalake life-cycle. In this talk I’ll do an introduction into DBT, and show how we can leverage Databricks to do the actual heavy lifting. Next, I’ll present how DBT supports Delta to enable upserting using SQL. Finally, we show how we integrate DBT+Databricks into the Azure cloud. Finally we show how we emit the pipeline metrics to Azure monitor to make sure that you have observability over your pipeline.
Simplify CDC Pipeline with Spark Streaming SQL and Delta LakeDatabricks
Change Data Capture CDC is a typical use case in Real-Time Data Warehousing. It tracks the data change log -binlog- of a relational database [OLTP], and replay these change log timely to an external storage to do Real-Time OLAP, such as delta/kudu. To implement a robust CDC streaming pipeline, lots of factors should be concerned, such as how to ensure data accuracy , how to process OLTP source schema changed, whether it is easy to build for variety databases with less code.
Cosco: An Efficient Facebook-Scale Shuffle ServiceDatabricks
Cosco is an efficient shuffle-as-a-service that powers Spark (and Hive) jobs at Facebook warehouse scale. It is implemented as a scalable, reliable and maintainable distributed system. Cosco is based on the idea of partial in-memory aggregation across a shared pool of distributed memory. This provides vastly improved efficiency in disk usage compared to Spark's built-in shuffle. Long term, we believe the Cosco architecture will be key to efficiently supporting jobs at ever larger scale. In this talk we'll take a deep dive into the Cosco architecture and describe how it's deployed at Facebook. We will then describe how it's integrated to run shuffle for Spark, and contrast it with Spark's built-in sort-based shuffle mechanism and SOS (presented at Spark+AI Summit 2018).
The Modern Data Team for the Modern Data Stack: dbt and the Role of the Analy...Databricks
A traditional data team has roles including data engineer, data scientist, and data analyst. However, many organizations are finding success by integrating a new role – the analytics engineer. The analytics engineer develops a code-based data infrastructure that can serve both analytics and data science teams. He or she develops re-usable data models using the software engineering practices of version control and unit testing, and provides the critical domain expertise that ensures that data products are relevant and insightful. In this talk we’ll talk about the role and skill set of the analytics engineer, and discuss how dbt, an open source programming environment, empowers anyone with a SQL skillset to fulfill this new role on the data team. We’ll demonstrate how to use dbt to build version-controlled data models on top of Delta Lake, test both the code and our assumptions about the underlying data, and orchestrate complete data pipelines on Apache Spark™.
Virtual Flink Forward 2020: Netflix Data Mesh: Composable Data Processing - J...Flink Forward
Netflix processes trillions of events and petabytes of data a day in the Keystone data pipeline, which is built on top of Apache Flink. As Netflix has scaled up original productions annually enjoyed by more than 150 million global members, data integration across the streaming service and the studio has become a priority. Scalably integrating data across hundreds of different data stores in a way that enables us to holistically optimize cost, performance and operational concerns presented a significant challenge. Learn how we expanded the scope of the Keystone pipeline into the Netflix Data Mesh, our real-time, general-purpose, data transportation platform for moving data between Netflix systems. The Keystone Platform’s unique approach to declarative configuration and schema evolution, as well as our approach to unifying batch and streaming data and processing will be covered in depth.
Troubleshooting Kerberos in Hadoop: Taming the BeastDataWorks Summit
Kerberos is the ubiquitous authentication mechanism when it comes to secure any Hadoop Services. With recent updates in Hadoop core and various Apache Hadoop components, inherent Kerberos support has matured and has come a long way.
Understanding & configuring Kerberos is still a challenge but even more painful & frustrating is troubleshooting a Kerberos issue. There are lot of things (small & big) that can go wrong (and will go wrong!). This talk covers the Kerberos debugging part in detail and discusses the tools & tricks that can be used to narrow down any Kerberos issue.
Rather than discussing the issues and their resolution, we will focus on how to approach a Kerberos problem and do's / dont's in Kerberos scene. This talk will provide a step by step guide that will equip the audience for troubleshooting future Kerberos problems.
Agenda is to discuss:
- Systematic approach to Kerberos troubleshooting
- Kerberos Tools available in Hadoop arsenal
- Tips & Tricks to narrow down Kerberos issues quickly
- Some nasty Kerberos issues from Support trenches
Some prior knowledge on Kerberos basics will be appreciated but is not a prerequisite.
Speaker:
Vipin Rathor, Sr. Product Specialist (HDP Security), Hortonworks
What is Apache Kafka and What is an Event Streaming Platform?confluent
Speaker: Gabriel Schenker, Lead Curriculum Developer, Confluent
Streaming platforms have emerged as a popular, new trend, but what exactly is a streaming platform? Part messaging system, part Hadoop made fast, part fast ETL and scalable data integration. With Apache Kafka® at the core, event streaming platforms offer an entirely new perspective on managing the flow of data. This talk will explain what an event streaming platform such as Apache Kafka is and some of the use cases and design patterns around its use—including several examples of where it is solving real business problems. New developments in this area such as KSQL will also be discussed.
Meta/Facebook's database serving social workloads is running on top of MyRocks (MySQL on RocksDB). This means our performance and reliability depends a lot on RocksDB. Not just MyRocks, but also we have other important systems running on top of RocksDB. We have learned many lessons from operating and debugging RocksDB at scale.
In this session, we will offer an overview of RocksDB, key differences from InnoDB, and share a few interesting lessons learned from production.
A Thorough Comparison of Delta Lake, Iceberg and HudiDatabricks
Recently, a set of modern table formats such as Delta Lake, Hudi, Iceberg spring out. Along with Hive Metastore these table formats are trying to solve problems that stand in traditional data lake for a long time with their declared features like ACID, schema evolution, upsert, time travel, incremental consumption etc.
Exactly-once Stream Processing with Kafka StreamsGuozhang Wang
I will present the recent additions to Kafka to achieve exactly-once semantics (0.11.0) within its Streams API for stream processing use cases. This is achieved by leveraging the underlying idempotent and transactional client features. The main focus will be the specific semantics that Kafka distributed transactions enable in Streams and the underlying mechanics to let Streams scale efficiently.
Modern architectures are moving away from a "one size fits all" approach. We are well aware that we need to use the best tools for the job. Given the large selection of options available today, chances are that you will end up managing data in MongoDB for your operational workload and with Spark for your high speed data processing needs.
Description: When we model documents or data structures there are some key aspects that need to be examined not only for functional and architectural purposes but also to take into consideration the distribution of data nodes, streaming capabilities, aggregation and queryability options and how we can integrate the different data processing software, like Spark, that can benefit from subtle but substantial model changes. A clear example is when embedding or referencing documents and their implications on high speed processing.
Over the course of this talk we will detail the benefits of a good document model for the operational workload. As well as what type of transformations we should incorporate in our document model to adjust for the high speed processing capabilities of Spark.
We will look into the different options that we have to connect these two different systems, how to model according to different workloads, what kind of operators we need to be aware of for top performance and what kind of design and architectures we should put in place to make sure that all of these systems work well together.
Over the course of the talk we will showcase different libraries that enable the integration between spark and MongoDB, such as MongoDB Hadoop Connector, Stratio Connector and MongoDB Spark Native Connector.
By the end of the talk I expect the attendees to have an understanding of:
How they connect their MongoDB clusters with Spark
Which use cases show a net benefit for connecting these two systems
What kind of architecture design should be considered for making the most of Spark + MongoDB
How documents can be modeled for better performance and operational process, while processing these data sets stored in MongoDB.
The talk is suitable for:
Developers that want to understand how to leverage Spark
Architects that want to integrate their existing MongoDB cluster and have real time high speed processing needs
Data scientists that know about Spark, are playing with Spark and want to integrate with MongoDB for their persistency layer
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Databricks
Spark SQL is a highly scalable and efficient relational processing engine with ease-to-use APIs and mid-query fault tolerance. It is a core module of Apache Spark. Spark SQL can process, integrate and analyze the data from diverse data sources (e.g., Hive, Cassandra, Kafka and Oracle) and file formats (e.g., Parquet, ORC, CSV, and JSON). This talk will dive into the technical details of SparkSQL spanning the entire lifecycle of a query execution. The audience will get a deeper understanding of Spark SQL and understand how to tune Spark SQL performance.
This developer-focused webinar will explain how to use the Cypher graph query language. Cypher, a query language designed specifically for graphs, allows for expressing complex graph patterns using simple ASCII art-like notation and offers a simple but expressive approach for working with graph data.
During this webinar you'll learn:
-Basic Cypher syntax
-How to construct graph patterns using Cypher
-Querying existing data
-Data import with Cypher
-Using aggregations such as statistical functions
-Extending the power of Cypher using procedures and functions
CDC Stream Processing with Apache FlinkTimo Walther
An instant world requires instant decisions at scale. This includes the ability to digest and react to changes in real-time. Thus, event logs such as Apache Kafka can be found in almost every architecture, while databases and similar systems still provide the foundation. Change Data Capture (CDC) has become popular for propagating changes. Nevertheless, integrating all these systems, which often have slightly different semantics, can be a challenge.
In this talk, we highlight what it means for Apache Flink to be a general data processor that acts as a data integration hub. Looking under the hood, we demonstrate Flink's SQL engine as a changelog processor that ships with an ecosystem tailored to processing CDC data and maintaining materialized views. We will discuss the semantics of different data sources and how to perform joins or stream enrichment between them. This talk illustrates how Flink can be used with systems such as Kafka (for upsert logging), Debezium, JDBC, and others.
Ozone is an object store for Hadoop. Ozone solves the small file problem of HDFS, which allows users to store trillions of files in Ozone and access them as if there are on HDFS. Ozone plugs into existing Hadoop deployments seamlessly, and programs like Hive, LLAP, and Spark work without any modifications. This talk looks at the architecture, reliability, and performance of Ozone.
In this talk, we will also explore Hadoop distributed storage layer, a block storage layer that makes this scaling possible, and how we plan to use the Hadoop distributed storage layer for scaling HDFS.
We will demonstrate how to install an Ozone cluster, how to create volumes, buckets, and keys, how to run Hive and Spark against HDFS and Ozone file systems using federation, so that users don’t have to worry about where the data is stored. In other words, a full user primer on Ozone will be part of this talk.
Speakers
Anu Engineer, Software Engineer, Hortonworks
Xiaoyu Yao, Software Engineer, Hortonworks
Data Build Tool (DBT) is an open source technology to set up your data lake using best practices from software engineering. This SQL first technology is a great marriage between Databricks and Delta. This allows you to maintain high quality data and documentation during the entire datalake life-cycle. In this talk I’ll do an introduction into DBT, and show how we can leverage Databricks to do the actual heavy lifting. Next, I’ll present how DBT supports Delta to enable upserting using SQL. Finally, we show how we integrate DBT+Databricks into the Azure cloud. Finally we show how we emit the pipeline metrics to Azure monitor to make sure that you have observability over your pipeline.
Simplify CDC Pipeline with Spark Streaming SQL and Delta LakeDatabricks
Change Data Capture CDC is a typical use case in Real-Time Data Warehousing. It tracks the data change log -binlog- of a relational database [OLTP], and replay these change log timely to an external storage to do Real-Time OLAP, such as delta/kudu. To implement a robust CDC streaming pipeline, lots of factors should be concerned, such as how to ensure data accuracy , how to process OLTP source schema changed, whether it is easy to build for variety databases with less code.
Cosco: An Efficient Facebook-Scale Shuffle ServiceDatabricks
Cosco is an efficient shuffle-as-a-service that powers Spark (and Hive) jobs at Facebook warehouse scale. It is implemented as a scalable, reliable and maintainable distributed system. Cosco is based on the idea of partial in-memory aggregation across a shared pool of distributed memory. This provides vastly improved efficiency in disk usage compared to Spark's built-in shuffle. Long term, we believe the Cosco architecture will be key to efficiently supporting jobs at ever larger scale. In this talk we'll take a deep dive into the Cosco architecture and describe how it's deployed at Facebook. We will then describe how it's integrated to run shuffle for Spark, and contrast it with Spark's built-in sort-based shuffle mechanism and SOS (presented at Spark+AI Summit 2018).
The Modern Data Team for the Modern Data Stack: dbt and the Role of the Analy...Databricks
A traditional data team has roles including data engineer, data scientist, and data analyst. However, many organizations are finding success by integrating a new role – the analytics engineer. The analytics engineer develops a code-based data infrastructure that can serve both analytics and data science teams. He or she develops re-usable data models using the software engineering practices of version control and unit testing, and provides the critical domain expertise that ensures that data products are relevant and insightful. In this talk we’ll talk about the role and skill set of the analytics engineer, and discuss how dbt, an open source programming environment, empowers anyone with a SQL skillset to fulfill this new role on the data team. We’ll demonstrate how to use dbt to build version-controlled data models on top of Delta Lake, test both the code and our assumptions about the underlying data, and orchestrate complete data pipelines on Apache Spark™.
Virtual Flink Forward 2020: Netflix Data Mesh: Composable Data Processing - J...Flink Forward
Netflix processes trillions of events and petabytes of data a day in the Keystone data pipeline, which is built on top of Apache Flink. As Netflix has scaled up original productions annually enjoyed by more than 150 million global members, data integration across the streaming service and the studio has become a priority. Scalably integrating data across hundreds of different data stores in a way that enables us to holistically optimize cost, performance and operational concerns presented a significant challenge. Learn how we expanded the scope of the Keystone pipeline into the Netflix Data Mesh, our real-time, general-purpose, data transportation platform for moving data between Netflix systems. The Keystone Platform’s unique approach to declarative configuration and schema evolution, as well as our approach to unifying batch and streaming data and processing will be covered in depth.
Troubleshooting Kerberos in Hadoop: Taming the BeastDataWorks Summit
Kerberos is the ubiquitous authentication mechanism when it comes to secure any Hadoop Services. With recent updates in Hadoop core and various Apache Hadoop components, inherent Kerberos support has matured and has come a long way.
Understanding & configuring Kerberos is still a challenge but even more painful & frustrating is troubleshooting a Kerberos issue. There are lot of things (small & big) that can go wrong (and will go wrong!). This talk covers the Kerberos debugging part in detail and discusses the tools & tricks that can be used to narrow down any Kerberos issue.
Rather than discussing the issues and their resolution, we will focus on how to approach a Kerberos problem and do's / dont's in Kerberos scene. This talk will provide a step by step guide that will equip the audience for troubleshooting future Kerberos problems.
Agenda is to discuss:
- Systematic approach to Kerberos troubleshooting
- Kerberos Tools available in Hadoop arsenal
- Tips & Tricks to narrow down Kerberos issues quickly
- Some nasty Kerberos issues from Support trenches
Some prior knowledge on Kerberos basics will be appreciated but is not a prerequisite.
Speaker:
Vipin Rathor, Sr. Product Specialist (HDP Security), Hortonworks
What is Apache Kafka and What is an Event Streaming Platform?confluent
Speaker: Gabriel Schenker, Lead Curriculum Developer, Confluent
Streaming platforms have emerged as a popular, new trend, but what exactly is a streaming platform? Part messaging system, part Hadoop made fast, part fast ETL and scalable data integration. With Apache Kafka® at the core, event streaming platforms offer an entirely new perspective on managing the flow of data. This talk will explain what an event streaming platform such as Apache Kafka is and some of the use cases and design patterns around its use—including several examples of where it is solving real business problems. New developments in this area such as KSQL will also be discussed.
Meta/Facebook's database serving social workloads is running on top of MyRocks (MySQL on RocksDB). This means our performance and reliability depends a lot on RocksDB. Not just MyRocks, but also we have other important systems running on top of RocksDB. We have learned many lessons from operating and debugging RocksDB at scale.
In this session, we will offer an overview of RocksDB, key differences from InnoDB, and share a few interesting lessons learned from production.
A Thorough Comparison of Delta Lake, Iceberg and HudiDatabricks
Recently, a set of modern table formats such as Delta Lake, Hudi, Iceberg spring out. Along with Hive Metastore these table formats are trying to solve problems that stand in traditional data lake for a long time with their declared features like ACID, schema evolution, upsert, time travel, incremental consumption etc.
Exactly-once Stream Processing with Kafka StreamsGuozhang Wang
I will present the recent additions to Kafka to achieve exactly-once semantics (0.11.0) within its Streams API for stream processing use cases. This is achieved by leveraging the underlying idempotent and transactional client features. The main focus will be the specific semantics that Kafka distributed transactions enable in Streams and the underlying mechanics to let Streams scale efficiently.
Modern architectures are moving away from a "one size fits all" approach. We are well aware that we need to use the best tools for the job. Given the large selection of options available today, chances are that you will end up managing data in MongoDB for your operational workload and with Spark for your high speed data processing needs.
Description: When we model documents or data structures there are some key aspects that need to be examined not only for functional and architectural purposes but also to take into consideration the distribution of data nodes, streaming capabilities, aggregation and queryability options and how we can integrate the different data processing software, like Spark, that can benefit from subtle but substantial model changes. A clear example is when embedding or referencing documents and their implications on high speed processing.
Over the course of this talk we will detail the benefits of a good document model for the operational workload. As well as what type of transformations we should incorporate in our document model to adjust for the high speed processing capabilities of Spark.
We will look into the different options that we have to connect these two different systems, how to model according to different workloads, what kind of operators we need to be aware of for top performance and what kind of design and architectures we should put in place to make sure that all of these systems work well together.
Over the course of the talk we will showcase different libraries that enable the integration between spark and MongoDB, such as MongoDB Hadoop Connector, Stratio Connector and MongoDB Spark Native Connector.
By the end of the talk I expect the attendees to have an understanding of:
How they connect their MongoDB clusters with Spark
Which use cases show a net benefit for connecting these two systems
What kind of architecture design should be considered for making the most of Spark + MongoDB
How documents can be modeled for better performance and operational process, while processing these data sets stored in MongoDB.
The talk is suitable for:
Developers that want to understand how to leverage Spark
Architects that want to integrate their existing MongoDB cluster and have real time high speed processing needs
Data scientists that know about Spark, are playing with Spark and want to integrate with MongoDB for their persistency layer
Migration from SQL to MongoDB - A Case Study at TheKnot.com MongoDB
8 out of 10 couples use TheKnot.com to help plan their wedding. A key part of planning involves selecting articles, photographs, and other resources and storing these in the user's Favorites. Recently we migrated major parts of our technology stack to open source technologies. As part of our migration strategy, we zeroed in on MongoDB, since it better suited our requirements for speed and data structure as well as eliminating the need for a caching layer. The transition required a period in which both our legacy and new API where working concurrently with data being persisted on both databases (SQL and Mongo) and all records were being synched with every request. We resourced to many strategies and applications to achieve this goal, including: Pentaho, AWS SQS and SNS, a queue messenger system and some proprietary ruby gems. In this session we will review our strategy and some of the lessons we learned about successfully migrating with zero downtime.
Learn how to build new classes of sophisticated, real-time analytics by combining Apache Spark, the industry's leading data processing engine, with MongoDB, the industry’s fastest growing database.
We live in a world of “big data.” But it isn’t just the data itself that is valuable – it’s the insight it can generate. How quickly an organization can unlock and act on that insight has become a major source of competitive advantage. Collecting data in operational systems and then relying on nightly batch extract, transform, load (ETL) processes to update the enterprise data warehouse (EDW) is no longer sufficient.
In this live session, we show you how MongoDB and Spark work together and provide examples using the new Spark Connector for MongoDB.
This session was sponsored by Stratio & Paradigma.
How Financial Services Organizations Use MongoDBMongoDB
MongoDB is the alternative that allows you to efficiently create and consume data, rapidly and securely, no matter how it is structured across channels and products, and makes it easy to aggregate data from multiple systems, while lowering TCO and delivering applications faster.
Learn how Financial Services Organizations are Using MongoDB with this presentation.
Real World MongoDB: Use Cases from Financial Services by Daniel RobertsMongoDB
Huge upheaval in the finance industry has led to a major strain on existing IT infrastructure and systems. New finance industry regulation has meant increased volume, velocity and variability of data. This coupled with cost pressures from the business has led these institutions to seek alternatives. In this session learn how FS companies are using MongoDB to solve their problems. The use cases are specific to FS but the patterns of usage - agility, scale, global distribution - will be applicable across many industries.
2017 02-07 - elastic & spark. building a search geo locatorAlberto Paro
Using Elasticsearch in a BigData environment is very simple. In this talk, we analyse what's Big Data and we show how it is easy integrating ElasticSearch with Apache Spark
Big Data Testing: Ensuring MongoDB Data QualityRTTS
You've made the move to MongoDB for its flexible schema and querying capabilities in order to enhance agility and reduce costs for your business. Shouldn't your data quality process be just as organized and efficient?
Using QuerySurge for testing your MongoDB data as part of your quality effort will increase your testing speed, boost your testing coverage (up to 100%), and improve the level of quality within your Big Data store. QuerySurge will help you keep your team organized and on track too!
To learn more about QuerySurge, visit www.QuerySurge.com
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...Spark Summit
Elasticsearch provides native integration with Apache Spark through ES-Hadoop. However, especially during development, it is at best cumbersome to have Elasticsearch running in a separate machine/instance. Leveraging Spark Cluster with Elasticsearch Inside it is possible to run an embedded instance of Elasticsearch in the driver node of a Spark Cluster. This opens up new opportunities to develop cutting-edge applications. One such application is Dataset Search.
Oscar will give a demo of a Dataset Search Engine built on Spark Cluster with Elasticsearch Inside. Motivation is that once Elasticsearch is running on Spark it becomes possible and interesting to have the Elasticsearch in-memory instance join an (existing) Elasticsearch cluster. And this in turn enables indexing of Datasets that are processed as part of Data Pipelines running on Spark. Dataset Search and Data Management are R&D topics that should be of interest to Spark Summit East attendees who are looking for a way to organize their Data Lake and make it searchable.
(BDT310) Big Data Architectural Patterns and Best Practices on AWSAmazon Web Services
The world is producing an ever increasing volume, velocity, and variety of big data. Consumers and businesses are demanding up-to-the-second (or even millisecond) analytics on their fast-moving data, in addition to classic batch processing. AWS delivers many technologies for solving big data problems. But what services should you use, why, when, and how? In this session, we simplify big data processing as a data bus comprising various stages: ingest, store, process, and visualize. Next, we discuss how to choose the right technology in each stage based on criteria such as data structure, query latency, cost, request rate, item size, data volume, durability, and so on. Finally, we provide reference architecture, design patterns, and best practices for assembling these technologies to solve your big data problems at the right cost.
What to Expect for Big Data and Apache Spark in 2017 Databricks
Big data remains a rapidly evolving field with new applications and infrastructure appearing every year. In this talk, Matei Zaharia will cover new trends in 2016 / 2017 and how Apache Spark is moving to meet them. In particular, he will talk about work Databricks is doing to make Apache Spark interact better with native code (e.g. deep learning libraries), support heterogeneous hardware, and simplify production data pipelines in both streaming and batch settings through Structured Streaming.
Speaker: Matei Zaharia
Video: http://go.databricks.com/videos/spark-summit-east-2017/what-to-expect-big-data-apache-spark-2017
This talk was originally presented at Spark Summit East 2017.
Big Data is an evolution of Business Intelligence (BI).
Whereas traditional BI relies on data warehouses limited in size
(some terabytes) and it hardly manages unstructured data and
real-time analysis, the era of Big Data opens up a new technological
period offering advanced architectures and infrastructures
allowing sophisticated analyzes taking into account these new
data integrated into the ecosystem of the business . In this article,
we will present the results of an experimental study on the performance
of the best framework of Big Analytics (Spark) with the
most popular databases of NoSQL MongoDB and Hadoop. The
objective of this study is to determine the software combination
that allows sophisticated analysis in real time.
MONGODB VS MYSQL: A COMPARATIVE STUDY OF PERFORMANCE IN SUPER MARKET MANAGEME...ijcsity
A database is information collection that is organized in tables so that it can easily be accessed, managed, and updated. It is the collection of tables, schemas, queries, reports, views and other objects. The data are typically organized to model in a way that supports processes requiring information, such as modelling to find a hotel with availability of rooms, thus the people can easily locate the hotels with vacancies. There are many databases commonly, relational and non relational databases. Relational databases usually work with structured data and non relational databases are work with semi structured data. In this paper, the performance evaluation of MySQL and MongoDB is performed where MySQL is an example of relational database and MongoDB is an example of non relational databases. A relational database is a data structure that allows you to connect information from different 'tables', or different types of data buckets. Non-relational database stores data without explicit and structured mechanisms to link data from different buckets to one another. This paper discuss about the performance of MongoDB and MySQL in the field of Super Market Management System. A supermarket is a large form of the traditional grocery store also a self-service shop offering a wide variety of food and household products, organized in systematic manner. It is larger and has a open selection than a traditional grocery store.
MONGODB VS MYSQL: A COMPARATIVE STUDY OF PERFORMANCE IN SUPER MARKET MANAGEME...ijcsity
A database is information collection that is organized in tables so that it can easily be accessed, managed,
and updated. It is the collection of tables, schemas, queries, reports, views and other objects. The data are
typically organized to model in a way that supports processes requiring information, such as modelling to
find a hotel with availability of rooms, thus the people can easily locate the hotels with vacancies. There
are many databases commonly, relational and non relational databases. Relational databases usually work
with structured data and non relational databases are work with semi structured data. In this paper, the
performance evaluation of MySQL and MongoDB is performed where MySQL is an example of relational
database and MongoDB is an example of non relational databases. A relational database is a data
structure that allows you to connect information from different 'tables', or different types of data buckets.
Non-relational database stores data without explicit and structured mechanisms to link data from different
buckets to one another. This paper discuss about the performance of MongoDB and MySQL in the field of
Super Market Management System. A supermarket is a large form of the traditional grocery store also a
self-service shop offering a wide variety of food and household products, organized in systematic manner.
It is larger and has a open selection than a traditional grocery store.
MONGODB VS MYSQL: A COMPARATIVE STUDY OF PERFORMANCE IN SUPER MARKET MANAGEME...ijcsity
A database is information collection that is organized in tables so that it can easily be accessed, managed,
and updated. It is the collection of tables, schemas, queries, reports, views and other objects. The data are
typically organized to model in a way that supports processes requiring information, such as modelling to
find a hotel with availability of rooms, thus the people can easily locate the hotels with vacancies. There
are many databases commonly, relational and non relational databases. Relational databases usually work
with structured data and non relational databases are work with semi structured data. In this paper, the
performance evaluation of MySQL and MongoDB is performed where MySQL is an example of relational
database and MongoDB is an example of non relational databases. A relational database is a data
structure that allows you to connect information from different 'tables', or different types of data buckets.
Non-relational database stores data without explicit and structured mechanisms to link data from different
buckets to one another. This paper discuss about the performance of MongoDB and MySQL in the field of
Super Market Management System. A supermarket is a large form of the traditional grocery store also a
self-service shop offering a wide variety of food and household products, organized in systematic manner.
It is larger and has a open selection than a traditional grocery store.
What are the major components of MongoDB and the major tools used in it.docxTechnogeeks
MongoDB, a renowned NoSQL database, comprises key components like databases, collections, documents, indexes, replica sets, and sharding, enabling flexible and scalable data management. Major tools include the Mongo Shell, MongoDB Compass, MongoDB Atlas, and Mongoose, facilitating database administration, monitoring, and development tasks. MongoDB's optimization strategies involve indexing, efficient querying, projection, aggregation, and sharding to enhance query performance. Capped collections offer a specialized solution for managing time-ordered data with predictable sizes, ensuring high performance and simplicity for specific use cases like event logging. Understanding MongoDB's components, utilizing its tools, and implementing optimization strategies empower developers to build modern, scalable, and efficient applications tailored to their needs.
Hands on Big Data Analysis with MongoDB - Cloud Expo Bootcamp NYCLaura Ventura
One of the most popular NoSQL databases, MongoDB is one of the building blocks for big data analysis. MongoDB can store unstructured data and makes it easy to analyze files by commonly available tools. This session will go over how big data analytics can improve sales outcomes in identifying users with a propensity to buy by processing information from social networks. All attendees will have a MongoDB instance on a public cloud, plus sample code to run Big Data Analytics.
Organizations adopt different databases for big data which is huge in volume and have different data models. Querying big data is challenging yet crucial for any business. The data warehouses traditionally built with On-line Transaction Processing (OLTP) centric technologies must be modernized to scale to the ever-growing demand of data. With rapid change in requirements it is important to have near real time response from the big data gathered so that business decisions needed to address new challenges can be made in a timely manner. The main focus of our research is to improve the performance of query execution for big data.
Organizations adopt different databases for big data which is huge in volume and have different data models. Querying big data is challenging yet crucial for any business. The data warehouses traditionally built with On-line Transaction Processing (OLTP) centric technologies must be modernized to scale to the ever-growing demand of data. With rapid change in requirements it is important to have near real time response from the big data gathered so that business decisions needed to address new challenges can be made in a timely manner. The main focus of our research is to improve the performance of query execution for big data.
how_can_businesses_address_storage_issues_using_mongodb.pptxsarah david
MongoDB enables seamless data storage and performance. Explore our blog to learn how MongoDB handles storage issues for startups and large-scale enterprises. Discover how to optimize MongoDB performance using open-source database storage.
how_can_businesses_address_storage_issues_using_mongodb.pdfsarah david
MongoDB enables seamless data storage and performance. Explore our blog to learn how MongoDB handles storage issues for startups and large-scale enterprises. Discover how to optimize MongoDB performance using open-source database storage.
SQL vs NoSQL, an experiment with MongoDBMarco Segato
A simple experiment with MongoDB compared to Oracle classic RDBMS database: what are NoSQL databases, when to use them, why to choose MongoDB and how we can play with it.
New generations of database technologies are allowing organizations to build applications never before possible, at a speed and scale that were previously unimaginable. MongoDB is the fastest growing database on the planet, and the new 3.2 release will bring the benefits of modern database architectures to an ever broader range of applications and users.
In-Memory Storage Engine (beta)
WiredTiger as the default storage engine
Advanced security (encryption at rest)
Document Validation
Advanced full text
Dynamic Lookups
BI Connector (Tableau, Qlikview, Cognos, BusinessObjects, etc...)
Database GUI with MongoDB Compass
And more...
Presentation Material for NoSQL Indonesia "October MeetUp".
This slide talks about basic schema design and some examples in applications already on production.
We are a company that delivers value to our customers by lowering costs with digital marketing and increasing the efficiency of campaigns and their conversions. Using the most advanced artificial intelligence models in the neuro-marketing perspective, we have been able to predict the effectiveness of a marketing campaign before it is published. After its publication, we evaluated the campaign, segmenting the public according to the standard extracted from each market segment, delivering information for strategic and efficient management.
Aplicações de Alto Desempenho com JHipster Full StackJoão Gabriel Lima
Palestra apresentada no Meetup da comunidade Sou Java Campinas sobre o JHipster, desmistificando muitas premissas e validando aquilo que temos de melhor no mercado de tecnologias Java.
Palestra apresentada no FEMUG-PE de Setembro! Mostro o ARKit Framework e algumas aplicações muito interessantes do uso de realidade aumentada. Por fim, apresento o React-Native-ArKit, biblioteca para que você, desenvolvedor React Native, também utilize o ARkit em seus projetos de forma facilitada e muito prática.
Com a crescente onda de dados gerados, está cada vez mais claro que tecnologias de preparação e processamento de Big Data precisam se apoiar em Inteligência Artificial. Nesta palestra apresento o estado da arte em Big Data e IA, mostro claramente a relação entre esses tópicos, dando um direcionamento de como esses conceitos devem ser aplicados. Foi mostrado um estudo de caso da Operação Serenata de Amor, proposta por cientistas de dados e jornalistas para o combate à corrupção no Brasil.
O modelo de regressão é então usado para prever o resultado de uma variável dependente desconhecida, dados os valores das variáveis independentes.
Nesta aula, mostro um passo a passo com a bordage teórica e prática de como fazer regressão linear utilizando o WEKA
Nesta apresentação, foram discutidos os principais casos que ocorreram entre 2015 e 2016, detalhando como cada um foi executado, as técnicas utilizadas e principalmente, dicas de como proteger-se delas.
Mineração de Dados com RapidMiner - Um Estudo de caso sobre o Churn Rate em...João Gabriel Lima
Nesta palestra, vamos trabalhar uma abordagem passo a passo de como construir um modelo de classificação, para identificar os padrões de clientes de uma empresa de telefonia que cancelaram o serviço, de modo que a operadora possa prever o risco de cancelamento e iniciar um trabalho para evitar que isso aconteça.
Mineração de dados com RapidMiner + WEKA - ClusterizaçãoJoão Gabriel Lima
Nesta apresentação, apresento um passo a passo prático de como clusterizar e mais importante que isso, como interpretar os resultados aplicando isso para auxiliar a tomada de decisão.
No final temos um exercício de fixação muito interessante que nos dá a oportunidade de aplicar os conhecimentos adquiridos.
jgabriel.ufpa@gmail.com
Nessa apresentação apresento ambas arquiteturas e mostro que ao invés de escolher entre uma e outra, podemos tirar o que há de melhor em cada e utilizá-las de forma limpa, simples e objetiva.
Game of data - Predição e Análise da série Game Of Thrones a partir do uso de...João Gabriel Lima
Nesta apresentação mostro um estudo realizado pela universidade de Munique que visa prever a probabilidade de um personagem morrer na próxima temporada de acordo com 24 características pré-selecionadas
Apresentação sobre o aplicativo e-Trânsito cidadão: https://play.google.com/store/apps/details?id=com.huddle3.etranstitocidadaov2
Contendo notícias e provendo consulta sobre o IPVA
[Estácio - IESAM] Automatizando Tarefas com Gulp.jsJoão Gabriel Lima
Tutorial sobre Gulpjs
Especialização em Desenvolvimento Web - Instituto de Estudos Superiores da Amazônia
Neste tutorial apresento a facilidade proporcionada por automatizadores e abordo especificamente o [Gulp.js](gulpjs.com)
Palestra apresentada no JsDay Recife 2015, onde mostro uma visão geral sobre o cenário da Internet das Coisas com Javascript. Primeiramente destaco os conceitos gerais, em seguida justifico o uso de javascript, além disso, mostro as principais ferramentas, bibliotecas e API's. Cito os principais projetos na área e mostro um projeto na prática implementado em javascript, utilizando a tecnologia bluetooth para contrução de smarthomes, provendo a comunicação entre o dispositivo controlador e o smartphone do usuário.
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Albert Hoitingh
In this session I delve into the encryption technology used in Microsoft 365 and Microsoft Purview. Including the concepts of Customer Key and Double Key Encryption.
Dr. Sean Tan, Head of Data Science, Changi Airport Group
Discover how Changi Airport Group (CAG) leverages graph technologies and generative AI to revolutionize their search capabilities. This session delves into the unique search needs of CAG’s diverse passengers and customers, showcasing how graph data structures enhance the accuracy and relevance of AI-generated search results, mitigating the risk of “hallucinations” and improving the overall customer journey.
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024Neo4j
Neha Bajwa, Vice President of Product Marketing, Neo4j
Join us as we explore breakthrough innovations enabled by interconnected data and AI. Discover firsthand how organizations use relationships in data to uncover contextual insights and solve our most pressing challenges – from optimizing supply chains, detecting fraud, and improving customer experiences to accelerating drug discoveries.
Transcript: Selling digital books in 2024: Insights from industry leaders - T...BookNet Canada
The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more.
Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/
Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
Elevating Tactical DDD Patterns Through Object CalisthenicsDorra BARTAGUIZ
After immersing yourself in the blue book and its red counterpart, attending DDD-focused conferences, and applying tactical patterns, you're left with a crucial question: How do I ensure my design is effective? Tactical patterns within Domain-Driven Design (DDD) serve as guiding principles for creating clear and manageable domain models. However, achieving success with these patterns requires additional guidance. Interestingly, we've observed that a set of constraints initially designed for training purposes remarkably aligns with effective pattern implementation, offering a more ‘mechanical’ approach. Let's explore together how Object Calisthenics can elevate the design of your tactical DDD patterns, offering concrete help for those venturing into DDD for the first time!
In his public lecture, Christian Timmerer provides insights into the fascinating history of video streaming, starting from its humble beginnings before YouTube to the groundbreaking technologies that now dominate platforms like Netflix and ORF ON. Timmerer also presents provocative contributions of his own that have significantly influenced the industry. He concludes by looking at future challenges and invites the audience to join in a discussion.
Securing your Kubernetes cluster_ a step-by-step guide to success !KatiaHIMEUR1
Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster.
However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks.
In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.
Pushing the limits of ePRTC: 100ns holdover for 100 daysAdtran
At WSTS 2024, Alon Stern explored the topic of parametric holdover and explained how recent research findings can be implemented in real-world PNT networks to achieve 100 nanoseconds of accuracy for up to 100 days.
UiPath Test Automation using UiPath Test Suite series, part 5DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 5. In this session, we will cover CI/CD with devops.
Topics covered:
CI/CD with in UiPath
End-to-end overview of CI/CD pipeline with Azure devops
Speaker:
Lyndsey Byblow, Test Suite Sales Engineer @ UiPath, Inc.
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
Unlocking Productivity: Leveraging the Potential of Copilot in Microsoft 365, a presentation by Christoforos Vlachos, Senior Solutions Manager – Modern Workplace, Uni Systems
Threats to mobile devices are more prevalent and increasing in scope and complexity. Users of mobile devices desire to take full advantage of the features
available on those devices, but many of the features provide convenience and capability but sacrifice security. This best practices guide outlines steps the users can take to better protect personal devices and information.
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
State of ICS and IoT Cyber Threat Landscape Report 2024 previewPrayukth K V
The IoT and OT threat landscape report has been prepared by the Threat Research Team at Sectrio using data from Sectrio, cyber threat intelligence farming facilities spread across over 85 cities around the world. In addition, Sectrio also runs AI-based advanced threat and payload engagement facilities that serve as sinks to attract and engage sophisticated threat actors, and newer malware including new variants and latent threats that are at an earlier stage of development.
The latest edition of the OT/ICS and IoT security Threat Landscape Report 2024 also covers:
State of global ICS asset and network exposure
Sectoral targets and attacks as well as the cost of ransom
Global APT activity, AI usage, actor and tactic profiles, and implications
Rise in volumes of AI-powered cyberattacks
Major cyber events in 2024
Malware and malicious payload trends
Cyberattack types and targets
Vulnerability exploit attempts on CVEs
Attacks on counties – USA
Expansion of bot farms – how, where, and why
In-depth analysis of the cyber threat landscape across North America, South America, Europe, APAC, and the Middle East
Why are attacks on smart factories rising?
Cyber risk predictions
Axis of attacks – Europe
Systemic attacks in the Middle East
Download the full report from here:
https://sectrio.com/resources/ot-threat-landscape-reports/sectrio-releases-ot-ics-and-iot-security-threat-landscape-report-2024/
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
Apache Spark and MongoDB - Turning Analytics into Real-Time Action
1. Apache Spark and MongoDB
Turning Analytics into Real-Time Action
A MongoDB White Paper
2. Table of Contents
1Introduction
2Rich Analytics in MongoDB
3Apache Spark: Extending MongoDB Analytics
5Configuring & Running Spark with MongoDB
8Integrating MongoDB & Spark with BI & Hadoop
9MongoDB & Spark in the Wild
10Conclusion
10We Can Help
3. Introduction
We live in a world of “big data”. But it isn’t only the data
itself that is valuable – it is the insight it can generate. That
insight can help designers better predict new products that
customers will love. It can help manufacturing companies
model failures in critical components, before costly recalls.
It can help financial institutions detect fraud. It can help
retailers better forecast supply and demand. Marketing
departments can deliver more relevant recommendations
to its customers. The list goes on.
How quickly an organization can unlock and act on that
insight is becoming a major source of competitive
advantage. Collecting data in operational systems and then
relying on nightly batch ETL (Extract Transform Load)
processes to update the Enterprise Data Warehouse
(EDW) is no longer sufficient. Speed-to-insight is critical,
and so analytics against live operational data to drive
real-time action is fast becoming a necessity, enabled by
technologies like MongoDB and Apache Spark.
Why Apache Spark and MongoDB?
Apache Spark is one of the fastest-growing big data
projects in the history of the Apache Software Foundation.
With its memory-oriented architecture, flexible processing
libraries and ease-of-use, Spark has emerged as a leading
distributed computing framework for real-time analytics.
With over 10 million downloads, MongoDB is the most
popular non-relational database, counting more than one
third of the Fortune 100 as customers. Its flexible
JSON-based document data model, dynamic schema and
automatic scaling on commodity hardware make MongoDB
an ideal fit for modern, always-on applications that must
manage high volumes of rapidly changing, multi-structured
data. Internet of Things (IoT), mobile apps, social
engagement, customer data and content management
systems are prime examples of MongoDB use cases.
Combining the leading analytics processing engine with
the fastest-growing database enables organizations to
realize real time analytics. Spark jobs can be executed
directly against operational data managed by MongoDB
without the time and expense of ETL processes. MongoDB
1
4. can then efficiently index and serve analytics results back
into live, operational processes. This approach offers many
benefits to teams tasked with delivering modern, data
driven applications:
• Developers can build more functional applications
faster, using a single database technology.
• Operations teams eliminate the requirement for
shuttling data between separate operational and
analytics infrastructure, each with its own unique
configuration, maintenance and management
requirements.
• CIOs deliver faster time-to-insight for the business, with
lower cost and risk.
This whitepaper will discuss the analytics capabilities
offered by MongoDB and Apache Spark, before providing
an overview of how to configure and combine them into a
real-time analytics engine. It will conclude with example
use cases.
Rich Analytics in MongoDB
Unlike NoSQL databases that offer little more than basic
key-value query operations, developers and data scientists
can use MongoDB’s native query processing capabilities to
generate many classes of analytics, before needing to
adopt dedicated frameworks such as Spark or Hadoop for
more specialized tasks.
Leading organizations including Salesforce, McAfee,
Buzzfeed, Intuit, City of Chicago, Amadeus, Buffer and
many others rely on MongoDB’s powerful query
functionality, aggregations and indexing to generate
analytics in real-time directly against their live, operational
data.
In this section of the paper, we provide more detail on
MongoDB's native analytics features, before then exploring
the additional capabilities Apache Spark can bring.
MongoDB Query Model
MongoDB users have access to a broad array of query,
projection and update operators supporting analytical
queries against live operational data:
• Aggregation and MapReduce queries, discussed in
more detail below
• Range queries returning results based on values
defined as inequalities (e.g., greater than, less than or
equal to, between)
• Geospatial queries returning results based on proximity
criteria, intersection and inclusion as specified by a
point, line, circle or polygon
• Text search queries returning results in relevance order
based on text arguments using Boolean operators (e.g.,
AND, OR, NOT)
• Key-value queries returning results based on any field in
the document, often the primary key
• Native BI and Hadoop integration, for deep, offline
analytics.
With the combination of MongoDB’s dynamic document
model and comprehensive query framework, users are able
to store data before knowing all of the questions they will
need to ask of it.
Data Aggregation
The MongoDB Aggregation Framework is similar in
concept to the SQL GROUP BY statement, enabling users
to generate aggregations of values returned by the query
(e.g., count, minimum, maximum, average, intersections)
that can be used to power reports, dashboards and
vizualizations.
Using the Aggregation Framework, documents in a
MongoDB collection (analogous to a table in a relational
database) pass through an aggregation pipeline, where
they are processed by operators. Expressions produce
output documents based on calculations performed on the
input documents. The accumulator expressions used in the
$group operator maintain state (e.g., totals, mins, maxs,
averages) as documents progress through the pipeline.
The aggregation pipeline enables multi-step data
enrichment and transformations to be performed directly in
the database with a simple declarative interface, supporting
processes such as lightweight ETL to be performed within
MongoDB.
2
5. Result sets from the aggregation pipeline can be written to
a named collection with no limit to the output size (subject
to the underlying storage system). Existing collections can
be replaced with new results while maintaining previously
defined indexes to ensure queries can always be returned
efficiently over rapidly changing data.
Incremental MapReduce within MongoDB
MongoDB provides native support for MapReduce,
enabling complex data processing that is expressed in
JavaScript and executed across data in the database.
Google’s V8 JavaScript engine, which is integrated into
MongoDB, allows multiple MapReduce jobs to be executed
simultaneously across both single servers and sharded
collections.
The incremental MapReduce operation uses a temporary
collection during processing so it can be run periodically
over the same target collection without affecting
intermediate states. This mode is useful when continuously
updating statistical output collections used by analytics
dashboards and visualizations.
Optimizing Analytics Queries with
MongoDB Indexes
MongoDB provides a number of different index types to
optimize performance of real-time and ad-hoc analytics
queries across highly variable, fast moving data sets. These
same indexes can be used by Apache spark to filter only
interesting subsets of data against which analytics can be
run.
Indexes can be created on any field within a document or
sub-document. In addition to supporting single field and
compound indexes, MongoDB also supports indexes of
arrays with multiple values, short-lived data (i.e., Time To
Live), sparse, geospatial and text data, and can enforce
constraints with unique indexes. Index intersection allows
indexes to be dynamically combined at runtime to
efficiently answer explorative questions of the data. Refer
to the documentation for the full list of index types.
The MongoDB query optimizer selects the best index to
use by running alternate query plans and selecting the
index with the best response time for each query type. The
results of this empirical test are stored as a cached query
plan and are updated periodically.
MongoDB also supports covered queries – where returned
results contain only indexed fields, without having to
access and read from source documents. With the
appropriate indexes, analytics workloads can be optimized
to use predominantly covered queries.
Building upon MongoDB, Apache Spark can offer
additional analytics capabilities to serve real time,
operational processes.
Apache Spark: Extending
MongoDB’s Analytics
Capabilities
Apache Spark is a powerful open source processing
engine designed around speed, ease of use, and
sophisticated analytics. Originally developed at UC
Berkeley in 2009, Spark has seen rapid adoption by
enterprises across a range of industries.
Spark is a general-purpose framework used for many types
of data processing. Spark comes packaged with support
for machine learning, interactive queries (SQL), statistical
queries with R, ETL, and streaming. For loading and storing
data, Spark integrates with a number of storage systems
including Amazon S3, HDFS, RDBMS’ and MongoDB.
Additionally, Spark supports a variety of popular
development languages including Java, Python and Scala.
Apache Spark Benefits
Spark was initially designed for interactive queries and
iterative algorithms, as these were two major use cases not
well served by batch frameworks such as Hadoop’s
MapReduce. Consequently Spark excels in scenarios that
require fast performance, such as iterative processing,
interactive querying, large-scale batch operations,
streaming, and graph computations.
Developers and data scientists typically deploy Spark for:
• Simplicity. Easy-to-use APIs for operating on large
datasets. This includes a collection of sophisticated
3
6. FigurFigure 1:e 1: Apache Spark Ecosystem
operators for transforming and manipulating
semi-structured data.
• Speed. By exploiting in-memory optimizations, Spark
has shown up to 100x higher performance than
MapReduce running on Hadoop.
• Unified Framework. Packaged with higher-level libraries,
including support for SQL queries, machine learning,
stream and graph processing. These standard libraries
increase developer productivity and can be combined to
create complex workflows.
Spark allows programmers to develop complex, multi-step
data pipelines using a directed acyclic graph (DAG)
pattern. It also supports in-memory data sharing across
DAGs, so that different jobs can work with the same data.
Spark takes MapReduce to the next level with less
expensive shuffles during data processing. Spark holds
intermediate results in memory rather than writing them to
disk, which is useful especially when a process needs to
iteratively work on the same dataset. Spark is designed as
an execution engine that works with data both in-memory
and on-disk. Spark operators perform external operations
when data does not fit in memory. Spark can be used for
processing datasets that are larger than the aggregate
memory in a cluster.
The Resilient Distributed Dataset (RDD) is the core data
structure used in the Spark framework. RDD is analogous
to a table in a relational database or a collection in
MongoDB. RDDs are immutable – they can be modified
with a transformation, but the transformation returns a new
RDD while the original RDD remains unchanged.
RDDs supports two types of operations: Transformations
and Actions. Transformations return a new RDD after
processing it with a function such as map, filter, flatMap,
groupByKey, reduceByKey, aggregateByKey, pipe, or
coalesce.
The Action operation evaluates the RDD and returns a new
value. When an Action function is called on a RDD object,
all the processing queries are computed and the result
value is returned. Some of the Action operations are
reduce, collect, count, first, take, countByKey, and foreach.
When to Use Spark with MongoDB
While MongoDB natively offers rich analytics capabilities,
there are situations where integrating the Spark engine
can extend the real-time processing of operational data
managed by MongoDB, and allow users to operationalize
results generated from Spark within real-time business
processes supported by MongoDB.
Spark can take advantage of MongoDB’s rich secondary
indexes to extract and process only the range data it needs
– for example, analyzing all customers located in a specific
geography. This is very different from other databases that
either do not offer, or do not recommend the use of
secondary indexes. In these cases, Spark would need to
extract all data based on a simple primary key, even if only
a subset of that data is required for the Spark process. This
means more processing overhead, more hardware, and
longer time-to-insight for the analyst.
Examples of where it is useful to combine Spark and
MongoDB include the following.
Rich Operators & Algorithms
Spark supports over 100 different operators and
algorithms for processing data. Developers can use these
to perform advanced computations that would otherwise
require more programmatic effort combining the MongoDB
aggregation framework with application code. For example,
Spark offers native support for advanced machine learning
algorithms including k-means clustering and Gaussian
mixture models.
Consider a web analytics platform that uses the MongoDB
aggregation framework to maintain a real time dashboard
displaying the number of clicks on an article by country;
how often the article is shared across social media; and the
number of shares by platform. With this data, analysts can
quickly gain insight on how content is performing,
optimizing user’s experience for posts that are trending,
4
7. with the ability to deliver critical feedback to the editors and
ad-tech team.
Spark’s machine learning algorithms can also be applied to
the log, clickstream and user data stored in MongoDB to
build precisely targeted content recommendations for its
readers. Multi-class classifications are run to divide articles
into granular sub-categories, before applying logistic
regression and decision tree methods to match readers’
interests to specific articles. The recommendations are
then served back to users through MongoDB, as they
browse the site.
Processing Paradigm
Many programming languages can use their own
MongoDB drivers to execute queries against the database,
returning results to the application where additional
analytics can be run using standard machine learning and
statistics libraries. For example, a developer could use the
MongoDB Python or R drivers to query the database,
loading the result sets into the application tier for additional
processing.
However, this starts to become more complex when an
analytical job in the application needs to be distributed
across multiple threads and nodes. While MongoDB can
service thousands of connections in parallel, the application
would need to partition the data, distribute the processing
across the cluster, and then merge results. Spark makes
this kind of distributed processing easier and faster to
develop. MongoDB exposes operational data to Spark’s
distributed processing layer to provide fast, real-time
analysis. Combining Spark queries with MongoDB indexes
allows data to be filtered, avoiding full collection scans and
delivering low-latency responsiveness with minimal
hardware and database overhead.
Skills Re-Use
With libraries for SQL, machine learning and others –
combined with programming in Java, Scala and Python –
developers can leverage existing skills and best practices
to build sophisticated analytics workflows on top of
MongoDB.
Configuring & Running Spark
with MongoDB
There are currently two packaged connectors to integrate
MongoDB and Spark:
• The MongoDB Connector for Hadoop and Spark
• The Spark-MongoDB Connector, developed by Stratio
MongoDB Connector for Hadoop & Spark
The MongoDB Connector is a plugin for both Hadoop and
Spark that provides the ability to use MongoDB as an input
source and/or an output destination for jobs running in
both environments. Note that the connector directly
integrates Spark with MongoDB and has no dependency
on also having a Hadoop cluster running.
Input and output classes are provided allowing users to
read and write against both live MongoDB collections and
against BSON (Binary JSON) files that are used to store
MongoDB snapshots. Standard MongoDB connection
strings, including authentication credentials and read
preferences, are used to specify collections against which
Spark queries are run, and where results are written back
to. JSON formatted queries and projections can be used to
filter the input collection, which uses a method in the
connector to create a Spark RDD from the MongoDB
collection.
There are only two dependencies for installing the
connector:
• Download the MongoDB Connector for Hadoop and
Spark. For Spark, all you need is the "core" jar.
• Download the jar for the MongoDB Java Driver.
The documentation provides a worked example using
MongoDB as both a source and sink for Spark jobs,
including how to create and save Spark’s Resilient
Distributed Dataset (RDD) to MongoDB collections.
Databricks (founded by the creators of Spark) and
MongoDB have also collaborated in creating a
demonstration app that documents how to configure and
5
8. MongoDB Connector for HadoopMongoDB Connector for Hadoop Stratio Spark-MongoDB ConnectorStratio Spark-MongoDB Connector
Machine Learning Yes Yes
SQL Not currently Yes
DataFrames Not currently Yes
Streaming Not currently Not currently
Python Yes Yes
Using SparkSQL syntax
Use MongoDB secondary indexes
to filter input data
Yes Yes
Compatibility with MongoDB
replica sets and sharding
Yes Yes
MongoDB Support Yes
Read and write
Yes
Read and write
HDFS Support Yes
Read and write
Partial
Write only
Support for MongoDB BSON
Files
Yes No
Commercial Support Yes
With MongoDB Enterprise Advanced
Yes
Provided by Stratio
TTable 1:able 1: Comparing Spark Connectors for MongoDB
use Spark to execute analysis of market trading data
managed by MongoDB.
Spark-MongoDB Connector
The Spark-MongoDB Connector is a library that allows the
user to read and write data to MongoDB with Spark,
accessible from Python, Scala and Java API’s. The
Connector is developed by Stratio and distributed under
the Apache Software License.
The MongoDB SparkSQL data source implements the
Spark Dataframe API, and is fully implemented in Scala.
The connector allows integration between multiple data
sources that implement the same API for Spark, in addition
to using the SQL syntax to provide higher-level
abstractions for complex analytics. It also includes an easy
way to integrate MongoDB with Spark Python API.
Stratio's SparkSQL MongoDB connector implements the
PrunedFilteredScan API instead of the TableScan API.
Therefore, in addition to MongoDB’s query projections, the
connector can avoid scanning the entire collection, which is
not the case with other databases.
The connector supports the Spark Catalyst optimizer for
both rule-based and cost-based query optimization. To
operate against multi-structured data, the connector infers
the schema by sampling documents from the MongoDB
collection. This process is controlled by the samplingRatio
parameter. If the schema is known, the developer can
provide it to the connector, avoiding the need for any
inference. Once data is stored in MongoDB, Stratio
provides an ODBC/JDBC connector for integrating results
with any BI tool, such as Tableau and QlikView.
The connector can be downloaded from the community
Spark Packages repository.
Installation is simple – the connector can be included in a
Spark application with a single command.
6
9. You can learn more about the Spark-Mongodb connector
from Stratio’s GitHub page.
Operations: Separating Operational from
Analytical Workloads
Spark can be run on any physical node within a MongoDB
replica set. However, executing heavyweight analytics jobs
on a host that is also serving requests from an operational
application can cause contention for system resources.
Similarly the working set containing the indexes and most
frequently accessed data in memory is overwritten by data
needed to process analytics queries. Whether running
MongoDB aggregations or Spark processes, it is a best
practice to isolate competing traffic from the operational
and analytics workloads.
Using native replication, MongoDB maintains multiple
copies of data called replica sets. Replica sets are primarily
designed to provide high availability and disaster recovery,
and to isolate different workloads running over the same
data set.
By default, all read and write operations are routed to the
primary replica set member, thus ensuring strong
consistency of data being read and written to the database.
Applications can also be configured to read from
secondary replica set members, where data is eventually
consistent by default. Reads from secondaries can be
useful in scenarios where it is acceptable for data to be
several seconds (often much less) behind the live data,
which is typical for many analytics and reporting
applications. To ensure a secondary node serving analytics
queries is never considered for election to the primary
replica set member, it should be configured with a priority
of 0.
Unlike other databases that have to offload Spark to
dedicated analytics nodes that are running an entirely
different configuration from the database nodes, all
members of a MongoDB replica set share the same
availability, scalability and security mechanisms. This
approach reduces operational complexity and risk.
Integrating MongoDB & Spark
Analytics with BI Tools &
Hadoop
Real time analytics generated by MongoDB and Spark can
serve both online operational applications and offline
reporting systems, where it can be blended with historical
data and analytics from other data sources. To power
dashboards, reports and visualizations MongoDB offers
integration with more of the leading BI and Analytics
platforms than any other non-relational database, including
tools from:
• Actuate
• Alteryx
• Informatica
• Jaspersoft
• Logi Analytics
• MicroStrategy
• Pentaho
• Qliktech
• SAP Lumira
• Talend
A range of ODBC/JDBC connectors for MongoDB from
the likes of Progress Software and Simba Technologies
provide integration with additional analytics and
visualization platforms, including Tableau and others.
At the most fundamental level, each connector provides
read and write access to MongoDB. The connector
enables the BI platform to access MongoDB documents,
parse them, before then blending them with other data.
Result sets can also be written back to MongoDB if
required.
More advanced functionality available in some BI
connectors includes integration with the MongoDB
aggregation framework for in-database analytics and
summarization, schema discovery and intelligent query
routing within a replica set. You can read more about these
integrations and connectors in the Bringing Online Big
Data to BI and Analytics whitepaper
7
10. FigurFigure 2:e 2: Using MongoDB Replica Sets to Isolate Analytics from Operational Workloads
MongoDB Connector for BI and
Visualization
To extend integration, MongoDB has recently previewed a
new connector that integrates MongoDB with
industry-standard BI and data visualization tools that, until
now, have only been able to analyze data in SQL-based
relational databases. By presenting MongoDB as a regular
ODBC data source, the connector is designed to work with
every SQL-compliant data analysis tool on the market,
including Tableau, SAP Business Objects, Qlik, IBM
Cognos Business Intelligence, and even Microsoft Excel.
By leveraging MongoDB to process queries, the connector
provides a highly scalable analytics solution for fast moving,
real-time data, while giving millions of BI & Analytics users
access to data in MongoDB using the tools they already
know.
The BI Connector is currently in preview release and
expected to become generally available in the fourth
quarter of 2015. More information on the connector is
available here.
Hadoop Integration
Like MongoDB, Hadoop is seeing growing adoption across
industry and government, becoming an adjunct to – and in
8
11. some cases a replacement of – the traditional Enterprise
Data Warehouse.
Many organizations are harnessing the power of Hadoop
and MongoDB together to create complete big data
applications:
• MongoDB powers the online, real time operational
application, serving business processes and end-users
• Hadoop consumes data from MongoDB, blending its
with data from other operational systems to fuel
sophisticated analytics and machine learning. Results
are loaded back to MongoDB to serve smarter
operational processes – i.e., delivering more relevant
offers, faster identification of fraud, improved customer
experience.
Organizations such as eBay, Orbitz, Pearson, Square Enix
and SFR are using MongoDB alongside Hadoop to
operationalise big data.
As well as providing integration for Spark, the MongoDB
Connector for Hadoop and Spark is also the foundation for
MongoDB’s integration with Hadoop. A single extensible
connector supporting both Spark and Hadoop makes it
FigurFigure 3:e 3: Integrating MongoDB & Spark with BI Tools
much simpler for developers and operations teams to build
and run powerful big data pipelines.
The MongoDB Connector for Hadoop and Spark presents
MongoDB as a Hadoop data source allowing Hadoop jobs
to read data from MongoDB directly without first copying it
to HDFS, thereby eliminating the need to move TB of data
between systems. Hadoop jobs can pass queries as filters,
thereby avoiding the need to scan entire collections and
speeding up processing; they can also take advantage of
MongoDB’s indexing capabilities, including text and
geospatial indexes.
As well as reading from MongoDB, the Connector also
allows results of Hadoop jobs to be written back out to
MongoDB, to support real-time operational processes and
ad-hoc querying by both MongoDB and Spark query
engines.
The Connector supports MapReduce, Spark on HDFS, Pig,
Hadoop Streaming (with Node.js, Python or Ruby) and
Flume jobs. Support is also available for SQL queries from
Apache Hive to be run across MongoDB data sets. The
connector is certified on the leading Hadoop distributions
from Cloudera, Hortonworks and MapR.
9
12. To learn more about running MongoDB with Hadoop,
download the Big Data: Examples and Guidelines
whitepaper
MongoDB & Spark in the Wild
Many organizations are already combining MongoDB and
Spark to unlock real-time analytics from their operational
data:
• A global manufacturing company has built a pilot project
to estimate warranty returns by analyzing material
samples from production lines. The collected data
enables them to build predictive failure models using
Spark Machine Learning and MongoDB.
• A video sharing website is using Spark with MongoDB
to place relevant advertisements in front of users as
they browse, view and share videos.
A multinational banking group operating in 31 countries
with 51 million clients implemented a unified real-time
monitoring application with the Stratio Big Data (BD)
platform, running Apache Spark and MongoDB. The bank
wanted to ensure a high quality of service across its online
channels, and so needed to continuously monitor client
activity to check service response times and identify
potential issues. The technology stack included:
• Apache Flume to aggregate log data
• Apache Spark to process log events in real time
• MongoDB to persist log data, processed events and
Key Performance Indicators (KPIs).
The aggregated KPIs enable the bank to meet its objective
of analyzing client and systems behavior in order to
improve customer experience. Collecting raw log data
allows the bank to immediately rebuild user sessions if a
service fails, with the analysis providing complete
traceability to identify the root cause of the issue.
MongoDB was selected as the database layer of the
application because of it’s always-on availability, high
performance and linear scalability. The application needed
a database with a fully dynamic schema to support high
volumes of rapidly changing semi-structured and
unstructured data being ingested from the variety of logs,
clickstreams and social networks. Operational efficiency
was also important. The bank wanted to automate
orchestrating the application flow and have deep visibility
into the database with proactive monitoring. MongoDB
Cloud Manager provided the required management
platform.
Conclusion
MongoDB natively provides a rich analytics framework
within the database. Multiple connectors are also available
to integrate Spark with MongoDB to enrich analytics
capabilities by enabling analysts to apply libraries for
machine learning, streaming and SQL to MongoDB data.
Together MongoDB and Apache Spark are already
enabling developers and data scientists to turn analytics
into real-time action.
We Can Help
We are the MongoDB experts. Over 2,000 organizations
rely on our commercial products, including startups and
more than a third of the Fortune 100. We offer software
and services to make your life easier:
MongoDB Enterprise Advanced is the best way to run
MongoDB in your data center. It’s a finely-tuned package
of advanced software, support, certifications, and other
services designed for the way you do business.
MongoDB Cloud Manager is the easiest way to run
MongoDB in the cloud. It makes MongoDB the system you
worry about the least and like managing the most.
Production Support helps keep your system up and
running and gives you peace of mind. MongoDB engineers
help you with production issues and any aspect of your
project.
Development Support helps you get up and running quickly.
It gives you a complete package of software and services
for the early stages of your project.
10