Sharding is a technique for partitioning and distributing data across multiple servers to enable scaling to large data volumes and workloads. It involves defining a shard key to partition data into chunks that are distributed across shards. The document discusses different types of sharding strategies like range, hash, and tag-aware sharding and how they apply to different use cases around scale, geo-distribution, and hardware optimization. It also covers best practices for building a sharded cluster like pre-splitting data, capacity planning, and using tools like MongoDB Management Service for production operations.
Learn about recent advances in MongoDB in the area of In-Memory Computing (Apache Spark Integration, In-memory Storage Engine), and how these advances can enable you to build a new breed of applications, and enhance your Enterprise Data Architecture.
Emergence of MongoDB as an Enterprise Data HubMongoDB
Emergence of MongoDB as an Enterprise Data Hub, presented by Dylan Tong, Sr. Solutions Architect, MongoDB at MongoDB Evenings Seattle at the Seattle Public Library on October 6, 2015.
Webinar: Enterprise Trends for Database-as-a-ServiceMongoDB
Two complementary trends are particularly strong in enterprise IT today: MongoDB itself, and the movement of infrastructure, platform, and software to as-a-service models. Being designed from the start to work in cloud deployments, MongoDB is a natural fit.
Learn how your enterprise can create its own MongoDB service offering, combining the advantages of MongoDB and cloud for agile, nearly-instantaneous deployments. Ease your operations workload by centralizing your points for enforcement, standardize best policies, and enable elastic scalability.
We will provide you with an enterprise planning outline which incorporates needs and value for stakeholders across operations, development, and business. We will cover accounting, chargeback integration, and quantification of benefits to the enterprise (such as standardizing best practices, creating elastic architecture, and reducing database maintenance costs).
Freddie Mac & KPMG Case Study – Advanced Machine Learning Data Integration wi...DataWorks Summit
Freddie Mac and KPMG will share an innovative solution to accelerate data model (ERM) development and data integration on a highly-distributed, in-memory computing platform. The machine learning component (PySpark) of the framework executes against evolving semi-structured and structured data sets to learn and automate data mapping from various sources to a targeted schema. As a result, it significantly reduces the manual analysis, design and development effort, as well as establishes faster data integration across a variety of complex and high-volume datasets.
The solution will leverage various components of the Hadoop data platform. It will use Sqoop to import the data into the platform. PySpark will be leveraged in order to process the data. In addition, the application will also have a developed PySpark ML model that will run as a continuous job in Spark to process the ingested semi-structured data and intelligently map into the proper Hive tables. This will all be scheduled thru the use of Oozie.
Speakers
Kevin Martelli, KPMG, Managing Director
Balaji Wooputur, Freddie Mac, Risk Analyst Director
MongoDB Evenings Dallas: What's the Scoop on MongoDB & HadoopMongoDB
What's the Scoop on MongoDB & Hadoop
Jake Angerman, Sr. Solutions Architect, MongoDB
MongoDB Evenings Dallas
March 30, 2016 at the Addison Treehouse, Dallas, TX
Learn about recent advances in MongoDB in the area of In-Memory Computing (Apache Spark Integration, In-memory Storage Engine), and how these advances can enable you to build a new breed of applications, and enhance your Enterprise Data Architecture.
Emergence of MongoDB as an Enterprise Data HubMongoDB
Emergence of MongoDB as an Enterprise Data Hub, presented by Dylan Tong, Sr. Solutions Architect, MongoDB at MongoDB Evenings Seattle at the Seattle Public Library on October 6, 2015.
Webinar: Enterprise Trends for Database-as-a-ServiceMongoDB
Two complementary trends are particularly strong in enterprise IT today: MongoDB itself, and the movement of infrastructure, platform, and software to as-a-service models. Being designed from the start to work in cloud deployments, MongoDB is a natural fit.
Learn how your enterprise can create its own MongoDB service offering, combining the advantages of MongoDB and cloud for agile, nearly-instantaneous deployments. Ease your operations workload by centralizing your points for enforcement, standardize best policies, and enable elastic scalability.
We will provide you with an enterprise planning outline which incorporates needs and value for stakeholders across operations, development, and business. We will cover accounting, chargeback integration, and quantification of benefits to the enterprise (such as standardizing best practices, creating elastic architecture, and reducing database maintenance costs).
Freddie Mac & KPMG Case Study – Advanced Machine Learning Data Integration wi...DataWorks Summit
Freddie Mac and KPMG will share an innovative solution to accelerate data model (ERM) development and data integration on a highly-distributed, in-memory computing platform. The machine learning component (PySpark) of the framework executes against evolving semi-structured and structured data sets to learn and automate data mapping from various sources to a targeted schema. As a result, it significantly reduces the manual analysis, design and development effort, as well as establishes faster data integration across a variety of complex and high-volume datasets.
The solution will leverage various components of the Hadoop data platform. It will use Sqoop to import the data into the platform. PySpark will be leveraged in order to process the data. In addition, the application will also have a developed PySpark ML model that will run as a continuous job in Spark to process the ingested semi-structured data and intelligently map into the proper Hive tables. This will all be scheduled thru the use of Oozie.
Speakers
Kevin Martelli, KPMG, Managing Director
Balaji Wooputur, Freddie Mac, Risk Analyst Director
MongoDB Evenings Dallas: What's the Scoop on MongoDB & HadoopMongoDB
What's the Scoop on MongoDB & Hadoop
Jake Angerman, Sr. Solutions Architect, MongoDB
MongoDB Evenings Dallas
March 30, 2016 at the Addison Treehouse, Dallas, TX
Introduce the Big-Data data characteristic, big-data process flow/architecture, and take out an example about EKG solution to explain why we are run into big data issue, and try to build up a big-data server farm architecture. From there, you can have more concrete point of view, what the big-data is.
Webinar: An Enterprise Architect’s View of MongoDBMongoDB
In the world of big data, legacy modernization, siloed organizations, empowered customers, and mobile devices, making informed choices about your enterprise infrastructure has become more important than ever. The alternatives are abundant, and the successful Enterprise Architect must constantly discern which new technology is just a shiny object and which will add true business value.
MongoDB is more than just a great application database for developers; it gives Enterprise Architects new capabilities to solve previously difficult architectural requirements much more easily. Take for example the challenge of many siloed systems at MetLife – with MongoDB, the Metlife team was able to successfully provide a single view into those 70 systems, in only 3 months.
In this webinar, we will:
Explore real life challenges enterprises face with case studies of their solutions
Consider how best to introduce MongoDB in the enterprise
Give an overview of how to optimize the use of MongoDB
Continuous Data Ingestion pipeline for the EnterpriseDataWorks Summit
Continuous Data ingestion platform built on NIFI and Spark that integrates variety of data sources including real-time events, data from external sources , structured and unstructured data with in-flight governance providing a real-time pipeline moving data from source to consumption in minutes. The next-gen data pipeline has helped eliminate the legacy batch latency and improve data quality and governance by designing custom NIFI processors and embedded Spark code. To meet the stringent regulatory requirements the data pipeline is being augmented with features to do in-flight ETL , DQ checks that enables a continuous workflow enhancing the Raw / unclassified data to Enriched / classified data available for consumption by users and production processes.
Watch full webinar here: https://buff.ly/2MwDyhq
The use of Data Virtualization as a global delivery layer means that Denodo is a critical component of the data architecture. It cannot fail, needs to be fault tolerant and perform as designed. In this context, enterprise level-monitoring is key to make sure the virtual layer is in good health and proactively detect potential issues. Fortunately, Denodo provides a full suite of monitoring capabilities and integrates with leading monitoring tools like Splunk, Elastic and CloudWatch.
Attend this session to learn:
- How to configure the key global parameters of the Denodo server
- How to integrate Denodo with enterprise monitoring solutions like Splunk and Cloudwatch
- Key metrics to monitor
Data Virtualization Reference Architectures: Correctly Architecting your Solu...Denodo
Correctly Architecting your Solutions for Analytical & Operational Uses reviews the two main types of use cases that can be solved with the Denodo Platform. Both high concurrency scenarios and big reporting use cases are discussed in this presentation in a comparative way, explaining the different approaches that you must take to be successful in any situation.
This presentation is part of the Fast Data Strategy Conference, and you can watch the video here goo.gl/wdZgpo.
The Future of Data Warehousing: ETL Will Never be the SameCloudera, Inc.
Traditional data warehouse ETL has become too slow, too complicated, and too expensive to address the torrent of new data sources and new analytic approaches needed for decision making. The new ETL environment is already looking drastically different.
In this webinar, Ralph Kimball, founder of the Kimball Group, and Manish Vipani, Vice President and Chief Architect of Enterprise Architecture at Kaiser Permanente will describe how this new ETL environment is actually implemented at Kaiser Permanente. They will describe the successes, the unsolved challenges, and their visions of the future for data warehouse ETL.
A Comparison of EDB Postgres to Self-Supported PostgreSQLEDB
Like all database management systems, PostgreSQL requires additional enterprise tools and capabilities to ensure high availability at scale. These include additional tools for backup, disaster recovery, replication, monitoring and data migration.
To meet these needs, EnterpriseDB created the EDB Postgres Platform.
This document explains the key differences between PostgreSQL using the EDB Postgres Platform compared to self-supported PostgreSQL alone.
Business intelligence requirements are changing and business users are moving more and more from historical reporting into predictive analytics in an attempt to get both a better and deeper understanding of their data. Traditionally, building an analytical platform has required an expensive infrastructure and a considerable amount of time for setup and deployment. Here we look at a quick and simple alternative.
MongoDB and RDBMS: Using Polyglot Persistence at Equifax MongoDB
MongoDB and RDBMS: Using Polyglot Persistence at Equifax. Presented by Michael Lawrence, Pariveda Solutions on behalf of Equifax at MongoDB Evenings Atlanta on September 24, 2015.
Synopsis: Modern enterprises anticipate business requirements and work proactively to optimise the outcomes. If they don’t renovate or reinvent their data architectures, they lose customers, and market share. So my talk will be in detailing the importance of data architecture, architectural challenges if is not addressed and a case study - the learnings and success story by fixing the issues at the root - at the data storage & access.
Target Audience: Principal Software engineers & Architects
Key Takeaways: Importance of Modern Data Architecture, PostgreSQL & JSONB
I have given a talk @ https://hasgeek.com/rootconf/elasticsearch-users-meetup-hyderabad/
Creating a Modern Data Architecture for Digital TransformationMongoDB
By managing Data in Motion, Data at Rest, and Data in Use differently, modern Information Management Solutions are enabling a whole range of architecture and design patterns that allow enterprises to fully harness the value in data flowing through their systems. In this session we explored some of the patterns (e.g. operational data lakes, CQRS, microservices and containerisation) that enable CIOs, CDOs and senior architects to tame the data challenge, and start to use data as a cross-enterprise asset.
Entity Resolution Service - Bringing Petabytes of Data Online for Instant AccessDataWorks Summit
2.5B+ ids, 2ms latency, 15K+ TPS and Petabytes of data.These numbers outline the challenges with eBay’s Entity Resolution Service (ERS). ERS provides a temporal map between anyid-anyid. The technology stack of ERS has Hadoop as the batch layer, Couchbase as cache layer, Spring Batch to load data to Couchbase and Rest API at Service layer. In our presentation we will take you through the journey from conceptual to production release. It’s a great story and we would like to share with you!
How to select a modern data warehouse and get the most out of it?Slim Baltagi
In the first part of this talk, we will give a setup and definition of modern cloud data warehouses as well as outline problems with legacy and on-premise data warehouses.
We will speak to selecting, technically justifying, and practically using modern data warehouses, including criteria for how to pick a cloud data warehouse and where to start, how to use it in an optimum way and use it cost effectively.
In the second part of this talk, we discuss the challenges and where people are not getting their investment. In this business-focused track, we cover how to get business engagement, identifying the business cases/use cases, and how to leverage data as a service and consumption models.
MongoDB San Francisco 2013: Basic Sharding in MongoDB presented by Brandon Bl...MongoDB
Sharding allows you to distribute load across multiple servers and keep your data balanced across those servers. This session will review MongoDB’s sharding support, including an architectural overview, design principles, and automation.
EclipseConEurope2012 SOA - Talend with EasySOAMarc Dutoo
At Eclipse Con Europe 2012 in the SOA Symposium track, Talend ESB and EasySOA Registry show how they bridge together design and runtime knowledge and tools.
Introduce the Big-Data data characteristic, big-data process flow/architecture, and take out an example about EKG solution to explain why we are run into big data issue, and try to build up a big-data server farm architecture. From there, you can have more concrete point of view, what the big-data is.
Webinar: An Enterprise Architect’s View of MongoDBMongoDB
In the world of big data, legacy modernization, siloed organizations, empowered customers, and mobile devices, making informed choices about your enterprise infrastructure has become more important than ever. The alternatives are abundant, and the successful Enterprise Architect must constantly discern which new technology is just a shiny object and which will add true business value.
MongoDB is more than just a great application database for developers; it gives Enterprise Architects new capabilities to solve previously difficult architectural requirements much more easily. Take for example the challenge of many siloed systems at MetLife – with MongoDB, the Metlife team was able to successfully provide a single view into those 70 systems, in only 3 months.
In this webinar, we will:
Explore real life challenges enterprises face with case studies of their solutions
Consider how best to introduce MongoDB in the enterprise
Give an overview of how to optimize the use of MongoDB
Continuous Data Ingestion pipeline for the EnterpriseDataWorks Summit
Continuous Data ingestion platform built on NIFI and Spark that integrates variety of data sources including real-time events, data from external sources , structured and unstructured data with in-flight governance providing a real-time pipeline moving data from source to consumption in minutes. The next-gen data pipeline has helped eliminate the legacy batch latency and improve data quality and governance by designing custom NIFI processors and embedded Spark code. To meet the stringent regulatory requirements the data pipeline is being augmented with features to do in-flight ETL , DQ checks that enables a continuous workflow enhancing the Raw / unclassified data to Enriched / classified data available for consumption by users and production processes.
Watch full webinar here: https://buff.ly/2MwDyhq
The use of Data Virtualization as a global delivery layer means that Denodo is a critical component of the data architecture. It cannot fail, needs to be fault tolerant and perform as designed. In this context, enterprise level-monitoring is key to make sure the virtual layer is in good health and proactively detect potential issues. Fortunately, Denodo provides a full suite of monitoring capabilities and integrates with leading monitoring tools like Splunk, Elastic and CloudWatch.
Attend this session to learn:
- How to configure the key global parameters of the Denodo server
- How to integrate Denodo with enterprise monitoring solutions like Splunk and Cloudwatch
- Key metrics to monitor
Data Virtualization Reference Architectures: Correctly Architecting your Solu...Denodo
Correctly Architecting your Solutions for Analytical & Operational Uses reviews the two main types of use cases that can be solved with the Denodo Platform. Both high concurrency scenarios and big reporting use cases are discussed in this presentation in a comparative way, explaining the different approaches that you must take to be successful in any situation.
This presentation is part of the Fast Data Strategy Conference, and you can watch the video here goo.gl/wdZgpo.
The Future of Data Warehousing: ETL Will Never be the SameCloudera, Inc.
Traditional data warehouse ETL has become too slow, too complicated, and too expensive to address the torrent of new data sources and new analytic approaches needed for decision making. The new ETL environment is already looking drastically different.
In this webinar, Ralph Kimball, founder of the Kimball Group, and Manish Vipani, Vice President and Chief Architect of Enterprise Architecture at Kaiser Permanente will describe how this new ETL environment is actually implemented at Kaiser Permanente. They will describe the successes, the unsolved challenges, and their visions of the future for data warehouse ETL.
A Comparison of EDB Postgres to Self-Supported PostgreSQLEDB
Like all database management systems, PostgreSQL requires additional enterprise tools and capabilities to ensure high availability at scale. These include additional tools for backup, disaster recovery, replication, monitoring and data migration.
To meet these needs, EnterpriseDB created the EDB Postgres Platform.
This document explains the key differences between PostgreSQL using the EDB Postgres Platform compared to self-supported PostgreSQL alone.
Business intelligence requirements are changing and business users are moving more and more from historical reporting into predictive analytics in an attempt to get both a better and deeper understanding of their data. Traditionally, building an analytical platform has required an expensive infrastructure and a considerable amount of time for setup and deployment. Here we look at a quick and simple alternative.
MongoDB and RDBMS: Using Polyglot Persistence at Equifax MongoDB
MongoDB and RDBMS: Using Polyglot Persistence at Equifax. Presented by Michael Lawrence, Pariveda Solutions on behalf of Equifax at MongoDB Evenings Atlanta on September 24, 2015.
Synopsis: Modern enterprises anticipate business requirements and work proactively to optimise the outcomes. If they don’t renovate or reinvent their data architectures, they lose customers, and market share. So my talk will be in detailing the importance of data architecture, architectural challenges if is not addressed and a case study - the learnings and success story by fixing the issues at the root - at the data storage & access.
Target Audience: Principal Software engineers & Architects
Key Takeaways: Importance of Modern Data Architecture, PostgreSQL & JSONB
I have given a talk @ https://hasgeek.com/rootconf/elasticsearch-users-meetup-hyderabad/
Creating a Modern Data Architecture for Digital TransformationMongoDB
By managing Data in Motion, Data at Rest, and Data in Use differently, modern Information Management Solutions are enabling a whole range of architecture and design patterns that allow enterprises to fully harness the value in data flowing through their systems. In this session we explored some of the patterns (e.g. operational data lakes, CQRS, microservices and containerisation) that enable CIOs, CDOs and senior architects to tame the data challenge, and start to use data as a cross-enterprise asset.
Entity Resolution Service - Bringing Petabytes of Data Online for Instant AccessDataWorks Summit
2.5B+ ids, 2ms latency, 15K+ TPS and Petabytes of data.These numbers outline the challenges with eBay’s Entity Resolution Service (ERS). ERS provides a temporal map between anyid-anyid. The technology stack of ERS has Hadoop as the batch layer, Couchbase as cache layer, Spring Batch to load data to Couchbase and Rest API at Service layer. In our presentation we will take you through the journey from conceptual to production release. It’s a great story and we would like to share with you!
How to select a modern data warehouse and get the most out of it?Slim Baltagi
In the first part of this talk, we will give a setup and definition of modern cloud data warehouses as well as outline problems with legacy and on-premise data warehouses.
We will speak to selecting, technically justifying, and practically using modern data warehouses, including criteria for how to pick a cloud data warehouse and where to start, how to use it in an optimum way and use it cost effectively.
In the second part of this talk, we discuss the challenges and where people are not getting their investment. In this business-focused track, we cover how to get business engagement, identifying the business cases/use cases, and how to leverage data as a service and consumption models.
MongoDB San Francisco 2013: Basic Sharding in MongoDB presented by Brandon Bl...MongoDB
Sharding allows you to distribute load across multiple servers and keep your data balanced across those servers. This session will review MongoDB’s sharding support, including an architectural overview, design principles, and automation.
EclipseConEurope2012 SOA - Talend with EasySOAMarc Dutoo
At Eclipse Con Europe 2012 in the SOA Symposium track, Talend ESB and EasySOA Registry show how they bridge together design and runtime knowledge and tools.
MongoDB San Francisco 2013: Hash-based Sharding in MongoDB 2.4 presented by B...MongoDB
In version 2.4, MongoDB introduces hash-based sharding, a new option for distributing data in sharded collections. Hash-based sharding and range-based sharding present different advantages for MongoDB users deploying large scale systems. In this talk, we'll provide an overview of this new feature and discuss when to use hash-based sharding or range-based sharding.
Deploying MongoDB can be a challenge if you don't understand how resources are used nor how to plan for the capacity of your systems. If you need to deploy, or grow, a MongoDB single instance, replica set, or tens of sharded clusters then you probably share the same challenges in trying to size that deployment. This talk will cover what resources MongoDB uses, and how to plan for their use in your deployment. Topics covered will include understanding how to model and plan capacity needs from the perspective of a new deployment, growing an existing one, and defining where the steps along scalability on your path to the top. The goal of this presentation will be to provide you with the tools needed to be successful in managing your MongoDB capacity planning tasks.
This talk is one that I gave to the HPTS workshop in Asilomar in 2009. It describes the ideas behind micro-sharding and outlines how Katta can manage micro-shards.
Some builds and spacing are off because this was exported as power point from Keynote.
Talend Open Studio Fundamentals #1: Workspaces, Jobs, Metadata and Trips & Tr...Gabriele Baldassarre
Introduction to Talend Open Studio for Data Integration, focusing on job architecture, metadata, workspaces, connection types and common use components. Rick Tips & Tricks sections
Applications have to be integrated – no matter which programming languages, databases or infrastructures are used. However, the realization of integration scenarios is a complex and time-consuming task. Over 10 years ago, Enteprise Integration Patterns (EIP) became the world wide defacto standard for splitting huge, complex integration scenarios into smaller recurring problems. These patterns appear in almost every integration project.
This session revisits EIPs and gives shows status quo. After giving a short introduction with several examples, the audience will learn which EIPs still have a „right to exist“, and which new EIPs emerged in the meantime. The end of the session shows different frameworks and tools which already implement EIPs and therefore help the architect to reduce efforts a lot.
Back to Basics 2017: Introduction to ShardingMongoDB
Sharding is a method for distributing data across multiple machines. MongoDB uses sharding to support deployments with very large data sets and high throughput operations by providing the capability for horizontal scaling.
Webinar: Working with Graph Data in MongoDBMongoDB
With the release of MongoDB 3.4, the number of applications that can take advantage of MongoDB has expanded. In this session we will look at using MongoDB for representing graphs and how graph relationships can be modeled in MongoDB.
We will also look at a new aggregation operation that we recently implemented for graph traversal and computing transitive closure. We will include an overview of the new operator and provide examples of how you can exploit this new feature in your MongoDB applications.
Webinar: 10-Step Guide to Creating a Single View of your BusinessMongoDB
Organizations have long seen the value in aggregating data from multiple systems into a single, holistic, real-time representation of a business entity. That entity is often a customer. But the benefits of a single view in enhancing business visibility and operational intelligence can apply equally to other business contexts. Think products, supply chains, industrial machinery, cities, financial asset classes, and many more.
However, for many organizations, delivering a single view to the business has been elusive, impeded by a combination of technology and governance limitations.
MongoDB has been used in many single view projects across enterprises of all sizes and industries. In this session, we will share the best practices we have observed and institutionalized over the years. By attending the webinar, you will learn:
- A repeatable, 10-step methodology to successfully delivering a single view
- The required technology capabilities and tools to accelerate project delivery
- Case studies from customers who have built transformational single view applications on MongoDB.
Learn about the various approaches to sharding your data with MongoDB. This presentation will help you answer questions such as when to shard and how to choose a shard key.
Dyn delivers exceptional Internet Performance. Enabling high quality services requires data centers around the globe. In order to manage services, customers need timely insight collected from all over the world. Dyn uses DataStax Enterprise (DSE) to deploy complex clusters across multiple datacenters to enable sub 50 ms query responses for hundreds of billions of data points. From granular DNS traffic data, to aggregated counts for a variety of report dimensions, DSE at Dyn has been up since 2013 and has shined through upgrades, data center migrations, DDoS attacks and hardware failures. In this webinar, Principal Engineers Tim Chadwick and Rick Bross cover the requirements which led them to choose DSE as their go-to Big Data solution, the path which led to SPARK, and the lessons that we’ve learned in the process.
How sitecore depends on mongo db for scalability and performance, and what it...Antonios Giannopoulos
Percona Live 2017 - How sitecore depends on mongo db for scalability and performance, and what it can teach you by Antonios Giannopoulos and Grant Killian
MongoDB has taken a clear lead in adoption among the new generation of databases, including the enormous variety of NoSQL offerings. A key reason for this lead has been a unique combination of agility and scalability. Agility provides business units with a quick start and flexibility to maintain development velocity, despite changing data and requirements. Scalability maintains that flexibility while providing fast, interactive performance as data volume and usage increase. We'll address the key organizational, operational, and engineering considerations to ensure that agility and scalability stay aligned at increasing scale, from small development instances to web-scale applications. We will also survey some key examples of highly-scaled customer applications of MongoDB.
Has your app taken off? Are you thinking about scaling? MongoDB makes it easy to horizontally scale out with built-in automatic sharding, but did you know that sharding isn't the only way to achieve scale with MongoDB?
In this webinar, we'll review three different ways to achieve scale with MongoDB. We'll cover how you can optimize your application design and configure your storage to achieve scale, as well as the basics of horizontal scaling. You'll walk away with a thorough understanding of options to scale your MongoDB application.
Making Hadoop Realtime by Dr. William Bain of Scaleout SoftwareData Con LA
Hadoop has been widely embraced for its ability to economically store and analyze large data sets. Using parallel computing techniques like MapReduce, Hadoop can reduce long computation times to hours or minutes. This works well for mining large volumes of historical data stored on disk, but it is not suitable for gaining real-time insights from live operational data. Still, the idea of using Hadoop for real-time data analytics on live data is appealing because it leverages existing programming skills and infrastructure – and the parallel architecture of Hadoop itself. This presentation will describe how real-time analytics using Hadoop can be performed by combining an in-memory data grid (IMDG) with an integrated, stand-alone Hadoop MapReduce execution engine. This new technology delivers fast results for live data and also accelerates the analysis of large, static data sets.
Webinar: High Performance MongoDB Applications with IBM POWER8MongoDB
Innovative companies are building Internet of Things, mobile, content management, single view, and big data apps on top of MongoDB. In this session, we'll explore how the IBM POWER8 platform brings new levels of performance and ease of configuration to these solutions which already benefit from easier and faster design and development using MongoDB.
AWS re:Invent 2016: Amazon CloudFront Flash Talks: Best Practices on Configur...Amazon Web Services
In this series of 15-minute technical flash talks you will learn directly from Amazon CloudFront engineers and their best practices on debugging caching issues, measuring performance using Real User Monitoring (RUM), and stopping malicious viewers using CloudFront and AWS WAF.
This presentation provides an introduction to Azure DocumentDB. Topics include elastic scale, global distribution and guaranteed low latencies (with SLAs) - all in a managed document store that you can query using SQL and Javascript. We also review common scenarios and advanced Data Sciences scenarios.
Neuro-symbolic is not enough, we need neuro-*semantic*Frank van Harmelen
Neuro-symbolic (NeSy) AI is on the rise. However, simply machine learning on just any symbolic structure is not sufficient to really harvest the gains of NeSy. These will only be gained when the symbolic structures have an actual semantics. I give an operational definition of semantics as “predictable inference”.
All of this illustrated with link prediction over knowledge graphs, but the argument is general.
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Ramesh Iyer
In today's fast-changing business world, Companies that adapt and embrace new ideas often need help to keep up with the competition. However, fostering a culture of innovation takes much work. It takes vision, leadership and willingness to take risks in the right proportion. Sachin Dev Duggal, co-founder of Builder.ai, has perfected the art of this balance, creating a company culture where creativity and growth are nurtured at each stage.
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...UiPathCommunity
💥 Speed, accuracy, and scaling – discover the superpowers of GenAI in action with UiPath Document Understanding and Communications Mining™:
See how to accelerate model training and optimize model performance with active learning
Learn about the latest enhancements to out-of-the-box document processing – with little to no training required
Get an exclusive demo of the new family of UiPath LLMs – GenAI models specialized for processing different types of documents and messages
This is a hands-on session specifically designed for automation developers and AI enthusiasts seeking to enhance their knowledge in leveraging the latest intelligent document processing capabilities offered by UiPath.
Speakers:
👨🏫 Andras Palfi, Senior Product Manager, UiPath
👩🏫 Lenka Dulovicova, Product Program Manager, UiPath
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
Essentials of Automations: Optimizing FME Workflows with ParametersSafe Software
Are you looking to streamline your workflows and boost your projects’ efficiency? Do you find yourself searching for ways to add flexibility and control over your FME workflows? If so, you’re in the right place.
Join us for an insightful dive into the world of FME parameters, a critical element in optimizing workflow efficiency. This webinar marks the beginning of our three-part “Essentials of Automation” series. This first webinar is designed to equip you with the knowledge and skills to utilize parameters effectively: enhancing the flexibility, maintainability, and user control of your FME projects.
Here’s what you’ll gain:
- Essentials of FME Parameters: Understand the pivotal role of parameters, including Reader/Writer, Transformer, User, and FME Flow categories. Discover how they are the key to unlocking automation and optimization within your workflows.
- Practical Applications in FME Form: Delve into key user parameter types including choice, connections, and file URLs. Allow users to control how a workflow runs, making your workflows more reusable. Learn to import values and deliver the best user experience for your workflows while enhancing accuracy.
- Optimization Strategies in FME Flow: Explore the creation and strategic deployment of parameters in FME Flow, including the use of deployment and geometry parameters, to maximize workflow efficiency.
- Pro Tips for Success: Gain insights on parameterizing connections and leveraging new features like Conditional Visibility for clarity and simplicity.
We’ll wrap up with a glimpse into future webinars, followed by a Q&A session to address your specific questions surrounding this topic.
Don’t miss this opportunity to elevate your FME expertise and drive your projects to new heights of efficiency.
Let's dive deeper into the world of ODC! Ricardo Alves (OutSystems) will join us to tell all about the new Data Fabric. After that, Sezen de Bruijn (OutSystems) will get into the details on how to best design a sturdy architecture within ODC.
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf91mobiles
91mobiles recently conducted a Smart TV Buyer Insights Survey in which we asked over 3,000 respondents about the TV they own, aspects they look at on a new TV, and their TV buying preferences.
Transcript: Selling digital books in 2024: Insights from industry leaders - T...BookNet Canada
The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more.
Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/
Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
2. 2
Agenda
Overview
• What is sharding?
• Why and what should I use sharding for?
Building your First Sharded Cluster
• What do I need to know to succeed with sharding?
Q&A
3. 3
What is Sharding?
Sharding is a means of partitioning data across servers to enable:
Scale
needed by
modern
applications
to support
massive work
loads and
data volume.
Geo-Locality
to support
geographically
distributed
deployments to
support optimal UX
for customers across
vast geographies.
Hardware
Optimizations
on Performance vs.
Cost
Lower Recovery Times
to make “Recovery Time Objectives”
(RTO) feasible.
4. 4
What is Sharding?
Sharding involves a shard key defined by a data modeler
that describes the partition space of a data set.
MongoDB provides 3 Types of Sharding Strategies:
• Ranged
• Hashed
• Tag-aware
Data is partitioned into data chunks by the shard key, and
these chunks are distributed evenly across shards that
reside across many physical servers.
6. 6
Hash Sharding
Hash Sharding is a subset of Range Sharding.
…3334…3333 …8001…8000 …AAAB…AAAA …DDDF…DDDD
MongoDB apples a MD5 hash on the key when a hash shard key is
used:
Hash Shard Key(deviceId) = MD5(deviceId)
Ensures data is distributed randomly within the range of MD5 values
7. 7
Tag-aware Sharding
Tag-aware sharding allows subset of shards to be tagged, and assigned
to a sub-range of the shard-key.
Tag Start End
West MinKey, MinKey MaxKey,50
East MinKey, 50 MaxKey, MaxKey
Collection: Users, Shard Key: {uId, regionCode}
Secondary
Secondary
Secondary
Secondary
Shard2,
Tag=West
Shard3,
Tag=East
Secondary
Secondary
Shard1,
Tag=West
Secondary
Secondary
Shard4,
Tag=East
Example: Sharding User Data belong to users from 100 “regions”
Assign Regions
1-50 to the West
Assign Regions
51-100 to the
East
8. 8
Applying Sharding
Usage Required Strategy
Scale Range or Hash
Geo-Locality Tag-aware
Hardware Optimization Tag-aware
Lower Recovery Times Range or Hash
10. 10
Typical Small Deployment
Highly Available
but not Scalable
Replica Set
Writes Reads
Limited by capacity
of the Primary’s
host
When Immediate
Consistency
Matters: Limited by
capacity of the
Primary’s host
When Eventual
Consistency is
Acceptable: Limited
by capacity of
available replicaSet
members
11. 11
Sharded Architecture
Auto-balancing:
data is partitioned
based on a shard
key, and
automatically
balanced across
shards by MongoDB
Query Routing: database
operations are
transparently routed
across the cluster
through a routing proxy
process (software).
Horizontal Scalability:
load is distributed and
resources are pooled
across commodity
servers.
Increasing read/write capacity
Decoupling for
Development and
Operational Simplicity:
Ops can add capacity
without app dependencies
and Dev involvement.
12. 13
Value of Scale-out Architecture
System Capacity
$
Scale-up Limits
High
Capacity/$
Scale-out to 100-1000s
of servers
Optimize on Capacity/$
Scale-down and re-allocate
resources
Apps of
the Past
Modern Apps Trend
13. 14
Sharding for Geo-Locality
Adobe Cloud Services among other popular consumer and
Enterprise services use sharding to run servers across multiple data
centers across geographies.
● Amazon - Every 1/10 second delay resulted in 1% loss of
sales.
● Google - Half a second delay caused a 20% drop in traffic.
● Aberdeen Group - 1-second delay in page-load time
o 11% fewer page views
o 16% decrease in customer satisfaction
o 7% loss in conversions
Network latency from West to East is ~80ms
14. 15
Multi-Active DCs via Tag-aware
Sharding
Secondary Secondary
WEST EAST
Local Reads (Eg. Read Preference = Nearest)
Query
Query
Shard 1
Secondary Secondary
Shard 2
Tag = East
Tag = West
Secondary Secondary
Shard 3
Tag = East
Update
Update
Collection: Users, Shard Key: {regionCode,uId}
Priority=10Priority=5
Priority=10 Priority=5
Tag Start End
West MinKey, MinKey MaxKey,50
East MinKey, 50 MaxKey, MaxKey
15. 17
Optimizing Latency and Cost
Event Latency Normalized to 1 s
RAM access 120 ns 6 min
SSD access 150 µs 6 days
HDD access 10 ms 12 months
Storage Type Avg. Cost ($/GB) Cost at 100TB ($)
RAM 5.50 550K
SSD 0.50-1.00 50K to 100K
HDD 0.03 3K
Magnitudes of Difference in Speed
Magnitudes of Difference in Cost
16. 18
Optimizing Latency and Cost
Data Type Description Latency SLA Data Volume
Meta Data Fast look-ups to
drive real-time
decisions
95th
Percentile <
1ms
< 1 TB
Last 90 days of
Metrics
95+% of data
reported and
monitored
95th
Percentile <
30ms
< 10 TB
Historic Used for historic
reporting. Access
infrequently
95th
Percentile < 2s > 100TB
Use Case: Sensor data collected from millions of devices. Data used for
real-time decision automation, real-time monitoring and historical
reporting.
17. 19
Hardware Optimizations
Tag: Cache Tag: Flash Tag: Archive
Collections Tag Start End
Meta Cache MinKey MaxKey
Metrics Flash MinKey,MinKey MaxKey, 90 days ago
Metrics Archive MinKey,>90 days ago MaxKey, MaxKey
Secondary
Secondary
Secondary
Secondary
Secondary
Secondary
Secondary
Secondary
Secondary
Secondary
Secondary
Secondary
Secondary
Secondary
Collection Shard Key
Meta DeviceId
Metrics DeviceId, TimestampHigh Memory Ratio,
Fast Cores
HDD
Medium Memory Ratio,
High Compute
SSDs
Low Memory Ratio,
Medium Compute
HHD
18. 20
Restoration Times
Scenario: Application bug causes logical corruption of the data, and the database
needs to be rolled back to a previous PIT. What’s RTO does your business require in
this event?
N = 10
10X10TB Snapshots generated
and/or transferred in parallel
Total DB Snapshot Size = 100TB
N = 100
100X1TB Snapshots generated
and/or transferred in parallel.
Potentially 10X faster
restoration time.
Tar.g
z
Tar.g
z
Tar.g
z
Tar.g
z
20. 22
Life-cycle of Sharding
Design &
Development
Test/QA Pre-Production Production
Product Definition: Starts with an idea to build something big!
Predictive Maintenance Platform: a cloud platform for building
predictive maintenance applications and services—such as a service
that monitors various vehicle components by collecting data from
sensors, and automatically prescribes actions to take.
•Allow tenant to register, ingest and modify data collected by sensors
•Define and apply workflows and business rules
•Publish and subscribe to notifications
•Data access API for things like reporting
21. 23
Life-cycle of Sharding
Design &
Development
Test/QA Pre-Production Production
Data Modeling: Do I need to Shard?
Throughput: data from millions of sensors updated in real-
time
Latency: The value of certain attributes need to be access
with 95th
percentile < 10ms to support real-time decisions
and automation.
Volume: 1TB of data collected per day. Retained for 5
years.
22. 26
Life-cycle of Sharding
Design &
Development
Test/QA Pre-Production Production
Data Modeling: Select a Good Shard Key
Critical Step
•Sharding is only effective as the shard key.
•Shard Key attributes are immutable.
•Re-sharding is non-trivial. Requires re-partitioning
data.
31. 35
Index Locality
MD5 -0 day-1 day-2 day...
Right balanced index may only
need to be partially in RAM to be
effective.
Working Set
Random Access Index Right Balance Index
Key = Hash(DeviceId) Key = DeviceId, Timestamp
… Device N
33. 37
Life-cycle of Sharding
Design &
Development
Test/QA Pre-Production Production
Performance Testing: Avoid Pitfalls
Best Practices:
•Pre-split:
1. Hash Shard Key: specify numInitialChunks:
http://docs.mongodb.org/manual/reference/command/shardC
ollection/
2. Custom Shard Key: create a pre-split script:
http://docs.mongodb.org/manual/tutorial/create-chunks-in-
sharded-cluster/
• Run mongos (query router) on app server if possible.
Splits happen on demand
as chunks grow
Migrations happen when an
imbalance is detected
Sharding results in massive
performance degradation!
34. 38
Life-cycle of Sharding
Design &
Development
Test/QA Pre-Production Production
Capacity Planning: How many shards do I need?
Sizing:
•What are the total resources required for your initial
deployment?
•What are the ideal hardware specs, and the # shards
necessary?
Capacity Planning: create a model to scale MongoDB for a
specific app.
•How do I determine when more shards need to be added?
•How much capacity do I gain from adding a shard?
35. 39
How Many Servers?
Strategy Accuracy Level of Effort Feasibility of
Early Project
Analysis
Domain Expert High to Low:
inversely related
to complexity of
the Application
Low Yes
Empirical (Load
Testing)
High High Unlikely
36. 41
Domain Expert
Model and Load Definition
Business Solution
Analysis
Resource Analysis Hardware Specification
• What is the document model? Collections, documents, indexes
• What are the major operations?
- Throughput
- Latency
• What is the working set? Eg. Last 90 days of orders
Normally, performed by MongoDB Solution Architect:
http://bit.ly/1rkXcfN
37. 42
Domain Expert
Model and Load Definition
Business Solution
Analysis
Resource Analysis Hardware Specification
Resource Methodology
RAM Standard: Working Set + Indexes
Adjust more or less depending on latency vs. cost requirements. Very large clusters
should account for connection pooling/thread overhead (1MB per active thread)
IOPs Primarily based on throughput requirements. Writes + estimation on query page
faults. Assume random IO. Account for replication, journal and log (note:
sequential IO). Ideally, estimated empirically through prototype testing. Experts
can use experience from similar applications as an estimate. Spot testing maybe
needed.
Storage Estimate using throughput, document and index size approximations, and
retention requirements. Account for overhead like fragmentation if applicable.
CPU Rarely the bottleneck; a lot less CPU intensive than RDBMs. Using current
commodity CPU specs will suffice.
Network Estimate using throughput and document size approximations.
38. 44
Sizing by Empirical Testing
• Sizing can be more accurately obtained by prototyping your application, and
performing load tests on selected hardware.
• Capacity Planning can be simultaneously accomplished through load testing.
• Past Webinars: http://www.mongodb.com/presentations/webinar-capacity-planning
Strategy:
1. Implement a prototype that can at least simulate major workloads
2. Select an economical server that you plan to scale-out on.
3. Saturate a single replicaSet or shard (maintaining latency SLA as needed). Address
bottlenecks, optimize and repeat.
4. Add an additional shard (as well as mongos and clients as needed). Saturate and
confirm roughly linear scaling.
5. Repeat step 4 until you are able to model capacity gains (throughput + latency)
versus #physical servers.
39. 45
Operational Scale
Design &
Development
Test/QA Pre-Production Production
Business Critical Operations: How do I manage 100s to 1000s of nodes?
MongoDB Management Services (MMS): https://mms.mongodb.com
• Real-time monitoring
and visualization of
cluster health
• Alerting
• Automated cluster
provisioning
• Automation of daily
operational tasks like no-
downtime upgrades
• Centralized configuration
management
• Automated PIT
snapshotting of
clusters
• PITR support for
sharded clusters
42. Get Expert Advice on Scaling. For Free.
For a limited time, if you’re
considering a commercial
relationship with
MongoDB, you can sign up
for a free one hour consult
about scaling with one of
our MongoDB Engineers.
Sign Up: http://bit.ly/1rkXcfN
EA Sports FIFA: world&apos;s best-selling sports video game franchise. User data and game state for millions of players,
Yandex: The largest search engine in Russia uses MongoDB to manage all user and metadata for its file sharing service. MongoDB has scaled to support tens of billions of objects and TBs of data, growing at 10 million new file uploads per day.
eBay
FourSquare: Foursquare is used by over 50 million people worldwide, who have checked in over 6 billion times, with millions more added every day. MongoDB is Foursquare’s main database, supporting hundreds of thousands of operations per second and storing all check-ins and history, user and venue data along with reviews.
AHL, a part of Man Group plc, is a quantitative investment manager based in London and Hong Kong, with over $11.3 billion in assets under management. After evaluating multiple technology options, AHL used MongoDB to replace its relational and specialised &quot;tick&quot; databases. MongoDB supports 250 million ticks per second, at 40x lower cost than the legacy technologies it replaced.
Adobe: Many of the world’s most recognizable brands use Adobe Experience Manager to accelerate development of digital experiences that increase customer loyalty, engagement and demand. Adobe uses MongoDB to store petabytes of data the large-scale content repositories underpinning the Experience Manager.
MongoDB MMS Back-up: 2PB
Mcafee: MongoDB powers McAfee Global Threat Intelligence (GTI), a cloud-based intelligence service that correlates data from millions of sensors around the globe. Billions of documents are stored and analyzed in MongoDB to deliver real-time threat intelligence to other McAfee end-client products.
Carfax: CARFAX relies on its Vehicle History database to connect potential buyers with used vehicles in their area, and for analytics to guide the business. To improve customer experience, CARFAX migrated to MongoDB which now manages over 13 billion documents, before replication across multiple data centers.
Limited capacity: monolithic architecture, siloed data access, or complex application-level facilitated sharding that is difficult and expensive to successfully implement
Scale down: system with seasonal load (ex. Online games- popular from the start and fade over time). Sharding enables scaling-down to reallocate resources on bare-metal servers. Can’t do the same for a database appliance
Hash Key isn’t always a good key. Can be good for the most most simplistic k-v access patterns, but any query retrieving multiple documents like range based queries can potentially lead to scatter-gather queries, which have high overhead and impair the ability to scale linearly.