Media owners are turning to MongoDB to drive social interaction with their published content. The way customers consume information has changed and passive communication is no longer enough. They want to comment, share and engage with publishers and their community through a range of media types and via multiple channels whenever and wherever they are. There are serious challenges with taking this semi-structured and unstructured data and making it work in a traditional relational database. This webinar looks at how MongoDB’s schemaless design and document orientation gives organisation’s like the Guardian the flexibility to aggregate social content and scale out.
CouchDB is a document-oriented database that uses JSON documents, has a RESTful HTTP API, and is queried using map/reduce views. Each of these properties alone, especially MapReduce views, may seem foreign to developers more familiar with relational databases. This tutorial will teach web developers the concepts they need to get started using CouchDB in their projects. CouchDB’s RESTful HTTP API makes it suitable for interfacing with any programming language. CouchDB libraries are available for many programming languages and we will take a look at some of the more popular ones.
Intro to MongoDB
Get a jumpstart on MongoDB, use cases, and next steps for building your first app with Buzz Moschetti, MongoDB Enterprise Architect.
@BuzzMoschetti
MongoDB Days Silicon Valley: Introducing MongoDB 3.2MongoDB
Presented by:
Eliot Horowitz, CTO and Co-Founder, MongoDB
Richard Kreuter, VP of Professional Services, MongoDB
Andrew Erlichson, VP of Engineering, Developer Experience, MongoDB
Media owners are turning to MongoDB to drive social interaction with their published content. The way customers consume information has changed and passive communication is no longer enough. They want to comment, share and engage with publishers and their community through a range of media types and via multiple channels whenever and wherever they are. There are serious challenges with taking this semi-structured and unstructured data and making it work in a traditional relational database. This webinar looks at how MongoDB’s schemaless design and document orientation gives organisation’s like the Guardian the flexibility to aggregate social content and scale out.
CouchDB is a document-oriented database that uses JSON documents, has a RESTful HTTP API, and is queried using map/reduce views. Each of these properties alone, especially MapReduce views, may seem foreign to developers more familiar with relational databases. This tutorial will teach web developers the concepts they need to get started using CouchDB in their projects. CouchDB’s RESTful HTTP API makes it suitable for interfacing with any programming language. CouchDB libraries are available for many programming languages and we will take a look at some of the more popular ones.
Intro to MongoDB
Get a jumpstart on MongoDB, use cases, and next steps for building your first app with Buzz Moschetti, MongoDB Enterprise Architect.
@BuzzMoschetti
MongoDB Days Silicon Valley: Introducing MongoDB 3.2MongoDB
Presented by:
Eliot Horowitz, CTO and Co-Founder, MongoDB
Richard Kreuter, VP of Professional Services, MongoDB
Andrew Erlichson, VP of Engineering, Developer Experience, MongoDB
BM Cloudant is a NoSQL Database-as-a-Service. Discover how you can outsource the data layer of your mobile or web application to Cloudant to provide high availability, scalability and tools to take you to the next level.
MongoDB Days Silicon Valley: Jumpstart: Ops/Admin 101MongoDB
Presented by Achille Brighton, Principal Consulting Engineer, MongoDB
Experience level: Introductory
New to MongoDB? We'll provide an overview of installation, high availability through replication, scale out through sharding, and options for monitoring and backup. No prior knowledge of MongoDB is assumed. This session will jumpstart your knowledge of MongoDB operations, providing you with context for the rest of the day's content.
Socialite, the Open Source Status Feed Part 2: Managing the Social GraphMongoDB
There are many possible approaches to storing and querying relationships between users in social networks. This section will dive into the details of storing a social user graph in MongoDB. It will cover the various schema designs for storing the follower networks of users and propose an optimal design for insert and query performance, as well as looking at performance differences between them.
How Thermo Fisher Is Reducing Mass Spectrometry Experiment Times from Days to...MongoDB
Mass spectrometry is the gold standard for determining chemical compositions, with spectrometers often measuring the mass of a compound down to a single electron. This level of granularity produces an enormous amount of hierarchical data that doesn't fit well into rows and columns. In this talk, learn how Thermo Fisher is using MongoDB Atlas on AWS to allow their users to get near real-time insights from mass spectrometry experiments—a process that used to take days. We also share how the underlying database service used by Thermo Fisher was built on AWS.
Agenda:
MongoDB Overview/History
Workshop
1. How to perform operations to MongoDB – Workshop
2. Using MongoDB in your Java application
Advance usage of MongoDB
1. Performance measurement comparison – real life use cases
3. Doing Cluster setup
4. Cons of MongoDB with other document oriented DB
5. Map-reduce/ Aggregation overview
Workshop prerequisite
1. All participants must bring their laptops.
2. https://github.com/geek007/mongdb-examples
3. Software prerequisite
a. Java version 1.6+
b. Your favorite IDE, Preferred http://www.jetbrains.com/idea/download/
c. MongoDB server version – 2.6.3 (http://www.mongodb.org/downloads - 64 bit version)
d. Participants can install MongoDB client – http://robomongo.org/
About Speaker:
Akbar Gadhiya is working with Ishi Systems as Programmer Analyst. Previously he worked with PMC, Baroda and HCL Technologies.
MongoDB and Hadoop: Driving Business InsightsMongoDB
MongoDB and Hadoop can work together to solve big data problems facing today's enterprises. We will take an in-depth look at how the two technologies complement and enrich each other with complex analyses and greater intelligence. We will take a deep dive into the MongoDB Connector for Hadoop and how it can be applied to enable new business insights with MapReduce, Pig, and Hive, and demo a Spark application to drive product recommendations.
OUG Scotland 2014 - NoSQL and MySQL - The best of both worldsAndrew Morgan
Understand how you can get the benefits you're looking for from NoSQL data stores without sacrificing the power and flexibility of the world's most popular open source database - MySQL.
How sitecore depends on mongo db for scalability and performance, and what it...Antonios Giannopoulos
Percona Live 2017 - How sitecore depends on mongo db for scalability and performance, and what it can teach you by Antonios Giannopoulos and Grant Killian
Webinar: Best Practices for Getting Started with MongoDBMongoDB
MongoDB adoption continues to grow at a record pace due to the significant enhancements in developer productivity and scalability that the database provides. Occasionally, however, organizations new to the technology make mistakes that limit their ability to leverage the significant advantages MongoDB provides. This webinar will discuss some of the common mistakes made by users when they first start working with MongoDB, how to identify when you've made those mistakes, and how to resolve them.
Elasticsearch what is it ? How can I use it in my stack ? I will explain how to set up a working environment with Elasticsearch. The slides are in English.
BM Cloudant is a NoSQL Database-as-a-Service. Discover how you can outsource the data layer of your mobile or web application to Cloudant to provide high availability, scalability and tools to take you to the next level.
MongoDB Days Silicon Valley: Jumpstart: Ops/Admin 101MongoDB
Presented by Achille Brighton, Principal Consulting Engineer, MongoDB
Experience level: Introductory
New to MongoDB? We'll provide an overview of installation, high availability through replication, scale out through sharding, and options for monitoring and backup. No prior knowledge of MongoDB is assumed. This session will jumpstart your knowledge of MongoDB operations, providing you with context for the rest of the day's content.
Socialite, the Open Source Status Feed Part 2: Managing the Social GraphMongoDB
There are many possible approaches to storing and querying relationships between users in social networks. This section will dive into the details of storing a social user graph in MongoDB. It will cover the various schema designs for storing the follower networks of users and propose an optimal design for insert and query performance, as well as looking at performance differences between them.
How Thermo Fisher Is Reducing Mass Spectrometry Experiment Times from Days to...MongoDB
Mass spectrometry is the gold standard for determining chemical compositions, with spectrometers often measuring the mass of a compound down to a single electron. This level of granularity produces an enormous amount of hierarchical data that doesn't fit well into rows and columns. In this talk, learn how Thermo Fisher is using MongoDB Atlas on AWS to allow their users to get near real-time insights from mass spectrometry experiments—a process that used to take days. We also share how the underlying database service used by Thermo Fisher was built on AWS.
Agenda:
MongoDB Overview/History
Workshop
1. How to perform operations to MongoDB – Workshop
2. Using MongoDB in your Java application
Advance usage of MongoDB
1. Performance measurement comparison – real life use cases
3. Doing Cluster setup
4. Cons of MongoDB with other document oriented DB
5. Map-reduce/ Aggregation overview
Workshop prerequisite
1. All participants must bring their laptops.
2. https://github.com/geek007/mongdb-examples
3. Software prerequisite
a. Java version 1.6+
b. Your favorite IDE, Preferred http://www.jetbrains.com/idea/download/
c. MongoDB server version – 2.6.3 (http://www.mongodb.org/downloads - 64 bit version)
d. Participants can install MongoDB client – http://robomongo.org/
About Speaker:
Akbar Gadhiya is working with Ishi Systems as Programmer Analyst. Previously he worked with PMC, Baroda and HCL Technologies.
MongoDB and Hadoop: Driving Business InsightsMongoDB
MongoDB and Hadoop can work together to solve big data problems facing today's enterprises. We will take an in-depth look at how the two technologies complement and enrich each other with complex analyses and greater intelligence. We will take a deep dive into the MongoDB Connector for Hadoop and how it can be applied to enable new business insights with MapReduce, Pig, and Hive, and demo a Spark application to drive product recommendations.
OUG Scotland 2014 - NoSQL and MySQL - The best of both worldsAndrew Morgan
Understand how you can get the benefits you're looking for from NoSQL data stores without sacrificing the power and flexibility of the world's most popular open source database - MySQL.
How sitecore depends on mongo db for scalability and performance, and what it...Antonios Giannopoulos
Percona Live 2017 - How sitecore depends on mongo db for scalability and performance, and what it can teach you by Antonios Giannopoulos and Grant Killian
Webinar: Best Practices for Getting Started with MongoDBMongoDB
MongoDB adoption continues to grow at a record pace due to the significant enhancements in developer productivity and scalability that the database provides. Occasionally, however, organizations new to the technology make mistakes that limit their ability to leverage the significant advantages MongoDB provides. This webinar will discuss some of the common mistakes made by users when they first start working with MongoDB, how to identify when you've made those mistakes, and how to resolve them.
Elasticsearch what is it ? How can I use it in my stack ? I will explain how to set up a working environment with Elasticsearch. The slides are in English.
It's an assignment in High technology Entrepreneurship course. I have to pick some new venture and think about its business model , SWOT analysis and 6-force model.
that goes to this presentation.
These slides use concepts from my (Jeff Funk) course entitled Biz Models for Hi-Tech Products to analyze the business model for 23andMe’s personal genomics service. 23andMe provides personal health care analysis and ancestry information to individuals that are based on a partial analysis of an individual’s DNA data. Driven by the falling cost of DNA sequencing, 23 and Me provides this information for $99 and a small amount of saliva. While this business model may succeed particularly if the FDA eventually approves the health care portion of the service, we recommend that 23andMe develop a new business model. It should offer the service for free to individuals, particularly those who frequent gyms and nutrition stores and sell information about potential athletes to sports teams.
Big data solutions for advanced marketing analyticsNatalino Busa
Our retail banking market demands now more than ever to stay close to our customers, and to carefully understand what services, products, and wishes are relevant for each customer at any given time. This sort of marketing research is often beyond the capacity of traditional BI reporting frameworks. In this talk, we illustrate how we team up data scientists and big data engineers in order to create and scale distributed analyses on a big data platform. By using Hadoop and open source statistical language and tools such R and Python, we can execute a variety of machine learning algorithms, and scale them out on a distributed computing framework.
AWS Webcast - Managing Big Data in the AWS Cloud_20140924Amazon Web Services
This presentation deck will cover specific services such as Amazon S3, Kinesis, Redshift, Elastic MapReduce, and DynamoDB, including their features and performance characteristics. It will also cover architectural designs for the optimal use of these services based on dimensions of your data source (structured or unstructured data, volume, item size and transfer rates) and application considerations - for latency, cost and durability. It will also share customer success stories and resources to help you get started.
Polyglot persistence for Java developers: time to move out of the relational ...Chris Richardson
Relational databases have long been considered the one true way to persist enterprise data. Even today, they are an excellent choice for many applications. But for some applications NoSQL databases are a viable alternative. They can simplify the persistence of complex data models and offer significantly better scalability, and performance. But using NoSQL databases is very different than the ACID/SQL/JDBC/JPA world that we have become accustomed to. They have different and unfamiliar APIs and a very different and usually limited transaction model. So what’s a Java developer to do?
AWS re:Invent 2016: ElastiCache Deep Dive: Best Practices and Usage Patterns ...Amazon Web Services
In this session, we provide a peek behind the scenes to learn about Amazon ElastiCache's design and architecture. See common design patterns with our Redis and Memcached offerings and how customers have used them for in-memory operations to reduce latency and improve application throughput. During this session, we review ElastiCache best practices, design patterns, and anti-patterns.
The millions of people that use Spotify each day generate a lot of data, roughly a few terabytes per day. What does it take to handle datasets of that scale, and what can be done with it? I will briefly cover how Spotify uses data to provide a better music listening experience, and to strengthen their busineess. Most of the talk will be spent on our data processing architecture, and how we leverage state of the art data processing and storage tools, such as Hadoop, Cassandra, Kafka, Storm, Hive, and Crunch. Last, I'll present observations and thoughts on innovation in the data processing aka Big Data field.
A data lake can be used as a source for both structured and unstructured data - but how? We'll look at using open standards including Spark and Presto with Amazon EMR, Amazon Redshift Spectrum and Amazon Athena to process and understand data.
Speakers:
Neel Mitra - Solutions Architect, AWS
Roger Dahlstrom - Solutions Architect, AWS
Big Data and New Challenges for DBAs (Michael Naumov, LivePerson)
Hadoop has become a popular platform for managing large datasets of structured and unstructured data. It does not replace existing infrastructures, but instead augments them. Most companies will still use relational databases for transactional processing and low-latency queries, but can benefit from Hadoop for reporting, machine learning or ETL. This session will cover:
What is Hadoop and why do I care?
What do people do with Hadoop?
How can SQL Server DBAs add Hadoop to their architecture?
This presentation was inspired post read of "TimeSeries Databases" -- Ted Dunning & Ellen Friedman.
I have tried to summarize a lot of the previous bench marks. Hope others find it useful. The slides were compiled early 2015 so some of the results might have changed but the core literature should still hold.
Data Analytics Week at the San Francisco Loft
Using Data Lakes
A data lake can be used as a source for both structured and unstructured data - but how? We'll look at using open standards including Spark and Presto with Amazon EMR, Amazon Redshift Spectrum and Amazon Athena to process and understand data.
Speakers:
John Mallory - Principal Business Development Manager Storage (Object), AWS
Hemant Borole - Sr. Big Data Consultant, AWS
A data lake can be used as a source for both structured and unstructured data - but how? We'll look at using open standards including Spark and Presto with Amazon EMR, Amazon Redshift Spectrum and Amazon Athena to process and understand data.
Level: Intermediate
Speakers:
Tony Nguyen - Senior Consultant, ProServe, AWS
Hannah Marlowe - Consultant - Federal, AWS
History of NoSQL and Azure Documentdb feature setSoner Altin
Short history of database systems from DBMS, RDBMS to NoSQL solutions. Introduction to SQL query support of Azure DocumentDB and integrating DocumentDB with simple Java application from Maven repository.
AWS APAC Webinar Week - Big Data on AWS. RedShift, EMR, & IOTAmazon Web Services
The world is producing an ever-increasing volume, velocity, and variety of data including data from devices. As we step into the era of Internet of things (IOT), for many consumers, batch analytics is no longer enough; they need sub-second analysis on fast-moving data. AWS delivers many technologies for solving big data and IOT problems. But what services should you use, why, when, and how? In this webinar where we simplify big data processing as a pipeline comprising various stages: ingest, store, process, analyze & visualize. Next, we discuss how to choose the right technology in each stage based on criteria such as data structure, query latency, cost, request rate, item size, data volume, and durability. Finally, we provide a reference architecture, design patterns, and best practices for assembling these technologies to solve your big data problems.
This session is recommended for anyone interested in understanding how to use AWS big data services to develop real-time analytics applications. In this session, you will get an overview of a number of Amazon's big data and analytics services that enable you to build highly scaleable cloud applications that immediately and continuously analyze large sets of distributed data. We'll explain how services like Amazon Kinesis, EMR and Redshift can be used for data ingestion, processing and storage to enable real-time insights and analysis into customer, operational and machine generated data and log files. We'll explore system requirements, design considerations, and walk through a specific customer use case to illustrate the power of real-time insights on their business.
Listen to the keynote address and hear about the latest developments from Rachana Ananthakrishnan and Ian Foster who review the updates to the Globus Platform and Service, and the relevance of Globus to the scientific community as an automation platform to accelerate scientific discovery.
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...Mind IT Systems
Healthcare providers often struggle with the complexities of chronic conditions and remote patient monitoring, as each patient requires personalized care and ongoing monitoring. Off-the-shelf solutions may not meet these diverse needs, leading to inefficiencies and gaps in care. It’s here, custom healthcare software offers a tailored solution, ensuring improved care and effectiveness.
First Steps with Globus Compute Multi-User EndpointsGlobus
In this presentation we will share our experiences around getting started with the Globus Compute multi-user endpoint. Working with the Pharmacology group at the University of Auckland, we have previously written an application using Globus Compute that can offload computationally expensive steps in the researcher's workflows, which they wish to manage from their familiar Windows environments, onto the NeSI (New Zealand eScience Infrastructure) cluster. Some of the challenges we have encountered were that each researcher had to set up and manage their own single-user globus compute endpoint and that the workloads had varying resource requirements (CPUs, memory and wall time) between different runs. We hope that the multi-user endpoint will help to address these challenges and share an update on our progress here.
Globus Connect Server Deep Dive - GlobusWorld 2024Globus
We explore the Globus Connect Server (GCS) architecture and experiment with advanced configuration options and use cases. This content is targeted at system administrators who are familiar with GCS and currently operate—or are planning to operate—broader deployments at their institution.
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...Juraj Vysvader
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I didn't get rich from it but it did have 63K downloads (powered possible tens of thousands of websites).
Globus Compute wth IRI Workflows - GlobusWorld 2024Globus
As part of the DOE Integrated Research Infrastructure (IRI) program, NERSC at Lawrence Berkeley National Lab and ALCF at Argonne National Lab are working closely with General Atomics on accelerating the computing requirements of the DIII-D experiment. As part of the work the team is investigating ways to speedup the time to solution for many different parts of the DIII-D workflow including how they run jobs on HPC systems. One of these routes is looking at Globus Compute as a way to replace the current method for managing tasks and we describe a brief proof of concept showing how Globus Compute could help to schedule jobs and be a tool to connect compute at different facilities.
In the ever-evolving landscape of technology, enterprise software development is undergoing a significant transformation. Traditional coding methods are being challenged by innovative no-code solutions, which promise to streamline and democratize the software development process.
This shift is particularly impactful for enterprises, which require robust, scalable, and efficient software to manage their operations. In this article, we will explore the various facets of enterprise software development with no-code solutions, examining their benefits, challenges, and the future potential they hold.
Large Language Models and the End of ProgrammingMatt Welsh
Talk by Matt Welsh at Craft Conference 2024 on the impact that Large Language Models will have on the future of software development. In this talk, I discuss the ways in which LLMs will impact the software industry, from replacing human software developers with AI, to replacing conventional software with models that perform reasoning, computation, and problem-solving.
Enhancing Research Orchestration Capabilities at ORNL.pdfGlobus
Cross-facility research orchestration comes with ever-changing constraints regarding the availability and suitability of various compute and data resources. In short, a flexible data and processing fabric is needed to enable the dynamic redirection of data and compute tasks throughout the lifecycle of an experiment. In this talk, we illustrate how we easily leveraged Globus services to instrument the ACE research testbed at the Oak Ridge Leadership Computing Facility with flexible data and task orchestration capabilities.
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptxrickgrimesss22
Discover the essential features to incorporate in your Winzo clone app to boost business growth, enhance user engagement, and drive revenue. Learn how to create a compelling gaming experience that stands out in the competitive market.
Software Engineering, Software Consulting, Tech Lead.
Spring Boot, Spring Cloud, Spring Core, Spring JDBC, Spring Security,
Spring Transaction, Spring MVC,
Log4j, REST/SOAP WEB-SERVICES.
Developing Distributed High-performance Computing Capabilities of an Open Sci...Globus
COVID-19 had an unprecedented impact on scientific collaboration. The pandemic and its broad response from the scientific community has forged new relationships among public health practitioners, mathematical modelers, and scientific computing specialists, while revealing critical gaps in exploiting advanced computing systems to support urgent decision making. Informed by our team’s work in applying high-performance computing in support of public health decision makers during the COVID-19 pandemic, we present how Globus technologies are enabling the development of an open science platform for robust epidemic analysis, with the goal of collaborative, secure, distributed, on-demand, and fast time-to-solution analyses to support public health.
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...Globus
The Earth System Grid Federation (ESGF) is a global network of data servers that archives and distributes the planet’s largest collection of Earth system model output for thousands of climate and environmental scientists worldwide. Many of these petabyte-scale data archives are located in proximity to large high-performance computing (HPC) or cloud computing resources, but the primary workflow for data users consists of transferring data, and applying computations on a different system. As a part of the ESGF 2.0 US project (funded by the United States Department of Energy Office of Science), we developed pre-defined data workflows, which can be run on-demand, capable of applying many data reduction and data analysis to the large ESGF data archives, transferring only the resultant analysis (ex. visualizations, smaller data files). In this talk, we will showcase a few of these workflows, highlighting how Globus Flows can be used for petabyte-scale climate analysis.
May Marketo Masterclass, London MUG May 22 2024.pdfAdele Miller
Can't make Adobe Summit in Vegas? No sweat because the EMEA Marketo Engage Champions are coming to London to share their Summit sessions, insights and more!
This is a MUG with a twist you don't want to miss.
Code reviews are vital for ensuring good code quality. They serve as one of our last lines of defense against bugs and subpar code reaching production.
Yet, they often turn into annoying tasks riddled with frustration, hostility, unclear feedback and lack of standards. How can we improve this crucial process?
In this session we will cover:
- The Art of Effective Code Reviews
- Streamlining the Review Process
- Elevating Reviews with Automated Tools
By the end of this presentation, you'll have the knowledge on how to organize and improve your code review proces
How Recreation Management Software Can Streamline Your Operations.pptxwottaspaceseo
Recreation management software streamlines operations by automating key tasks such as scheduling, registration, and payment processing, reducing manual workload and errors. It provides centralized management of facilities, classes, and events, ensuring efficient resource allocation and facility usage. The software offers user-friendly online portals for easy access to bookings and program information, enhancing customer experience. Real-time reporting and data analytics deliver insights into attendance and preferences, aiding in strategic decision-making. Additionally, effective communication tools keep participants and staff informed with timely updates. Overall, recreation management software enhances efficiency, improves service delivery, and boosts customer satisfaction.
Top 7 Unique WhatsApp API Benefits | Saudi ArabiaYara Milbes
Discover the transformative power of the WhatsApp API in our latest SlideShare presentation, "Top 7 Unique WhatsApp API Benefits." In today's fast-paced digital era, effective communication is crucial for both personal and professional success. Whether you're a small business looking to enhance customer interactions or an individual seeking seamless communication with loved ones, the WhatsApp API offers robust capabilities that can significantly elevate your experience.
In this presentation, we delve into the top 7 distinctive benefits of the WhatsApp API, provided by the leading WhatsApp API service provider in Saudi Arabia. Learn how to streamline customer support, automate notifications, leverage rich media messaging, run scalable marketing campaigns, integrate secure payments, synchronize with CRM systems, and ensure enhanced security and privacy.
4. Personal profiles in family trees
Family trees:
A complex network of
people, each with
personal info, life
events, and connections
to relatives.
5. Personal profiles in family trees
Family trees:
A complex network of
people, each with
personal info, life
events, and connections
to relatives.
6. Personal profiles in family trees –
Sharding MySQL
Family trees:
A complex network of
people, each with
personal info, life
events, and connections
to relatives. Many
interconnected MySQL
tables. Millions of daily
updates.
Site A
Site A
Site A
Event Individual
Child
InFamily
Family
Family
Event
Tags
Photos
7. Personal profiles in family trees –
Sharding MySQL
Family trees:
A complex network of
people, each with
personal info, life
events, and connections
to relatives. Many
interconnected MySQL
tables. Millions of daily
updates.
Good response time for
single family site access,
using MySQL Database
Sharding.
Over 650 shards, on >20
physical hosts, growing
Shard 650
Partition 500
Shard 1
Site A
Site A
Site A
Event Individual
Child
InFamily
Family
Family
Event
Tags
Photos
. . .
8. The issue with RDBMS sharding
Problematic when
multiple shards are
needed at once.
For example, to display
search results and profile
matches coming from
many family trees.
Costly to scale for
more readers
Options:
• Build a custom
parallel-fetch
Aggregator service
• NoSQL
9. Cassandra to the rescue
Cassandra recap:
• Key-value store
• Ring-based consistent
hashing cluster
• Support for clusters split
between data centers
• Data redundancy and
consistency at user
controlled level
• Append-only high write
throughput
11. PeopleStore: Overview
• Store 2.6 billion profiles (and growing over a million a day)
• Provide very fast read access
• Shadows the MySQL source of truth (at least for foreseeable future)
• Data consistency is critical
• Store each person as one aggregated record in Cassandra, including
ALL info for typical uses, to minimize nested/follow-up queries: get all
information needed at once
• Decision point: replicate relatives, or point to their record?
13. Web Servers
(PHP)
PeopleStore: Architecture
. . .
MySQL
Highly sharded RDBMS
Web Servers
(PHP)Web Server
PoepleStore
micorservicePoepleStore
micorservicePeopleStore
micorservice
Source of Truth
Synchronous
updates
Multi-item Fetch
Cassandra Cluster
Mass
Loading
Hadoop
cluster
Online Flows Batch first load / reload
14. PeopleStore: Schema
CREATE TABLE peoplestore.people (
site_id int,
tree_id int,
individual_id int,
adopted_child_in_family_id int,
child_in_family_id int,
foster_child_in_family_id int,
gender text,
is_alive boolean,
privacy_level int,
last_update int,
loading_mode int,
loading_time timestamp,
thumbnail text,
name text,
events text,
photos text,
relatives text,
PRIMARY KEY (site_id, tree_id, individual_id)
) WITH …
compaction = {'class': ’...LeveledCompactionStrategy'};
6 hosts, RF=3
ID
Meta data
JSON blobs
• JSON: Flexibility of structure
(text, not 2.2 JSON support)
• Split fields: Flexibility to
fetch fields needed
• Not using a Collection for
plural fields – due to
Cassandra limitation on
using IN clause on table
with Collection fields (non
issue for us)
• Future: use User Defined
Types
15. PeopleStore: Schema
CREATE TABLE peoplestore.people (
site_id int,
tree_id int,
individual_id int,
adopted_child_in_family_id int,
child_in_family_id int,
foster_child_in_family_id int,
gender text,
is_alive boolean,
privacy_level int,
last_update int,
loading_mode int,
loading_time timestamp,
thumbnail text,
name text,
events text,
photos text,
relatives text,
PRIMARY KEY (site_id, tree_id, individual_id)
) WITH …
compaction = {'class': ’...LeveledCompactionStrategy'};
6 hosts, RF=3
ID
Meta data
JSON blobs
Only minimal relatives info:
ID + name
Requires another fetch for full
relative data
16. PeopleStore: Schema
CREATE TABLE peoplestore.people (
site_id int,
tree_id int,
individual_id int,
adopted_child_in_family_id int,
child_in_family_id int,
foster_child_in_family_id int,
gender text,
is_alive boolean,
privacy_level int,
last_update int,
loading_mode int,
loading_time timestamp,
thumbnail text,
name text,
events text,
photos text,
relatives text,
PRIMARY KEY (site_id, tree_id, individual_id)
) WITH …
compaction = {'class': ’...LeveledCompactionStrategy'};
6 hosts, RF=3
ID
Meta data
JSON blobs
Started with Size Tiered
Compaction. Generated
thousands of SSTables and
slowed query time.
Moving to Leveled Compaction
solved the issue.
17. PeopleStore: microservice
Clients
• Control exposure, read / write per flow
• Discover services by listing DNS SRV records
• Clients do round-robin on these services
Services
• A Spring Boot Java REST server
• Deployed as a Docker container managed by
Mesos & Marathon
• Mesos manages DNS entries
• Mesos monitors services health
• Metrics sent to JMX
Failure recovery needed despite redundancy
• In write for consistency; in read for availability
PoepleStore
micorservice
(Java)
PoepleStore
micorservice
(Java)
PeopleStore
micorservice
Web Servers
(PHP)
Web Servers
(PHP)Web Server
Write
failure
recovery
Mesos+Marathon
DNS
18. PeopleStore: Mass Loading
To boot the system, and in case of major scheme/logic changes, we
had to load 2.2 billion person profiles at once.
Evaluated:
• Cassandra’s sstableloader tool
• hdfs2cass from Spotify
Cons:
• Uses SSTableSimpleWriter and Cassandra streaming
• Very sensitive to C* version
Selected: Hadoop + online Cassandra updates
19. PeopleStore: Mass Loading with Hadoop
. . .
MySQL
Hadoop
cluster
Extract and Aggregate
MySQL extractor + PIG flow Avro
Load
Crunch + Cassandra Driver
• Tested: logged/unlogged BATCH writes.
Does NOT help performance.
• Had to implement write retries to reach 0 failures
• Collect stats into Hadoop counters
• Load time
2.2 billion items, 6 Hadoop nodes, 6 C* nodes
~30k writes per second
~17 hours load + hours compaction time
Impact on read latency very reasonable
20. PeopleStore: Mass loading with online updates
Mass loading takes time. In the meanwhile, we have online updates.
Batch load must not overwrite newer online updates
Tested: lightweight transactions:
INSERT ... IF NOT EXISTS / UPDATE... IF update_time < <value>
Result: major slowdown, due to massive read-before-write
Solution: updated_people table: small table, indicating only people that changed
online while batch loading is running. Read-before-write viable because table is
small; >99% of queries return empty set. Insignificant slowdown.
Hadoop
cluster
PoepleStore
micorservice
(Java)
PoepleStore
micorservice
(Java)
PeopleStore
micorservice
(Java)
. . .
MySQL
21. PeopleStore: JVM Tuning
Experienced long GC pauses in the Cassandra nodes
Upgrade from Java 1.7 to 1.8.0_65
Switch from CMS to G1 garbage collector
Major improvement.
This is the default in Cassandra 3.0
Tune JVM params
(/etc/cassandra/conf/cassandra-env.sh)
See https://tobert.github.io/pages/als-cassandra-21-tuning-
guide.html
# highlights:
JVM_OPTS="$JVM_OPTS -XX:+UseG1GC”
JVM_OPTS="$JVM_OPTS -Xms16G -Xmx16G”
JVM_OPTS="$JVM_OPTS -XX:MaxGCPauseMillis=500”
JVM_OPTS="$JVM_OPTS -XX:ParallelGCThreads=16"
JVM_OPTS="$JVM_OPTS -XX:ConcGCThreads=16”
22. PeopleStore: Other issues
Experienced unexplained missing rows on read (CASSANDRA-10801)
We upgraded the Cassandra nodes from 2.1.11 to 2.1.12 and the java
driver from 2.1.5 to 2.1.9, which solved the issue.
Cassandra Driver: Spring @Query annotations cannot handle “IN”
queries. Instead, we used CassandraTemplate to build a native query.
23. PeopleStore: Results
Reduced latency:
• Matches page: over 50% reduction of load time
• Search results page: 40% reduction of load time
• 90% of microservice
calls < 100ms
Reduced load on
MySQL databases
• From hundreds of
queries per page,
to just a few
25. EVERY page on myheritage.com needs access to
• Summarized user (account) information from multiple sources
For marketing tracking, affiliates programs, retargeting
Includes properties and counters, coming from various sources
• A/B test data
Participation and variant selection
for guests and registered members
Latency: less than 10ms slowdown for any page
Data must be fresh
Storing also for guests, lots of data
Make data available to BI systems
Aggregating the data at runtime is too slow
requires maintaining live aggregated data
high updates rate
AccountStore Needs
Fast user properties and counters
Example
var gtmDataLayer = [{
"site_plan":"premium-plus”,
"data_subscription":
"no-data-subscription",
"active_paying":
"not-actively-paying”
"site_visits":3509,
"last_mobile_sighting":
"2016-02-07 11:10:25”
...
}];
26. Use Cassandra to “store it as you read it” – updated aggregate
information and counters on users and guests.
Event subscribers update the aggregate data online as it changes, in
two tables: data, and counters (C* limitation)
For example, num_individuals_in_trees changed online as family tree is modified,
and subscription_expiration_date is changed as user becomes a paying subscriber.
A separate Cassandra table maps guests to users as they convert and
register.
AccountStore: Overview
27. Same physical datacenter
Requirement: Allow BI systems to collect data. Do not put BI load on
the production cluster.
Solution: create a fictitious data-center in the cluster
AccountStore and A/B test cluster topology
App Cassandra
data center
BI Cassandra
data center
BI systemclients
clients
clients
28. Using secondary indexes for non-
typical flows
Converted guests maintain the UUID,
plus mapping to/from account_id
AccountStore: Schema
CREATE TABLE accounts.account_store_data (
account_uid uuid PRIMARY KEY,
creation_time timestamp,
device_types set,
highest_site_plan int,
last_visit timestamp,
. . .
) WITH ...;
CREATE TABLE accounts.account_id_guest_id (
account_id int,
guest_id ascii,
guest_creation_time timestamp,
updated_at timestamp,
uuid uuid,
PRIMARY KEY ((account_id, guest_id))
) WITH ...;
CREATE INDEX account_id_guest_id_updated_at_idx ON
accounts.account_id_guest_id (updated_at);
CREATE INDEX account_id_guest_id_uuid_idx ON
accounts.account_id_guest_id (uuid);
CREATE TABLE accounts.account_store_counters
(
account_uid uuid PRIMARY KEY,
num_individuals_in_all_trees counter,
num_visits counter,
. . .
) WITH ...;
29. Scale: Millions of active users, hundreds of active experiments: billions
of rows.
Latency: must not slow down the application; many pages have
multiple experiments active on them.
Must allow time-based collection into BI systems.
Classic implementation: Sharded MySQL.
We already have a cluster sharded by Family Site ID.
We do not want another MySQL cluster, sharded by User ID.
Decision: a natural addition to the AccountStore Cassandra cluster.
A/B tests
30. AccountStore: A/B tests schema
CREATE TABLE ab_test.member_to_experiment_ts
(
uuid_bucket int,
day int,
hour int,
experiment_id int,
uuid uuid,
created_at timestamp,
variant_id int,
PRIMARY KEY (
(uuid_bucket, day, hour),
experiment_id,
uuid)
) WITH ...;
CREATE TABLE ab_test.member_to_experiment (
account_uid uuid,
experiment_id int,
created_at timestamp,
created_at_ts bigint,
variant_id int,
PRIMARY KEY (account_uid, experiment_id)
) WITH ...;
CREATE INDEX member_to_experiment_experiment_id_idx
ON ab_test.member_to_experiment (experiment_id);
Simple lookup of
experiment variant
for a user
Secondary lookup by experiment
Preventing hotspots in time-based
index:
Use uuid_bucketing to ensure
partitioning
Reading requires going over all buckets.
Full dump: using
sstable2json