For our June meetup, we'll have our local friends at www.vast.com presenting some of their current use cases for Cassandra. Additionally, Vast will be talking about a non-blocking Scala client that they have developed in house.
The document discusses key-value stores as options for scaling the backend of a Facebook game. It describes Redis, Cassandra, and Membase and evaluates them as potential solutions. Redis is selected for its simplicity and ability to handle the expected write-heavy workload using just one or two servers initially. The game has since launched and is performing well with the Redis implementation.
Michael DelNegro is a principal database administrator at AOL who has been administering MongoDB databases for applications including MapQuest and Patch. He provided an overview of MongoDB and tips for administering MongoDB databases effectively, as well as resources for further learning.
MapReduce with Apache Hadoop is a framework for distributed processing of large datasets across clusters of computers. It allows for parallel processing of data, fault tolerance, and scalability. The framework includes Hadoop Distributed File System (HDFS) for reliable storage, and MapReduce for distributed computing. MapReduce programs can be written in various languages and frameworks provide higher-level interfaces like Pig and Hive.
Oracle GoldenGate for MySQL provides the real-time data replication to capture and deliver data for MySQL databases.This is an overview of the product.
Slides from my talk at ACCU2011 in Oxford on 16th April 2011. A whirlwind tour of the non-relational database families, with a little more detail on Redis, MongoDB, Neo4j and HBase.
Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your big...rhatr
You’ve got your Hadoop cluster, you’ve got your petabytes of unstructured data, you run mapreduce jobs and SQL-on-Hadoop queries. Something is still missing though. After all, we are not expected to enter SQL queries while looking for information on the web. Altavista and Google solved it for us ages ago. Why are we still requiring SQL or Java certification from our enterprise bigdata users? In this talk, we will look into how integration of SolrCloud into Apache Bigtop is now enabling building bigdata indexing solutions and ingest pipelines. We will dive into the details of integrating full-text search into the lifecycle of your bigdata management applications and exposing the power of Google-in-a-box to all enterprise users, not just a chosen few data scientists.
The document discusses key-value stores as options for scaling the backend of a Facebook game. It describes Redis, Cassandra, and Membase and evaluates them as potential solutions. Redis is selected for its simplicity and ability to handle the expected write-heavy workload using just one or two servers initially. The game has since launched and is performing well with the Redis implementation.
Michael DelNegro is a principal database administrator at AOL who has been administering MongoDB databases for applications including MapQuest and Patch. He provided an overview of MongoDB and tips for administering MongoDB databases effectively, as well as resources for further learning.
MapReduce with Apache Hadoop is a framework for distributed processing of large datasets across clusters of computers. It allows for parallel processing of data, fault tolerance, and scalability. The framework includes Hadoop Distributed File System (HDFS) for reliable storage, and MapReduce for distributed computing. MapReduce programs can be written in various languages and frameworks provide higher-level interfaces like Pig and Hive.
Oracle GoldenGate for MySQL provides the real-time data replication to capture and deliver data for MySQL databases.This is an overview of the product.
Slides from my talk at ACCU2011 in Oxford on 16th April 2011. A whirlwind tour of the non-relational database families, with a little more detail on Redis, MongoDB, Neo4j and HBase.
Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your big...rhatr
You’ve got your Hadoop cluster, you’ve got your petabytes of unstructured data, you run mapreduce jobs and SQL-on-Hadoop queries. Something is still missing though. After all, we are not expected to enter SQL queries while looking for information on the web. Altavista and Google solved it for us ages ago. Why are we still requiring SQL or Java certification from our enterprise bigdata users? In this talk, we will look into how integration of SolrCloud into Apache Bigtop is now enabling building bigdata indexing solutions and ingest pipelines. We will dive into the details of integrating full-text search into the lifecycle of your bigdata management applications and exposing the power of Google-in-a-box to all enterprise users, not just a chosen few data scientists.
The document provides an introduction to Hadoop. It discusses how Google developed its own infrastructure using Google File System (GFS) and MapReduce to power Google Search due to limitations with databases. Hadoop was later developed based on these Google papers to provide an open-source implementation of GFS and MapReduce. The document also provides overviews of the HDFS file system and MapReduce programming model in Hadoop.
This document provides an overview of parallel processing and Hadoop. It discusses how Hadoop uses HDFS for distributed storage and MapReduce for parallel processing. An example application calculates maximum temperatures by year from climate data to demonstrate how Hadoop can process large datasets in parallel across multiple machines.
This presentation can help you to apply partioning when appropriate, and to avoid problems when using it. The oneliner is: Simple Works Best. The illustrating demos are on Postgres12 (maybe -13 by the time of presenting) and show some of the problems and solutions that Partitioning can provide. Some of this “experience” is quite old and the demo runs near-identical on Oracle…
These problems are the same on any database.
Efficient in situ processing of various storage types on apache tajoHyunsik Choi
The document discusses Apache Tajo, an open source data warehouse system that supports efficient in-situ processing of various storage types. It describes Tajo's architecture, how it supports different storage backends like HDFS, S3, HBase and data formats. The key points are:
1) Tajo provides a unified interface to integrate and process data from various storage systems and formats like HDFS, S3, HBase, in a single system.
2) It uses a pluggable storage and data format architecture with tablespaces to abstract different physical storage configurations.
3) Operations can be pushed down to underlying storages for optimization during query execution.
4) Current supported storages include HDFS, S
TiDB is a distributed, horizontally scalable SQL database that is compatible with MySQL. It separates processing and storage into independent scalable components - the TiDB SQL layer and the TiKV storage foundation. TiDB uses a multi-version concurrency control approach based on Google's Spanner/F1 databases. It has been used in large-scale production deployments containing over 30 TB of data per day. Benchmarks show it can scale linearly with additional nodes. While aiming to be compatible with MySQL features, it does not support some like stored procedures and triggers.
Best practices for MySQL/MariaDB Server/Percona Server High AvailabilityColin Charles
Best practices for MySQL/MariaDB Server/Percona Server High Availability - presented at Percona Live Amsterdam 2016. The focus is on picking the right High Availability solution, discussing replication, handling failure (yes, you can achieve a quick automatic failover), proxies (there are plenty), HA in the cloud/geographical redundancy, sharding solutions, how newer versions of MySQL help you, and what to watch for next.
M|18 How to use MyRocks with MariaDB ServerMariaDB plc
MyRocks in MariaDB summarizes MyRocks, a storage engine for MariaDB that is based on RocksDB. It discusses how MyRocks addresses some of the limitations of InnoDB such as high write and space amplification. It provides details on installing and using MyRocks, including data loading techniques, tuning considerations, and replication support. Parallel replication is supported, but the highest isolation level is repeatable-read and row-based replication must be used.
The Hive Think Tank: Rocking the Database World with RocksDBThe Hive
Dhruba Borthakur, Facebook
Dhruba Borthakur is an engineer at Facebook. He has been one of the founding engineer of RocksDB, an open-source key-value store optimized for storing data in flash and main-memory storage. He has been one of the founding architects of the Apache Hadoop Distributed File System and has been instrumental in scaling Facebook's Hadoop cluster to multiples of petabytes. Dhruba has contributed code to the Apache HBase project. Earlier, he contributed to the development of the Andrew File System (AFS). He has an M.S. in Computer Science from the University of Wisconsin, Madison and a B.S. in Computer Science BITS, Pilani, India.
An introduction to MongoDB from an experienced MySQL user and developer. There are differences and we go thru the What/Why/Who/Where of MongoDB, the "similarities" to the MySQL world like storage engines, how replication is a little more interesting with built-in sharding and automatic failover, backups, monitoring, DBaaS, going to production and finding out more resources.
The relational database model was designed to solve the problems of yesterday’s data storage requirements. The massively connected world of today presents different problems and new challenges. We’ll explore the NoSQL philosophy, before comparing and contrasting the strengths and weaknesses of the relational model versus the NoSQL model. While stepping through real-world scenarios, we’ll discuss the reasons for choosing one solution over the other.
To complete this session, let’s demonstrate our findings with an application written with a NoSQL storage layer and explain the advantages that accrue from that decision. By taking a look at the new challenges we face with our data storage needs, we’ll examine why the principles behind NoSQL make it a better candidate as a solution, than yesterday’s relational model.
MariaDB 10 Tutorial - 13.11.11 - Percona Live LondonIvan Zoratti
This document provides an overview and summary of MariaDB 10 features presented by Ivan Zoratti. It discusses new features in MariaDB 10 like storage engines, administration improvements, and replication capabilities. The document also summarizes optimization enhancements in MariaDB 10 like the new optimizer, improved indexing techniques, and subquery optimizations. Various agenda topics are outlined for the MariaDB 10 tutorial.
Apachecon Europe 2012: Operating HBase - Things you need to knowChristian Gügi
This document provides an overview of important concepts for operating HBase, including:
- HBase stores data in columns families stored as files on disk and writes to memory before flushing to disk.
- Manual and automatic splitting of regions is covered, as well as challenges of improper splitting.
- Tools for monitoring, debugging, and visualizing HBase operations are discussed.
- Key lessons focus on proper data modeling, extensive monitoring, and understanding the whole Hadoop ecosystem.
The document summarizes the evolution of Flipkart's website architecture from 2007 to 2012. Key issues addressed included slow website performance due to slow database queries, isolating reads from writes, isolating production traffic from analytics jobs, implementing caching which introduced complexity, isolating the impact of slow external services, handling spikes in traffic, and separating systems to isolate internal from external requests. The evolution involved learning lessons around scaling databases, isolating systems, managing caching complications, and ensuring systems are not overloaded.
- Apache Geode is a distributed, memory-based data management platform for data-intensive applications that require high performance, scalability, resiliency, and continuous availability.
- It is used by over 1000 companies including many large financial institutions, government organizations, and telecommunications providers for applications that involve large datasets, high throughput, and low latency access to data.
- Geode provides in-memory data storage and management, data replication, load balancing and partitioning across multiple servers in a distributed cluster to enable fast access to critical data sets.
Improving Apache Spark by Taking Advantage of Disaggregated ArchitectureDatabricks
Shuffle in Apache Spark is an intermediate phrase redistributing data across computing units, which has one important primitive that the shuffle data is persisted on local disks. This architecture suffers from some scalability and reliability issues. Moreover, the assumptions of collocated storage do not always hold in today’s data centers. The hardware trend is moving to disaggregated storage and compute architecture for better cost efficiency and scalability.
To address the issues of Spark shuffle and support disaggregated storage and compute architecture, we implemented a new remote Spark shuffle manager. This new architecture writes shuffle data to a remote cluster with different Hadoop-compatible filesystem backends.
Firstly, the failure of compute nodes will no longer cause shuffle data recomputation. Spark executors can also be allocated and recycled dynamically which results in better resource utilization.
Secondly, for most customers currently running Spark with collocated storage, it is usually challenging for them to upgrade the disks on every node to latest hardware like NVMe SSD and persistent memory because of cost consideration and system compatibility. With this new shuffle manager, they are free to build a separated cluster storing and serving the shuffle data, leveraging the latest hardware to improve the performance and reliability.
Thirdly, in HPC world, more customers are trying Spark as their high performance data analytics tools, while storage and compute in HPC clusters are typically disaggregated. This work will make their life easier.
In this talk, we will present an overview of the issues of the current Spark shuffle implementation, the design of new remote shuffle manager, and a performance study of the work.
Configuring workload-based storage and topologiesMariaDB plc
This document discusses configuring workload-based storage and topologies in MariaDB. It introduces several MariaDB storage engines including InnoDB, MyRocks, Aria, Spider, and ColumnStore. For each engine, it provides an overview of use cases, key configuration parameters, and recommendations on when to use each engine. It also provides an example of using different engines like MyRocks, InnoDB and Spider across multiple microservices databases based on the workload. The document aims to help users choose the right storage engine for their specific workload needs.
Big Data Camp LA 2014 - Apache Tajo: A Big Data Warehouse System on HadoopGruter
Apache Tajo is an open source big data warehouse system on Hadoop. This slide is a presentation material used in Big Data Camp LA 2014. This slide shows an introduction to Apache Tajo and the current status of the project. The current status includes cost-based optimization and the current supported SQL feature set.
Apache Geode is an open source in-memory data grid that provides data distribution, replication and high availability. It can be used for caching, messaging and interactive queries. The presentation discusses Geode concepts like cache, region and member. It provides examples of how large companies use Geode for applications requiring real-time response, high concurrency and global data visibility. Geode's performance comes from minimizing data copying and contention through flexible consistency and partitioning. The project is now hosted by Apache and the community is encouraged to get involved through mailing lists, code contributions and example applications.
Swiss Big Data User Group - Introduction to Apache DrillMapR Technologies
This document provides an introduction and overview of Apache Drill, an open source distributed SQL query engine designed for interactive analysis of large-scale datasets. It describes Drill's architecture as being inspired by Google's Dremel, with support for standard SQL queries, pluggable data sources, and schema flexibility. Drill distributes query execution across multiple nodes to maximize data locality and parallelism. Key features highlighted include full ANSI SQL support, support for nested data, optional schemas, and extensibility points.
The document provides an introduction to Hadoop. It discusses how Google developed its own infrastructure using Google File System (GFS) and MapReduce to power Google Search due to limitations with databases. Hadoop was later developed based on these Google papers to provide an open-source implementation of GFS and MapReduce. The document also provides overviews of the HDFS file system and MapReduce programming model in Hadoop.
This document provides an overview of parallel processing and Hadoop. It discusses how Hadoop uses HDFS for distributed storage and MapReduce for parallel processing. An example application calculates maximum temperatures by year from climate data to demonstrate how Hadoop can process large datasets in parallel across multiple machines.
This presentation can help you to apply partioning when appropriate, and to avoid problems when using it. The oneliner is: Simple Works Best. The illustrating demos are on Postgres12 (maybe -13 by the time of presenting) and show some of the problems and solutions that Partitioning can provide. Some of this “experience” is quite old and the demo runs near-identical on Oracle…
These problems are the same on any database.
Efficient in situ processing of various storage types on apache tajoHyunsik Choi
The document discusses Apache Tajo, an open source data warehouse system that supports efficient in-situ processing of various storage types. It describes Tajo's architecture, how it supports different storage backends like HDFS, S3, HBase and data formats. The key points are:
1) Tajo provides a unified interface to integrate and process data from various storage systems and formats like HDFS, S3, HBase, in a single system.
2) It uses a pluggable storage and data format architecture with tablespaces to abstract different physical storage configurations.
3) Operations can be pushed down to underlying storages for optimization during query execution.
4) Current supported storages include HDFS, S
TiDB is a distributed, horizontally scalable SQL database that is compatible with MySQL. It separates processing and storage into independent scalable components - the TiDB SQL layer and the TiKV storage foundation. TiDB uses a multi-version concurrency control approach based on Google's Spanner/F1 databases. It has been used in large-scale production deployments containing over 30 TB of data per day. Benchmarks show it can scale linearly with additional nodes. While aiming to be compatible with MySQL features, it does not support some like stored procedures and triggers.
Best practices for MySQL/MariaDB Server/Percona Server High AvailabilityColin Charles
Best practices for MySQL/MariaDB Server/Percona Server High Availability - presented at Percona Live Amsterdam 2016. The focus is on picking the right High Availability solution, discussing replication, handling failure (yes, you can achieve a quick automatic failover), proxies (there are plenty), HA in the cloud/geographical redundancy, sharding solutions, how newer versions of MySQL help you, and what to watch for next.
M|18 How to use MyRocks with MariaDB ServerMariaDB plc
MyRocks in MariaDB summarizes MyRocks, a storage engine for MariaDB that is based on RocksDB. It discusses how MyRocks addresses some of the limitations of InnoDB such as high write and space amplification. It provides details on installing and using MyRocks, including data loading techniques, tuning considerations, and replication support. Parallel replication is supported, but the highest isolation level is repeatable-read and row-based replication must be used.
The Hive Think Tank: Rocking the Database World with RocksDBThe Hive
Dhruba Borthakur, Facebook
Dhruba Borthakur is an engineer at Facebook. He has been one of the founding engineer of RocksDB, an open-source key-value store optimized for storing data in flash and main-memory storage. He has been one of the founding architects of the Apache Hadoop Distributed File System and has been instrumental in scaling Facebook's Hadoop cluster to multiples of petabytes. Dhruba has contributed code to the Apache HBase project. Earlier, he contributed to the development of the Andrew File System (AFS). He has an M.S. in Computer Science from the University of Wisconsin, Madison and a B.S. in Computer Science BITS, Pilani, India.
An introduction to MongoDB from an experienced MySQL user and developer. There are differences and we go thru the What/Why/Who/Where of MongoDB, the "similarities" to the MySQL world like storage engines, how replication is a little more interesting with built-in sharding and automatic failover, backups, monitoring, DBaaS, going to production and finding out more resources.
The relational database model was designed to solve the problems of yesterday’s data storage requirements. The massively connected world of today presents different problems and new challenges. We’ll explore the NoSQL philosophy, before comparing and contrasting the strengths and weaknesses of the relational model versus the NoSQL model. While stepping through real-world scenarios, we’ll discuss the reasons for choosing one solution over the other.
To complete this session, let’s demonstrate our findings with an application written with a NoSQL storage layer and explain the advantages that accrue from that decision. By taking a look at the new challenges we face with our data storage needs, we’ll examine why the principles behind NoSQL make it a better candidate as a solution, than yesterday’s relational model.
MariaDB 10 Tutorial - 13.11.11 - Percona Live LondonIvan Zoratti
This document provides an overview and summary of MariaDB 10 features presented by Ivan Zoratti. It discusses new features in MariaDB 10 like storage engines, administration improvements, and replication capabilities. The document also summarizes optimization enhancements in MariaDB 10 like the new optimizer, improved indexing techniques, and subquery optimizations. Various agenda topics are outlined for the MariaDB 10 tutorial.
Apachecon Europe 2012: Operating HBase - Things you need to knowChristian Gügi
This document provides an overview of important concepts for operating HBase, including:
- HBase stores data in columns families stored as files on disk and writes to memory before flushing to disk.
- Manual and automatic splitting of regions is covered, as well as challenges of improper splitting.
- Tools for monitoring, debugging, and visualizing HBase operations are discussed.
- Key lessons focus on proper data modeling, extensive monitoring, and understanding the whole Hadoop ecosystem.
The document summarizes the evolution of Flipkart's website architecture from 2007 to 2012. Key issues addressed included slow website performance due to slow database queries, isolating reads from writes, isolating production traffic from analytics jobs, implementing caching which introduced complexity, isolating the impact of slow external services, handling spikes in traffic, and separating systems to isolate internal from external requests. The evolution involved learning lessons around scaling databases, isolating systems, managing caching complications, and ensuring systems are not overloaded.
- Apache Geode is a distributed, memory-based data management platform for data-intensive applications that require high performance, scalability, resiliency, and continuous availability.
- It is used by over 1000 companies including many large financial institutions, government organizations, and telecommunications providers for applications that involve large datasets, high throughput, and low latency access to data.
- Geode provides in-memory data storage and management, data replication, load balancing and partitioning across multiple servers in a distributed cluster to enable fast access to critical data sets.
Improving Apache Spark by Taking Advantage of Disaggregated ArchitectureDatabricks
Shuffle in Apache Spark is an intermediate phrase redistributing data across computing units, which has one important primitive that the shuffle data is persisted on local disks. This architecture suffers from some scalability and reliability issues. Moreover, the assumptions of collocated storage do not always hold in today’s data centers. The hardware trend is moving to disaggregated storage and compute architecture for better cost efficiency and scalability.
To address the issues of Spark shuffle and support disaggregated storage and compute architecture, we implemented a new remote Spark shuffle manager. This new architecture writes shuffle data to a remote cluster with different Hadoop-compatible filesystem backends.
Firstly, the failure of compute nodes will no longer cause shuffle data recomputation. Spark executors can also be allocated and recycled dynamically which results in better resource utilization.
Secondly, for most customers currently running Spark with collocated storage, it is usually challenging for them to upgrade the disks on every node to latest hardware like NVMe SSD and persistent memory because of cost consideration and system compatibility. With this new shuffle manager, they are free to build a separated cluster storing and serving the shuffle data, leveraging the latest hardware to improve the performance and reliability.
Thirdly, in HPC world, more customers are trying Spark as their high performance data analytics tools, while storage and compute in HPC clusters are typically disaggregated. This work will make their life easier.
In this talk, we will present an overview of the issues of the current Spark shuffle implementation, the design of new remote shuffle manager, and a performance study of the work.
Configuring workload-based storage and topologiesMariaDB plc
This document discusses configuring workload-based storage and topologies in MariaDB. It introduces several MariaDB storage engines including InnoDB, MyRocks, Aria, Spider, and ColumnStore. For each engine, it provides an overview of use cases, key configuration parameters, and recommendations on when to use each engine. It also provides an example of using different engines like MyRocks, InnoDB and Spider across multiple microservices databases based on the workload. The document aims to help users choose the right storage engine for their specific workload needs.
Big Data Camp LA 2014 - Apache Tajo: A Big Data Warehouse System on HadoopGruter
Apache Tajo is an open source big data warehouse system on Hadoop. This slide is a presentation material used in Big Data Camp LA 2014. This slide shows an introduction to Apache Tajo and the current status of the project. The current status includes cost-based optimization and the current supported SQL feature set.
Apache Geode is an open source in-memory data grid that provides data distribution, replication and high availability. It can be used for caching, messaging and interactive queries. The presentation discusses Geode concepts like cache, region and member. It provides examples of how large companies use Geode for applications requiring real-time response, high concurrency and global data visibility. Geode's performance comes from minimizing data copying and contention through flexible consistency and partitioning. The project is now hosted by Apache and the community is encouraged to get involved through mailing lists, code contributions and example applications.
Swiss Big Data User Group - Introduction to Apache DrillMapR Technologies
This document provides an introduction and overview of Apache Drill, an open source distributed SQL query engine designed for interactive analysis of large-scale datasets. It describes Drill's architecture as being inspired by Google's Dremel, with support for standard SQL queries, pluggable data sources, and schema flexibility. Drill distributes query execution across multiple nodes to maximize data locality and parallelism. Key features highlighted include full ANSI SQL support, support for nested data, optional schemas, and extensibility points.
Apache Geode Meetup, Cork, Ireland at CITApache Geode
This document provides an introduction to Apache Geode (incubating), including:
- A brief history of Geode and why it was developed
- An overview of key Geode concepts such as regions, caching, and functions
- Examples of interesting large-scale use cases from companies like Indian Railways
- A demonstration of using Geode with Apache Spark and Spring XD for a stock prediction application
- Information on how to get involved with the Geode open source project community
This document summarizes a presentation about MongoDB best practices and lessons learned. The presentation covers: an introduction to MongoDB and how it compares to SQL databases; supported platforms and drivers; common use cases and misuses; best practices including configuration options, indexing, and operations; lessons learned around updates, sharding, and embedding data; and resources for learning more including books, training, forums, and future MongoDB releases.
The document summarizes the history and current state of the MySQL database server ecosystem. It discusses the origins and development of MySQL, MariaDB, Percona Server, and other related projects. It also describes some of the key features and innovations in recent versions of these database servers. The ecosystem is very active with contributions from many organizations and the future remains promising with ongoing work.
BDM8 - Near-realtime Big Data Analytics using ImpalaDavid Lauzon
Quick overview of all informations I've gathered on Cloudera Impala. It describes use cases for Impala and what not to use Impala for. Presented at Big Data Montreal #8 at RPM Startup Center.
Updated version of my talk about Hadoop 3.0 with the newest community updates.
Talk given at the codecentric Meetup Berlin on 31.08.2017 and on Data2Day Meetup on 28.09.2017 in Heidelberg.
Presentation about the Spil Storage Platform (SSP) written in Erlang. This talk was first given at the Erlang User Group Netherlands in July 2012 hosted at Spilgames in Hilversum.
Cassandra Day SV 2014: Spark, Shark, and Apache CassandraDataStax Academy
Evan Chan from Ooyala presents on integrating Apache Spark and Apache Cassandra for interactive analytics. He discusses how Ooyala uses Cassandra for analytics and is becoming a major Spark user. The talk focuses on using Spark to generate dynamic queries over Cassandra data, as precomputing all possible aggregates is infeasible at Ooyala's scale. Chan describes Ooyala's architecture that uses Spark to generate materialized views from Cassandra for fast querying, and demonstrates running queries over a Spark/Cassandra dataset.
Hadoop is an open-source software framework for distributed storage and processing of large datasets across clusters of computers. It provides reliable data storage, parallel processing, and fault tolerance. Key components of Hadoop include HDFS for distributed file storage, MapReduce for distributed processing, Hive for data warehousing, and HBase for NoSQL database access. Hadoop has seen widespread adoption for applications such as log analysis, data warehousing, and machine learning due to its scalability, low costs, and fault tolerance.
Severalnines Training: MySQL® Cluster - Part IXSeveralnines
This document discusses best practices for designing a MySQL Cluster database infrastructure. It recommends dedicating instances for data and API nodes and not co-locating them. The number of nodes depends on storage, throughput and redundancy requirements. Hardware recommendations include fast CPUs, RAM sized for the dataset, and SSDs or RAID for storage. Performance planning requires benchmarking typical workloads to determine if resources need scaling. The document provides formulas and tools to help calculate storage and memory needs.
Best practices for Data warehousing with Amazon Redshift - AWS PS Summit Canb...Amazon Web Services
Get a look under the hood: Understand how to take advantage of Amazon Redshift's columnar technology and parallel processing capabilities to improve your delivery of queries and improve overall database performance. You’ll also hear about how the University of Technology Sydney (UTS) are using Redshift. The University of Technology Sydney will describe how utilizing Amazon Redshift enabled agility in dealing with Data Quality, a capacity to scale when required, and optimizing development processes through rapid provisioning of Data Warehouse environments.
Speaker: Ganesh Raja, Solutions Architect, Amazon Web Services with Susan Gibson, Manager, Data and Business Intelligence, UTS
Level: 300
This document discusses large scale computing with MapReduce. It provides background on the growth of digital data, noting that by 2020 there will be over 5,200 GB of data for every person on Earth. It introduces MapReduce as a programming model for processing large datasets in a distributed manner, describing the key aspects of Map and Reduce functions. Examples of MapReduce jobs are also provided, such as counting URL access frequencies and generating a reverse web link graph.
This document provides an introduction to using Spring Data to simplify development of NoSQL applications. It discusses why NoSQL databases emerged as alternatives to relational databases, gives an overview of popular NoSQL databases like Redis, MongoDB, Neo4j and their features. It then introduces Spring Data and how it provides common APIs and conventions to work with various NoSQL databases. Specific database APIs for MongoDB, HyperSQL and Neo4j are also covered along with how Spring Data supports cross-store persistence across SQL and NoSQL databases in a single transaction.
MongoDB has taken a clear lead in adoption among the new generation of databases, including the enormous variety of NoSQL offerings. A key reason for this lead has been a unique combination of agility and scalability. Agility provides business units with a quick start and flexibility to maintain development velocity, despite changing data and requirements. Scalability maintains that flexibility while providing fast, interactive performance as data volume and usage increase. We'll address the key organizational, operational, and engineering considerations to ensure that agility and scalability stay aligned at increasing scale, from small development instances to web-scale applications. We will also survey some key examples of highly-scaled customer applications of MongoDB.
Slides for the Apache Geode Hands-on Meetup and Hackathon Announcement VMware Tanzu
This document provides an agenda for a hands-on introduction and hackathon kickoff for Apache Geode. The agenda includes details about the hackathon, an introduction to Apache Geode including its history and key features, a hands-on lab to build, run, and use Geode, and a Q&A session. It also outlines how to contribute to the Geode project through code, documentation, issue tracking, and mailing lists.
Joel Jacobson (Datastax) - Diagnosing Cassandra Problems in ProductionOutlyer
This sessions covers diagnosing and solving common problems encountered in production, using performance profiling tools. We’ll also give a crash course to basic JVM garbage collection tuning. Attendees will leave with a better understanding of what they should look for when they encounter problems with their in-production Cassandra cluster.
Video: https://www.youtube.com/watch?v=9XrHoAxd0Is
Join DevOps Exchange London here: http://www.meetup.com/DevOps-Exchange-London
Follow DOXLON on twitter http://www.twitter.com/doxlon
Apache Drill [1] is a distributed system for interactive analysis of large-scale datasets, inspired by Google’s Dremel technology. It is a design goal to scale to 10,000 servers or more and to be able to process Petabytes of data and trillions of records in seconds. Since its inception in mid 2012, Apache Drill has gained widespread interest in the community. In this talk we focus on how Apache Drill enables interactive analysis and query at scale. First we walk through typical use cases and then delve into Drill's architecture, the data flow and query languages as well as data sources supported.
[1] http://incubator.apache.org/drill/
Similar to Austin Cassandra Users 6/19: Apache Cassandra at Vast (20)
Forrester CXNYC 2017 - Delivering great real-time cx is a true craftDataStax Academy
Companies today are innovating with real-time data to deliver truly amazing customer experiences in the moment. Real-time data management for real-time customer experience is core to staying ahead of competition and driving revenue growth. Join Trays to learn how Comcast is differentiating itself from it's own historical reputation with Customer Experience strategies.
Introduction to DataStax Enterprise Graph DatabaseDataStax Academy
DataStax Enterprise (DSE) Graph is a built to manage, analyze, and search highly connected data. DSE Graph, built on NoSQL Apache Cassandra delivers continuous uptime along with predictable performance and scales for modern systems dealing with complex and constantly changing data.
Download DataStax Enterprise: Academy.DataStax.com/Download
Start free training for DataStax Enterprise Graph: Academy.DataStax.com/courses/ds332-datastax-enterprise-graph
Introduction to DataStax Enterprise Advanced Replication with Apache CassandraDataStax Academy
DataStax Enterprise Advanced Replication supports one-way distributed data replication from remote database clusters that might experience periods of network or internet downtime. Benefiting use cases that require a 'hub and spoke' architecture.
Learn more at http://www.datastax.com/2016/07/stay-100-connected-with-dse-advanced-replication
Advanced Replication docs – https://docs.datastax.com/en/latest-dse/datastax_enterprise/advRep/advRepTOC.html
This document discusses using Docker containers to run Cassandra clusters at Walmart. It proposes transforming existing Cassandra hardware into containers to better utilize unused compute. It also suggests building new Cassandra clusters in containers and migrating old clusters to double capacity on existing hardware and save costs. Benchmark results show Docker containers outperforming virtual machines on OpenStack and Azure in terms of reads, writes, throughput and latency for an in-house application.
The document discusses the evolution of Cassandra's data modeling capabilities over different versions of CQL. It covers features introduced in each version such as user defined types, functions, aggregates, materialized views, and storage attached secondary indexes (SASI). It provides examples of how to create user defined types, functions, materialized views, and SASI indexes in CQL. It also discusses when each feature should and should not be used.
Cisco has a large global IT infrastructure supporting many applications, databases, and employees. The document discusses Cisco's existing customer service and commerce systems (CSCC/SMS3) and some of the performance, scalability, and user experience issues. It then presents a proposed new architecture using modern technologies like Elasticsearch, Cassandra, and microservices to address these issues and improve agility, performance, scalability, uptime, and the user interface.
Data Modeling is the one of the first things to sink your teeth into when trying out a new database. That's why we are going to cover this foundational topic in enough detail for you to get dangerous. Data Modeling for relational databases is more than a touch different than the way it's approached with Cassandra. We will address the quintessential query-driven methodology through a couple of different use cases, including working with time series data for IoT. We will also demo a new tool to get you bootstrapped quickly with MovieLens sample data. This talk should give you the basics you need to get serious with Apache Cassandra.
Hear about how Coursera uses Cassandra as the core of its scalable online education platform. I'll discuss the strengths of Cassandra that we leverage, as well as some limitations that you might run into as well in practice.
In the second part of this talk, we'll dive into how best to effectively use the Datastax Java drivers. We'll dig into how the driver is architected, and use this understanding to develop best practices to follow. I'll also share a couple of interesting bug we've run into at Coursera.
This document promotes Datastax Academy and Certification resources for learning Cassandra including a three step process of learning Cassandra, getting certified, and profiting. It lists community evangelists like Luke Tillman, Patrick McFadin, Jon Haddad, and Duy Hai Doan who can provide help and resources.
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & PythonDataStax Academy
This document summarizes three presentations from a Cassandra Meetup:
1. Jason Cacciatore discussed monitoring Cassandra health at scale across hundreds of clusters and thousands of nodes using the reactive stream processing system Mantis.
2. Minh Do explained how Cassandra uses the gossip protocol for tasks like discovering cluster topology and sharing load information. Gossip also has limitations and race conditions that can cause problems.
3. Chris Kalantzis presented Cassandra Tickler, an open source tool he created to help repair operations that get stuck by running lightweight consistency checks on an old Cassandra version or a node with space issues.
Cassandra @ Sony: The good, the bad, and the ugly part 1DataStax Academy
This talk covers scaling Cassandra to a fast growing user base. Alex and Isaias will cover new best practices and how to work with the strengths and weaknesses of Cassandra at large scale. They will discuss how to adapt to bottlenecks while providing a rich feature set to the playstation community.
Cassandra @ Sony: The good, the bad, and the ugly part 2DataStax Academy
The document discusses Cassandra's use by Sony Network Entertainment to handle the large amount of user and transaction data from the growing PlayStation Network. It describes how the relational database they previously used did not scale sufficiently, so they transitioned to using Cassandra in a denormalized and customized way. Some of the techniques discussed include caching user data locally on application servers, secondary indexing, and using a real-time indexer to enable personalized search by friends.
This document provides guidance on setting up server monitoring, application metrics, log aggregation, time synchronization, replication strategies, and garbage collection for a Cassandra cluster. Key recommendations include:
1. Use monitoring tools like Monit, Munin, Nagios, or OpsCenter to monitor processes, disk usage, and system performance. Aggregate all logs centrally with tools like Splunk, Logstash, or Greylog.
2. Install NTP to synchronize server times which are critical for consistency.
3. Use the NetworkTopologyStrategy replication strategy and avoid SimpleStrategy for production.
4. Avoid shared storage and focus on low latency and high throughput using multiple local disks.
5. Understand
This document discusses real time analytics using Spark and Spark Streaming. It provides an introduction to Spark and highlights limitations of Hadoop for real-time analytics. It then describes Spark's advantages like in-memory processing and rich APIs. The document discusses Spark Streaming and the Spark Cassandra Connector. It also introduces DataStax Enterprise which integrates Spark, Cassandra and Solr to allow real-time analytics without separate clusters. Examples of streaming use cases and demos are provided.
Introduction to Data Modeling with Apache CassandraDataStax Academy
This document provides an introduction to data modeling with Apache Cassandra. It discusses how Cassandra data models are designed based on the queries an application will perform, unlike relational databases which are designed based on normalization rules. Key aspects covered include avoiding joins by denormalizing data, using a partition key to group related data on nodes, and controlling the clustering order of columns. The document provides examples of modeling time series and tag data in Cassandra.
The document discusses different data storage options for small, medium, and large datasets. It argues that relational databases do not scale well for large datasets due to limitations with replication, normalization, sharding, and high availability. The document then introduces Apache Cassandra as a fast, distributed, highly available, and linearly scalable database that addresses these limitations through its use of a hash ring architecture and tunable consistency levels. It describes Cassandra's key features including replication, compaction, and multi-datacenter support.
Enabling Search in your Cassandra Application with DataStax EnterpriseDataStax Academy
This document provides an overview of using Datastax Enterprise (DSE) Search to enable full-text search capabilities in Cassandra applications. It discusses how DSE Search integrates Solr/Lucene indexing with the Cassandra database to allow searching of application data without requiring a separate search cluster, external ETL processes, or custom application code for data management. The document also includes examples of different types of searches that can be performed, such as filtering, faceting, geospatial searches, and joins. It concludes with basic steps for getting started with DSE Search such as creating a Solr core and executing search queries using CQL.
The document discusses common bad habits that can occur when working with Apache Cassandra and provides recommendations to avoid them. Specifically, it addresses issues like sliding back into a relational mindset when the data model is different, improperly benchmarking Cassandra systems, having slow client performance, and neglecting important operations tasks. The presentation provides guidance on how to approach data modeling, querying, benchmarking, driver usage, and operations management in a Cassandra-oriented way.
This document provides an overview and examples of modeling data in Apache Cassandra. It begins with an introduction to thinking about data models and queries before modeling, and emphasizes that Cassandra requires modeling around queries due to its limitations on joins and indexes. The document then provides examples of modeling user, video, and other entity data for a video sharing application to support common queries. It also discusses techniques for handling queries that could become hotspots, such as bucketing or adding random values. The examples illustrate best practices for data duplication, materialized views, and time series data storage in Cassandra.
The document discusses best practices for using Apache Cassandra, including:
- Topology considerations like replication strategies and snitches
- Booting new datacenters and replacing nodes
- Security techniques like authentication, authorization, and SSL encryption
- Using prepared statements for efficiency
- Asynchronous execution for request pipelining
- Batch statements and their appropriate uses
- Improving performance through techniques like the new row cache
How to Get CNIC Information System with Paksim Ga.pptxdanishmna97
Pakdata Cf is a groundbreaking system designed to streamline and facilitate access to CNIC information. This innovative platform leverages advanced technology to provide users with efficient and secure access to their CNIC details.
Introduction of Cybersecurity with OSS at Code Europe 2024Hiroshi SHIBATA
I develop the Ruby programming language, RubyGems, and Bundler, which are package managers for Ruby. Today, I will introduce how to enhance the security of your application using open-source software (OSS) examples from Ruby and RubyGems.
The first topic is CVE (Common Vulnerabilities and Exposures). I have published CVEs many times. But what exactly is a CVE? I'll provide a basic understanding of CVEs and explain how to detect and handle vulnerabilities in OSS.
Next, let's discuss package managers. Package managers play a critical role in the OSS ecosystem. I'll explain how to manage library dependencies in your application.
I'll share insights into how the Ruby and RubyGems core team works to keep our ecosystem safe. By the end of this talk, you'll have a better understanding of how to safeguard your code.
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?Speck&Tech
ABSTRACT: A prima vista, un mattoncino Lego e la backdoor XZ potrebbero avere in comune il fatto di essere entrambi blocchi di costruzione, o dipendenze di progetti creativi e software. La realtà è che un mattoncino Lego e il caso della backdoor XZ hanno molto di più di tutto ciò in comune.
Partecipate alla presentazione per immergervi in una storia di interoperabilità, standard e formati aperti, per poi discutere del ruolo importante che i contributori hanno in una comunità open source sostenibile.
BIO: Sostenitrice del software libero e dei formati standard e aperti. È stata un membro attivo dei progetti Fedora e openSUSE e ha co-fondato l'Associazione LibreItalia dove è stata coinvolta in diversi eventi, migrazioni e formazione relativi a LibreOffice. In precedenza ha lavorato a migrazioni e corsi di formazione su LibreOffice per diverse amministrazioni pubbliche e privati. Da gennaio 2020 lavora in SUSE come Software Release Engineer per Uyuni e SUSE Manager e quando non segue la sua passione per i computer e per Geeko coltiva la sua curiosità per l'astronomia (da cui deriva il suo nickname deneb_alpha).
Have you ever been confused by the myriad of choices offered by AWS for hosting a website or an API?
Lambda, Elastic Beanstalk, Lightsail, Amplify, S3 (and more!) can each host websites + APIs. But which one should we choose?
Which one is cheapest? Which one is fastest? Which one will scale to meet our needs?
Join me in this session as we dive into each AWS hosting service to determine which one is best for your scenario and explain why!
Fueling AI with Great Data with Airbyte WebinarZilliz
This talk will focus on how to collect data from a variety of sources, leveraging this data for RAG and other GenAI use cases, and finally charting your course to productionalization.
Project Management Semester Long Project - Acuityjpupo2018
Acuity is an innovative learning app designed to transform the way you engage with knowledge. Powered by AI technology, Acuity takes complex topics and distills them into concise, interactive summaries that are easy to read & understand. Whether you're exploring the depths of quantum mechanics or seeking insight into historical events, Acuity provides the key information you need without the burden of lengthy texts.
UiPath Test Automation using UiPath Test Suite series, part 6DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 6. In this session, we will cover Test Automation with generative AI and Open AI.
UiPath Test Automation with generative AI and Open AI webinar offers an in-depth exploration of leveraging cutting-edge technologies for test automation within the UiPath platform. Attendees will delve into the integration of generative AI, a test automation solution, with Open AI advanced natural language processing capabilities.
Throughout the session, participants will discover how this synergy empowers testers to automate repetitive tasks, enhance testing accuracy, and expedite the software testing life cycle. Topics covered include the seamless integration process, practical use cases, and the benefits of harnessing AI-driven automation for UiPath testing initiatives. By attending this webinar, testers, and automation professionals can gain valuable insights into harnessing the power of AI to optimize their test automation workflows within the UiPath ecosystem, ultimately driving efficiency and quality in software development processes.
What will you get from this session?
1. Insights into integrating generative AI.
2. Understanding how this integration enhances test automation within the UiPath platform
3. Practical demonstrations
4. Exploration of real-world use cases illustrating the benefits of AI-driven test automation for UiPath
Topics covered:
What is generative AI
Test Automation with generative AI and Open AI.
UiPath integration with generative AI
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...Jeffrey Haguewood
Sidekick Solutions uses Bonterra Impact Management (fka Social Solutions Apricot) and automation solutions to integrate data for business workflows.
We believe integration and automation are essential to user experience and the promise of efficient work through technology. Automation is the critical ingredient to realizing that full vision. We develop integration products and services for Bonterra Case Management software to support the deployment of automations for a variety of use cases.
This video focuses on integration of Salesforce with Bonterra Impact Management.
Interested in deploying an integration with Salesforce for Bonterra Impact Management? Contact us at sales@sidekicksolutionsllc.com to discuss next steps.
Monitoring and Managing Anomaly Detection on OpenShift.pdfTosin Akinosho
Monitoring and Managing Anomaly Detection on OpenShift
Overview
Dive into the world of anomaly detection on edge devices with our comprehensive hands-on tutorial. This SlideShare presentation will guide you through the entire process, from data collection and model training to edge deployment and real-time monitoring. Perfect for those looking to implement robust anomaly detection systems on resource-constrained IoT/edge devices.
Key Topics Covered
1. Introduction to Anomaly Detection
- Understand the fundamentals of anomaly detection and its importance in identifying unusual behavior or failures in systems.
2. Understanding Edge (IoT)
- Learn about edge computing and IoT, and how they enable real-time data processing and decision-making at the source.
3. What is ArgoCD?
- Discover ArgoCD, a declarative, GitOps continuous delivery tool for Kubernetes, and its role in deploying applications on edge devices.
4. Deployment Using ArgoCD for Edge Devices
- Step-by-step guide on deploying anomaly detection models on edge devices using ArgoCD.
5. Introduction to Apache Kafka and S3
- Explore Apache Kafka for real-time data streaming and Amazon S3 for scalable storage solutions.
6. Viewing Kafka Messages in the Data Lake
- Learn how to view and analyze Kafka messages stored in a data lake for better insights.
7. What is Prometheus?
- Get to know Prometheus, an open-source monitoring and alerting toolkit, and its application in monitoring edge devices.
8. Monitoring Application Metrics with Prometheus
- Detailed instructions on setting up Prometheus to monitor the performance and health of your anomaly detection system.
9. What is Camel K?
- Introduction to Camel K, a lightweight integration framework built on Apache Camel, designed for Kubernetes.
10. Configuring Camel K Integrations for Data Pipelines
- Learn how to configure Camel K for seamless data pipeline integrations in your anomaly detection workflow.
11. What is a Jupyter Notebook?
- Overview of Jupyter Notebooks, an open-source web application for creating and sharing documents with live code, equations, visualizations, and narrative text.
12. Jupyter Notebooks with Code Examples
- Hands-on examples and code snippets in Jupyter Notebooks to help you implement and test anomaly detection models.
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slackshyamraj55
Discover the seamless integration of RPA (Robotic Process Automation), COMPOSER, and APM with AWS IDP enhanced with Slack notifications. Explore how these technologies converge to streamline workflows, optimize performance, and ensure secure access, all while leveraging the power of AWS IDP and real-time communication via Slack notifications.
Ocean lotus Threat actors project by John Sitima 2024 (1).pptxSitimaJohn
Ocean Lotus cyber threat actors represent a sophisticated, persistent, and politically motivated group that poses a significant risk to organizations and individuals in the Southeast Asian region. Their continuous evolution and adaptability underscore the need for robust cybersecurity measures and international cooperation to identify and mitigate the threats posed by such advanced persistent threat groups.
Main news related to the CCS TSI 2023 (2023/1695)Jakub Marek
An English 🇬🇧 translation of a presentation to the speech I gave about the main changes brought by CCS TSI 2023 at the biggest Czech conference on Communications and signalling systems on Railways, which was held in Clarion Hotel Olomouc from 7th to 9th November 2023 (konferenceszt.cz). Attended by around 500 participants and 200 on-line followers.
The original Czech 🇨🇿 version of the presentation can be found here: https://www.slideshare.net/slideshow/hlavni-novinky-souvisejici-s-ccs-tsi-2023-2023-1695/269688092 .
The videorecording (in Czech) from the presentation is available here: https://youtu.be/WzjJWm4IyPk?si=SImb06tuXGb30BEH .
AI 101: An Introduction to the Basics and Impact of Artificial IntelligenceIndexBug
Imagine a world where machines not only perform tasks but also learn, adapt, and make decisions. This is the promise of Artificial Intelligence (AI), a technology that's not just enhancing our lives but revolutionizing entire industries.
Threats to mobile devices are more prevalent and increasing in scope and complexity. Users of mobile devices desire to take full advantage of the features
available on those devices, but many of the features provide convenience and capability but sacrifice security. This best practices guide outlines steps the users can take to better protect personal devices and information.
Your One-Stop Shop for Python Success: Top 10 US Python Development Providersakankshawande
Simplify your search for a reliable Python development partner! This list presents the top 10 trusted US providers offering comprehensive Python development services, ensuring your project's success from conception to completion.
Webinar: Designing a schema for a Data WarehouseFederico Razzoli
Are you new to data warehouses (DWH)? Do you need to check whether your data warehouse follows the best practices for a good design? In both cases, this webinar is for you.
A data warehouse is a central relational database that contains all measurements about a business or an organisation. This data comes from a variety of heterogeneous data sources, which includes databases of any type that back the applications used by the company, data files exported by some applications, or APIs provided by internal or external services.
But designing a data warehouse correctly is a hard task, which requires gathering information about the business processes that need to be analysed in the first place. These processes must be translated into so-called star schemas, which means, denormalised databases where each table represents a dimension or facts.
We will discuss these topics:
- How to gather information about a business;
- Understanding dictionaries and how to identify business entities;
- Dimensions and facts;
- Setting a table granularity;
- Types of facts;
- Types of dimensions;
- Snowflakes and how to avoid them;
- Expanding existing dimensions and facts.
Austin Cassandra Users 6/19: Apache Cassandra at Vast
1. June 19, 2014
Cassandra at Vast
Graham Sanderson - CTO, David Pratt - Director of Applications
1
2. June 19, 2014
Introduction
2
• Don’t want this to be a data modeling talk
• We aren't experts - we are learning as we go
• Hopefully this will be useful to both you and us
• Informal, questions as we go
• We will share our experiences so far moving to Cassandra
• We are working on a bunch existing and new projects
• We'll talk about 2 1/2 of them
• Some dev stuff, some ops stuff
• Some thoughts for the future
• Athena Scala Driver
3. June 19, 2014
Who isVast?
3
• Vast operates while-label performance based marketplaces for
publishers; and delivers big data mobile applications for
automotive and real estate sales professionals
• “Big Data for Big Purchases”
• Marketplaces
• Large partner sites, including AOL, CARFAX,TrueCar, Realogy, USAA,
Yahoo
• Hundreds of smaller partner sites
• Analytics
• Strong team of scarily smart data scientists
• Integrating analytics everywhere
5. June 19, 2014
Data Flow
5
• Flows between different data store types (many include historical data too)
• Systems of Record (SOR)
• Both root nodes and leaf nodes
• Derived data stores (mostly MVCC) for:
• Real time customer facing queries
• Real time analytics
• Alerting
• Offline analytics
• Reporting
• Debugging
• Mixture of dumps and deltas
• We have derived SORs
• Cached smaller subset records/fields for a specific purpose
• SORs in multiple data centers - some derived SORs shared
• Data flow is graph not a tree - feedback
6. June 19, 2014
Goals
6
• Reduce latency <15 mins for customer facing data
• Reduce copying and duplication of data
• Network/Storage/Time costs
• More streaming & deltas, less dumps and derived SORs
• Want multi-purpose, multi-tenant central store
• Something rock solid
• Something that can handle lots of data fast
• Something that can do random access and bulk operations
• Use for all data store types on previous slide
• (Over?)build it; they will come
• Consolidate rest on
• HDFS,Vertica, Postgres, S3, Glacier, SOLR/Lucene
7. June 19, 2014
Why Cassandra?
7
• Regarded as rock solid
• No single point of failure
• Active development & open source Java
• Good fit for the type of data we wanted to store
• Ease of configuration; all nodes are the same
• Easily tunable consistency at application level
• Easy control of sharding at application level
• Drivers for all our languages (we're mostly JVM but also node)
• Data locality with other tools
• Good cross data center support
8. June 19, 2014
Evolution
8
• July 2013 (alpha on C* 1.1)
• September 2013 (MTC-1 on C* 2.0.0)
• First use case (a nasty one) - talk about it later
• Stress/Destructive testing
• Found and helped fix a few bugs along the way
• Learned a lot about tuning and operations
• Half nodes down at one point
• Corrupted SSTables on one node
• We’ve been cautious
• Started with internal facing only use (don’t need 100% uptime)
• Moved to external facing use but with ability to fall back off C* in minutes
• Getting braver
• C* is only SOR and real time customer facing store for some cases now
• We have on occasion custom built C* with cherry-picked patches
9. June 19, 2014
HW Specs MTC-1
9
• Remember we want to build for the C* future
• 6 nodes
• 16x cores (Sandy Bridge)
• 256G RAM
• Lots of disk cache and mem-mapped NIO buffers
• 7x 1.2TB 10K RPM JBOD (4.2ms latency, 200MB/sec sequential each)
• 1x SSD commit volume (~100K IIOPs, 550MB/sec sequential)
• RAID1 OS drives
• 4x gigabit ethernet
10. June 19, 2014
SW Specs MTC-1
10
• CentOS 6.5
• Cassandra 2.0.5
• JDK 1.7.0_60-b19
• 8 gig young generation / 6.4 gig eden
• 16 gig old generation
• Parallel new collector
• CMS collector
• Sounds like overkill but we are multi-tenant and have spiky loads
11. June 19, 2014
General
11
• LOCAL_QUORUM for reads and writes
• Use LZ4 compression
• Use key cache (not row cache)
• Some SizeTiered, some Leveled CompactionStrategy
• Drivers
• Athena (Scala / binary)
• Astyanax 1.56.48 (Java / thrift)
• node-cassandra-cql (Node / binary)
12. June 19, 2014
Use Case 1 - Search API - Problem
12
• 40 million records (including duplicates perVIN) in HDFS
• Map/Reduce to 7 million SOLR XML updates in HDFS
• Not delta today because of map/reduce like business rules
• Export to SOLR XML from HDFS to local FS
• Re-index via SOLR
• 40 gig SOLR index - at least 3 slaves
• OKish every few hours, not every 15 minutes
• Even though we made very fast parallel indexer
• % of stored data read per indexing is getting smaller
13. June 19, 2014
Use Case 1 - Search API - Solution
13
• Indexing in hadoop
• SOLR(Lucene) segments created (no stored fields)
• Job option for fallback to stored fields in SOLR index
• Stored fields go to C* as JSON directly from hadoop
• Astyanax - 1MB batches - LOCAL_QUORUM
• Periodically create new table(CF) with full data baseline (clustering) column
• 200MB/s 3 replicas continuously for one to two minutes
• 40000 partition keys/s (one per record)
• Periodically add new (clustering) column to table with deltas from latest dump
• Delta data size is 100x smaller and hits many fewer partition keys
• Keep multiple recent tables for rollback (bad data more than recovery)
• 2 gig SOLR index (20x smaller)
14. June 19, 2014
Use Case 1 - Search API - Solution
14
• Very bare bones - not even any metadata :-(
• Thrift style
• Note we use blob
• Everything is UTF-8
• Avro - Utf8
• Hadoop - Text
• Astyanax - ByteBuffer
• Most JVM drivers try to convert text to String
CREATE TABLE "20140618084015_20140618_081920_1403072360" (!
key text,!
column1 blob,!
value blob,!
PRIMARY KEY (key, column1)!
) WITH COMPACT STORAGE;
15. June 19, 2014
Use Case 1 - Search API - Solution
15
• Stored fields cached in SOLR JVM (verification/warm up tests)
• MVCC to prevent read-from-future
• Single clustering key limit for the SOLR core
• Reads fallback from LOCAL_QUORUM to LOCAL_ONE
• Better to return something even a subset of results
• Never happened in production though
• Issues
• Don’t recreate table/CF until C* 2.1
• Early 2.0.x and Astyanax don’t like schema changes
• Create new tables via CQL3 via Astyanax
• Monitoring harder since we now use UUID for table name
• Full (non delta) index write rate strains GC and causes some hinting
• C* remains rock solid
• We can constrain by mapper/reducer count, and will probably add zookeeper mutex
16. June 19, 2014
Use Case 1.5 - RESA
16
• Newer version of real estate
• Fully streaming delta pipeline (RabbitMQ)
• Field level SOLR index updates (include latest timestamp)
• C* row with JSON delta for that timestamp
• History is used in customer facing features
• Note this is really the same table as thrift one
CREATE TABLE for_sale (!
id text,!
created_date timestamp,!
delta_json text!
PRIMARY KEY (id, created_date)!
) !
17. June 19, 2014
Use Case 2 - Feed Management - Problem
17
• Thousands of feeds of different size and frequency
• Incoming feeds must be “polished”
• Geocoding must be done
• Images must be made available in S3
• Need to reprocess individual feeds
• Full output records are munged from asynchronously updated
parts
• Previously huge HDFS job
• 300M inputs for 70M full output records
• Records need all data to be “ready” for full output
• Silly because most work is redundant from previous run
• Only help partitioning is by brittle HDFS directory structures
18. June 19, 2014
Use Case 2 - Feed Management - Solution
18
• Scala & Akka & Athena (large throughput - high parallelism)
• Compound partition key (2^n shards per feed)
• Spreads data - limits partition “row” length
• Read entire feed without key scan - small IN clause
• Random access writes
• Any sub-field may be updated asynchronously
• Munged record emitted to HDFS whenever “ready”
CREATE TABLE feed_state (!
feed_name text,!
feed_record_id_shard int,!
record_id uuid,!
raw_record text,!
polished_data text,!
geocode_data text,!
image_status text,!
...!
PRIMARY KEY ((feed_name, feed_record_id_shard), record_id)!
)
19. June 19, 2014
Monitoring
19
• OpsCenter
• log4j/syslog/graylog
• Email alerts
• nagios/zabbix
• Graphite (autogen graph pages)
• Machine stats via collectl, JVM from codahale
• Cassandra stats from codahale
• Suspect possible issue with hadoop using same coordinator nodes
• GC logs
• VisualVM
20. June 19, 2014
General Issues / Lessons Learned
20
• GC issues
• Old generation fragmentation causes eventual promotion failure
• Usually of 1MB Memtable “slabs” - These can be off heap in C* 2.1 :-)
• Thrift API with bulk load probably not helping, but fragmentation is inevitable
• Some slow initial mark and remark STW pauses
• We do have a big young gen - New -XX:+ flags in 1.7.0_60 :-)
• As said we aim to be multi-tenant
• Avoid client stupidity, but otherwise accommodate any client behavior
• GC now well tuned
• 1 compacting GC at off times/day, very rare 1 sec pauses/day, handful >0.5 sec/day
• Cassandra and it’s own dog food
• Can’t wait for hints to be commit log style regular file (C* 3.0)
• Compactions in progress table
• OpsCenter rollup - turned off for search api tables
21. June 19, 2014
General Issues / Lessons Learned
21
• Don’t repair things that don’t need them
• We also run -pr -par repair on each node
• Beware when not following the rules
• We were knowingly running on potentially buggy minor versions
• If you don’t know what you’re doing you will likely screw up
• Fortunately for us C* has always kept running fine
• It is usually pretty easy to fix with some googling
• Deleting data is counter-intuitively often a good fix!
22. June 19, 2014
Future
22
• Upgrade 2.0.x to use static columns
• User defined types :-)
• De-duplicate data into shared storage in C*
• Analytics via data-locality
• Hadoop, Pig, Spark/Scalding, R
• More cross data center
• More tuning
• Full streaming pipeline with C* as side state store
23. June 19, 2014
Athena
23
• Why would we do such an obviously crazy thing?
• Need to support async, reactive applications across different problem domains
• Real-time API used by several disparate clients (iOS, Node.js, …)
• Ground-up implementation of the CQL 2.0 binary protocol
• Scala 2.10/2.11
• Akka 2.3.x
• Fully async, nonblocking API
• Has obvious advantages but requires different paradigm
• Implemented as an extension for Akka-IO
• Low-level actor based abstraction
• Cluster, Host and Connection actors
• Reasonably stable
• High-level streaming streaming Session API
24. June 19, 2014
Athena
24
• Next steps
• Move off of Play Iteratees and onto Akka Reactive Streams
• Token based routing
• Client API very much in flux - suggestions are welcome!
!
• https://github.com/vast-engineering/athena
• Release of first beta milestone to Sonatype Maven repository imminent
• Pull requests welcome!