The document discusses different types of databases including relational, column-oriented, document-oriented, and graph databases. It explains key concepts such as ACID vs BASE, CAP theorem, isolation levels, indexes, sharding, and provides descriptions and comparisons of each database type.
In this second part, we'll continue the Spark's review and introducing SparkSQL which allows to use data frames in Python, Java, and Scala; read and write data in a variety of structured formats; and query Big Data with SQL.
This document provides an overview comparison of SAS and Spark for analytics. SAS is a commercial software while Spark is an open source framework. SAS uses datasets that reside in memory while Spark uses resilient distributed datasets (RDDs) that can scale across clusters. Both support SQL queries but Spark SQL allows querying distributed data lazily. Spark also provides machine learning APIs through MLlib that can perform tasks like classification, clustering, and recommendation at scale.
What is Distributed Computing, Why we use Apache SparkAndy Petrella
In this talk we introduce the notion of distributed computing then we tackle the Spark advantages.
The Spark core content is very tiny because the whole explanation has been done live using a Spark Notebook (https://github.com/andypetrella/spark-notebook/blob/geek/conf/notebooks/Geek.snb).
This talk has been given together by @xtordoir and myself at the University of Liège, Belgium.
This document provides an overview of Apache Spark, including:
- The problems of big data that Spark addresses like large volumes of data from various sources.
- A comparison of Spark to existing techniques like Hadoop, noting Spark allows for better developer productivity and performance.
- An overview of the Spark ecosystem and how Spark can integrate with an existing enterprise.
- Details about Spark's programming model including its RDD abstraction and use of transformations and actions.
- A discussion of Spark's execution model involving stages and tasks.
This document discusses scaling machine learning using Apache Spark. It covers several key topics:
1) Parallelizing machine learning algorithms and neural networks to distribute computation across clusters. This includes data, model, and parameter server parallelism.
2) Apache Spark's Resilient Distributed Datasets (RDDs) programming model which allows distributing data and computation across a cluster in a fault-tolerant manner.
3) Examples of very large neural networks trained on clusters, such as a Google face detection model using 1,000 servers and a IBM brain-inspired chip model using 262,144 CPUs.
This document provides an overview of installing and deploying Apache Spark, including:
1. Spark can be installed via prebuilt packages or by building from source.
2. Spark runs in local, standalone, YARN, or Mesos cluster modes and the SparkContext is used to connect to the cluster.
3. Jobs are deployed to the cluster using the spark-submit script which handles building jars and dependencies.
AWS Big Data Demystified #2 | Athena, Spectrum, Emr, Hive Omid Vahdaty
This document provides an overview of various AWS big data services including Athena, Redshift Spectrum, EMR, and Hive. It discusses how Athena allows users to run SQL queries directly on data stored in S3 using Presto. Redshift Spectrum enables querying data in S3 using standard SQL from Amazon Redshift. EMR is a managed Hadoop framework that can run Hive, Spark, and other big data applications. Hive provides a SQL-like interface to query data stored in various formats like Parquet and ORC on distributed storage systems. The document demonstrates features and provides best practices for working with these AWS big data services.
Powering a Graph Data System with Scylla + JanusGraphScyllaDB
Key Value and Column Stores are not the only two data models Scylla is capable of. In this presentation learn the What, Why and How of building and deploying a graph data system in the cloud, backed by the power of Scylla.
In this second part, we'll continue the Spark's review and introducing SparkSQL which allows to use data frames in Python, Java, and Scala; read and write data in a variety of structured formats; and query Big Data with SQL.
This document provides an overview comparison of SAS and Spark for analytics. SAS is a commercial software while Spark is an open source framework. SAS uses datasets that reside in memory while Spark uses resilient distributed datasets (RDDs) that can scale across clusters. Both support SQL queries but Spark SQL allows querying distributed data lazily. Spark also provides machine learning APIs through MLlib that can perform tasks like classification, clustering, and recommendation at scale.
What is Distributed Computing, Why we use Apache SparkAndy Petrella
In this talk we introduce the notion of distributed computing then we tackle the Spark advantages.
The Spark core content is very tiny because the whole explanation has been done live using a Spark Notebook (https://github.com/andypetrella/spark-notebook/blob/geek/conf/notebooks/Geek.snb).
This talk has been given together by @xtordoir and myself at the University of Liège, Belgium.
This document provides an overview of Apache Spark, including:
- The problems of big data that Spark addresses like large volumes of data from various sources.
- A comparison of Spark to existing techniques like Hadoop, noting Spark allows for better developer productivity and performance.
- An overview of the Spark ecosystem and how Spark can integrate with an existing enterprise.
- Details about Spark's programming model including its RDD abstraction and use of transformations and actions.
- A discussion of Spark's execution model involving stages and tasks.
This document discusses scaling machine learning using Apache Spark. It covers several key topics:
1) Parallelizing machine learning algorithms and neural networks to distribute computation across clusters. This includes data, model, and parameter server parallelism.
2) Apache Spark's Resilient Distributed Datasets (RDDs) programming model which allows distributing data and computation across a cluster in a fault-tolerant manner.
3) Examples of very large neural networks trained on clusters, such as a Google face detection model using 1,000 servers and a IBM brain-inspired chip model using 262,144 CPUs.
This document provides an overview of installing and deploying Apache Spark, including:
1. Spark can be installed via prebuilt packages or by building from source.
2. Spark runs in local, standalone, YARN, or Mesos cluster modes and the SparkContext is used to connect to the cluster.
3. Jobs are deployed to the cluster using the spark-submit script which handles building jars and dependencies.
AWS Big Data Demystified #2 | Athena, Spectrum, Emr, Hive Omid Vahdaty
This document provides an overview of various AWS big data services including Athena, Redshift Spectrum, EMR, and Hive. It discusses how Athena allows users to run SQL queries directly on data stored in S3 using Presto. Redshift Spectrum enables querying data in S3 using standard SQL from Amazon Redshift. EMR is a managed Hadoop framework that can run Hive, Spark, and other big data applications. Hive provides a SQL-like interface to query data stored in various formats like Parquet and ORC on distributed storage systems. The document demonstrates features and provides best practices for working with these AWS big data services.
Powering a Graph Data System with Scylla + JanusGraphScyllaDB
Key Value and Column Stores are not the only two data models Scylla is capable of. In this presentation learn the What, Why and How of building and deploying a graph data system in the cloud, backed by the power of Scylla.
Making Sense of Spark Performance-(Kay Ousterhout, UC Berkeley)Spark Summit
This document summarizes the key findings from a study analyzing the performance bottlenecks in Spark data analytics frameworks. The study used three different workloads run on Spark and found that: network optimizations provided at most a 2% reduction in job completion time; CPU was often the main bottleneck rather than disk or network I/O; optimizing disk performance reduced completion time by less than 19%; and many straggler causes could be identified and addressed to improve performance. The document discusses the methodology used to measure bottlenecks and blocked times, limitations of the study, and reasons why the results differed from assumptions in prior works.
Resilient Distributed DataSets - Apache SPARKTaposh Roy
RDDs (Resilient Distributed Datasets) provide a fault-tolerant abstraction for data reuse across jobs in distributed applications. They allow data to be persisted in memory and manipulated using transformations like map and filter. This enables efficient processing of iterative algorithms. RDDs achieve fault tolerance by logging the transformations used to build a dataset rather than the actual data, enabling recovery of lost partitions through recomputation.
What is Mesos? How does it works? In the following slides we make an interesting review of this open-source software project to manage computer clusters.
comprehensive Introduction to NoSQL solutions inside the big data landscape. Graph store? Column store? key Value store? Document Store? redis or memcache? dynamo db? mongo db ? hbase? Cloud or open source?
Presentation slides for the paper on Resilient Distributed Datasets, written by Matei Zaharia et al. at the University of California, Berkeley.
The paper is not my work.
These slides were made for the course on Advanced, Distributed Systems held by prof. Bratsberg at NTNU (Norwegian University of Science and Technology, Trondheim, Norway).
Distributed Stream Processing - Spark Summit East 2017Petr Zapletal
The document discusses distributed stream processing frameworks. It provides an overview of frameworks like Storm, Spark Streaming, Samza, Flink, and Kafka Streams. It compares aspects of different frameworks like programming models, delivery guarantees, fault tolerance, and state management. General guidelines are given for choosing a framework based on needs like latency requirements and state needs. Storm and Trident are recommended for low latency tasks while Spark Streaming and Flink are more full-featured but have higher latency. The document provides code examples for word count in different frameworks.
This document discusses alternatives to Hadoop for big data analytics. It introduces the Berkeley data analytics stack, including Spark, and compares the performance of iterative machine learning algorithms between Spark and Hadoop. It also discusses using Twitter's Storm for real-time analytics and compares the performance of Mahout and R/ML over Storm. The document provides examples of using Spark for logistic regression and k-means clustering and discusses how companies like Ooyala and Conviva have benefited from using Spark.
Adding Complex Data to Spark Stack by Tug GrallSpark Summit
This document discusses adding complex data to the Spark stack using Apache Drill. It provides an overview of Drill, how to integrate it with Spark, the current status and next steps. Drill allows SQL-based querying of structured, semi-structured and unstructured data across various data sources. It can be used as an input to Spark jobs and to query Spark RDDs. The integration provides benefits like flexibility, rich storage support and efficient distributed processing while combining Drill and Spark's capabilities.
This document provides an introduction and overview of Apache Spark. It discusses why Spark is useful, describes some Spark basics including Resilient Distributed Datasets (RDDs) and DataFrames, and gives a quick tour of Spark Core, SQL, and Streaming functionality. It also provides some tips for using Spark and describes how to set up Spark locally. The presenter is introduced as a data engineer who uses Spark to load data from Kafka streams into Redshift and Cassandra. Ways to learn more about Spark are suggested at the end.
Large volume data analysis on the Typesafe Reactive Platform - Big Data Scala...Martin Zapletal
The document discusses distributed machine learning and data processing. It covers several topics including reasons for using distributed machine learning, different distributed computing architectures and primitives, distributed data stores and analytics tools like Spark, streaming architectures like Lambda and Kappa, and challenges around distributed state management and fault tolerance. It provides examples of failures in distributed databases and suggestions to choose the appropriate tools based on the use case and understand their internals.
Interactive Graph Analytics with Spark-(Daniel Darabos, Lynx Analytics)Spark Summit
This document summarizes Daniel Darabos' talk about the design and implementation of the LynxKite graph analytics application. The key ideas discussed are: (1) using column-based attributes to avoid processing unused data, (2) making joins fast through co-located loading of sorted RDDs, (3) not reading or computing all the data through techniques like prefix sampling, and (4) using binary search for lookups instead of filtering for small key sets. Examples are provided to illustrate how these techniques improve performance and user experience of interactive graph analytics on Spark.
This document summarizes a system using Cassandra, Spark, and ELK (Elasticsearch, Logstash, Kibana) for processing streaming data. It describes how the Spark Cassandra Connector is used to represent Cassandra tables as Spark RDDs and write RDDs back to Cassandra. It also explains how data is extracted from Cassandra into RDDs based on token ranges, transformed using Spark, and indexed into Elasticsearch for visualization and analysis in Kibana. Recommendations are provided for improving performance of the Cassandra to Spark data extraction.
Strata NYC 2015 - Supercharging R with Apache SparkDatabricks
R is the favorite language of many data scientists. In addition to a language and runtime, R is a rich ecosystem of libraries for a wide range of use cases from statistical inference to data visualization. However, handling large or distributed data with R is challenging. Hence R is used along with other frameworks and languages by most data scientist. In this mode most of the friction is at the interface of R and the other systems. For example, when data is sampled by a big data platform, results need to be transferred to and imported in R as native data structures. In this talk we show an alternative, and complimentary, approach to SparkR for integrating Spark and R.
Since SparkR was released in version 1.4 of Apache Spark distributed data remains inside the JVM instead of individual R processes running on workers. This approach is more convenient when dealing with external data sources such as Cassandra, Hive, and Spark’s own distributed DataFrames. We show two specific techniques to remove the data transfer friction between R and JVM: collecting Spark DataFrames as R data frames and user space filesystems. We think this model complements and improves the day-to-day workload of many data scientists who use R. Spark’s interactive query processing, especially with in-memory datasets, closely matches the R interactive session model. When integrated together Spark and R can provide state of the art tools for the entire end-to-end data science pipeline. We will show how such a pipeline works in real world use cases in a live demo at the end of the talk.
Unlocking Your Hadoop Data with Apache Spark and CDH5SAP Concur
Spark/Mesos Seattle Meetup group shares the latest presentation from their recent meetup event on showcasing real world implementations of working with Spark within the context of your Big Data Infrastructure.
Session are demo heavy and slide light focusing on getting your development environments up and running including getting up and running, configuration issues, SparkSQL vs. Hive, etc.
To learn more about the Seattle meetup: http://www.meetup.com/Seattle-Spark-Meetup/members/21698691/
Optimizing Delta/Parquet Data Lakes for Apache SparkDatabricks
Matthew Powers gave a talk on optimizing data lakes for Apache Spark. He discussed community goals like standardizing method signatures. He advocated for using Spark helper libraries like spark-daria and spark-fast-tests. Powers explained how to build better data lakes using techniques like partitioning data on relevant fields to skip data and speed up queries significantly. He also covered modern Scala libraries, incremental updates, compacting small files, and using Delta Lakes to more easily update partitioned data lakes over time.
This talk gives details about Spark internals and an explanation of the runtime behavior of a Spark application. It explains how high level user programs are compiled into physical execution plans in Spark. It then reviews common performance bottlenecks encountered by Spark users, along with tips for diagnosing performance problems in a production application.
This document discusses time series databases and the Apache Parquet columnar storage format. It notes that time series databases store data for each point in time, such as weather or stock price data. Storage is optimized to minimize input/output by reading the minimum number of records. Apache Parquet provides a columnar storage format that allows for better compression, reduced input/output by scanning subset of columns, and encoding of data types. It discusses Parquet terminology, encodings, and techniques for query optimization such as projection and predicate push down and choosing an appropriate Parquet block size.
End-to-end Data Pipeline with Apache SparkDatabricks
This document discusses Apache Spark, a fast and general cluster computing system. It summarizes Spark's capabilities for machine learning workflows, including feature preparation, model training, evaluation, and production use. It also outlines new high-level APIs for data science in Spark, including DataFrames, machine learning pipelines, and an R interface, with the goal of making Spark more similar to single-machine libraries like SciKit-Learn. These new APIs are designed to make Spark easier to use for machine learning and interactive data analysis.
Apache Cassandra Lunch #70: Basics of Apache CassandraAnant Corporation
In Cassandra Lunch #70, we discuss the Basics of Apache Cassandra and setup a stand-alone Apache Cassandra.
Accompanying Blog: https://blog.anant.us/cassandra-launch-70-basics-of-apache-cassandra
Accompanying YouTube: https://youtu.be/o-yU0mi4nzc
Sign Up For Our Newsletter: http://eepurl.com/grdMkn
Join Cassandra Lunch Weekly at 12 PM EST Every Wednesday: https://www.meetup.com/Cassandra-DataStax-DC/events/
Cassandra.Link:
https://cassandra.link/
Follow Us and Reach Us At:
Anant:
https://www.anant.us/
Awesome Cassandra:
https://github.com/Anant/awesome-cassandra
Cassandra.Lunch:
https://github.com/Anant/Cassandra.Lunch
Email:
solutions@anant.us
LinkedIn:
https://www.linkedin.com/company/anant/
Twitter:
https://twitter.com/anantcorp
Eventbrite:
https://www.eventbrite.com/o/anant-1072927283
Facebook:
https://www.facebook.com/AnantCorp/
Join The Anant Team:
https://www.careers.anant.us
El Quality Assurance ha experimentado una evolución equiparable a la de la especie humana. En esta presentación se explica la importancia creciente de los sistemas de gestión de la calidad en el desarollo de software.
Created at the University of Berkeley in California, Apache Spark combines a distributed computing system through computer clusters with a simple and elegant way of writing programs. Spark is considered the first open source software that makes distribution programming really accessible to data scientists. Here you can find an introduction and basic concepts.
Making Sense of Spark Performance-(Kay Ousterhout, UC Berkeley)Spark Summit
This document summarizes the key findings from a study analyzing the performance bottlenecks in Spark data analytics frameworks. The study used three different workloads run on Spark and found that: network optimizations provided at most a 2% reduction in job completion time; CPU was often the main bottleneck rather than disk or network I/O; optimizing disk performance reduced completion time by less than 19%; and many straggler causes could be identified and addressed to improve performance. The document discusses the methodology used to measure bottlenecks and blocked times, limitations of the study, and reasons why the results differed from assumptions in prior works.
Resilient Distributed DataSets - Apache SPARKTaposh Roy
RDDs (Resilient Distributed Datasets) provide a fault-tolerant abstraction for data reuse across jobs in distributed applications. They allow data to be persisted in memory and manipulated using transformations like map and filter. This enables efficient processing of iterative algorithms. RDDs achieve fault tolerance by logging the transformations used to build a dataset rather than the actual data, enabling recovery of lost partitions through recomputation.
What is Mesos? How does it works? In the following slides we make an interesting review of this open-source software project to manage computer clusters.
comprehensive Introduction to NoSQL solutions inside the big data landscape. Graph store? Column store? key Value store? Document Store? redis or memcache? dynamo db? mongo db ? hbase? Cloud or open source?
Presentation slides for the paper on Resilient Distributed Datasets, written by Matei Zaharia et al. at the University of California, Berkeley.
The paper is not my work.
These slides were made for the course on Advanced, Distributed Systems held by prof. Bratsberg at NTNU (Norwegian University of Science and Technology, Trondheim, Norway).
Distributed Stream Processing - Spark Summit East 2017Petr Zapletal
The document discusses distributed stream processing frameworks. It provides an overview of frameworks like Storm, Spark Streaming, Samza, Flink, and Kafka Streams. It compares aspects of different frameworks like programming models, delivery guarantees, fault tolerance, and state management. General guidelines are given for choosing a framework based on needs like latency requirements and state needs. Storm and Trident are recommended for low latency tasks while Spark Streaming and Flink are more full-featured but have higher latency. The document provides code examples for word count in different frameworks.
This document discusses alternatives to Hadoop for big data analytics. It introduces the Berkeley data analytics stack, including Spark, and compares the performance of iterative machine learning algorithms between Spark and Hadoop. It also discusses using Twitter's Storm for real-time analytics and compares the performance of Mahout and R/ML over Storm. The document provides examples of using Spark for logistic regression and k-means clustering and discusses how companies like Ooyala and Conviva have benefited from using Spark.
Adding Complex Data to Spark Stack by Tug GrallSpark Summit
This document discusses adding complex data to the Spark stack using Apache Drill. It provides an overview of Drill, how to integrate it with Spark, the current status and next steps. Drill allows SQL-based querying of structured, semi-structured and unstructured data across various data sources. It can be used as an input to Spark jobs and to query Spark RDDs. The integration provides benefits like flexibility, rich storage support and efficient distributed processing while combining Drill and Spark's capabilities.
This document provides an introduction and overview of Apache Spark. It discusses why Spark is useful, describes some Spark basics including Resilient Distributed Datasets (RDDs) and DataFrames, and gives a quick tour of Spark Core, SQL, and Streaming functionality. It also provides some tips for using Spark and describes how to set up Spark locally. The presenter is introduced as a data engineer who uses Spark to load data from Kafka streams into Redshift and Cassandra. Ways to learn more about Spark are suggested at the end.
Large volume data analysis on the Typesafe Reactive Platform - Big Data Scala...Martin Zapletal
The document discusses distributed machine learning and data processing. It covers several topics including reasons for using distributed machine learning, different distributed computing architectures and primitives, distributed data stores and analytics tools like Spark, streaming architectures like Lambda and Kappa, and challenges around distributed state management and fault tolerance. It provides examples of failures in distributed databases and suggestions to choose the appropriate tools based on the use case and understand their internals.
Interactive Graph Analytics with Spark-(Daniel Darabos, Lynx Analytics)Spark Summit
This document summarizes Daniel Darabos' talk about the design and implementation of the LynxKite graph analytics application. The key ideas discussed are: (1) using column-based attributes to avoid processing unused data, (2) making joins fast through co-located loading of sorted RDDs, (3) not reading or computing all the data through techniques like prefix sampling, and (4) using binary search for lookups instead of filtering for small key sets. Examples are provided to illustrate how these techniques improve performance and user experience of interactive graph analytics on Spark.
This document summarizes a system using Cassandra, Spark, and ELK (Elasticsearch, Logstash, Kibana) for processing streaming data. It describes how the Spark Cassandra Connector is used to represent Cassandra tables as Spark RDDs and write RDDs back to Cassandra. It also explains how data is extracted from Cassandra into RDDs based on token ranges, transformed using Spark, and indexed into Elasticsearch for visualization and analysis in Kibana. Recommendations are provided for improving performance of the Cassandra to Spark data extraction.
Strata NYC 2015 - Supercharging R with Apache SparkDatabricks
R is the favorite language of many data scientists. In addition to a language and runtime, R is a rich ecosystem of libraries for a wide range of use cases from statistical inference to data visualization. However, handling large or distributed data with R is challenging. Hence R is used along with other frameworks and languages by most data scientist. In this mode most of the friction is at the interface of R and the other systems. For example, when data is sampled by a big data platform, results need to be transferred to and imported in R as native data structures. In this talk we show an alternative, and complimentary, approach to SparkR for integrating Spark and R.
Since SparkR was released in version 1.4 of Apache Spark distributed data remains inside the JVM instead of individual R processes running on workers. This approach is more convenient when dealing with external data sources such as Cassandra, Hive, and Spark’s own distributed DataFrames. We show two specific techniques to remove the data transfer friction between R and JVM: collecting Spark DataFrames as R data frames and user space filesystems. We think this model complements and improves the day-to-day workload of many data scientists who use R. Spark’s interactive query processing, especially with in-memory datasets, closely matches the R interactive session model. When integrated together Spark and R can provide state of the art tools for the entire end-to-end data science pipeline. We will show how such a pipeline works in real world use cases in a live demo at the end of the talk.
Unlocking Your Hadoop Data with Apache Spark and CDH5SAP Concur
Spark/Mesos Seattle Meetup group shares the latest presentation from their recent meetup event on showcasing real world implementations of working with Spark within the context of your Big Data Infrastructure.
Session are demo heavy and slide light focusing on getting your development environments up and running including getting up and running, configuration issues, SparkSQL vs. Hive, etc.
To learn more about the Seattle meetup: http://www.meetup.com/Seattle-Spark-Meetup/members/21698691/
Optimizing Delta/Parquet Data Lakes for Apache SparkDatabricks
Matthew Powers gave a talk on optimizing data lakes for Apache Spark. He discussed community goals like standardizing method signatures. He advocated for using Spark helper libraries like spark-daria and spark-fast-tests. Powers explained how to build better data lakes using techniques like partitioning data on relevant fields to skip data and speed up queries significantly. He also covered modern Scala libraries, incremental updates, compacting small files, and using Delta Lakes to more easily update partitioned data lakes over time.
This talk gives details about Spark internals and an explanation of the runtime behavior of a Spark application. It explains how high level user programs are compiled into physical execution plans in Spark. It then reviews common performance bottlenecks encountered by Spark users, along with tips for diagnosing performance problems in a production application.
This document discusses time series databases and the Apache Parquet columnar storage format. It notes that time series databases store data for each point in time, such as weather or stock price data. Storage is optimized to minimize input/output by reading the minimum number of records. Apache Parquet provides a columnar storage format that allows for better compression, reduced input/output by scanning subset of columns, and encoding of data types. It discusses Parquet terminology, encodings, and techniques for query optimization such as projection and predicate push down and choosing an appropriate Parquet block size.
End-to-end Data Pipeline with Apache SparkDatabricks
This document discusses Apache Spark, a fast and general cluster computing system. It summarizes Spark's capabilities for machine learning workflows, including feature preparation, model training, evaluation, and production use. It also outlines new high-level APIs for data science in Spark, including DataFrames, machine learning pipelines, and an R interface, with the goal of making Spark more similar to single-machine libraries like SciKit-Learn. These new APIs are designed to make Spark easier to use for machine learning and interactive data analysis.
Apache Cassandra Lunch #70: Basics of Apache CassandraAnant Corporation
In Cassandra Lunch #70, we discuss the Basics of Apache Cassandra and setup a stand-alone Apache Cassandra.
Accompanying Blog: https://blog.anant.us/cassandra-launch-70-basics-of-apache-cassandra
Accompanying YouTube: https://youtu.be/o-yU0mi4nzc
Sign Up For Our Newsletter: http://eepurl.com/grdMkn
Join Cassandra Lunch Weekly at 12 PM EST Every Wednesday: https://www.meetup.com/Cassandra-DataStax-DC/events/
Cassandra.Link:
https://cassandra.link/
Follow Us and Reach Us At:
Anant:
https://www.anant.us/
Awesome Cassandra:
https://github.com/Anant/awesome-cassandra
Cassandra.Lunch:
https://github.com/Anant/Cassandra.Lunch
Email:
solutions@anant.us
LinkedIn:
https://www.linkedin.com/company/anant/
Twitter:
https://twitter.com/anantcorp
Eventbrite:
https://www.eventbrite.com/o/anant-1072927283
Facebook:
https://www.facebook.com/AnantCorp/
Join The Anant Team:
https://www.careers.anant.us
El Quality Assurance ha experimentado una evolución equiparable a la de la especie humana. En esta presentación se explica la importancia creciente de los sistemas de gestión de la calidad en el desarollo de software.
Created at the University of Berkeley in California, Apache Spark combines a distributed computing system through computer clusters with a simple and elegant way of writing programs. Spark is considered the first open source software that makes distribution programming really accessible to data scientists. Here you can find an introduction and basic concepts.
DC/OS: The definitive platform for modern appsDatio Big Data
DC/OS is an open source platform that provides container orchestration and management using Mesos. It allows running applications and services across data center infrastructure including bare metal, VMs, and cloud. DC/OS provides services like Marathon for container orchestration, security, monitoring, load balancing and service discovery. It has features like high resource utilization, mixed workload support, elastic scalability, high availability and zero downtime upgrades.
This document discusses security and governance considerations for big data. It notes that while many businesses use big data, they may not have sufficient access controls or security practices in place. Big data breaches can be large, making security critical. It then outlines some risks like insufficiently hardened systems, uncontrolled access, and unfulfilled regulatory requirements. The document introduces GOSEC as a centralized security component that manages access control across big data services like HDFS and applications. GOSEC allows setting access policies for users and groups for resources like files and topics. It covers authentication, authorization, and auditing. The document stresses the need to integrate GOSEC with the organization's identity provider and security strategies to prevent leakage of internal and external data. It
Kafka Connect allows data ingestion into Kafka from external systems by using connectors. It provides scalability, fault tolerance, and exactly-once semantics. Connectors are run as tasks within workers that can run in either standalone or distributed mode. The Schema Registry works with Kafka Connect to handle schema validation and evolution.
The real purpose of any career plan should be to improve the skills of the person owning it, to discover his/her strong points, to find out the things they need help with and eventually becoming a better professional and a more self-assured individual. Then, we should start looking for a Personal Development Plan instead.
Comparative study of no sql document, column store databases and evaluation o...ijdms
In the last decade, rapid growth in mobile applications, web technologies, social media generating
unstructured data has led to the advent of various nosql data stores. Demands of web scale are in
increasing trend everyday and nosql databases are evolving to meet up with stern big data requirements.
The purpose of this paper is to explore nosql technologies and present a comparative study of document
and column store nosql databases such as cassandra, MongoDB and Hbase in various attributes of
relational and distributed database system principles. Detailed study and analysis of architecture and
internal working cassandra, Mongo DB and HBase is done theoretically and core concepts are depicted.
This paper also presents evaluation of cassandra for an industry specific use case and results are
published.
في الفيديو ده بيتم شرح ما هي المشاكل التي انتجت ظهور هذا النوع من قواعد البيانات
انواع المشاريع التي يمكن استخدامها بها
نبذة عن تاريخها و مزاياها و عيوبها
https://youtu.be/I9zgrdCf0fY
NOSQL in big data is the not only structure langua.pdfajajkhan16
This presentation discusses the limitations of relational database management systems (RDBMS) in handling large datasets and introduces NoSQL databases as an alternative. It begins by defining RDBMS and describing issues with scaling RDBMS to big data through techniques like master-slave architecture and sharding. It then defines NoSQL databases, explaining why they emerged and classifying them into key-value, columnar, document, and graph models. The presentation concludes that both RDBMS and NoSQL databases have advantages, suggesting a polyglot approach is optimal to handle different data storage needs.
This Presentation is about NoSQL which means Not Only SQL. This presentation covers the aspects of using NoSQL for Big Data and the differences from RDBMS.
SURVEY ON IMPLEMANTATION OF COLUMN ORIENTED NOSQL DATA STORES ( BIGTABLE & CA...IJCERT JOURNAL
NOSQL is a database provides a mechanism for storage and retrieval of data that is modeled for huge amount of data which is used in big data and Cloud Computing . NOSQL systems are also called "Not only SQL" to emphasize that they may support SQL-like query languages. A basic classification of NOSQL is based on data model; they are like column, Document, Key-Value etc. The objective of this paper is to study and compare the implantation of various column oriented data stores like Bigtable, Cassandra.
Modern databases and its challenges (SQL ,NoSQL, NewSQL)Mohamed Galal
Nowadays the amount of data becomes very large, every organization produces a huge amount of data daily.
Thus we want new technology to help in storing and query a huge amount of data in acceptable time.
The old relational model may help in consistency but it was not designed to deal with big data problem.
In this slides, I will describe the relational model, NoSql Models and the NewSql models with some examples.
This document discusses NoSQL databases and compares MongoDB and Cassandra. It begins with an introduction to NoSQL databases and why they were created. It then describes the key features and data models of NoSQL databases including key-value, column-oriented, document, and graph databases. Specific details are provided about MongoDB and Cassandra, including their data structure, query operations, examples of usage, and enhancements. The document provides an in-depth overview of NoSQL databases and a side-by-side comparison of MongoDB and Cassandra.
This document provides an introduction to NoSQL databases. It discusses that NoSQL databases are non-relational, do not require a fixed table schema, and do not require SQL for data manipulation. It also covers characteristics of NoSQL such as not using SQL for queries, partitioning data across machines so JOINs cannot be used, and following the CAP theorem. Common classifications of NoSQL databases are also summarized such as key-value stores, document stores, and graph databases. Popular NoSQL products including Dynamo, BigTable, MongoDB, and Cassandra are also briefly mentioned.
The document discusses NoSQL databases as an alternative to traditional SQL databases. It provides an overview of NoSQL databases, including their key features, data models, and popular examples like MongoDB and Cassandra. Some key points:
- NoSQL databases were developed to overcome limitations of SQL databases in handling large, unstructured datasets and high volumes of read/write operations.
- NoSQL databases come in various data models like key-value, column-oriented, and document-oriented. Popular examples discussed are MongoDB and Cassandra.
- MongoDB is a document database that stores data as JSON-like documents. It supports flexible querying. Cassandra is a column-oriented database developed by Facebook that is highly scalable
Here is my seminar presentation on No-SQL Databases. it includes all the types of nosql databases, merits & demerits of nosql databases, examples of nosql databases etc.
For seminar report of NoSQL Databases please contact me: ndc@live.in
A NoSQL (often interpreted as Not Only SQL) database provides a mechanism for storage and retrieval of data that is modeled in means other than the tabular relations used in relational databases.
Data management in cloud study of existing systems and future opportunitiesEditor Jacotech
This document discusses data management in cloud computing and provides an overview of existing NoSQL database systems and their advantages over traditional SQL databases. It begins by defining cloud computing and the need for scalable data storage. It then discusses key goals for cloud data management systems including availability, scalability, elasticity and performance. Several popular NoSQL databases are described, including BigTable, MongoDB and Dynamo. The advantages of NoSQL systems like elastic scaling and easier administration are contrasted with some limitations like limited transaction support. The document concludes by discussing opportunities for future research to improve scalability and queries in cloud data management systems.
The document provides an overview of NoSQL databases and MongoDB. It discusses:
- What NoSQL is and why it was created
- The different categories of NoSQL databases, including key-value stores, document databases, column family stores, and graph databases
- MongoDB specifically, including its flexible schema, horizontal scalability, replication support, and data modeling approach
- Comparisons between relational and NoSQL databases
The document provides an overview of NoSQL and MongoDB. It discusses that NoSQL databases were built for large datasets and cloud applications. It covers some of the main types of NoSQL databases like document stores, key-value stores, and column family stores. The document also compares NoSQL to SQL/relational databases, discussing how NoSQL is more flexible and scales horizontally. MongoDB is presented as a popular document-oriented NoSQL database, covering its flexible schema, horizontal scaling, and replication features.
This document discusses NoSQL databases and compares them to relational databases. It begins by explaining that NoSQL databases were developed to address scalability issues in relational databases. The document then categorizes NoSQL databases into four main types: key-value stores, column-oriented databases, document stores, and graph databases. For each type, popular examples are provided (e.g. DynamoDB, Cassandra, MongoDB) along with descriptions and use cases. The advantages of NoSQL databases over relational databases are also briefly touched on.
This document discusses emerging trends in databases, including NoSQL databases and object-oriented databases. It provides information on the characteristics, categories, advantages, and disadvantages of NoSQL databases. It also compares relational databases to object-oriented databases and discusses object-relational mapping.
The rising interest in NoSQL technology over the last few years resulted in an increasing number of evaluations and comparisons among competing NoSQL technologies From survey we create a concise and up-to-date comparison of NoSQL engines, identifying their most beneficial use from the software engineer point of view.
DATABASE MANAGEMENT SYSTEM-MRS. LAXMI B PANDYA FOR 25TH AUGUST,2022.pptxLaxmi Pandya
The document discusses database management systems and provides examples of different types of databases including relational, non-relational, centralized, distributed and object-oriented databases. It describes key components of databases like fields, records, tables and the core functions of adding, deleting, modifying and retrieving records. The document also explains concepts like database languages, database models, database examples, database features and integrity constraints.
EVALUATING CASSANDRA, MONGO DB LIKE NOSQL DATASETS USING HADOOP STREAMINGijiert bestjournal
This document summarizes a research paper that evaluates Cassandra and MongoDB NoSQL databases for processing unstructured data using Hadoop streaming. It proposes a system with three stages: data preparation where data is downloaded from Cassandra servers to file systems; data transformation where JSON data is converted to other formats using MapReduce; and data processing where non-Java executables run on the transformed data. The document reviews related work on Cassandra and Hadoop performance and discusses the data models of key-value, document, column-oriented, and graph databases. It concludes that comparing Cassandra and MongoDB can help process unstructured data and outline new approaches.
How to work with Python 3, how to create virtual environments, to install libraries, to create code skeletons and more.
Maybe an IDE for Python is right for you. If you are familiar with IntelliJ, then PyCharm is your option. There are other options such as Visual Studio Code, PyDev, Spyder, so you can choose the one you like the most.
And now you have no excuse to start with your first Python project.
How to document without dying in the attemptDatio Big Data
The document discusses the importance of documentation. It notes that documenting knowledge allows future generations to understand the past and envision the future. From a practical perspective, documentation provides advantages such as allowing others to use knowledge, standardizing processes, and producing documentation easily. The document then provides guidelines for technical writing such as using an active voice and focusing on the document's goal.
Testing is important for project quality and reliability. Unit tests check individual code units for correct functionality, are fast and independent. Behavior-driven development (BDD) applies business needs through scenarios defined in Gherkin. Test doubles replace code to independently test objects like dummies, fakes, stubs and mocks. Integration tests check connections between components while acceptance tests check overall project behavior from the user perspective, ensuring expected functionality but are slower and require more maintenance.
El documento describe las características y capacidades de Ceph, un sistema de almacenamiento de objetos distribuido de código abierto. Ceph puede escalar horizontalmente hasta el exabyte, no tiene puntos únicos de fallo, y funciona en hardware estándar. Utiliza un diseño descentralizado basado en objetos que puede exponer almacenamiento en forma de objetos, archivos o bloques.
La infraestructura Atlantis de Datio se basa en OpenStack. Consta de 3 controladores, 16 nodos de computación y almacenamiento distribuido Ceph. A nivel físico incluye servidores, cuchillas y almacenamiento. A nivel lógico define 4 redes aisladas y despliega los servicios principales de OpenStack como Keystone, Glance, Cinder y Neutron en alta disponibilidad. Los usuarios pueden gestionar la infraestructura a través de Horizon, la API de OpenStack o Ansible.
This document discusses data integration and architectures for processing both batch and streaming data. It covers topics like data ingestion using tools like Flume, Sqoop and Kafka to move data into data lakes and warehouses. It also discusses batch processing using MapReduce on Hadoop and stream processing using real-time technologies like Kafka and architectures like lambda and kappa for serving queries on both real-time and batch-processed views of the data.
How we have designed and applied the gamification at Datio.
See more about this topic in our blog:
http://www.datio.com/corporate/gwc-2016-towards-the-high-level-engagement/
http://www.datio.com/corporate/beginning-a-gamification-project/
Pandas: High Performance Structured Data ManipulationDatio Big Data
Brief theoretical introduction to Pandas: the Python library that is used primarily for manipulation and data analysis. Here we'll explain when it is convenient to work with this one, its main characteristics and the operations that allows.
The Building Blocks of QuestDB, a Time Series Databasejavier ramirez
Talk Delivered at Valencia Codes Meetup 2024-06.
Traditionally, databases have treated timestamps just as another data type. However, when performing real-time analytics, timestamps should be first class citizens and we need rich time semantics to get the most out of our data. We also need to deal with ever growing datasets while keeping performant, which is as fun as it sounds.
It is no wonder time-series databases are now more popular than ever before. Join me in this session to learn about the internal architecture and building blocks of QuestDB, an open source time-series database designed for speed. We will also review a history of some of the changes we have gone over the past two years to deal with late and unordered data, non-blocking writes, read-replicas, or faster batch ingestion.
Predictably Improve Your B2B Tech Company's Performance by Leveraging DataKiwi Creative
Harness the power of AI-backed reports, benchmarking and data analysis to predict trends and detect anomalies in your marketing efforts.
Peter Caputa, CEO at Databox, reveals how you can discover the strategies and tools to increase your growth rate (and margins!).
From metrics to track to data habits to pick up, enhance your reporting for powerful insights to improve your B2B tech company's marketing.
- - -
This is the webinar recording from the June 2024 HubSpot User Group (HUG) for B2B Technology USA.
Watch the video recording at https://youtu.be/5vjwGfPN9lw
Sign up for future HUG events at https://events.hubspot.com/b2b-technology-usa/
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...Social Samosa
The Modern Marketing Reckoner (MMR) is a comprehensive resource packed with POVs from 60+ industry leaders on how AI is transforming the 4 key pillars of marketing – product, place, price and promotions.
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...sameer shah
"Join us for STATATHON, a dynamic 2-day event dedicated to exploring statistical knowledge and its real-world applications. From theory to practice, participants engage in intensive learning sessions, workshops, and challenges, fostering a deeper understanding of statistical methodologies and their significance in various fields."
End-to-end pipeline agility - Berlin Buzzwords 2024Lars Albertsson
We describe how we achieve high change agility in data engineering by eliminating the fear of breaking downstream data pipelines through end-to-end pipeline testing, and by using schema metaprogramming to safely eliminate boilerplate involved in changes that affect whole pipelines.
A quick poll on agility in changing pipelines from end to end indicated a huge span in capabilities. For the question "How long time does it take for all downstream pipelines to be adapted to an upstream change," the median response was 6 months, but some respondents could do it in less than a day. When quantitative data engineering differences between the best and worst are measured, the span is often 100x-1000x, sometimes even more.
A long time ago, we suffered at Spotify from fear of changing pipelines due to not knowing what the impact might be downstream. We made plans for a technical solution to test pipelines end-to-end to mitigate that fear, but the effort failed for cultural reasons. We eventually solved this challenge, but in a different context. In this presentation we will describe how we test full pipelines effectively by manipulating workflow orchestration, which enables us to make changes in pipelines without fear of breaking downstream.
Making schema changes that affect many jobs also involves a lot of toil and boilerplate. Using schema-on-read mitigates some of it, but has drawbacks since it makes it more difficult to detect errors early. We will describe how we have rejected this tradeoff by applying schema metaprogramming, eliminating boilerplate but keeping the protection of static typing, thereby further improving agility to quickly modify data pipelines without fear.
4. Databases and how to choose them - January 2017
ACID vs BASE
● ACID:
● Atomicity. It contains the concept of transaction, as a group of tasks that must be performed against a database. If one element of a
transaction fails, the entire transaction fails.
● Consistency. This is usually defined like the property that guarantees that a transaction brings the database from one valid state (in a
formal sense, not in a functional one) to another. In ACID, consistency just implies a compliance with the defined rules, like constraints,
triggers, etc.
● Isolation. Each transaction must be independent by itself, meaning that it should not “see” the effects of other concurrent operations.
● Durability. This property ensures that once a transaction is complete, it will survive system failure, power loss and other types of system
breakdowns.
● BASE:
● Basically Available. This property states that the system ensures the availability of the data in a way: there will be a response to any
request (it could be inconsistent data or even a error).
● Soft-state. Due to the way from eventual consistency to actually consistency, the state of the system could change over time, even while
there is not an input operation over the database. Thus, the state of the system is called “soft”.
● Eventual consistency. After the system stops receiving input, when data have been propagated to every nodes, it will eventually become
consistent.
5. Databases and how to choose them - January 2017
CAP THEOREM
● CAP:
● Consistency. C in CAP actually means “linearizability”, which is a very specific and strong notion of consistency
that has nothing to do with the C in ACID (it has more to do with Atomic and Isolation, indeed). A typical way to
define it is like this: “if operation B started after operation A successfully completed, B must see the the system in the
same state as it was on completion of operation A, or a newer state”. Thus, a system is consistent if an update is
applied to all nodes at the same time.
name=Alice
name?
Alice
6. Databases and how to choose them - January 2017
CAP THEOREM
● CAP:
● Availability. A in CAP is defined as “every request received by a non-failing database node must result in a
non-error response”. This is both a strong and a weak requirement, since 100% of the requests must return a
response, but the response can take an unbounded (but finite) amount of time. As people tend to care more about
latency, a very slow response usually makes a system “not-available” for users.
7. Databases and how to choose them - January 2017
CAP THEOREM
● CAP:
● Partition Tolerance. P in CAP means… well, it is not clear. Some definitions of the concept state that the system
keeps on working even if some nodes, or the connection between two of them, fail. This kind of definition is what
drives to apply the CAP theorem to monolithic, single-node relational databases (they qualify as CA). A multi-node
system not requiring partition-tolerance would have to run on a network that never drops messages and whose
nodes can’t fail. Since this kind of system does not exist, P in CAP can’t be excluded by decision.
9. Databases and how to choose them - January 2017
Isolation
● Isolation.
In database systems, isolation determines how transaction integrity is visible to other users and systems. Though it’s often
used in a relaxed way, this property of ACID in a DBMS (Database Management System) is an important part of any
transactional system. This property specifies when and how the changes implemented in an operation become visible to
other parallel operations.
Acquiring locks on data is the way to achieve a good isolation level, so the most locks taken in an executing transaction, the
higher isolation level. On the other hand, locks have an impact on performance.
10. Databases and how to choose them - January 2017
Isolation
● Isolation levels.
ISOLATION LEVELS
READ
UNCOMMITED
READ
COMMITED
REPEATABLE
READS
SERIALIZABLE
CONCURRENCY
PHENOMENA
DIRTY READS
UNREAPEATABLE
READS
PHANTOM READS
11. Databases and how to choose them - January 2017
Indexes
● Indexes
A database index is a data structure that improves the speed of searches on a database table, with the trade-off of slower write performance,
due to additional writes and storage space to maintain the index data structure. Indexes are used to quickly locate data without having to
search every row in a database table.
12. Databases and how to choose them - January 2017
Indexes
● Inverted indexes
An inverted index is a data structure that maps content to its locations in a database file (in contrast to a Forward Index, which maps from
documents to content). The purpose of an inverted index is to allow fast full text searches, at the cost of increased processing and intensive
use of resources.
This document
can be stored in
ElasticSearch.
1
ElasticSearch is a
document
oriented database
2
13. Databases and how to choose them - January 2017
Sharding
● Sharding
Shards are partitions of data within a database. Since each partition is smaller than the whole database, a query using the
shard key (the field that sets the partition) will avoid a full scan, so there will be a dramatic improvement in search
performance.
On the other side, sharding implies a strong dependency on the network, with higher latency when querying several
shards, as well as consistency concerns when data is replicated among several shards (as it should be, for high-availability
needs).
It also introduces additional complexity in design (partition key must be carefully chosen) and development (load
balancing, replication, failover, etc).
14. Databases and how to choose them - January 2017
Database types
{data}
15. Databases and how to choose them - January 2017
Database types
● Database types
As a first approach, we have the next kinds of databases:
● Relational
● Key-value, column-oriented
● Document-oriented
● Graph
We deliberately exclude the popular key-value type because of the naive approach of its players for several production use cases and the
overlapping of some features with some of the aforementioned.
16. Databases and how to choose them - January 2017
Database types
Relational columnar storage.
The concept of relational databases is wide-known and involves some of the topics already treated in this document, specially ACID.
Recently, the schema-less need has been covered by RDBMS also, so their strengths are the consistency under heavy read and write needs
and the popular knowledge in both design and query language.
Columnar storage can be seen as a transposition of the common row-storage, meaning that:
Columnar models are very useful for some use cases. A common example is selecting a unique field, or calculating an average. Instead of
going through every row and accessing to the field age, a columnar model allows accessing exactly to the area where age is stored.
This kind of models are just relational (thus, ACID), and they are suitable for use cases with needs of very good read performance till certain
limit in volume (say, under one Terabyte).
1, 2, 3; Alice, Bob, Charles; Adams, Brown, Cooper; 23, 42, 34
17. Databases and how to choose them - January 2017
Database types
● Column-oriented databases.
● A common misunderstanding is about columnar storage in relational databases and column oriented databases, such as Cassandra.
● Column oriented databases store data in column families as rows that have many columns associated with a row key. Column
families are groups of related data that are often accessed together.
● Each column family can be compared to a container of rows in an RDBMS table where the key identifies the row and the row consists
of multiple columns (and here it is where the key-value concept appears).
● This kind of databases are strongly dependent of design, since they are thought to be accessed by a key. Secondary indexes are
allowed but they do not bring good enough performance for operational needs.
users
1 “Name”: “Alice” “Surname”: ”Adams” “age”: “23”
2 “Name”: “Bob” “Surname”: ”Brown” “age”: “42”
3 “Name”: “Charles” “Surname”: ”Cooper” “age”: “34”
18. Databases and how to choose them - January 2017
Database types
● Document-oriented databases.
● Just like it sounds, document-oriented databases store documents, typically in a JSON format. They are a certain kind of key-value
storages, with the nuance of having an internal structure which is used by their engines to query for the data.
● The way of viewing the data seems similar to the one in relational databases, except for the need of a schema and relational
constraints.
● The main difference between two worlds is in the ACID vs BASE distinction, which translate to horizontal scaling capabilities.
● Thus, these systems can offer good performance operating with several Terabytes.
● ElasticSearch is a rare example of document-oriented database. It is very suitable for Full Text Search and its capabilities (making
use of the aforementioned inverted indexes) allow to solve non-defined searches in operational-use-cases time.
19. Databases and how to choose them - January 2017
Database types
● Graph databases.
● Graph databases use the mathematical concept of a graph to store data. Graphs consists of nodes, edges and properties, which are
used to query for the desired information.
● The main advantage of these systems is the high performance for certain use cases involving a lot of SQL-joins, since those cases
are about following nodes relations.
● Write performance (and read performance without joins) are under the ones offered by other systems, so this kind of databases are
quite polarized regarding the use case.
20. Databases and how to choose them - January 2017
Database types
● High-level comparison.
Relational
(row-based)
Relational
(columnar)
Document-oriented Key-value
column-oriented
Graph
Basic description Data structured in
rows
Data stored in
columns
Data stored in
(potentially)
unstructured
documents
Data structured as
key-value maps
Data structured as
nodes and edges
(graphs) with
relations
Strengths ACID
Good performance
Low complexity
ACID
Good read
performance
Scalability
Good read
performance
Scalability
Good write
performance
ACID
Good read
performance
Weaknesses Scalability Scalability
Counter-intuitive
Consistency
Complexity
Strong design
dependency,
use-case
polarization
Scalability
Complexity
Counter-intuitive
Typical use
cases
Online operational
with ACID needs
Read-only without
scaling-out
Heavy readings with
high volume of records
Heavy writings with
high volume of
records and reads
by key
SQL-Joins
(relations)
Key players PostgreSQL PostgreSQL ElasticSearch
MongoDB
Cassandra Neo4J
21. Databases and how to choose them - January 2017
Database types
● Radar graph.
23. Databases and how to choose them - January 2017
Use cases
● CRUD over an entity
● For typical CRUD operations (and, maybe, listing) over a certain entity, in a RESTful way, the very first option should be a
RDBMS. They provide:
○ good write and read performance
○ (typically) lots of features
○ (typically) the advantage of the SQL modeling and language, which qualifies them for a straightforward usage.
● Note that CRUD over an entity usually implies accessing data by an unique key, which would be the entity id. Accessing one, or
several (listing), entities by other fields, would need index creation.
● Both scenarios fit well in a RDBMS while the WHERE clause fields were known, but the possibility of scaling out has to be
considered. If volume of data may grow too much, a document-oriented database could be the logical alternative.
● Particularizing, MongoDB covers essentially the same use cases than PostgreSQL, with the former being the chosen when
volume is (or could be) high, and the last being the election when ACID capabilities are more important.
24. Databases and how to choose them - January 2017
Use cases
● FTS or searching by any field
● Performing searches by any field involves the creation of lots of indexes in the way PostgreSQL or MongoDB treat them.
● Instead of that, using ElasticSearch would be much more effective. The same logic applies for Free Text Search, with the
inverted indexes of Elastic being the solution.
● The intensive use of resources made by ElasticSearch prevents it to be used in other use cases, like the aforementioned CRUD
over an entity or much more concrete accesses (id or known fields).
25. Databases and how to choose them - January 2017
Use cases
● High-volume loads
● Cassandra is the system that provides better write performance and scalability.
● A typical use case could be a log system, if it is just accessed by date or by component name.
● If there is a high volume of online writes, but access can not be done by a unique field, then we can choose among others
products, attending to the previous considerations.
● It is important to know that reindexing operations over the database has a big impact in performance. If it is not possible to
switch off the indexes while writing (like in a typical online operative), MongoDB and PostgreSQL could be worse options than
ElasticSearch.
● On the other hand, in high writing and reading scenarios, consistency becomes relevant, so PostgreSQL may have the edge.
26. Databases and how to choose them - January 2017
Use cases
● Relations
● Fraud detection or a recommendation engine are typical cases in which a lot of SQL joins are needed, since they are all about
querying several entities of the same type by a variety of fields, and maybe with entities of a different type.
● In a graph, that’s about following a path among several nodes, so it is natively more efficient to use a graph database.
● Scalability or consistency could be concerns in those cases.
27. Databases and how to choose them - January 2017
Use cases
● Analytics
● Analytics use cases usually involve:
○ a huge volume of data
○ a much more relaxed time of processing
○ a much lower level of concurrence.
● For those cases, jobs accesing to a DFS can be enough.
28. Databases and how to choose them - January 2017
Best and bad practices
29. Databases and how to choose them - January 2017
Best and bad practices
● Best practices:
● Choose the right database for the each use case.
● A new “materialized view” is better than fight with problems. There is not a silver bullet.
● Avoid BLOBs
● Schemas are good: keep order and are intuitive.
● Mind the CAP
30. Databases and how to choose them - January 2017
Best and bad practices
● Bad practices:
● Over…
○ indexing
○ normalization
○ provisioning of resources
● Relational mindset
● Split brain
● Fashion victim