Apache CarbonData is a fully indexed columnar and Hadoop native data-store for processing heavy analytical workloads and detailed queries on big data. In customer benchmarks, Carbon Data has proven to manage Petabyte of data running on extraordinarily low-cost hardware and answers queries around 10 times faster than the current open source solutions (column-oriented SQL on Hadoop data-stores).
Apache Carbondata: An Indexed Columnar File Format for Interactive Query with...Spark Summit
Realtime analytics over large datasets has become an increasing wide-spread demand, over the past several years, Hadoop ecosystem has been continuously evolving, even complex queries over large datasets can be realized in an interactive fashion with distributed processing framework like Apache Spark, new paradigm of efficient storage were introduced as well to facilitate data processing framework, such as Apache Parquet, ORC provide fast scan over columnar data format, and Apache Hbase offers fast ingest and millisecond scale random access.
In this talk, we will outline Apache Carbondata, a new addition to open source Hadoop ecosystem which is an indexed columnar file format aimed for bridging the gap to fully enable real-time analytics abilities. It has been deeply integrated with Spark SQL and enables dramatic acceleration of query processing by leveraging efficient encoding/compression and effective predicate push down through Carbondata’s multi-level index technique.
Introduction to the DAOS Scale-out object store (HLRS Workshop, April 2017)Johann Lombardi
DAOS is a open-source storage stack designed from the ground up to address many of the problems that arise when scaling out storage. DAOS takes advantage of next generation non-volatile memory technologies while presenting a rich and scalable storage interface providing features such as transactional non-blocking list I/O, data resiliency on top of commodity hardware, fine grained data control, and storage tiering to optimize performance and cost. Check out https://github.com/daos-stack for more information.
Netflix’s Big Data Platform team manages data warehouse in Amazon S3 with over 60 petabytes of data and writes hundreds of terabytes of data every day. With a data warehouse at this scale, it is a constant challenge to keep improving performance. This talk will focus on Iceberg, a new table metadata format that is designed for managing huge tables backed by S3 storage. Iceberg decreases job planning time from minutes to under a second, while also isolating reads from writes to guarantee jobs always use consistent table snapshots.
In this session, you'll learn:
• Some background about big data at Netflix
• Why Iceberg is needed and the drawbacks of the current tables used by Spark and Hive
• How Iceberg maintains table metadata to make queries fast and reliable
• The benefits of Iceberg's design and how it is changing the way Netflix manages its data warehouse
• How you can get started using Iceberg
Speaker
Ryan Blue, Software Engineer, Netflix
Oracle RAC One Node 12c provides best-in-class single instance Oracle Database availability, better database consolidation, and better database virtualization. It uses Oracle Grid Infrastructure, including Automatic Storage Management and Oracle Clusterware, to provide high availability capabilities like automatic instance failover to another server in case of hardware or software failure. It also enables capabilities like online database relocation to allow patching and maintenance with zero downtime.
Sizing Splunk SmartStore - Spend Less and Get More Out of SplunkPaula Koziol
Data is growing exponentially; however IT budgets are not. Growth in internal use cases and additional data sources can put organizations under intense pressure to manage spiraling costs. The good news is that help is on the way. We will show how to size and configure Splunk SmartStore to yield significant cost savings, for both current and future data growth. In addition, learn how to configure the Splunk deployment for optimal search performance.
Originally presented at Splunk .conf19 on October 22, 2019
Apache Carbondata: An Indexed Columnar File Format for Interactive Query with...Spark Summit
Realtime analytics over large datasets has become an increasing wide-spread demand, over the past several years, Hadoop ecosystem has been continuously evolving, even complex queries over large datasets can be realized in an interactive fashion with distributed processing framework like Apache Spark, new paradigm of efficient storage were introduced as well to facilitate data processing framework, such as Apache Parquet, ORC provide fast scan over columnar data format, and Apache Hbase offers fast ingest and millisecond scale random access.
In this talk, we will outline Apache Carbondata, a new addition to open source Hadoop ecosystem which is an indexed columnar file format aimed for bridging the gap to fully enable real-time analytics abilities. It has been deeply integrated with Spark SQL and enables dramatic acceleration of query processing by leveraging efficient encoding/compression and effective predicate push down through Carbondata’s multi-level index technique.
Introduction to the DAOS Scale-out object store (HLRS Workshop, April 2017)Johann Lombardi
DAOS is a open-source storage stack designed from the ground up to address many of the problems that arise when scaling out storage. DAOS takes advantage of next generation non-volatile memory technologies while presenting a rich and scalable storage interface providing features such as transactional non-blocking list I/O, data resiliency on top of commodity hardware, fine grained data control, and storage tiering to optimize performance and cost. Check out https://github.com/daos-stack for more information.
Netflix’s Big Data Platform team manages data warehouse in Amazon S3 with over 60 petabytes of data and writes hundreds of terabytes of data every day. With a data warehouse at this scale, it is a constant challenge to keep improving performance. This talk will focus on Iceberg, a new table metadata format that is designed for managing huge tables backed by S3 storage. Iceberg decreases job planning time from minutes to under a second, while also isolating reads from writes to guarantee jobs always use consistent table snapshots.
In this session, you'll learn:
• Some background about big data at Netflix
• Why Iceberg is needed and the drawbacks of the current tables used by Spark and Hive
• How Iceberg maintains table metadata to make queries fast and reliable
• The benefits of Iceberg's design and how it is changing the way Netflix manages its data warehouse
• How you can get started using Iceberg
Speaker
Ryan Blue, Software Engineer, Netflix
Oracle RAC One Node 12c provides best-in-class single instance Oracle Database availability, better database consolidation, and better database virtualization. It uses Oracle Grid Infrastructure, including Automatic Storage Management and Oracle Clusterware, to provide high availability capabilities like automatic instance failover to another server in case of hardware or software failure. It also enables capabilities like online database relocation to allow patching and maintenance with zero downtime.
Sizing Splunk SmartStore - Spend Less and Get More Out of SplunkPaula Koziol
Data is growing exponentially; however IT budgets are not. Growth in internal use cases and additional data sources can put organizations under intense pressure to manage spiraling costs. The good news is that help is on the way. We will show how to size and configure Splunk SmartStore to yield significant cost savings, for both current and future data growth. In addition, learn how to configure the Splunk deployment for optimal search performance.
Originally presented at Splunk .conf19 on October 22, 2019
We talk a lot about Galera Cluster being great for High Availability, but what about Disaster Recovery (DR)? Database outages can occur when you lose a data centre due to data center power outages or natural disaster, so why not plan appropriately in advance?
In this webinar, we will discuss the business considerations including achieving the highest possible uptime, analysis business impact as well as risk, focus on disaster recovery itself, as well as discussing various scenarios, from having no offsite data to having synchronous replication to another data centre.
This webinar will cover MySQL with Galera Cluster, as well as branches MariaDB Galera Cluster as well as Percona XtraDB Cluster (PXC). We will focus on architecture solutions, DR scenarios and have you on your way to success at the end of it.
1. Log structured merge trees store data in multiple levels with different storage speeds and costs, requiring data to periodically merge across levels.
2. This structure allows fast writes by storing new data in faster levels before merging to slower levels, and efficient reads by querying multiple levels and merging results.
3. The merging process involves loading, sorting, and rewriting levels to consolidate and propagate deletions and updates between levels.
[REPEAT 1] Deep Dive on Amazon Aurora with MySQL Compatibility (DAT304-R1) - ...Amazon Web Services
Amazon Aurora is a fully managed relational database service that combines the speed and availability of high-end commercial databases with the simplicity and cost-effectiveness of open source databases. With Aurora, we've completely reimagined how databases are built for the cloud, providing you higher performance, availability, and durability than previously possible. In this session, we dive deep into the architectural details of Aurora with MySQL compatibility, and we review recent innovations, such as parallel query, backtrack, serverless, and multi-master. We also share best practices for utilizing the power of relational databases at cloud scale.
SELinux(Security-Enhanced Linux)는 미국 국가 안보국(NSA)에서 개발한 것으로,
관리자가 시스템 엑세스 권한을 효과적으로 제어할 수 있게 하는 Linux 시스템용 보안 아키텍처입니다.
특정 서비스의 구동이 원활하지 않거나 혹은 관리의 번거로움 등으로 인해 SELinux를 Disable 하는 경우가 많은데요,
SELinux를 사용해야 하는 이유와 작동 방식에 대해 설명합니다,
C* Summit 2013: The World's Next Top Data Model by Patrick McFadinDataStax Academy
The document provides an overview and examples of data modeling techniques for Cassandra. It discusses four use cases - shopping cart data, user activity tracking, log collection/aggregation, and user form versioning. For each use case, it describes the business needs, issues with a relational database approach, and provides the Cassandra data model solution with examples in CQL. The models showcase techniques like de-normalizing data, partitioning, clustering, counters, maps and setting TTL for expiration. The presentation aims to help attendees properly model their data for Cassandra use cases.
FOSDEM 2022 MySQL Devroom: MySQL 8.0 - Logical Backups, Snapshots and Point-...Frederic Descamps
Logical dumps are becoming popular again. MySQL Shell parallel dump & load utility changed to way to deal with logical dumps, certainly when using instances in the cloud. MySQL 8.0 released also an awesome physical snapshot feature with CLONE.
In this session, I will show how to use these two ways of saving your data and how to use the generated backup to perform point-in-time recovery like a rockstar with MySQL 8.0 in 2022 !
ETL With Cassandra Streaming Bulk Loadingalex_araujo
Cassandra ETL uses Chef recipes to configure Cassandra clusters on EC2 for bulk loading data. A custom Java ETL JAR processes input files to generate SSTables, which are streamed into Cassandra using sstableloader for fast loading. The Grinder is used to stress test and measure the import performance and throughput across the Cassandra cluster. Results showed streaming bulk loads were 2.5x faster than Thrift and up to 300x faster than MySQL. The only downside was the custom SSTable generation was slower than Cassandra's native writes.
The document provides information on how to reduce the size of the SYSAUX tablespace in an Oracle database. It discusses which database components occupy space in SYSAUX, including top offenders like SM/AWR and SM/OPTSTAT. It then describes various methods to cleanup the SYSAUX tablespace such as reorganizing tables and indexes, moving components to other tablespaces using provided procedures, and reducing retention periods for components like AWR and advisors to delete old data. Proper sizing of the SYSAUX tablespace is also discussed.
Accelerating Apache Spark Shuffle for Data Analytics on the Cloud with Remote...Databricks
The increasing challenge to serve ever-growing data driven by AI and analytics workloads makes disaggregated storage and compute more attractive as it enables companies to scale their storage and compute capacity independently to match data & compute growth rate. Cloud based big data services is gaining momentum as it provides simplified management, elasticity, and pay-as-you-go model.
Percona Live 2022 - The Evolution of a MySQL Database SystemFrederic Descamps
From a single MySQL instance to multi-site high availability, this is what you will find out in this presentation. You will learn how to make this transition and which solutions best suit changing business requirements (RPO, RTO). Recently, MySQL has extended the possibilities for easy deployment of architecture with integrated tools. Come and discover these open source solutions that are part of MySQL.
GraphConnect 2022 - Top 10 Cypher Tuning Tips & Tricks.pptxjexp
I was there when Cypher was invented in 2012
and have been using it ever since. The language is
extremely powerful and easy to learn. But to truly
master it, you need to understand how it works
internally and how the database executes your
queries. In this session, you'll learn to look behind
the scenes at execution plans with PROFILE and
EXPLAIN and which specific clauses, expressions,
structures, and operations help you minimize
Cypher and database operations. After this talk,
you should be able to speed up your Cypher
statements quite a bit.
Kafka Tiered Storage separates compute and data storage in two independently scalable layers. Uber's Kafka Improvement Proposal (KIP) #405 describes two-tiered storage, which is a major step towards cloud-native Kafka. It stores the most recent data locally and offloads older data to a remote storage service. Operationally, the benefit is faster routine cluster maintenance activities. In Linkedin, Kafka tiered storage is strongly desired to reduce the cost of running Kafka in the Azure cloud environment. As KIP-405 does not dictate the implementation of remote storage substrate, Linkedin's choice for tiering Kafka in Azure deployments is the Azure Blob Service. This presentation will begin with the motivation behind Linkedin efforts to adopt Kafka Tiered Storage. Next, the architecture of KIP-405 will be discussed. Finally, the Remote Storage Manager for Azure Blobs, which is a work-in-progress, will be presented.
Video: https://youtu.be/V5gaBE5CMwg?t=1387
This session talks about how unit testing of Spark applications is done, as well as tells the best way to do it. This includes writing unit tests with and without Spark Testing Base package, which is a spark package containing base classes to use when writing tests with Spark.
Kubernetes Forum Seoul 2019: Re-architecting Data Platform with KubernetesSeungYong Oh
Session Video: https://youtu.be/7MPH1mknIxE
In this talk, we share Devsisters' journey of migrating its internal data platform including Spark to Kubernetes, with its benefits and issues.
데브시스터즈에서 데이터플랫폼 컴포넌트를 쿠버네티스로 옮기면서 얻은 장점들과 이슈들에 대해 공유합니다.
Conference session page:
- English: https://sched.co/WIRK
- Korean: https://sched.co/WYRc
MariaDB Galera Cluster Webinar by Ivan Zoratti on 13.11.2013. Also available as on demand webinar at http://www.skysql.com/why-skysql/webinars/mariadb-galera-cluster-simple-transparent-highly-available
Understanding my database through SQL*Plus using the free tool eDB360Carlos Sierra
This session introduces eDB360 - a free tool that is executed from SQL*Plus and generates a set of reports providing a 360-degree view of an Oracle database; all without installing anything on the database.
If using Oracle Enterprise Manager (OEM) is off-limits for you or your team, and you can only access the database thorough a SQL*Plus connection with no direct access to the database server, then this tool is a perfect fit to provide you with a broad overview of the database configuration, performance, top SQL and much more. You only need a SQL*Plus account with read access to the data dictionary, and common Oracle licenses like the Diagnostics or the Tuning Pack.
Typical uses of this eDB360 tool include: databases health-checks, performance assessments, pre or post upgrade verifications, snapshots of the environment for later use, compare between two similar environments, documenting the state of a database when taking ownership of it, etc.
Once you learn how to use eDB360 and get to appreciate its value, you may want to execute this tool on all your databases on a regular basis, so you can keep track of things for long periods of time. This tool is becoming part of a large collection of goodies many DBAs use today.
During this session you will learn the basics about the free eDB360 tool, plus some cool tricks. The target audience is: DBAs, developers and consultants (some managers could also benefit).
This document provides an overview and introduction to Redis, including:
- Redis is an open source, in-memory data structure store that can be used as a database, cache, and message broker.
- It supports common data structures like strings, hashes, lists, sets, sorted sets with operations like GET, SET, LPUSH, SADD.
- Redis has advantages like speed, rich feature set, replication, and persistence to disk.
- The document outlines how to install and use Redis, and covers additional features like pub/sub, transactions, security and backup.
RocksDB is an embedded key-value store written in C++ and optimized for fast storage environments like flash or RAM. It uses a log-structured merge tree to store data by writing new data sequentially to an in-memory log and memtable, periodically flushing the memtable to disk in sorted SSTables. It reads from the memtable and SSTables, and performs background compaction to merge SSTables and remove overwritten data. RocksDB supports two compaction styles - level style, which stores SSTables in multiple levels sorted by age, and universal style, which stores all SSTables in level 0 sorted by time.
This document summarizes a presentation about Druid, an open-source distributed data store designed to handle real-time queries on large datasets. It discusses what Druid is, its architecture, and how it compares to other technologies. Specifically, it covers how Druid's TopN query works and is much faster than a GROUP BY query, though results can be unstable with high replication. It also provides examples of queries and performance comparisons between Druid, Elasticsearch, and Kudu+Impala.
We talk a lot about Galera Cluster being great for High Availability, but what about Disaster Recovery (DR)? Database outages can occur when you lose a data centre due to data center power outages or natural disaster, so why not plan appropriately in advance?
In this webinar, we will discuss the business considerations including achieving the highest possible uptime, analysis business impact as well as risk, focus on disaster recovery itself, as well as discussing various scenarios, from having no offsite data to having synchronous replication to another data centre.
This webinar will cover MySQL with Galera Cluster, as well as branches MariaDB Galera Cluster as well as Percona XtraDB Cluster (PXC). We will focus on architecture solutions, DR scenarios and have you on your way to success at the end of it.
1. Log structured merge trees store data in multiple levels with different storage speeds and costs, requiring data to periodically merge across levels.
2. This structure allows fast writes by storing new data in faster levels before merging to slower levels, and efficient reads by querying multiple levels and merging results.
3. The merging process involves loading, sorting, and rewriting levels to consolidate and propagate deletions and updates between levels.
[REPEAT 1] Deep Dive on Amazon Aurora with MySQL Compatibility (DAT304-R1) - ...Amazon Web Services
Amazon Aurora is a fully managed relational database service that combines the speed and availability of high-end commercial databases with the simplicity and cost-effectiveness of open source databases. With Aurora, we've completely reimagined how databases are built for the cloud, providing you higher performance, availability, and durability than previously possible. In this session, we dive deep into the architectural details of Aurora with MySQL compatibility, and we review recent innovations, such as parallel query, backtrack, serverless, and multi-master. We also share best practices for utilizing the power of relational databases at cloud scale.
SELinux(Security-Enhanced Linux)는 미국 국가 안보국(NSA)에서 개발한 것으로,
관리자가 시스템 엑세스 권한을 효과적으로 제어할 수 있게 하는 Linux 시스템용 보안 아키텍처입니다.
특정 서비스의 구동이 원활하지 않거나 혹은 관리의 번거로움 등으로 인해 SELinux를 Disable 하는 경우가 많은데요,
SELinux를 사용해야 하는 이유와 작동 방식에 대해 설명합니다,
C* Summit 2013: The World's Next Top Data Model by Patrick McFadinDataStax Academy
The document provides an overview and examples of data modeling techniques for Cassandra. It discusses four use cases - shopping cart data, user activity tracking, log collection/aggregation, and user form versioning. For each use case, it describes the business needs, issues with a relational database approach, and provides the Cassandra data model solution with examples in CQL. The models showcase techniques like de-normalizing data, partitioning, clustering, counters, maps and setting TTL for expiration. The presentation aims to help attendees properly model their data for Cassandra use cases.
FOSDEM 2022 MySQL Devroom: MySQL 8.0 - Logical Backups, Snapshots and Point-...Frederic Descamps
Logical dumps are becoming popular again. MySQL Shell parallel dump & load utility changed to way to deal with logical dumps, certainly when using instances in the cloud. MySQL 8.0 released also an awesome physical snapshot feature with CLONE.
In this session, I will show how to use these two ways of saving your data and how to use the generated backup to perform point-in-time recovery like a rockstar with MySQL 8.0 in 2022 !
ETL With Cassandra Streaming Bulk Loadingalex_araujo
Cassandra ETL uses Chef recipes to configure Cassandra clusters on EC2 for bulk loading data. A custom Java ETL JAR processes input files to generate SSTables, which are streamed into Cassandra using sstableloader for fast loading. The Grinder is used to stress test and measure the import performance and throughput across the Cassandra cluster. Results showed streaming bulk loads were 2.5x faster than Thrift and up to 300x faster than MySQL. The only downside was the custom SSTable generation was slower than Cassandra's native writes.
The document provides information on how to reduce the size of the SYSAUX tablespace in an Oracle database. It discusses which database components occupy space in SYSAUX, including top offenders like SM/AWR and SM/OPTSTAT. It then describes various methods to cleanup the SYSAUX tablespace such as reorganizing tables and indexes, moving components to other tablespaces using provided procedures, and reducing retention periods for components like AWR and advisors to delete old data. Proper sizing of the SYSAUX tablespace is also discussed.
Accelerating Apache Spark Shuffle for Data Analytics on the Cloud with Remote...Databricks
The increasing challenge to serve ever-growing data driven by AI and analytics workloads makes disaggregated storage and compute more attractive as it enables companies to scale their storage and compute capacity independently to match data & compute growth rate. Cloud based big data services is gaining momentum as it provides simplified management, elasticity, and pay-as-you-go model.
Percona Live 2022 - The Evolution of a MySQL Database SystemFrederic Descamps
From a single MySQL instance to multi-site high availability, this is what you will find out in this presentation. You will learn how to make this transition and which solutions best suit changing business requirements (RPO, RTO). Recently, MySQL has extended the possibilities for easy deployment of architecture with integrated tools. Come and discover these open source solutions that are part of MySQL.
GraphConnect 2022 - Top 10 Cypher Tuning Tips & Tricks.pptxjexp
I was there when Cypher was invented in 2012
and have been using it ever since. The language is
extremely powerful and easy to learn. But to truly
master it, you need to understand how it works
internally and how the database executes your
queries. In this session, you'll learn to look behind
the scenes at execution plans with PROFILE and
EXPLAIN and which specific clauses, expressions,
structures, and operations help you minimize
Cypher and database operations. After this talk,
you should be able to speed up your Cypher
statements quite a bit.
Kafka Tiered Storage separates compute and data storage in two independently scalable layers. Uber's Kafka Improvement Proposal (KIP) #405 describes two-tiered storage, which is a major step towards cloud-native Kafka. It stores the most recent data locally and offloads older data to a remote storage service. Operationally, the benefit is faster routine cluster maintenance activities. In Linkedin, Kafka tiered storage is strongly desired to reduce the cost of running Kafka in the Azure cloud environment. As KIP-405 does not dictate the implementation of remote storage substrate, Linkedin's choice for tiering Kafka in Azure deployments is the Azure Blob Service. This presentation will begin with the motivation behind Linkedin efforts to adopt Kafka Tiered Storage. Next, the architecture of KIP-405 will be discussed. Finally, the Remote Storage Manager for Azure Blobs, which is a work-in-progress, will be presented.
Video: https://youtu.be/V5gaBE5CMwg?t=1387
This session talks about how unit testing of Spark applications is done, as well as tells the best way to do it. This includes writing unit tests with and without Spark Testing Base package, which is a spark package containing base classes to use when writing tests with Spark.
Kubernetes Forum Seoul 2019: Re-architecting Data Platform with KubernetesSeungYong Oh
Session Video: https://youtu.be/7MPH1mknIxE
In this talk, we share Devsisters' journey of migrating its internal data platform including Spark to Kubernetes, with its benefits and issues.
데브시스터즈에서 데이터플랫폼 컴포넌트를 쿠버네티스로 옮기면서 얻은 장점들과 이슈들에 대해 공유합니다.
Conference session page:
- English: https://sched.co/WIRK
- Korean: https://sched.co/WYRc
MariaDB Galera Cluster Webinar by Ivan Zoratti on 13.11.2013. Also available as on demand webinar at http://www.skysql.com/why-skysql/webinars/mariadb-galera-cluster-simple-transparent-highly-available
Understanding my database through SQL*Plus using the free tool eDB360Carlos Sierra
This session introduces eDB360 - a free tool that is executed from SQL*Plus and generates a set of reports providing a 360-degree view of an Oracle database; all without installing anything on the database.
If using Oracle Enterprise Manager (OEM) is off-limits for you or your team, and you can only access the database thorough a SQL*Plus connection with no direct access to the database server, then this tool is a perfect fit to provide you with a broad overview of the database configuration, performance, top SQL and much more. You only need a SQL*Plus account with read access to the data dictionary, and common Oracle licenses like the Diagnostics or the Tuning Pack.
Typical uses of this eDB360 tool include: databases health-checks, performance assessments, pre or post upgrade verifications, snapshots of the environment for later use, compare between two similar environments, documenting the state of a database when taking ownership of it, etc.
Once you learn how to use eDB360 and get to appreciate its value, you may want to execute this tool on all your databases on a regular basis, so you can keep track of things for long periods of time. This tool is becoming part of a large collection of goodies many DBAs use today.
During this session you will learn the basics about the free eDB360 tool, plus some cool tricks. The target audience is: DBAs, developers and consultants (some managers could also benefit).
This document provides an overview and introduction to Redis, including:
- Redis is an open source, in-memory data structure store that can be used as a database, cache, and message broker.
- It supports common data structures like strings, hashes, lists, sets, sorted sets with operations like GET, SET, LPUSH, SADD.
- Redis has advantages like speed, rich feature set, replication, and persistence to disk.
- The document outlines how to install and use Redis, and covers additional features like pub/sub, transactions, security and backup.
RocksDB is an embedded key-value store written in C++ and optimized for fast storage environments like flash or RAM. It uses a log-structured merge tree to store data by writing new data sequentially to an in-memory log and memtable, periodically flushing the memtable to disk in sorted SSTables. It reads from the memtable and SSTables, and performs background compaction to merge SSTables and remove overwritten data. RocksDB supports two compaction styles - level style, which stores SSTables in multiple levels sorted by age, and universal style, which stores all SSTables in level 0 sorted by time.
This document summarizes a presentation about Druid, an open-source distributed data store designed to handle real-time queries on large datasets. It discusses what Druid is, its architecture, and how it compares to other technologies. Specifically, it covers how Druid's TopN query works and is much faster than a GROUP BY query, though results can be unstable with high replication. It also provides examples of queries and performance comparisons between Druid, Elasticsearch, and Kudu+Impala.
Cassandra was chosen over other NoSQL options like MongoDB for its scalability and ability to handle a projected 10x growth in data and shift to real-time updates. A proof-of-concept showed Cassandra and ActiveSpaces performing similarly for initial loads, writes and reads. Cassandra was selected due to its open source nature. The data model transitioned from lists to maps to a compound key with JSON to optimize for queries. Ongoing work includes upgrading Cassandra, integrating Spark, and improving JSON schema management and asynchronous operations.
Virtual Knowledge Graphs for Federated Log AnalysisKabul Kurniawan
This document presents a method for executing federated graph pattern queries on dispersed and heterogeneous raw log data by dynamically constructing virtual knowledge graphs (VKGs). The approach extracts only relevant log messages on demand, integrates log events into a common graph, federates queries across endpoints, and links results to background knowledge. The architecture includes modules for log parsing, query processing, and a prototype implementation demonstrates the approach for security analytics use cases. An evaluation analyzes the performance of query execution time against factors like number of extracted log lines and queried hosts.
In this presentation, you will get a look under the covers of Amazon Redshift, a fast, fully-managed, petabyte-scale data warehouse service for less than $1,000 per TB per year. Learn how Amazon Redshift uses columnar technology, optimized hardware, and massively parallel processing to deliver fast query performance on data sets ranging in size from hundreds of gigabytes to a petabyte or more. We'll also walk through techniques for optimizing performance and, you’ll hear from a specific customer and their use case to take advantage of fast performance on enormous datasets leveraging economies of scale on the AWS platform.
Speakers:
Ian Meyers, AWS Solutions Architect
Toby Moore, Chief Technology Officer, Space Ape
Amazon Redshift is a fully managed, petabyte-scale data warehouse service in the cloud. It is fast, inexpensive, and fully managed. Some key benefits include being 10x faster and cheaper than traditional data warehouses, with high availability and disaster recovery built-in. It is easy to set up and use, and has a large ecosystem of integration and business intelligence tools. Common use cases include analytics on large volumes of mobile, web, IoT and operational data. The presentation provides an overview of Amazon Redshift and how to get started, including provisioning a cluster, data modeling best practices, and loading and querying data.
VMworld 2013: Deep Dive into vSphere Log Management with vCenter Log InsightVMworld
VMworld 2013
Steve Flanders, VMware
Chengdu Huang, VMware
Learn more about VMworld and register at http://www.vmworld.com/index.jspa?src=socmed-vmworld-slideshare
Advanced Index, Partitioning and Compression Strategies for SQL ServerConfio Software
This document discusses advanced indexing, partitioning, and compression strategies for SQL Server. It provides an overview of SQL Server data compression features, including vardecimal compression introduced in SQL 2005, and page and row compression available since 2008. The document outlines the different data types that benefit most from compression and estimates for space savings. It also discusses using Unicode compression and how compression can be combined with other features like partitioning to further optimize database performance. Real-world demos and scripts are provided to help identify objects in a database that would benefit from these advanced strategies.
How to use Parquet as a Sasis for ETL and AnalyticsDataWorks Summit
Parquet is a columnar storage format that provides efficient compression and querying capabilities. It aims to store data efficiently for analysis while supporting interoperability across systems. Parquet uses column-oriented storage with efficient encodings and statistics to enable fast querying of large datasets. It integrates with many query engines and frameworks like Hive, Impala, Spark and MapReduce to allow projection and predicate pushdown for optimized queries.
Saiba como Amazon Redshift, o nosso dataware house totalmente gerenciados, pode ajudá-lo de forma rápida e rentável analisar todos os seus dados utilizando suas ferramentas de BI. Também será abordado introdução ao serviço, o qual utiliza MPP, arquitetura scale-out e armazenamento de forma colunar.
SQL Server 2014 for Developers (Cristian Lefter)ITCamp
There’s a special Microsoft product that has come a long way since its first OS/2 version. Actually this year, the 12th version of the product marks 25 years from the start, 25 years of Microsoft SQL Server.
Over the years, we have seen many amazing and innovative technologies such as SQL CLR, Service Broker, Resource Governor or recently the cloud-based version of SQL Server. And each time we concluded that we have seen it all, another version came to surprise us. SQL Server 2014 is no exception and I guarantee that some of its features will know your socks off!
I will say just “In-Memory OLTP” and “Updatable Clustered Columnstore Indexes” and it should be enough. But there’s more than that! Come to this session to find out for yourself!
Performance Optimizations in Apache ImpalaCloudera, Inc.
Apache Impala is a modern, open-source MPP SQL engine architected from the ground up for the Hadoop data processing environment. Impala provides low latency and high concurrency for BI/analytic read-mostly queries on Hadoop, not delivered by batch frameworks such as Hive or SPARK. Impala is written from the ground up in C++ and Java. It maintains Hadoop’s flexibility by utilizing standard components (HDFS, HBase, Metastore, Sentry) and is able to read the majority of the widely-used file formats (e.g. Parquet, Avro, RCFile).
To reduce latency, such as that incurred from utilizing MapReduce or by reading data remotely, Impala implements a distributed architecture based on daemon processes that are responsible for all aspects of query execution and that run on the same machines as the rest of the Hadoop infrastructure. Impala employs runtime code generation using LLVM in order to improve execution times and uses static and dynamic partition pruning to significantly reduce the amount of data accessed. The result is performance that is on par or exceeds that of commercial MPP analytic DBMSs, depending on the particular workload. Although initially designed for running on-premises against HDFS-stored data, Impala can also run on public clouds and access data stored in various storage engines such as object stores (e.g. AWS S3), Apache Kudu and HBase. In this talk, we present Impala's architecture in detail and discuss the integration with different storage engines and the cloud.
Vijayendra Shamanna from SanDisk presented on optimizing the Ceph distributed storage system for all-flash architectures. Some key points:
1) Ceph is an open-source distributed storage system that provides file, block, and object storage interfaces. It operates by spreading data across multiple commodity servers and disks for high performance and reliability.
2) SanDisk has optimized various aspects of Ceph's software architecture and components like the messenger layer, OSD request processing, and filestore to improve performance on all-flash systems.
3) Testing showed the optimized Ceph configuration delivering over 200,000 IOPS and low latency with random 8K reads on an all-flash setup.
Hpverticacertificationguide 150322232921-conversion-gate01Anvith S. Upadhyaya
Here are the key points about projection segmentation in Vertica:
- Projection segmentation splits large projections into multiple segments and distributes those segments across database nodes for improved parallelism and high availability.
- The segmentation process randomly distributes rows of a projection across all available nodes using a hash function. This random distribution balances the load evenly.
- Segmented projections allow Vertica to parallelize queries by enabling each node to work independently on its portion of the data.
- If a node fails, its segments can be recovered from the duplicate segments stored on other live nodes, ensuring the data remains available.
- Segmentation is determined automatically by Vertica based on projection size and number of nodes. The system monitors segment
Here are the key points about projection segmentation in Vertica:
- Projection segmentation splits large projections into multiple segments and distributes those segments across database nodes for improved parallelism and high availability.
- The segmentation process randomly distributes rows of data across all available nodes using a hash function. This random distribution helps optimize query performance.
- Segmentation allows Vertica to parallelize queries by enabling each node to work independently on its portion of the data.
- It also provides high availability because if a node fails, its data segments are available on other nodes, avoiding data loss.
- During recovery, the replacement node can retrieve missing segments from the live segments on other nodes.
- Administrators can control
MySQL 5.6 - Operations and Diagnostics ImprovementsMorgan Tocker
This document discusses MySQL 5.6 and its improvements to operational and diagnostic capabilities. Key enhancements include online DDL operations that do not block reads or writes, buffer pool dump and restore for faster startup, import/export of partitioned tables, and transportable tablespaces. Diagnostic tools were improved with EXPLAIN showing more details, the ability to EXPLAIN updates and deletes, optimizer tracing, and the performance schema providing detailed query level instrumentation and monitoring by default.
Learn SQL from basic queries to Advance queriesmanishkhaire30
Dive into the world of data analysis with our comprehensive guide on mastering SQL! This presentation offers a practical approach to learning SQL, focusing on real-world applications and hands-on practice. Whether you're a beginner or looking to sharpen your skills, this guide provides the tools you need to extract, analyze, and interpret data effectively.
Key Highlights:
Foundations of SQL: Understand the basics of SQL, including data retrieval, filtering, and aggregation.
Advanced Queries: Learn to craft complex queries to uncover deep insights from your data.
Data Trends and Patterns: Discover how to identify and interpret trends and patterns in your datasets.
Practical Examples: Follow step-by-step examples to apply SQL techniques in real-world scenarios.
Actionable Insights: Gain the skills to derive actionable insights that drive informed decision-making.
Join us on this journey to enhance your data analysis capabilities and unlock the full potential of SQL. Perfect for data enthusiasts, analysts, and anyone eager to harness the power of data!
#DataAnalysis #SQL #LearningSQL #DataInsights #DataScience #Analytics
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...Social Samosa
The Modern Marketing Reckoner (MMR) is a comprehensive resource packed with POVs from 60+ industry leaders on how AI is transforming the 4 key pillars of marketing – product, place, price and promotions.
The Ipsos - AI - Monitor 2024 Report.pdfSocial Samosa
According to Ipsos AI Monitor's 2024 report, 65% Indians said that products and services using AI have profoundly changed their daily life in the past 3-5 years.
End-to-end pipeline agility - Berlin Buzzwords 2024Lars Albertsson
We describe how we achieve high change agility in data engineering by eliminating the fear of breaking downstream data pipelines through end-to-end pipeline testing, and by using schema metaprogramming to safely eliminate boilerplate involved in changes that affect whole pipelines.
A quick poll on agility in changing pipelines from end to end indicated a huge span in capabilities. For the question "How long time does it take for all downstream pipelines to be adapted to an upstream change," the median response was 6 months, but some respondents could do it in less than a day. When quantitative data engineering differences between the best and worst are measured, the span is often 100x-1000x, sometimes even more.
A long time ago, we suffered at Spotify from fear of changing pipelines due to not knowing what the impact might be downstream. We made plans for a technical solution to test pipelines end-to-end to mitigate that fear, but the effort failed for cultural reasons. We eventually solved this challenge, but in a different context. In this presentation we will describe how we test full pipelines effectively by manipulating workflow orchestration, which enables us to make changes in pipelines without fear of breaking downstream.
Making schema changes that affect many jobs also involves a lot of toil and boilerplate. Using schema-on-read mitigates some of it, but has drawbacks since it makes it more difficult to detect errors early. We will describe how we have rejected this tradeoff by applying schema metaprogramming, eliminating boilerplate but keeping the protection of static typing, thereby further improving agility to quickly modify data pipelines without fear.
Analysis insight about a Flyball dog competition team's performanceroli9797
Insight of my analysis about a Flyball dog competition team's last year performance. Find more: https://github.com/rolandnagy-ds/flyball_race_analysis/tree/main
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeWalaa Eldin Moustafa
Dynamic policy enforcement is becoming an increasingly important topic in today’s world where data privacy and compliance is a top priority for companies, individuals, and regulators alike. In these slides, we discuss how LinkedIn implements a powerful dynamic policy enforcement engine, called ViewShift, and integrates it within its data lake. We show the query engine architecture and how catalog implementations can automatically route table resolutions to compliance-enforcing SQL views. Such views have a set of very interesting properties: (1) They are auto-generated from declarative data annotations. (2) They respect user-level consent and preferences (3) They are context-aware, encoding a different set of transformations for different use cases (4) They are portable; while the SQL logic is only implemented in one SQL dialect, it is accessible in all engines.
#SQL #Views #Privacy #Compliance #DataLake
The Building Blocks of QuestDB, a Time Series Databasejavier ramirez
Talk Delivered at Valencia Codes Meetup 2024-06.
Traditionally, databases have treated timestamps just as another data type. However, when performing real-time analytics, timestamps should be first class citizens and we need rich time semantics to get the most out of our data. We also need to deal with ever growing datasets while keeping performant, which is as fun as it sounds.
It is no wonder time-series databases are now more popular than ever before. Join me in this session to learn about the internal architecture and building blocks of QuestDB, an open source time-series database designed for speed. We will also review a history of some of the changes we have gone over the past two years to deal with late and unordered data, non-blocking writes, read-replicas, or faster batch ingestion.
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...Aggregage
This webinar will explore cutting-edge, less familiar but powerful experimentation methodologies which address well-known limitations of standard A/B Testing. Designed for data and product leaders, this session aims to inspire the embrace of innovative approaches and provide insights into the frontiers of experimentation!
2. 2
Outline
Use Case & Motivation : Why introducing a new file format?
CarbonData File Format Deep Dive
Framework Integrated with CarbonData
Performance
Demo
Future Plan
3. 3
Full table scan
Big scan & fast batch processing
Only fetch a few columns of the table
Common usage scenario:
ETL job
Log Analysis
Use case: Sequential scan
C1 C2 C3 C4 C5 C6 C7
R1
R2
R3
R4
R5
R6
R7
R8
R9
R10
…..
5. 5
Predicate filtering on range of columns
Full row keys or range of keys lookup
Narrow scan but might fetch all columns
Requires second/sub-second level low-latency
Common usage scenario:
Operational query
User profiling
Use case: Random Access
C1 C2 C3 C4 C5 C6 C7
R1
R2
R3
R4
R5
R6
R7
R8
R9
R10
……
7. 7
Design Goals
Low-Latency for various types of data access pattern
Allow fast query on fast data
Ensure Space Efficiency
General format available on Hadoop-ecosystem
Read-optimized columnar storage
Leveraging multi-level Index for low-latency
Support column group to leverage the benefit of row-based
Enables dictionary encoding for deferred decoding for aggregation
Optimized streaming ingestion support
Broader Integration across Hadoop-ecosystem
CarbonData:
8. 8
Outline
Use cases & Motivation: Why introducing a new file format?
CarbonData File Format Deep Dive
Framework Integrated with CarbonData
Performance
Demo
Future Plan
9. 9
Carbon File
CarbonData File Structure
Blocklet : A set of rows in columnar format
Default blocklet size: ~120k rows
Balance between efficient scan and compression
Column chunk : Data for one column/column group in a Blocklet
Allow multiple columns forms a column group & stored as row-based
Column data stored as sorted index
Footer : Metadata information
File level metadata & statistics
Schema
Blocklet Index & Blocklet level Metadata
Blocklet 1
Col1 Chunk
Col2 Chunk
…
Colgroup1 Chunk
Colgroup2 Chunk
…
Blocklet N
…
Footer
10. 10
Carbon Data File
Blocklet 1
Column 1 Chunk
Column 2 Chunk
…
ColumnGroup 1 Chunk
ColumnGroup 2 Chunk
…
Blocklet N
File Footer
Blocklet Index
Blocklet 1 Index Node
•Minmax index: min, max
•Multi-dimensional index: startKey,
endKey
Blocklet N Index Node
…
…
Blocklet Info
Blocklet 1 Info
Blocklet N Info
•Column 1 Chunk Info
•Compression scheme
•ColumnFormat
•ColumnID list
•ColumnChunk length
•ColumnChunk offset
…
File Metadata
Version, No. Row, …
Segment Info
Schema
Schema for each column
Blocklet Index
Blocklet Info
ColumnGroup1 Chunk Info
…
…
Format
11. 11
Years Quarters Months Territory Country Quantity Sales
2003 QTR1 Jan EMEA Germany 142 11,432
2003 QTR1 Jan APAC China 541 54,702
2003 QTR1 Jan EMEA Spain 443 44,622
2003 QTR1 Feb EMEA Denmark 545 58,871
2003 QTR1 Feb EMEA Italy 675 56,181
2003 QTR1 Mar APAC India 52 9,749
2003 QTR1 Mar EMEA UK 570 51,018
2003 QTR1 Mar Japan Japan 561 55,245
2003 QTR2 Apr APAC Australia 525 50,398
2003 QTR2 Apr EMEA Germany 144 11,532
[1,1,1,1,1] : [142,11432]
[1,1,1,3,2] : [541,54702]
[1,1,1,1,3] : [443,44622]
[1,1,2,1,4] : [545,58871]
[1,1,2,1,5] : [675,56181]
[1,1,3,3,6] : [52,9749]
[1,1,3,1,7] : [570,51018]
[1,1,3,2,8] : [561,55245]
[1,2,4,3,9] : [525,50398]
[1,2,4,1,1] : [144,11532]
Blocklet
• Data are sorted along MDK (multi-dimensional keys)
• data stored as index in columnar format
Encoding
Blocklet Logical View
Sort
(MDK Index)
[1,1,1,1,1] : [142,11432]
[1,1,1,1,3] : [443,44622]
[1,1,1,3,2] : [541,54702]
[1,1,2,1,4] : [545,58871]
[1,1,2,1,5] : [675,56181]
[1,1,3,1,7] : [570,51018]
[1,1,3,2,8] : [561,55245]
[1,1,3,3,6] : [52,9749]
[1,2,4,1,1] : [144,11532]
[1,2,4,3,9] : [525,50398]
Sorted MDK Index
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
2
2
1
1
1
2
2
3
3
3
4
4
1
1
3
1
1
1
2
3
1
3
142
443
541
545
675
570
561
52
144
525
11432
44622
54702
58871
56181
51018
55245
9749
11532
50398
C1 C2 C3 C4 C5 C6 C7
1
3
2
4
5
7
8
6
1
9
14. 14
10 2 23 23 38 15.2
10 2 50 15 29 18.5
10 3 51 18 52 22.8
11 6 60 29 16 32.9
12 8 68 32 18 21.6
Blocklet 1
C1 C2 C3 C4 C6C5
Col
Chunk
Col
Chunk
Col
Chunk
Col
Chunk
Column Group
• Allow multiple columns form a column group
• stored as a single column chunk in row-
based format
• suitable to set of columns frequently
fetched together
• saving stitching cost for reconstructing
row
Col
Chunk
15. 15
Nested Data Type Representation
• Represented as a composite of two columns
• One column for the element value
• One column for start_index & length of Array
Arrays
• Represented as a composite of finite number
of columns
• Each struct element is a separate column
Struts
Name Array<Ph_Number>
John [192,191]
Sam [121,345,333]
Bob [198,787]
Name Array
[start,len]
Ph_Number
John 0,2 192
Sam 2,3 191
Bob 5,2 121
345
333
198
787
Name Info Strut<age,gender>
John [31,M]
Sam [45,F]
Bob [16,M]
Name Info.age Info.gender
John 31 M
Sam 45 F
Bob 16 M
16. 16
Encoding & Compression
• Efficient encoding scheme supported:
• DELTA, RLE, BIT_PACKED
• Dictionary:
• medium high cardinality: file level dictionary
• very low cardinality: table level global dictionary
• CUSTOM
• Compression Scheme: Snappy
•Speedup Aggregation
•Reduce run-time memory footprint
•Enable deferred decoding
•Enable fast distinct count
Big Win:
17. 17
Outline
Use Case & Motivation: Why introducing a new file format?
CarbonData File Format Deep Dive
Framework Integrated with CarbonData
Performance
Demo
Future Plan
19. 19
Spark Integration
• Query CarbonData Table
• DataFrame API
• Spark SQL Statement
• Support schema evolution of Carbon table via ALTER TABLE
• Add, Delete or Rename Column
• schema update only, data stored on disk is untouched
CREATE TABLE [IF NOT EXISTS] [db_name.]table_name [(col_name
data_type [COMMENT col_comment], ...)] [COMMENT table_comment]
[PARTITIONED BY (col_name data_type [COMMENT col_comment],
...)] STORED BY ‘org.carbondata.hive.CarbonHanlder’
[TBLPROPERTIES (property_name=property_value, ...)] [AS
select_statement];
20. 20
Blocklet
Spark Integration
Table
Block
Footer + Index
Blocklet
Blocklet
…
…
C1 C2 C3 C4 C5 C6 C7 C9
Table Level MDK Tree Index
Inverted
Index
• Query optimization
• Vectorized record reading
• Predicate push down by leveraging multi-level index
• Column Pruning
• Defer decoding for aggregation
Block
Blocklet
Blocklet
Footer + Index
Block
Footer + Index
Blocklet
Blocklet
Block
Blocklet
Blocklet
Footer + Index
21. 21
Data Ingestion
• Bulk Data Ingestion
• CSV file conversion
• MDK clustering level: load level vs. node level
• Save Spark dataframe as Carbon data file
df.write
.format("org.apache.spark.CarbonSource")
.options(Map("dbName" -> "db1", "tableName" ->
"tbl1"))
.mode(SaveMode.Overwrite)
.save(“/path”)
LOAD DATA [LOCAL] INPATH 'folder path' [OVERWRITE]
INTO TABLE tablename
OPTIONS(property_name=property_value, ...)
INSERT INTO TABLE tablennme AS select_statement1
FROM table1;
22. 22
Data Compaction
• Data compaction is used to merge small files
• Re-clustering across loads
• Two types of compactions
- Minor compaction
• Compact adjacent files into a single big file (~HDFS block size)
- Major compaction
• Reorganize adjacent loads to achieve better clustering along MDK index
23. 23
Outline
Use Case & Motivation: Why introducing a new file format?
CarbonData File Format Deep Dive
Framework Integrated with CarbonData
Performance
Demo
Future Plan
25. 25
Performance comparison - Observations
High Throughput/Full Scan Query
1.4 to 6 times faster
Deferred decoding enables faster aggregation on the fly.
OLAP/Interactive Query
20 to 33 times faster
MDK, Min-Max and Inverted indices enable block pruning
Deferred decoding enables faster aggregation on the fly.
Random Access Query
26 to 688 times faster
Inverted index enables faster row reconstruction.
Column group eliminates implicit joins for row reconstruction.
26. 26
Outline
Motivation: Why introducing a new file format?
CarbonData File Format Deep Dive
Framework Integrated with CarbonData
Performance
Demo
Future Plan
27. 27
Live Demo
Demo Environment
Number of Nodes 5 VM (AWS r3.4xlarge)
vCPU 80 (16/node)
Memory 500 GiB (100 GiB/node)
#Columns 300
Data Size 600GB
#Records 300M
High Throughput/Full Scan Query
SELECT PROD_BRAND_NAME, SUM(STR_ORD_QTY) FROM
oscon_demo GROUP BY PROD_BRAND_NAME;
OLAP/Interactive query
SELECT PROD_COLOR, SUM(STR_ORD_QTY) FROM oscon_demo
WHERE CUST_COUNTRY ='New Zealand' AND CUST_CITY =
'Auckland' AND PRODUCT_NAME = 'Huawei Honor 4X' GROUP BY
PROD_COLOR;
Random Access Query
SELECT * FROM oscon_demo WHERE CUST_PRFRD_FLG= "Y" AND
PROD_BRAND_NAME = "Huawei" AND PROD_COLOR = "BLACK"
AND CUST_LAST_RVW_DATE = "2015-12-11 00:00:00" AND
CUST_COUNTRY ='New Zealand' AND CUST_CITY = 'Auckland' AND
PRODUCT_NAME = 'Huawei Honor 4X' ;
28. 28
Outline
Motivation: Why introducing a new file format?
CarbonData File Format Deep Dive
Framework Integrated with CarbonData
Performance
Demo
Future Plan
29. 29
Future Plan
• Upgrade to Spark 2.0
• Add append support
• Support pre-aggregated table
• Enable offline IUD support
• Broader Integration across Hadoop-ecosystem
30. 30
Community
• CarbonData is open sourced & will become Apache Incubator project
• Welcome contribution to our Github @:
https://github.com/HuaweiBigData/carbondata
• Main Contributors:
• Jihong MA, Vimal, Raghu, Ramana, Ravindra, Vishal, Aniket, Liang Chenliang, Jacky Likun,
Jarry Qiuheng, David Caiqiang, Eason Linyixin, Ashok, Sujith, Manish, Manohar, Shahid,
Ravikiran, Naresh, Krishna, Babu, Ayush, Santosh, Zhangshunyu, Liujunjie, Zhujing (Huawei)
• Jean-Baptiste Onofre (Talend, ASF member), Henry Saputra (eBay, ASF member),
Uma Maheswara Rao G(Intel, Hadoop PMC)