ClickHouse Features for Advanced Users, by Aleksei MilovidovAltinity Ltd
This document summarizes key features for advanced users of ClickHouse, an open-source column-oriented database management system. It describes sample keys that can be defined in MergeTree tables to generate instant reports on large customer data. It also summarizes intermediate aggregation states, consistency modes, and tools for processing data without a server like clickhouse-local.
ClickHouse tips and tricks. Webinar slides. By Robert Hodges, Altinity CEOAltinity Ltd
From webinar on December 3, 2019
New users of ClickHouse love the speed but may run into a few surprises when designing applications. Column storage turns classic SQL design precepts on their heads. This talk shares our favorite tricks for building great applications. We'll talk about fact tables and dimensions, materialized views, codecs, arrays, and skip indexes, to name a few of our favorites. We'll show examples of each and also reserve time to handle questions. Join us to take your next step to ClickHouse guruhood!
Speaker Bio:
Robert Hodges is CEO of Altinity, which offers enterprise support for ClickHouse. He has over three decades of experience in data management spanning 20 different DBMS types. ClickHouse is his current favorite. ;)
The document discusses configuring and monitoring buffer pools and memory settings for a DB2 database instance and partitions. It includes commands to:
- Show buffer pool information and alter a buffer pool size
- View tablespace to buffer pool mappings
- Check database and instance configuration parameters for memory settings
- List instances and reset the monitoring
- View buffer pool snapshots
ClickHouse Unleashed 2020: Our Favorite New Features for Your Analytical Appl...Altinity Ltd
Robert Hodges is the Altinity CEO with over 30 years of experience in DBMS, virtualization, and security. ClickHouse is the 20th DBMS he has worked with. Alexander Zaitsev is the Altinity CTO and founder with decades of experience designing and operating petabyte-scale analytic systems. Vitaliy Zakaznikov is the QA Architect with over 13 years of testing hardware and software and is the author of the TestFlows open source testing tool.
This document discusses big data technologies including Hadoop, MapReduce, HDFS, and data querying tools like Apache Pig and Hive. It then summarizes Presto, an open source distributed SQL query engine, including its architecture, how it compares to MapReduce, optimization techniques like vectorized reading and predicate pushdown, columnar storage, benchmarks, and usage at Netflix.
Clickhouse Capacity Planning for OLAP Workloads, Mik Kocikowski of CloudFlareAltinity Ltd
Presented on December ClickHouse Meetup. Dec 3, 2019
Concrete findings and "best practices" from building a cluster sized for 150 analytic queries per second on 100TB of http logs. Topics covered: hardware, clients (http vs native), partitioning, indexing, SELECT vs INSERT performance, replication, sharding, quotas, and benchmarking.
Webinar: Secrets of ClickHouse Query Performance, by Robert HodgesAltinity Ltd
From webinars September 11 and September 17, 2019
ClickHouse is famous for speed. That said, you can almost always make it faster! This webinar uses examples to teach you how to deduce what queries are actually doing by reading the system log and system tables. We'll then explore standard ways to increase query speed: data types and encodings, filtering, join reordering, skip indexes, materialized views, session parameters, to name just a few. In each case we'll circle back to query plans and system metrics to demonstrate changes in ClickHouse behavior that explain the boost in performance. We hope you'll enjoy the first step to becoming a ClickHouse performance guru!
Speaker Bio:
Robert Hodges is CEO of Altinity, which offers enterprise support for ClickHouse. He has over three decades of experience in data management spanning 20 different DBMS types. ClickHouse is his current favorite. ;)
ClickHouse Features for Advanced Users, by Aleksei MilovidovAltinity Ltd
This document summarizes key features for advanced users of ClickHouse, an open-source column-oriented database management system. It describes sample keys that can be defined in MergeTree tables to generate instant reports on large customer data. It also summarizes intermediate aggregation states, consistency modes, and tools for processing data without a server like clickhouse-local.
ClickHouse tips and tricks. Webinar slides. By Robert Hodges, Altinity CEOAltinity Ltd
From webinar on December 3, 2019
New users of ClickHouse love the speed but may run into a few surprises when designing applications. Column storage turns classic SQL design precepts on their heads. This talk shares our favorite tricks for building great applications. We'll talk about fact tables and dimensions, materialized views, codecs, arrays, and skip indexes, to name a few of our favorites. We'll show examples of each and also reserve time to handle questions. Join us to take your next step to ClickHouse guruhood!
Speaker Bio:
Robert Hodges is CEO of Altinity, which offers enterprise support for ClickHouse. He has over three decades of experience in data management spanning 20 different DBMS types. ClickHouse is his current favorite. ;)
The document discusses configuring and monitoring buffer pools and memory settings for a DB2 database instance and partitions. It includes commands to:
- Show buffer pool information and alter a buffer pool size
- View tablespace to buffer pool mappings
- Check database and instance configuration parameters for memory settings
- List instances and reset the monitoring
- View buffer pool snapshots
ClickHouse Unleashed 2020: Our Favorite New Features for Your Analytical Appl...Altinity Ltd
Robert Hodges is the Altinity CEO with over 30 years of experience in DBMS, virtualization, and security. ClickHouse is the 20th DBMS he has worked with. Alexander Zaitsev is the Altinity CTO and founder with decades of experience designing and operating petabyte-scale analytic systems. Vitaliy Zakaznikov is the QA Architect with over 13 years of testing hardware and software and is the author of the TestFlows open source testing tool.
This document discusses big data technologies including Hadoop, MapReduce, HDFS, and data querying tools like Apache Pig and Hive. It then summarizes Presto, an open source distributed SQL query engine, including its architecture, how it compares to MapReduce, optimization techniques like vectorized reading and predicate pushdown, columnar storage, benchmarks, and usage at Netflix.
Clickhouse Capacity Planning for OLAP Workloads, Mik Kocikowski of CloudFlareAltinity Ltd
Presented on December ClickHouse Meetup. Dec 3, 2019
Concrete findings and "best practices" from building a cluster sized for 150 analytic queries per second on 100TB of http logs. Topics covered: hardware, clients (http vs native), partitioning, indexing, SELECT vs INSERT performance, replication, sharding, quotas, and benchmarking.
Webinar: Secrets of ClickHouse Query Performance, by Robert HodgesAltinity Ltd
From webinars September 11 and September 17, 2019
ClickHouse is famous for speed. That said, you can almost always make it faster! This webinar uses examples to teach you how to deduce what queries are actually doing by reading the system log and system tables. We'll then explore standard ways to increase query speed: data types and encodings, filtering, join reordering, skip indexes, materialized views, session parameters, to name just a few. In each case we'll circle back to query plans and system metrics to demonstrate changes in ClickHouse behavior that explain the boost in performance. We hope you'll enjoy the first step to becoming a ClickHouse performance guru!
Speaker Bio:
Robert Hodges is CEO of Altinity, which offers enterprise support for ClickHouse. He has over three decades of experience in data management spanning 20 different DBMS types. ClickHouse is his current favorite. ;)
Ceph Object Storage Performance Secrets and Ceph Data Lake SolutionKaran Singh
In this presentation, i have explained how Ceph Object Storage Performance can be improved drastically together with some object storage best practices, recommendations tips. I have also covered Ceph Shared Data Lake which is getting very popular.
A lot of data is best represented as time series: Operational data, financial data, and even in data warehouses the dominant dimension is often time. We present Chronix, a time series database based on Apache Solr and Spark which is able to handle trillions of time series data points and perform interactive queries. Chronix Spark is open source software and battle-proven at a German car manufacturer and an international telco.
We demonstrate several use cases of Chronix from real-life. Afterwards we lift the curtain and deep-dive into the Chronix architecture esp. how we're using Solr to store time series data and how we've hooked up Solr with Spark. We provide some benchmarks showing how Chronix has outperformed other time series databases in both performance and storage-efficiency.
Chronix is open source under the Apache License (http://chronix.io).
A Practical Introduction to Handling Log Data in ClickHouse, by Robert Hodges...Altinity Ltd
This document discusses using ClickHouse to manage log data. It begins with an introduction to ClickHouse and its features. It then covers different ways to model log data in ClickHouse, including storing logs as JSON blobs or converting them to a tabular format. The document demonstrates using materialized views to ingest logs into ClickHouse tables in an efficient manner, extracting values from JSON and converting to columns. It shows how this approach allows flexible querying of log data while scaling to large volumes.
Cloud native applications are popular these days – applications that run in the cloud reliably und scale almost arbitrarily. They follow three key principles: They are built and composed as microservices, they are packaged and distributed in containers and the containers are executed dynamically in the cloud. In this hands-on session we will show how to build, package and deploy cloud native Java EE applications on top of DC/OS - fully automated with Gradle using cloud native infrastructure like Consul, Fabio, Hystrix and Prometheus. And for the fun of it we will be using an off-the-shelf DJ pad, programmed with nothing else than the Java Sound API, to demonstrate the core concepts and to visualize and remote control DC/OS.
Tiered storage intro. By Robert Hodges, Altinity CEOAltinity Ltd
The document discusses ClickHouse's tiered storage feature, which allows data to be stored across multiple storage devices based on access patterns and policies. It provides an overview of ClickHouse's storage architecture and how data is normally stored in a single directory. It then introduces storage configurations that define policies and mappings to different disks/volumes. Examples are given of using policies to control where newly inserted or optimized data is stored. This allows fast and less frequently accessed data to be tiered across storage with different performance characteristics.
ClickHouse materialized views - a secret weapon for high performance analytic...Altinity Ltd
ClickHouse materialized views allow you to precompute aggregates and transform data to improve query performance. Materialized views can store precomputed aggregates from a source table to speed up aggregation queries over 100x. They can also retrieve the last data point for each item over 100x faster than scanning the raw data table. Materialized views provide a way to optimize data storage layout and indexing to improve query efficiency.
Testing multi outputformat based mapreduceAshok Agarwal
The document describes using a MultiOutputFormat in MapReduce to generate separate output files for each stock price based on input that contains stock price data. It includes code for a mapper that extracts the stock name and price from each input record and a reducer that writes these values to individual files for each stock name. Unit tests are also described to test the reducer by mocking the MultipleOutputs class and validating that the output files contain the expected stock price values.
Cascading provides a simpler way to write MapReduce programs through data flows. It uses a pipe and tap metaphor where data flows through pipes and is read from or written to taps. This allows assembling MapReduce jobs as data flow graphs in a more logical way compared to the traditional MapReduce API.
PL/CUDA allows running CUDA C code directly in PostgreSQL user-defined functions. This allows advanced analytics and machine learning algorithms to be run directly in the database.
The gstore_fdw foreign data wrapper allows data to be stored directly in GPU memory, accessed via SQL, eliminating the overhead of copying data between CPU and GPU memory for each query.
Integrating PostgreSQL with GPU computing and machine learning frameworks allows for fast data exploration and model training by combining flexible SQL queries with high-performance analytics directly on the data.
ClickHouse Materialized Views: The Magic ContinuesAltinity Ltd
Slides for the webinar, presented on February 26, 2020
By Robert Hodges, Altinity CEO
Materialized views are the killer feature of ClickHouse, and the Altinity 2019 webinar on how they work was very popular. Join this updated webinar to learn how to use materialized views to speed up queries hundreds of times. We'll cover basic design, last point queries, using TTLs to drop source data, counting unique values, and other useful tricks. Finally, we'll cover recent improvements that make materialized views more useful than ever.
The Best and Worst of Cassandra-stress Tool (Christopher Batey, The Last Pick...DataStax
Making sure your Data Model will work on the production cluster after 6 months as well as it does on your laptop is an important skill. It's one that we use every day with our clients at The Last Pickle, and one that relies on tools like the cassandra-stress. Knowing how the data model will perform under stress once it has been loaded with data can prevent expensive re-writes late in the project.
In this talk Christopher Batey, Consultant at The Last Pickle, will shed some light on how to use the cassandra-stress tool to test your own schema, graph the results and even how to extend the tool for your own use cases. While this may be called premature optimisation for a RDBS, a successful Cassandra project depends on it's data model.
About the Speaker
Christopher Batey Consultant / Software Engineer, The Last Pickle
Christopher (@chbatey) is a part time consultant at The Last Pickle where he works with clients to help them succeed with Apache Cassandra as well as a freelance software engineer working in London. Likes: Scala, Haskell, Java, the JVM, Akka, distributed databases, XP, TDD, Pairing. Hates: Untested software, code ownership. You can checkout his blog at: http://www.batey.info
The document discusses various topics related to optimizing performance for PostgreSQL including:
- Indexes and how to use EXPLAIN and EXPLAIN ANALYZE to analyze query performance. Conditional, functional and concurrent indexes are covered.
- Connection pooling options for Django like django-postgrespool to improve connection management.
- Replication options such as Slony, Bucardo, pgpool, WAL-E and Barman for high availability.
- Backup strategies including logical backups with pg_dump and physical backups using base backups. When each approach is best to use.
Gruter_TECHDAY_2014_04_TajoCloudHandsOn (in Korean)Gruter
This document discusses setting up and using Tajo, an Apache Hadoop-based data warehousing system, on AWS. It provides instructions on using Tajo Cloud to easily configure a Tajo cluster on AWS. Examples show how to connect external data from S3, perform queries, and analyze customer cohort data to understand purchase patterns over time. Tajo allows direct access to data in S3 and dynamic scaling of worker nodes, and its connector enables remote querying from SQL clients, Excel, and R.
Avaliable at: https://github.com/dbsmasters/bdsmasters
The current project is implemented in the context of the course "Big Data Management Systems" taught by Prof. Chatziantoniou in the Department of Management Science and Technology (AUEB). The aim of the project is to familiarize the students with big data management systems such as Hadoop, Redis, MongoDB and Azure Stream Analytics.
Stefan Hochdörfer - The NoSQL Store everyone ignores: PostgreSQL - NoSQL matt...NoSQLmatters
PostgreSQL is well known being an object-relational database management system. In it`s core PostgreSQL is schema-aware dealing with fixed database tables and column types. However, recent versions of PostgreSQL made it possible to deal with schema-free data. Learn which new features PostgreSQL supports and how to use those features in your application.
The document discusses caching concepts in Java, including the JSR107 caching API. It covers cache configuration, events, computations using entry processors, and annotations for CDI integration. The presentation includes code examples for basic caching, events handling, and invoking entry processors on cache entries.
Advanced Apache Cassandra Operations with JMXzznate
Nodetool is a command line interface for managing a Cassandra node. It provides commands for node administration, cluster inspection, table operations and more. The nodetool info command displays node-specific information such as status, load, memory usage and cache details. The nodetool compactionstats command shows compaction status including active tasks and progress. The nodetool tablestats command displays statistics for a specific table including read/write counts, space usage, cache usage and latency.
Webinar slides: MORE secrets of ClickHouse Query Performance. By Robert Hodge...Altinity Ltd
The document provides an overview of ClickHouse and techniques for optimizing performance. It discusses how the ClickHouse query log can help understand query execution and bottlenecks. Methods covered for improving performance include adding indexes, optimizing data layout through partitioning and ordering, using encodings to reduce data size, and materialized views. Storage optimizations like multi-disk volumes and tiered storage are also introduced.
The document discusses the glance-replicator tool in OpenStack. Glance-replicator allows replication of images between two glance servers. It can replicate images and also import and export images. The document provides examples of using glance-replicator commands like compare, livecopy to replicate images between two devstack all-in-one OpenStack environments. It demonstrates the initial state with only one environment having images and after replication both environments having the same set of images.
Building a Fast, Resilient Time Series Store with Cassandra (Alex Petrov, Dat...DataStax
Cassandra is awesome for many things. One of the things it's awesome for is Time Series. Combining the power of Cassandra with APIs of existing Time Series tools, such as Graphite can yield interesting results.
Cyanite is a Time Series aggregator and store built on top of Cassandra. It's fully compatible with Graphite, can serve as a plug-in replacement for Graphite and Graphite web.
Cyanite is using SASI indexes to make glob metric path queries, can query and aggregate, store, display and analyse metrics from hundreds and thousands of servers.
Which data modelling practices work best for Time Series, which new awesome Cassandra features you can use to make your Time Series analysis better.
About the Speaker
Alex Petrov Software Engineer, DataStax
Polyglot programmer. Interested in algorithms, distributed systems, algebra and high performance solutions.
The document summarizes Cassandra developments over the past 5 years, including keynote details from Jonathan Ellis on Cassandra 1.2 and 2.0. Some highlights include improvements to scalability, performance and reliability in Cassandra 1.2, and the introduction of new features in Cassandra 2.0 like lightweight transactions (CAS), improved compaction, and experimental triggers. The keynote outlines changes and removals between the two versions to ease the transition for developers and operators.
This document discusses using sampling to diagnose buffer busy wait issues in an Oracle database. It provides an example of using the v$session_wait view to identify the specific buffer busy wait type, file, and block number involved. This allows finding the impacted object and SQL statement. The example identifies an insert statement on a table with a single freelist as the cause. It recommends adding more freelists to improve concurrency for inserts on that table.
Ceph Object Storage Performance Secrets and Ceph Data Lake SolutionKaran Singh
In this presentation, i have explained how Ceph Object Storage Performance can be improved drastically together with some object storage best practices, recommendations tips. I have also covered Ceph Shared Data Lake which is getting very popular.
A lot of data is best represented as time series: Operational data, financial data, and even in data warehouses the dominant dimension is often time. We present Chronix, a time series database based on Apache Solr and Spark which is able to handle trillions of time series data points and perform interactive queries. Chronix Spark is open source software and battle-proven at a German car manufacturer and an international telco.
We demonstrate several use cases of Chronix from real-life. Afterwards we lift the curtain and deep-dive into the Chronix architecture esp. how we're using Solr to store time series data and how we've hooked up Solr with Spark. We provide some benchmarks showing how Chronix has outperformed other time series databases in both performance and storage-efficiency.
Chronix is open source under the Apache License (http://chronix.io).
A Practical Introduction to Handling Log Data in ClickHouse, by Robert Hodges...Altinity Ltd
This document discusses using ClickHouse to manage log data. It begins with an introduction to ClickHouse and its features. It then covers different ways to model log data in ClickHouse, including storing logs as JSON blobs or converting them to a tabular format. The document demonstrates using materialized views to ingest logs into ClickHouse tables in an efficient manner, extracting values from JSON and converting to columns. It shows how this approach allows flexible querying of log data while scaling to large volumes.
Cloud native applications are popular these days – applications that run in the cloud reliably und scale almost arbitrarily. They follow three key principles: They are built and composed as microservices, they are packaged and distributed in containers and the containers are executed dynamically in the cloud. In this hands-on session we will show how to build, package and deploy cloud native Java EE applications on top of DC/OS - fully automated with Gradle using cloud native infrastructure like Consul, Fabio, Hystrix and Prometheus. And for the fun of it we will be using an off-the-shelf DJ pad, programmed with nothing else than the Java Sound API, to demonstrate the core concepts and to visualize and remote control DC/OS.
Tiered storage intro. By Robert Hodges, Altinity CEOAltinity Ltd
The document discusses ClickHouse's tiered storage feature, which allows data to be stored across multiple storage devices based on access patterns and policies. It provides an overview of ClickHouse's storage architecture and how data is normally stored in a single directory. It then introduces storage configurations that define policies and mappings to different disks/volumes. Examples are given of using policies to control where newly inserted or optimized data is stored. This allows fast and less frequently accessed data to be tiered across storage with different performance characteristics.
ClickHouse materialized views - a secret weapon for high performance analytic...Altinity Ltd
ClickHouse materialized views allow you to precompute aggregates and transform data to improve query performance. Materialized views can store precomputed aggregates from a source table to speed up aggregation queries over 100x. They can also retrieve the last data point for each item over 100x faster than scanning the raw data table. Materialized views provide a way to optimize data storage layout and indexing to improve query efficiency.
Testing multi outputformat based mapreduceAshok Agarwal
The document describes using a MultiOutputFormat in MapReduce to generate separate output files for each stock price based on input that contains stock price data. It includes code for a mapper that extracts the stock name and price from each input record and a reducer that writes these values to individual files for each stock name. Unit tests are also described to test the reducer by mocking the MultipleOutputs class and validating that the output files contain the expected stock price values.
Cascading provides a simpler way to write MapReduce programs through data flows. It uses a pipe and tap metaphor where data flows through pipes and is read from or written to taps. This allows assembling MapReduce jobs as data flow graphs in a more logical way compared to the traditional MapReduce API.
PL/CUDA allows running CUDA C code directly in PostgreSQL user-defined functions. This allows advanced analytics and machine learning algorithms to be run directly in the database.
The gstore_fdw foreign data wrapper allows data to be stored directly in GPU memory, accessed via SQL, eliminating the overhead of copying data between CPU and GPU memory for each query.
Integrating PostgreSQL with GPU computing and machine learning frameworks allows for fast data exploration and model training by combining flexible SQL queries with high-performance analytics directly on the data.
ClickHouse Materialized Views: The Magic ContinuesAltinity Ltd
Slides for the webinar, presented on February 26, 2020
By Robert Hodges, Altinity CEO
Materialized views are the killer feature of ClickHouse, and the Altinity 2019 webinar on how they work was very popular. Join this updated webinar to learn how to use materialized views to speed up queries hundreds of times. We'll cover basic design, last point queries, using TTLs to drop source data, counting unique values, and other useful tricks. Finally, we'll cover recent improvements that make materialized views more useful than ever.
The Best and Worst of Cassandra-stress Tool (Christopher Batey, The Last Pick...DataStax
Making sure your Data Model will work on the production cluster after 6 months as well as it does on your laptop is an important skill. It's one that we use every day with our clients at The Last Pickle, and one that relies on tools like the cassandra-stress. Knowing how the data model will perform under stress once it has been loaded with data can prevent expensive re-writes late in the project.
In this talk Christopher Batey, Consultant at The Last Pickle, will shed some light on how to use the cassandra-stress tool to test your own schema, graph the results and even how to extend the tool for your own use cases. While this may be called premature optimisation for a RDBS, a successful Cassandra project depends on it's data model.
About the Speaker
Christopher Batey Consultant / Software Engineer, The Last Pickle
Christopher (@chbatey) is a part time consultant at The Last Pickle where he works with clients to help them succeed with Apache Cassandra as well as a freelance software engineer working in London. Likes: Scala, Haskell, Java, the JVM, Akka, distributed databases, XP, TDD, Pairing. Hates: Untested software, code ownership. You can checkout his blog at: http://www.batey.info
The document discusses various topics related to optimizing performance for PostgreSQL including:
- Indexes and how to use EXPLAIN and EXPLAIN ANALYZE to analyze query performance. Conditional, functional and concurrent indexes are covered.
- Connection pooling options for Django like django-postgrespool to improve connection management.
- Replication options such as Slony, Bucardo, pgpool, WAL-E and Barman for high availability.
- Backup strategies including logical backups with pg_dump and physical backups using base backups. When each approach is best to use.
Gruter_TECHDAY_2014_04_TajoCloudHandsOn (in Korean)Gruter
This document discusses setting up and using Tajo, an Apache Hadoop-based data warehousing system, on AWS. It provides instructions on using Tajo Cloud to easily configure a Tajo cluster on AWS. Examples show how to connect external data from S3, perform queries, and analyze customer cohort data to understand purchase patterns over time. Tajo allows direct access to data in S3 and dynamic scaling of worker nodes, and its connector enables remote querying from SQL clients, Excel, and R.
Avaliable at: https://github.com/dbsmasters/bdsmasters
The current project is implemented in the context of the course "Big Data Management Systems" taught by Prof. Chatziantoniou in the Department of Management Science and Technology (AUEB). The aim of the project is to familiarize the students with big data management systems such as Hadoop, Redis, MongoDB and Azure Stream Analytics.
Stefan Hochdörfer - The NoSQL Store everyone ignores: PostgreSQL - NoSQL matt...NoSQLmatters
PostgreSQL is well known being an object-relational database management system. In it`s core PostgreSQL is schema-aware dealing with fixed database tables and column types. However, recent versions of PostgreSQL made it possible to deal with schema-free data. Learn which new features PostgreSQL supports and how to use those features in your application.
The document discusses caching concepts in Java, including the JSR107 caching API. It covers cache configuration, events, computations using entry processors, and annotations for CDI integration. The presentation includes code examples for basic caching, events handling, and invoking entry processors on cache entries.
Advanced Apache Cassandra Operations with JMXzznate
Nodetool is a command line interface for managing a Cassandra node. It provides commands for node administration, cluster inspection, table operations and more. The nodetool info command displays node-specific information such as status, load, memory usage and cache details. The nodetool compactionstats command shows compaction status including active tasks and progress. The nodetool tablestats command displays statistics for a specific table including read/write counts, space usage, cache usage and latency.
Webinar slides: MORE secrets of ClickHouse Query Performance. By Robert Hodge...Altinity Ltd
The document provides an overview of ClickHouse and techniques for optimizing performance. It discusses how the ClickHouse query log can help understand query execution and bottlenecks. Methods covered for improving performance include adding indexes, optimizing data layout through partitioning and ordering, using encodings to reduce data size, and materialized views. Storage optimizations like multi-disk volumes and tiered storage are also introduced.
The document discusses the glance-replicator tool in OpenStack. Glance-replicator allows replication of images between two glance servers. It can replicate images and also import and export images. The document provides examples of using glance-replicator commands like compare, livecopy to replicate images between two devstack all-in-one OpenStack environments. It demonstrates the initial state with only one environment having images and after replication both environments having the same set of images.
Building a Fast, Resilient Time Series Store with Cassandra (Alex Petrov, Dat...DataStax
Cassandra is awesome for many things. One of the things it's awesome for is Time Series. Combining the power of Cassandra with APIs of existing Time Series tools, such as Graphite can yield interesting results.
Cyanite is a Time Series aggregator and store built on top of Cassandra. It's fully compatible with Graphite, can serve as a plug-in replacement for Graphite and Graphite web.
Cyanite is using SASI indexes to make glob metric path queries, can query and aggregate, store, display and analyse metrics from hundreds and thousands of servers.
Which data modelling practices work best for Time Series, which new awesome Cassandra features you can use to make your Time Series analysis better.
About the Speaker
Alex Petrov Software Engineer, DataStax
Polyglot programmer. Interested in algorithms, distributed systems, algebra and high performance solutions.
The document summarizes Cassandra developments over the past 5 years, including keynote details from Jonathan Ellis on Cassandra 1.2 and 2.0. Some highlights include improvements to scalability, performance and reliability in Cassandra 1.2, and the introduction of new features in Cassandra 2.0 like lightweight transactions (CAS), improved compaction, and experimental triggers. The keynote outlines changes and removals between the two versions to ease the transition for developers and operators.
This document discusses using sampling to diagnose buffer busy wait issues in an Oracle database. It provides an example of using the v$session_wait view to identify the specific buffer busy wait type, file, and block number involved. This allows finding the impacted object and SQL statement. The example identifies an insert statement on a table with a single freelist as the cause. It recommends adding more freelists to improve concurrency for inserts on that table.
The document discusses dynamic C++ and the POCO library. It introduces the problem of accessing data in different formats and proposes POCO as a solution. POCO provides classes like RecordSet and Row that allow dynamically binding data and generating output in different formats like XML. It discusses the implementation details of how POCO achieves dynamic and type-safe behavior through templates and classes like Poco::Dynamic::Var.
Oracle Open World Thursday 230 ashmastersKyle Hailey
This document discusses database performance tuning using Oracle's ASH (Active Session History) feature. It provides examples of ASH queries to identify top wait events, long running SQL statements, and sessions consuming the most CPU. It also explains how to use ASH data to diagnose specific problems like buffer busy waits and latch contention by tracking session details over time.
From Postgres to Cassandra (Rimas Silkaitis, Heroku) | C* Summit 2016DataStax
Most web applications start out with a Postgres database and it serves the application very well for an extended period of time. Based on type of application, the data model of the app will have a table that tracks some kind of state for either objects in the system or the users of the application. Names for this table include logs, messages or events. The growth in the number of rows in this table is not linear as the traffic to the app increases, it's typically exponential.
Over time, the state table will increasingly become the bulk of the data volume in Postgres, think terabytes, and become increasingly hard to query. This use case can be characterized as the one-big-table problem. In this situation, it makes sense to move that table out of Postgres and into Cassandra. This talk will walk through the conceptual differences between the two systems, a bit of data modeling, as well as advice on making the conversion.
About the Speaker
Rimas Silkaitis Product Manager, Heroku
Rimas currently runs Product for Heroku Postgres and Heroku Redis but the common thread throughout his career is data. From data analysis, building data warehouses and ultimately building data products, he's held various positions that have allowed him to see the challenges of working with data at all levels of an organization. This experience spans the smallest of startups to the biggest enterprises.
The document describes Active Session History (ASH), a new methodology for performance tuning introduced by Kyle Hailey. ASH simplifies performance tuning by using statistical sampling to provide a multidimensional view of sessions, SQL, objects, users, and other database components over time. This allows identification of top resource consumers using less data collection than traditional methods.
Ash masters : advanced ash analytics on Oracle Kyle Hailey
The document discusses database performance tuning. It recommends using Active Session History (ASH) and sampling sessions to identify the root causes of performance issues like buffer busy waits. ASH provides key details on sessions, SQL statements, wait events, and durations to understand top resource consumers. Counting rows in ASH approximates time spent and is important for analysis. Sampling sessions in real-time can provide the SQL, objects, and blocking sessions involved in issues like buffer busy waits.
The document describes Active Session History (ASH), a new methodology for performance tuning introduced by Oracle. ASH simplifies performance tuning by using statistical sampling to collect session state and resource usage data over time. This provides a multidimensional view of sessions, SQL, users, objects, and waits consuming resources. ASH replaces previous methods that collected complete data at infrequent intervals, obscuring problems. Its sampling approach is cheaper, faster, and provides a good representation of the workload. The document outlines how ASH data can be used to identify top resource consumers and troubleshoot issues.
BlueStore: a new, faster storage backend for CephSage Weil
BlueStore is a new storage backend for Ceph that provides faster performance compared to the existing FileStore backend. BlueStore stores metadata in RocksDB and data directly on block devices, avoiding double writes and improving transaction performance. It supports multiple storage tiers by allowing different components like the RocksDB WAL, database and object data to be placed on SSDs, HDDs or NVRAM as appropriate.
BlueStore is a new storage backend for Ceph that stores data directly on block devices rather than using a file system. It uses RocksDB to store metadata like a key-value database and pluggable block allocation policies to improve performance. BlueStore aims to provide more natural transaction support without double writes by using a write-ahead log. It also supports multiple storage devices to optimize placement of metadata, data and write-ahead logs.
This document provides an overview of Oracle's Active Session History (ASH) feature. ASH samples database sessions every second to capture session states and activity. It stores this data in an in-memory circular buffer and periodically writes samples to disk for analysis. ASH data provides insights into database time usage, top SQL, wait events, and blocking issues. It can be used for performance analysis by aggregating and analyzing ASH dimensions like SQL_ID, event, and wait class over time.
Parallel computing allows breaking problems into independent pieces that can be computed simultaneously across multiple processors. The document discusses using the snow package in R to set up a simple parallel cluster on a single machine and perform operations like bootstrapping in parallel. It also mentions more advanced high performance computing techniques for large memory, compiled code, profiling, and batch scheduling.
The document discusses techniques for hacking into Microsoft SQL and Oracle databases. It begins by outlining scanning and enumeration methods for MSSQL databases using Metasploit modules like mssql_login to identify accessible databases. It then discusses gaining access to databases by exploiting blank passwords or known vulnerabilities. The document continues explaining how to escalate privileges within databases and then moves on to discuss the "Operation CloudBurst" attack and references.
The document describes processing raw system log files by loading them into Hive tables. It creates a rawlog table to load the raw data, then cleans the data by removing rows with null values and loads it into a cleanlog table. The cleanlog table is partitioned by year and month into a partitionedlog table for improved query performance on specific date ranges. Queries are shown to count page hits from the partitioned data for a given year and month.
The document discusses direct access to Oracle's shared memory (SGA) using C code. It describes the main regions of the SGA, how information is used automatically and by queries, and reasons for direct access such as reading hidden information or during database hangs. It outlines examining the SGA contents through externalized X$ tables, with most structures in the SGA not directly visible. The document provides a procedure to summarize waits from the V$SESSION_WAIT view by mapping it to the underlying X$KSUSECST table fields and offsets.
This document discusses using Active Session History (ASH) to analyze and troubleshoot performance issues in an Oracle database. It provides an example of using ASH to identify the top CPU-consuming session over the last 5 minutes. It shows how to group and count ASH data to calculate metrics like average active sessions (AAS) and percentage of time spent on CPU. The document also discusses using ASH to identify top waiting sessions and analyze specific wait events like buffer busy waits.
This document provides an agenda and overview of Big Data Analytics using Spark and Cassandra. It discusses Cassandra as a distributed database and Spark as a data processing framework. It covers connecting Spark and Cassandra, reading and writing Cassandra tables as Spark RDDs, and using Spark SQL, Spark Streaming, and Spark MLLib with Cassandra data. Key capabilities of each technology are highlighted such as Cassandra's tunable consistency and Spark's fault tolerance through RDD lineage. Examples demonstrate basic operations like filtering, aggregating, and joining Cassandra data with Spark.
Cassandra is a distributed database designed to handle large amounts of data across many servers. It provides high availability with no single point of failure and linear scalability. Data is distributed across nodes and replicated for fault tolerance. Writes are fast by using an append-only commit log and can be configured for different consistency levels.
Scylla Summit 2018: From SAP to Scylla - Tracking the Fleet at GPS InsightScyllaDB
Originally using SAP Adaptive Server Enterprise (ASE), the GPS Insight team soon found that relational databases simply aren’t a match for high volume machine data. To top it off, SAP ASE’s clustering technology proved cumbersome to manage and operate. In this presentation, you’ll learn about GPS Insight’s hybrid Scylla deployment that runs on-premises and on AWS datacenter. GPS Insight relies on Scylla to capture and analyze GPS data, offloading data from RDBMS to Scylla for hybrid analytics approach.
Similar to financial analytics of AAPL_stock markets (20)
The Ipsos - AI - Monitor 2024 Report.pdfSocial Samosa
According to Ipsos AI Monitor's 2024 report, 65% Indians said that products and services using AI have profoundly changed their daily life in the past 3-5 years.
End-to-end pipeline agility - Berlin Buzzwords 2024Lars Albertsson
We describe how we achieve high change agility in data engineering by eliminating the fear of breaking downstream data pipelines through end-to-end pipeline testing, and by using schema metaprogramming to safely eliminate boilerplate involved in changes that affect whole pipelines.
A quick poll on agility in changing pipelines from end to end indicated a huge span in capabilities. For the question "How long time does it take for all downstream pipelines to be adapted to an upstream change," the median response was 6 months, but some respondents could do it in less than a day. When quantitative data engineering differences between the best and worst are measured, the span is often 100x-1000x, sometimes even more.
A long time ago, we suffered at Spotify from fear of changing pipelines due to not knowing what the impact might be downstream. We made plans for a technical solution to test pipelines end-to-end to mitigate that fear, but the effort failed for cultural reasons. We eventually solved this challenge, but in a different context. In this presentation we will describe how we test full pipelines effectively by manipulating workflow orchestration, which enables us to make changes in pipelines without fear of breaking downstream.
Making schema changes that affect many jobs also involves a lot of toil and boilerplate. Using schema-on-read mitigates some of it, but has drawbacks since it makes it more difficult to detect errors early. We will describe how we have rejected this tradeoff by applying schema metaprogramming, eliminating boilerplate but keeping the protection of static typing, thereby further improving agility to quickly modify data pipelines without fear.
Predictably Improve Your B2B Tech Company's Performance by Leveraging DataKiwi Creative
Harness the power of AI-backed reports, benchmarking and data analysis to predict trends and detect anomalies in your marketing efforts.
Peter Caputa, CEO at Databox, reveals how you can discover the strategies and tools to increase your growth rate (and margins!).
From metrics to track to data habits to pick up, enhance your reporting for powerful insights to improve your B2B tech company's marketing.
- - -
This is the webinar recording from the June 2024 HubSpot User Group (HUG) for B2B Technology USA.
Watch the video recording at https://youtu.be/5vjwGfPN9lw
Sign up for future HUG events at https://events.hubspot.com/b2b-technology-usa/
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Round table discussion of vector databases, unstructured data, ai, big data, real-time, robots and Milvus.
A lively discussion with NJ Gen AI Meetup Lead, Prasad and Procure.FYI's Co-Found
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...Aggregage
This webinar will explore cutting-edge, less familiar but powerful experimentation methodologies which address well-known limitations of standard A/B Testing. Designed for data and product leaders, this session aims to inspire the embrace of innovative approaches and provide insights into the frontiers of experimentation!
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...Social Samosa
The Modern Marketing Reckoner (MMR) is a comprehensive resource packed with POVs from 60+ industry leaders on how AI is transforming the 4 key pillars of marketing – product, place, price and promotions.
Global Situational Awareness of A.I. and where its headedvikram sood
You can see the future first in San Francisco.
Over the past year, the talk of the town has shifted from $10 billion compute clusters to $100 billion clusters to trillion-dollar clusters. Every six months another zero is added to the boardroom plans. Behind the scenes, there’s a fierce scramble to secure every power contract still available for the rest of the decade, every voltage transformer that can possibly be procured. American big business is gearing up to pour trillions of dollars into a long-unseen mobilization of American industrial might. By the end of the decade, American electricity production will have grown tens of percent; from the shale fields of Pennsylvania to the solar farms of Nevada, hundreds of millions of GPUs will hum.
The AGI race has begun. We are building machines that can think and reason. By 2025/26, these machines will outpace college graduates. By the end of the decade, they will be smarter than you or I; we will have superintelligence, in the true sense of the word. Along the way, national security forces not seen in half a century will be un-leashed, and before long, The Project will be on. If we’re lucky, we’ll be in an all-out race with the CCP; if we’re unlucky, an all-out war.
Everyone is now talking about AI, but few have the faintest glimmer of what is about to hit them. Nvidia analysts still think 2024 might be close to the peak. Mainstream pundits are stuck on the wilful blindness of “it’s just predicting the next word”. They see only hype and business-as-usual; at most they entertain another internet-scale technological change.
Before long, the world will wake up. But right now, there are perhaps a few hundred people, most of them in San Francisco and the AI labs, that have situational awareness. Through whatever peculiar forces of fate, I have found myself amongst them. A few years ago, these people were derided as crazy—but they trusted the trendlines, which allowed them to correctly predict the AI advances of the past few years. Whether these people are also right about the next few years remains to be seen. But these are very smart people—the smartest people I have ever met—and they are the ones building this technology. Perhaps they will be an odd footnote in history, or perhaps they will go down in history like Szilard and Oppenheimer and Teller. If they are seeing the future even close to correctly, we are in for a wild ride.
Let me tell you what we see.
1. Untitled7
In [45]:
library(readxl)
library(ggplot2)
library(tidyr)
library(lattice)
library(lubridate)
library(quantmod)
In [46]:
ss<-read.csv('C:UsersvinodDownloadsSHAREitLe
X507filestock_AAPL.csv',header=TRUE)
stock_AAPL <- as.data.frame(ss)
colnames(stock_AAPL) <-
c('Date','Symbol','Open','High','Low','Close','Volume','Macd','Mfi','Rsi','Wi
lliam_r','Stochastic_fast','Stochastic_slow','Bollinger_bands','Chaikin_money
_flow','Obv','Log_timestamp','Datasource')
stock_AAPL$Date <- as.integer(gsub('-','',stock_AAPL$Date))
ss <- stock_AAPL
ss
Da
te
S
y
m
b
ol
O
p
e
n
H
ig
h
L
o
w
Cl
o
s
e
Vo
lu
m
e
M
a
c
d
M
fi
R
si
Wi
lli
a
m_
r
Stoc
hast
ic_f
ast
Stoc
hast
ic_sl
ow
Bolli
nger
_ban
ds
Chaik
in_m
oney_
flow
Ob
v
Log_
time
stam
p
Da
tas
ou
rce
20
16
06
28
A
A
P
L
9
2.
9
0
9
3.
6
6
9
2.
1
4
9
3.
5
9
40
44
49
14
N
A
N
A
N
A
N
A
NA NA NA NA NA 201
7-
12-
28
06:0
9:09.
664
752
20
16
06
29
A
A
P
L
9
3.
9
7
9
4.
5
5
9
3.
6
3
9
4.
4
0
36
53
10
06
N
A
N
A
N
A
N
A
NA NA NA NA NA 201
7-
12-
28
06:0
9:09.
664
752
20 A 9 9 9 9 35 N N N N NA NA NA NA NA 201
2. 16
06
30
A
P
L
4.
4
4
5.
7
7
4.
3
0
5.
6
0
83
63
56
A A A A 7-
12-
28
06:0
9:09.
664
752
20
16
07
01
A
A
P
L
9
5.
4
9
9
6.
4
6
9
5.
3
3
9
5.
8
9
26
02
65
40
N
A
N
A
N
A
N
A
NA NA NA NA NA 201
7-
12-
28
06:0
9:09.
664
752
20
16
07
05
A
A
P
L
9
5.
3
9
9
5.
4
0
9
4.
4
6
9
4.
9
9
27
70
52
10
N
A
N
A
N
A
N
A
NA NA NA NA NA 201
7-
12-
28
06:0
9:09.
664
752
20
16
07
06
A
A
P
L
9
4.
6
0
9
5.
6
6
9
4.
3
7
9
5.
5
3
30
94
90
90
N
A
N
A
N
A
N
A
NA NA NA NA NA 201
7-
12-
28
06:0
9:09.
664
752
20
16
07
07
A
A
P
L
9
5.
7
0
9
6.
5
0
9
5.
6
2
9
5.
9
4
25
13
95
58
N
A
N
A
N
A
N
A
NA NA NA NA NA 201
7-
12-
28
06:0
9:09.
664
752
20
16
07
08
A
A
P
L
9
6.
4
9
9
6.
8
9
9
6.
0
5
9
6.
6
8
28
91
21
03
N
A
N
A
N
A
N
A
NA NA NA NA NA 201
7-
12-
28
06:0
3. 9:09.
664
752
20
16
07
11
A
A
P
L
9
6.
7
5
9
7.
6
5
9
6.
7
3
9
6.
9
8
23
79
49
45
N
A
N
A
N
A
N
A
NA NA NA NA NA 201
7-
12-
28
06:0
9:09.
664
752
20
16
07
12
A
A
P
L
9
7.
1
7
9
7.
7
0
9
7.
1
2
9
7.
4
2
24
16
74
63
N
A
N
A
N
A
N
A
NA NA NA NA NA 201
7-
12-
28
06:0
9:09.
664
752
20
16
07
13
A
A
P
L
9
7.
4
1
9
7.
6
7
9
6.
8
4
9
6.
8
7
25
89
21
71
N
A
N
A
N
A
N
A
NA NA NA NA NA 201
7-
12-
28
06:0
9:09.
664
752
20
16
07
14
A
A
P
L
9
7.
3
9
9
8.
9
9
9
7.
3
2
9
8.
7
9
38
91
89
97
N
A
N
A
N
A
N
A
NA NA NA NA NA 201
7-
12-
28
06:0
9:09.
664
752
20
16
07
15
A
A
P
L
9
8.
9
2
9
9.
3
0
9
8.
5
0
9
8.
7
8
30
13
69
90
N
A
N
A
N
A
N
A
NA NA NA NA NA 201
7-
12-
28
06:0
9:09.
664
752
20 A 9 1 9 9 36 N N N N NA NA NA NA NA 201
4. 16
07
18
A
P
L
8.
7
0
0
0.
1
3
8.
6
0
9.
8
3
49
38
67
A A A A 7-
12-
28
06:0
9:09.
664
752
20
16
07
19
A
A
P
L
9
9.
5
6
1
0
0.
0
0
9
9.
3
4
9
9.
8
7
23
77
99
24
N
A
N
A
N
A
N
A
NA NA NA NA NA 201
7-
12-
28
06:0
9:09.
664
752
20
16
07
20
A
A
P
L
1
0
0.
0
0
1
0
0.
4
6
9
9.
7
4
9
9.
9
6
26
27
59
68
N
A
N
A
N
A
N
A
NA NA NA NA NA 201
7-
12-
28
06:0
9:09.
664
752
20
16
07
21
A
A
P
L
9
9.
8
3
1
0
1.
0
0
9
9.
1
3
9
9.
4
3
32
70
20
28
N
A
N
A
N
A
N
A
NA NA NA NA NA 201
7-
12-
28
06:0
9:09.
664
752
20
16
07
22
A
A
P
L
9
9.
2
6
9
9.
3
0
9
8.
3
1
9
8.
6
6
28
31
36
69
N
A
N
A
N
A
N
A
NA NA NA NA NA 201
7-
12-
28
06:0
9:09.
664
752
20
16
07
25
A
A
P
L
9
8.
2
5
9
8.
8
4
9
6.
9
2
9
7.
3
4
40
38
29
21
N
A
N
A
N
A
N
A
NA NA NA NA NA 201
7-
12-
28
06:0
5. 9:09.
664
752
20
16
07
26
A
A
P
L
9
6.
8
2
9
7.
9
7
9
6.
4
2
9
6.
6
7
56
23
98
22
N
A
N
A
N
A
N
A
NA NA NA NA NA 201
7-
12-
28
06:0
9:09.
664
752
20
16
07
27
A
A
P
L
1
0
4.
2
6
1
0
4.
3
5
1
0
2.
7
5
1
0
2.
9
5
92
34
48
20
N
A
6
4
.
3
4
7
5
.
3
1
-
0.
17
0.8
3
49.1
5
NA 0.67 24
82
42
74
0
201
7-
12-
28
06:0
9:09.
664
752
20
16
07
28
A
A
P
L
1
0
2.
8
3
1
0
4.
4
5
1
0
2.
8
2
1
0
4.
3
4
39
86
98
39
N
A
6
5
.
2
7
7
7
.
5
4
-
0.
01
0.9
9
67.0
9
NA 0.72 28
81
12
57
9
201
7-
12-
28
06:0
9:09.
664
752
20
16
07
29
A
A
P
L
1
0
4.
1
9
1
0
4.
5
5
1
0
3.
6
8
1
0
4.
2
1
27
73
36
88
N
A
6
5
.
6
6
7
6
.
8
4
-
0.
04
0.9
6
92.5
3
NA 0.53 26
03
78
89
1
201
7-
12-
28
06:0
9:09.
664
752
20
16
08
01
A
A
P
L
1
0
4.
4
1
1
0
6.
1
5
1
0
4.
4
1
1
0
6.
0
5
38
16
78
71
N
A
6
6
.
7
2
7
9
.
6
5
-
0.
01
0.9
9
97.8
1
NA 0.44 29
85
46
76
2
201
7-
12-
28
06:0
9:09.
664
752
20 A 1 1 1 1 33 N 6 7 - 0.8 92.5 103. 0.51 26 201
13. In [58]:
stock_AAPL_filtered <- stock_AAPL %>% dplyr :: filter(Date >20170000)
ggplot(stock_AAPL_filtered,aes(x=Date)) + geom_histogram(aes(fill=
..count..),col="grey")+
ggtitle(' Histogram of dates after filtering for
2018')
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
18. TRADING DESCISION¶
Black colour in the plot indicates a day where the closing price was #
higher than the open = GAIN¶
red color in the plot indicates a day where the open was higher than
the # close = LOSS¶
let first do analysis with 2018¶
In [39]:
plot(AAPL_s[,'AAPL.Close'], main ='AAPL_s')