This document discusses approaches to improving the speed and efficiency of forensic imaging workflows. It identifies current forensic image formats as a bottleneck, with linear hashing and compression algorithms not scaling well to modern multi-core systems. The Advanced Forensic Format (AFF4) is presented as a solution, using block-based hashing and faster compression to fully utilize available I/O throughput. Optimizing storage interfaces, filesystem choices, and system configuration can further increase acquisition speeds.
AFF4: The new standard in forensic imaging and why you should careBradley Schatz
This seminar will outline why a new forensic container standard is needed and outline recent efforts to standardize the Advanced Forensic Format 4 forensic container (AFF4). Originally proposed in 2009 by Michael Cohen, Simson Garfinkel, and Bradley Schatz, the AFF4 forensic container supports a range of next generation forensic image features such as storage virtualisation, arbitrary metadata, and partial, non-linear and discontinuous images. Current AFF4 implementations include Rekall, The Pmem suite of Memory Acquisition tools, Evimetry Wirespeed, and Google Rapid Response.
The seminar will present an introduction to the format, outline the current state of adoption within the forensic ecosystem, and announce the availability of open source implementations.
- The document describes a media player plugin that allows choosing between IPv4 and IPv6 protocols for streaming video chunks to determine the faster connection speed.
- The plugin modifies an existing media player (Hls.js) to measure download speeds for each video chunk delivered over IPv4 or IPv6, and then selects the preferable protocol.
- Statistics from over 950 streaming sessions in Japan show IPv6 speeds are generally faster than IPv4, especially during night hours, though IPv4 can be faster in some cases like with older IPv6 tunneling.
The document compares the performance of Ceph storage cluster using TCP and RDMA (XIO) as the transport mechanisms. It finds that XIO provides around 30-50% higher IOPS and bandwidth compared to TCP with the same hardware setup. However, TCP performance is improving and catching up to XIO as the number of OSDs increases. While XIO provides better CPU utilization, it requires over 2x more memory usage than TCP. Scaling out to multiple nodes shows TCP scaling better than XIO. XIO performance is also unstable and connection startup times are longer compared to TCP.
Vijayendra Shamanna from SanDisk presented on optimizing the Ceph distributed storage system for all-flash architectures. Some key points:
1) Ceph is an open-source distributed storage system that provides file, block, and object storage interfaces. It operates by spreading data across multiple commodity servers and disks for high performance and reliability.
2) SanDisk has optimized various aspects of Ceph's software architecture and components like the messenger layer, OSD request processing, and filestore to improve performance on all-flash systems.
3) Testing showed the optimized Ceph configuration delivering over 200,000 IOPS and low latency with random 8K reads on an all-flash setup.
Decentralized storage systems like IPFS and Swarm allow users to store and access files in a decentralized peer-to-peer manner without relying on centralized servers. IPFS in particular aims to build a better web by making files addressable through content hashes rather than locations and improving availability, security, and cost efficiency compared to HTTP. It works by breaking files into chunks that are distributed across the network and retrieved by hash rather than location. Basic IPFS commands demonstrated include adding files, pinning for local access, and downloading content from the decentralized network.
Ceph optimized Storage / Global HW solutions for SDS, David AlvarezCeph Community
This document discusses Supermicro's portfolio of scale-out optimized storage nodes and Ceph-ready hardware solutions. It presents several models of storage nodes that support high density and ultra dense storage and are optimized for the Red Hat Ceph storage platform. The document also covers Supermicro's modular LAN switching I/O modules that provide flexible networking connectivity including 10GbE, 25GbE, and InfiniBand options.
The document provides an overview of DNS history and requirements for maintaining a DNS infrastructure. It discusses how DNS has evolved since 1983 to support features like load balancing, geobalancing, failover, and security protocols. When choosing a DNS software product or service provider, key considerations include scalability, supported features, dynamic configuration, failover capabilities, and protection against DDoS attacks. Maintaining DNS with multiple service providers can improve performance and reliability compared to a single provider.
AFF4: The new standard in forensic imaging and why you should careBradley Schatz
This seminar will outline why a new forensic container standard is needed and outline recent efforts to standardize the Advanced Forensic Format 4 forensic container (AFF4). Originally proposed in 2009 by Michael Cohen, Simson Garfinkel, and Bradley Schatz, the AFF4 forensic container supports a range of next generation forensic image features such as storage virtualisation, arbitrary metadata, and partial, non-linear and discontinuous images. Current AFF4 implementations include Rekall, The Pmem suite of Memory Acquisition tools, Evimetry Wirespeed, and Google Rapid Response.
The seminar will present an introduction to the format, outline the current state of adoption within the forensic ecosystem, and announce the availability of open source implementations.
- The document describes a media player plugin that allows choosing between IPv4 and IPv6 protocols for streaming video chunks to determine the faster connection speed.
- The plugin modifies an existing media player (Hls.js) to measure download speeds for each video chunk delivered over IPv4 or IPv6, and then selects the preferable protocol.
- Statistics from over 950 streaming sessions in Japan show IPv6 speeds are generally faster than IPv4, especially during night hours, though IPv4 can be faster in some cases like with older IPv6 tunneling.
The document compares the performance of Ceph storage cluster using TCP and RDMA (XIO) as the transport mechanisms. It finds that XIO provides around 30-50% higher IOPS and bandwidth compared to TCP with the same hardware setup. However, TCP performance is improving and catching up to XIO as the number of OSDs increases. While XIO provides better CPU utilization, it requires over 2x more memory usage than TCP. Scaling out to multiple nodes shows TCP scaling better than XIO. XIO performance is also unstable and connection startup times are longer compared to TCP.
Vijayendra Shamanna from SanDisk presented on optimizing the Ceph distributed storage system for all-flash architectures. Some key points:
1) Ceph is an open-source distributed storage system that provides file, block, and object storage interfaces. It operates by spreading data across multiple commodity servers and disks for high performance and reliability.
2) SanDisk has optimized various aspects of Ceph's software architecture and components like the messenger layer, OSD request processing, and filestore to improve performance on all-flash systems.
3) Testing showed the optimized Ceph configuration delivering over 200,000 IOPS and low latency with random 8K reads on an all-flash setup.
Decentralized storage systems like IPFS and Swarm allow users to store and access files in a decentralized peer-to-peer manner without relying on centralized servers. IPFS in particular aims to build a better web by making files addressable through content hashes rather than locations and improving availability, security, and cost efficiency compared to HTTP. It works by breaking files into chunks that are distributed across the network and retrieved by hash rather than location. Basic IPFS commands demonstrated include adding files, pinning for local access, and downloading content from the decentralized network.
Ceph optimized Storage / Global HW solutions for SDS, David AlvarezCeph Community
This document discusses Supermicro's portfolio of scale-out optimized storage nodes and Ceph-ready hardware solutions. It presents several models of storage nodes that support high density and ultra dense storage and are optimized for the Red Hat Ceph storage platform. The document also covers Supermicro's modular LAN switching I/O modules that provide flexible networking connectivity including 10GbE, 25GbE, and InfiniBand options.
The document provides an overview of DNS history and requirements for maintaining a DNS infrastructure. It discusses how DNS has evolved since 1983 to support features like load balancing, geobalancing, failover, and security protocols. When choosing a DNS software product or service provider, key considerations include scalability, supported features, dynamic configuration, failover capabilities, and protection against DDoS attacks. Maintaining DNS with multiple service providers can improve performance and reliability compared to a single provider.
This presentation is from the Gophercon-India where we talked about how to design a concurrent high performance database client in go language. We talked about how we use goroutines and channels to our advantages. we also talked about how to use pools for efficient memory utilization.
Vikram Hosakote gave a presentation on using the Bullseye code coverage tool to generate code coverage numbers in Cisco NXOS. Bullseye is used to capture coverage of C and C++ code and provide a ratio of tested vs total lines of code. The presentation covered building a Bullseye NXOS image, running tests to generate coverage files, processing the files on a Linux server, and viewing coverage reports in Bullseye's GUI or merged across devices. Automation ideas and integration with eARMS testing were also discussed.
Ceph Object Storage Performance Secrets and Ceph Data Lake SolutionKaran Singh
In this presentation, i have explained how Ceph Object Storage Performance can be improved drastically together with some object storage best practices, recommendations tips. I have also covered Ceph Shared Data Lake which is getting very popular.
Automation of Hadoop cluster operations in Arm Treasure DataYan Wang
This talk will focus on the journey we in the Arm Treasure Data hadoop team is on to simplify and automate how we deploy hadoop. In Arm Treasure Data, up to recently we were running hadoop clusters in two clouds. Due to fast increase of deployments into more sites, the overhead of manual operations has started to strain us. Due to this, we started a project last year to automate and simplify how we deploy using tools like AWS autoscaling groups. Steps we have taken so far are modernize and standardize instance types, moved from manually executed deployment scripts to api triggered work flows, actively working to deprecate chef in favor of debian packages and AWS Codedeploy. We have also started to automate a lot of operations that up to recently were manual, like scaling in and out clusters, and routing traffic between clusters. We also started simplify health check and node snapshotting. And our goal of the year is close to fully automated cluster operations.
The document provides an overview of the Aerospike architecture, including the client, cluster, storage, primary and secondary indexes, RAM, flash storage, and cross datacenter replication (XDR). The Aerospike architecture aims to handle extremely high read/write rates over persistent data at low latency while ensuring consistency and scalability across datacenters with no downtime.
High Performance, Scalable MongoDB in a Bare Metal CloudMongoDB
High-performance MongoDB deployments require dedicated hardware resources to avoid I/O bottlenecks. Testing showed the MongoDB cloud subscriptions on bare metal outperformed shared virtual instances by 6-93% for read/write operations due to optimized configurations of SSDs, disks, CPUs and tuning of OS parameters. For best results, deploy MongoDB on dedicated bare metal servers from the cloud provider rather than in virtual machines.
Human: Thank you for the summary. It captured the key points about the document's comparison of MongoDB performance on bare metal cloud servers versus virtual machines and highlighted the main reasons why bare metal outperformed in most tests. The summary was concise at 3 sentences and hit on the high level takeaways. Well done!
This document discusses using Ceph object storage to replace an NFS-based email storage solution for Deutsche Telekom's mail platform. It describes developing a Ceph plugin called librados mailbox (librmb) to directly store emails in RADOS objects while keeping metadata and indexes in CephFS. The hybrid approach is open sourced to avoid vendor lock-in. A proof-of-concept deployment is testing the solution across two data centers before potential migration of over 100 million email accounts to the Ceph-based storage.
Ceph Day Beijing: Big Data Analytics on Ceph Object Store Ceph Community
Big Data Analytics on Ceph* Object Storage
The document discusses using Ceph* object storage for big data analytics workloads on OpenStack. It covers deployment considerations for analytics clusters using options like VMs, containers, or bare metal. It details the design of using Ceph* RADOS Gateway (RGW) with an SSD cache tier for storage, and developing an RGW file system adapter and proxy for scheduling. Sample performance testing showed container overhead of 1.46x and VM overhead of 2.19x compared to bare metal. The next steps are to complete development and performance testing of the Ceph*/RGW solution.
One of the most important things you can do to improve the performance of your flash/SSDs with Aerospike is to properly prepare them. This Presentation goes through how to select, test, and prepare the drives so that you will get the best performance and lifetime out of them.
Slides from my talk at Cassandra Summit 2016 on troubleshooting Cassandra. This is a reprise of my popular talk from last summit, reorganized, expanded, and updated for Cassandra 3.0. In it I share the secrets I've learned in four years of supporting hundreds of customers using Apache Cassandra and DataStax Enterprise. Be sure to check out presenter notes for additional tips and links to further resources.
Brent Compton and Kyle Bader of Red Hat took the stage at Red Hat Storage Day New York on 1/19/16 to share with attendees best practices and lessons learned for architecting solutions with Red Hat Ceph Storage.
The document discusses improving the efficiency of PostgreSQL vacuums. It proposes performing vacuums in parallel using multiple worker processes to shorten execution time. It also suggests deferring index vacuums by spoiling dead tuple IDs to a threshold to reduce the number of expensive index vacuum operations. Range vacuums are proposed to vacuum specific blocks to minimize disruption to transactions while still making progress.
Cephalocon APAC 2018
March 22-23, 2018 - Beijing, China
Lars Marowsky-Brée SUSE Distinguished Engineer, Ceph Advisory Board member
Marc Koderer, SAP OpenStack Evangelist
Hot Cloud'16: An Experiment on Bare-Metal BigData ProvisioningAta Turk
An Experiment on Bare-Metal BigData Provisioning: Many BigData customers use on-demand platforms in the cloud, where they can get a dedicated virtual cluster in a couple of minutes and pay only for the time they use. Increasingly, there is a demand for bare-metal bigdata solutions for applications that cannot tolerate the unpredictability and performance degradation of virtualized systems. Existing bare-metal solutions can introduce delays of 10s of minutes to provision a cluster by installing operating systems and applications on the local disks of servers. This has motivated recent research developing sophisticated mechanisms to optimize this installation. These approaches assume that using network mounted boot disks incur unacceptable run-time overhead. Our analysis suggest that while this assumption is true for application data, it is incorrect for operating systems and applications, and network mounting the boot disk and applications result in negligible run-time impact while leading to faster provisioning time.
This presentation breaks down the Aerospike Key Value Data Access. It covers the topics of Structured vs Unstructured Data, Database Hierarchy & Definitions as well as Data Patterns.
Lessons Learned on Java Tuning for Our Cassandra Clusters (Carlos Monroy, Kne...DataStax
Customizing JVM settings for the needs of an application can be a tricky business, especially when running externally developed software such as Cassandra. In this talk I will share our experiences and the procedure that we have used to test and validate changes with Java tuning. We'll explore with two recent experiences: changes and monitoring of G1 garbage collection, and moving buffer objects off the heap.
For the talk, I'll discuss our tuning process at Knewton. I will share some of the challenges that we faced while identifying what we expected to learn. I'll discuss how we isolated and minimized variables across tests, the importance of the duration of these tests, and how we try to separate correlation from causation. I will demonstrate how to use and interpret the results of the custom scripts that we were driven to develop to gain visibility into our G1GC processes; these scripts will be open sourced.
About the Speaker
Carlos Monroy Senior Software Engineer, Knewton
Carlos Monroy is a senior engineer on the database team at Knewton, an education company that created an adaptive learning platform. Carlos has been developing software professionally since 1998. His experience holding multiple roles on the software lifecycle provides him a wholistic approach. Having used over a half dozen relational database engines, he has recently come over to the NoSQL side, first working with HBase and for the last three years Cassandra.
Accelerating forensic and incident response workflow: the case for a new stan...Bradley Schatz
Today’s forensic processes are mired by practices carried over from a pre-networked world. Practitioners and responders are faced with the unsatisfactory choice of either forensically preserving only a limited amount of evidence while accepting the risk of missing relevant information (triage), or delaying analysis while waiting for full forensic preservation. This seminar will examine the role of existing forensic imaging formats in creating such an environment, and examine how an improved forensic image format (the AFF4 forensic container format) enables practitioners to perform forensic analysis without the delays imposed by current approaches.
AC&NC provides full product line up of Network Attached Storage (NAS) systems that are all built for reliability and ease of use. AC&NC also offers combined NAS and Storage Area Networks (SAN) into a single system, allowing for a consolidated storage and network environment.
Focused intently on storage without distractions of tape backup or bundled servers, AC&NC manufacturers in-house and delivers complete solutions in 24-48 hours from in-stock JetStor RAID, iSCSI, FC, NAS / Unified, All Flash and JBOD SAS systems that set the bar for performance.
Технологии работы с дисковыми хранилищами и файловыми системами Windows Serve...Виталий Стародубцев
##Что такое Storage Replica
##Архитектура и сценарии
##Синхронная и асинхронная репликация
##Междисковая, межсерверная, внутрикластерная и межкластерная репликация
##Дизайн и проектирование Storage Replica
##Нововведения в Windows Server 2016 TP5
##Графический интерфейс управления, и другие возможности - демонстрация и планы развития
##Интеграция Storage Replica с Storage Spaces Direct
This presentation is from the Gophercon-India where we talked about how to design a concurrent high performance database client in go language. We talked about how we use goroutines and channels to our advantages. we also talked about how to use pools for efficient memory utilization.
Vikram Hosakote gave a presentation on using the Bullseye code coverage tool to generate code coverage numbers in Cisco NXOS. Bullseye is used to capture coverage of C and C++ code and provide a ratio of tested vs total lines of code. The presentation covered building a Bullseye NXOS image, running tests to generate coverage files, processing the files on a Linux server, and viewing coverage reports in Bullseye's GUI or merged across devices. Automation ideas and integration with eARMS testing were also discussed.
Ceph Object Storage Performance Secrets and Ceph Data Lake SolutionKaran Singh
In this presentation, i have explained how Ceph Object Storage Performance can be improved drastically together with some object storage best practices, recommendations tips. I have also covered Ceph Shared Data Lake which is getting very popular.
Automation of Hadoop cluster operations in Arm Treasure DataYan Wang
This talk will focus on the journey we in the Arm Treasure Data hadoop team is on to simplify and automate how we deploy hadoop. In Arm Treasure Data, up to recently we were running hadoop clusters in two clouds. Due to fast increase of deployments into more sites, the overhead of manual operations has started to strain us. Due to this, we started a project last year to automate and simplify how we deploy using tools like AWS autoscaling groups. Steps we have taken so far are modernize and standardize instance types, moved from manually executed deployment scripts to api triggered work flows, actively working to deprecate chef in favor of debian packages and AWS Codedeploy. We have also started to automate a lot of operations that up to recently were manual, like scaling in and out clusters, and routing traffic between clusters. We also started simplify health check and node snapshotting. And our goal of the year is close to fully automated cluster operations.
The document provides an overview of the Aerospike architecture, including the client, cluster, storage, primary and secondary indexes, RAM, flash storage, and cross datacenter replication (XDR). The Aerospike architecture aims to handle extremely high read/write rates over persistent data at low latency while ensuring consistency and scalability across datacenters with no downtime.
High Performance, Scalable MongoDB in a Bare Metal CloudMongoDB
High-performance MongoDB deployments require dedicated hardware resources to avoid I/O bottlenecks. Testing showed the MongoDB cloud subscriptions on bare metal outperformed shared virtual instances by 6-93% for read/write operations due to optimized configurations of SSDs, disks, CPUs and tuning of OS parameters. For best results, deploy MongoDB on dedicated bare metal servers from the cloud provider rather than in virtual machines.
Human: Thank you for the summary. It captured the key points about the document's comparison of MongoDB performance on bare metal cloud servers versus virtual machines and highlighted the main reasons why bare metal outperformed in most tests. The summary was concise at 3 sentences and hit on the high level takeaways. Well done!
This document discusses using Ceph object storage to replace an NFS-based email storage solution for Deutsche Telekom's mail platform. It describes developing a Ceph plugin called librados mailbox (librmb) to directly store emails in RADOS objects while keeping metadata and indexes in CephFS. The hybrid approach is open sourced to avoid vendor lock-in. A proof-of-concept deployment is testing the solution across two data centers before potential migration of over 100 million email accounts to the Ceph-based storage.
Ceph Day Beijing: Big Data Analytics on Ceph Object Store Ceph Community
Big Data Analytics on Ceph* Object Storage
The document discusses using Ceph* object storage for big data analytics workloads on OpenStack. It covers deployment considerations for analytics clusters using options like VMs, containers, or bare metal. It details the design of using Ceph* RADOS Gateway (RGW) with an SSD cache tier for storage, and developing an RGW file system adapter and proxy for scheduling. Sample performance testing showed container overhead of 1.46x and VM overhead of 2.19x compared to bare metal. The next steps are to complete development and performance testing of the Ceph*/RGW solution.
One of the most important things you can do to improve the performance of your flash/SSDs with Aerospike is to properly prepare them. This Presentation goes through how to select, test, and prepare the drives so that you will get the best performance and lifetime out of them.
Slides from my talk at Cassandra Summit 2016 on troubleshooting Cassandra. This is a reprise of my popular talk from last summit, reorganized, expanded, and updated for Cassandra 3.0. In it I share the secrets I've learned in four years of supporting hundreds of customers using Apache Cassandra and DataStax Enterprise. Be sure to check out presenter notes for additional tips and links to further resources.
Brent Compton and Kyle Bader of Red Hat took the stage at Red Hat Storage Day New York on 1/19/16 to share with attendees best practices and lessons learned for architecting solutions with Red Hat Ceph Storage.
The document discusses improving the efficiency of PostgreSQL vacuums. It proposes performing vacuums in parallel using multiple worker processes to shorten execution time. It also suggests deferring index vacuums by spoiling dead tuple IDs to a threshold to reduce the number of expensive index vacuum operations. Range vacuums are proposed to vacuum specific blocks to minimize disruption to transactions while still making progress.
Cephalocon APAC 2018
March 22-23, 2018 - Beijing, China
Lars Marowsky-Brée SUSE Distinguished Engineer, Ceph Advisory Board member
Marc Koderer, SAP OpenStack Evangelist
Hot Cloud'16: An Experiment on Bare-Metal BigData ProvisioningAta Turk
An Experiment on Bare-Metal BigData Provisioning: Many BigData customers use on-demand platforms in the cloud, where they can get a dedicated virtual cluster in a couple of minutes and pay only for the time they use. Increasingly, there is a demand for bare-metal bigdata solutions for applications that cannot tolerate the unpredictability and performance degradation of virtualized systems. Existing bare-metal solutions can introduce delays of 10s of minutes to provision a cluster by installing operating systems and applications on the local disks of servers. This has motivated recent research developing sophisticated mechanisms to optimize this installation. These approaches assume that using network mounted boot disks incur unacceptable run-time overhead. Our analysis suggest that while this assumption is true for application data, it is incorrect for operating systems and applications, and network mounting the boot disk and applications result in negligible run-time impact while leading to faster provisioning time.
This presentation breaks down the Aerospike Key Value Data Access. It covers the topics of Structured vs Unstructured Data, Database Hierarchy & Definitions as well as Data Patterns.
Lessons Learned on Java Tuning for Our Cassandra Clusters (Carlos Monroy, Kne...DataStax
Customizing JVM settings for the needs of an application can be a tricky business, especially when running externally developed software such as Cassandra. In this talk I will share our experiences and the procedure that we have used to test and validate changes with Java tuning. We'll explore with two recent experiences: changes and monitoring of G1 garbage collection, and moving buffer objects off the heap.
For the talk, I'll discuss our tuning process at Knewton. I will share some of the challenges that we faced while identifying what we expected to learn. I'll discuss how we isolated and minimized variables across tests, the importance of the duration of these tests, and how we try to separate correlation from causation. I will demonstrate how to use and interpret the results of the custom scripts that we were driven to develop to gain visibility into our G1GC processes; these scripts will be open sourced.
About the Speaker
Carlos Monroy Senior Software Engineer, Knewton
Carlos Monroy is a senior engineer on the database team at Knewton, an education company that created an adaptive learning platform. Carlos has been developing software professionally since 1998. His experience holding multiple roles on the software lifecycle provides him a wholistic approach. Having used over a half dozen relational database engines, he has recently come over to the NoSQL side, first working with HBase and for the last three years Cassandra.
Accelerating forensic and incident response workflow: the case for a new stan...Bradley Schatz
Today’s forensic processes are mired by practices carried over from a pre-networked world. Practitioners and responders are faced with the unsatisfactory choice of either forensically preserving only a limited amount of evidence while accepting the risk of missing relevant information (triage), or delaying analysis while waiting for full forensic preservation. This seminar will examine the role of existing forensic imaging formats in creating such an environment, and examine how an improved forensic image format (the AFF4 forensic container format) enables practitioners to perform forensic analysis without the delays imposed by current approaches.
AC&NC provides full product line up of Network Attached Storage (NAS) systems that are all built for reliability and ease of use. AC&NC also offers combined NAS and Storage Area Networks (SAN) into a single system, allowing for a consolidated storage and network environment.
Focused intently on storage without distractions of tape backup or bundled servers, AC&NC manufacturers in-house and delivers complete solutions in 24-48 hours from in-stock JetStor RAID, iSCSI, FC, NAS / Unified, All Flash and JBOD SAS systems that set the bar for performance.
Технологии работы с дисковыми хранилищами и файловыми системами Windows Serve...Виталий Стародубцев
##Что такое Storage Replica
##Архитектура и сценарии
##Синхронная и асинхронная репликация
##Междисковая, межсерверная, внутрикластерная и межкластерная репликация
##Дизайн и проектирование Storage Replica
##Нововведения в Windows Server 2016 TP5
##Графический интерфейс управления, и другие возможности - демонстрация и планы развития
##Интеграция Storage Replica с Storage Spaces Direct
The document summarizes SAN (storage area network) technology and its advantages over direct-attached storage. It discusses how SANs allow servers to share storage, provide better infrastructure for features like multipathing, and vastly increase scalability. Diagrams show how SANs consolidate storage from a past environment of isolated systems to a future environment with centralized shared storage. The document also compares the IBM DS3400 and EMC CX3-40s storage arrays and provides an overview of SAN delivery and support options available to customers.
Oracle Open World 2014: Lies, Damned Lies, and I/O Statistics [ CON3671]Kyle Hailey
The document discusses analyzing I/O performance and summarizing lessons learned. It describes common tools used to measure I/O like moats.sh, strace, and ioh.sh. It also summarizes the top 10 anomalies encountered like caching effects, shared drives, connection limits, I/O request consolidation and fragmentation over NFS, and tiered storage migration. Solutions provided focus on avoiding caching, isolating workloads, proper sizing of NFS parameters, and direct I/O.
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...Databricks
This document summarizes a presentation about using the Crail distributed storage system to improve Spark performance on high-performance computing clusters with RDMA networking and NVMe flash storage. The key points are:
1) Traditional Spark storage and networking APIs do not bypass the operating system kernel, limiting performance on modern hardware.
2) The Crail system provides user-level APIs for RDMA networking and NVMe flash to improve Spark shuffle, join, and sorting workloads by 2-10x on a 128-node cluster.
3) Crail allows Spark workloads to fully utilize high-speed networks and disaggregate memory and flash storage across nodes without performance penalties.
This document provides information about IBM's tape storage solutions. It begins with an overview of how tape saves on costs, energy usage, data storage, and protects companies. It then discusses specific uses of tape for backup and archiving large amounts of unstructured or cold data. The document highlights the growing capacity of tape drives and declining capacity growth of hard disk drives. It argues that tape is well-suited for storing cold or inactive data long-term in a cost effective manner. The document also emphasizes how tape provides an "air gap" to protect against ransomware and software bugs by keeping backup data completely offline.
This document describes the JetStor NAS 1600S, an all-in-one NAS/iSCSI/Fibre Channel RAID storage system. It has a wizard interface, supports thin provisioning for iSCSI/Fibre/NAS volumes, has high availability clustering, supports iSCSI and Fibre Channel targets, volume cloning, and multiple backup solutions. It provides up to 96 HDDs of storage with 16 bays and expansion, has mobile apps, and is suitable for use in research, education, telecom, and server virtualization applications.
The document discusses using flash storage to accelerate application performance. It describes how flash provides faster data transfer rates, IOPS and lower latency compared to HDDs. It outlines different ways flash can be used, including as a host-side PCIe device, array-based caching, or within an all-flash array optimized for flash. The Whiptail storage system is highlighted as providing high throughput, IOPS and endurance while reducing power, space and cooling needs compared to HDD solutions. It can support multiple workloads on a single system.
Tape continues to be an important storage solution due to its low cost and high capacity. It plays a key role in backup and archiving, especially for large amounts of cold data. Recent technology demonstrations by IBM show that tape capacity can continue to increase significantly at around 40% per year, with a demonstration of 220TB per cartridge. Tape provides unmatched security as an offline storage medium and is much more reliable than disk storage. With innovations like LTFS, tape is also easier to use for a wider range of applications.
Optimizing the Upstreaming Workflow: Flexibly Scale Storage for Seismic Proce...Avere Systems
Avere Systems provides a solution to optimize seismic data processing workflows by flexibly scaling performance and reducing costs. Their solution improves throughput by 50% while reducing storage footprint by 50% using flash storage and auto-tiering. It simplifies workflows by eliminating unnecessary data copies between specialty storage silos and provides a unified storage system. This allows for faster time to results, lower costs, and easier management compared to existing solutions from NetApp, EMC Isilon, Panasas, and Lustre/DDN.
What is the average rotational latency of this disk drive What seek.docxajoy21
SSDs have advantages over HDDs like faster speeds without seek times, but are more expensive. Various caching methods leverage SSD speed while retaining HDD capacity. DM-cache, Flashcache, Bcache, and EnhanceIO all use SSDs to cache hot data for faster access without extra storage management. Each has advantages like transparency or ability to cache partitions, but DM-cache may have metadata limits while Bcache uses system memory. A hybrid system provides the best features of both SSDs and HDDs.
This document discusses disk I/O performance testing tools. It introduces SQLIO and IOMETER for measuring disk throughput, latency, and IOPS. Examples are provided for running SQLIO tests and interpreting the output, including metrics like throughput in MB/s, latency in ms, and I/O histograms. Other disk performance factors discussed include the number of outstanding I/Os, block size, and sequential vs random access patterns.
Thunderstorm is a PCIe flash storage card that provides:
1) Fast parallel access to flash memory for up to 80 Gbps throughput and 10x faster boot times compared to SAS.
2) Compatibility without requiring driver software and support for open standards and virtualization.
3) An economical solution without vendor lock-in and with field serviceable flash modules.
VirtualStor Extreme - Software Defined Scale-Out All Flash StorageGIGABYTE Technology
VirtualStor is a software-defined storage platform that aggregates and optimizes all storage resources to provide flexible storage solutions for any environment or application. It uses a scale-out architecture to deliver up to 10 million IOPS and 1PB of storage. VirtualStor offers high performance with sub-millisecond latency, low write amplification to extend SSD life, and the ability to consolidate and seamlessly migrate data from existing storage.
The FalconStor Network Storage Server (NSS) is a storage virtualization and data protection appliance. It provides virtualization and thin provisioning of storage for efficient utilization. The NSS also includes features for data replication, snapshots, and centralized management.
Measuring Storage Performance
Course practice
Presented by Valerian Ceaus
The document discusses using SQLIO to test the input/output capacity of a disk subsystem. It provides guidance on running SQLIO tests with different I/O types, sizes, and durations. The document also discusses interpreting SQLIO results and monitoring I/O performance using Windows Performance Monitor and Resource Monitor. Key factors that influence I/O performance like outstanding I/Os, queue depth, throughput, and latency are explained.
This document discusses IBM's tape storage solutions and the future of tape technology. It notes that tape is very energy efficient, secure, reliable, and cost-effective for archival storage. The document summarizes IBM's recent demonstration of a tape technology that achieved a recording density of 123Gb/in2, showing tape has potential for significant future capacity increases. It also reviews challenges with hard disk drive and flash storage scaling and how tape compares favorably due to its larger physical bit cells.
Ceph Day New York 2014: Ceph, a physical perspective Ceph Community
The document summarizes the results of testing a Ceph storage cluster configuration using Supermicro hardware. Key findings include:
- Using SSDs for journals improved sequential write bandwidth significantly.
- Erasure coded pools provided reasonable performance at a lower cost compared to replicated pools.
- A single client could saturate the network connection with two 36-bay OSD nodes.
- Network performance was critical as the cluster scaled to support more clients and objects.
- Further testing was needed on erasure coded performance under failure conditions and using newer Ceph and Linux versions.
Storage and performance- Batch processing, WhiptailInternet World
Batch processing allows jobs to run without manual intervention by shifting processing to less busy times. It avoids idling computing resources and allows higher overall utilization. Batch processing provides benefits like prioritizing batch and interactive work. The document then discusses different approaches to batch processing like dedicating all resources to it or sharing resources. It outlines challenges like systems being unavailable during batch processing. The rest of the document summarizes Whiptail's flash storage solutions for accelerating workloads and reducing costs and resources compared to HDDs.
Similar to Accelerating forensic and incident response workflow: the case for a new standard in forensic imaging - HTCIA 2016 (20)
End-to-end pipeline agility - Berlin Buzzwords 2024Lars Albertsson
We describe how we achieve high change agility in data engineering by eliminating the fear of breaking downstream data pipelines through end-to-end pipeline testing, and by using schema metaprogramming to safely eliminate boilerplate involved in changes that affect whole pipelines.
A quick poll on agility in changing pipelines from end to end indicated a huge span in capabilities. For the question "How long time does it take for all downstream pipelines to be adapted to an upstream change," the median response was 6 months, but some respondents could do it in less than a day. When quantitative data engineering differences between the best and worst are measured, the span is often 100x-1000x, sometimes even more.
A long time ago, we suffered at Spotify from fear of changing pipelines due to not knowing what the impact might be downstream. We made plans for a technical solution to test pipelines end-to-end to mitigate that fear, but the effort failed for cultural reasons. We eventually solved this challenge, but in a different context. In this presentation we will describe how we test full pipelines effectively by manipulating workflow orchestration, which enables us to make changes in pipelines without fear of breaking downstream.
Making schema changes that affect many jobs also involves a lot of toil and boilerplate. Using schema-on-read mitigates some of it, but has drawbacks since it makes it more difficult to detect errors early. We will describe how we have rejected this tradeoff by applying schema metaprogramming, eliminating boilerplate but keeping the protection of static typing, thereby further improving agility to quickly modify data pipelines without fear.
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Discussion on Vector Databases, Unstructured Data and AI
https://www.meetup.com/unstructured-data-meetup-new-york/
This meetup is for people working in unstructured data. Speakers will come present about related topics such as vector databases, LLMs, and managing data at scale. The intended audience of this group includes roles like machine learning engineers, data scientists, data engineers, software engineers, and PMs.This meetup was formerly Milvus Meetup, and is sponsored by Zilliz maintainers of Milvus.
Global Situational Awareness of A.I. and where its headedvikram sood
You can see the future first in San Francisco.
Over the past year, the talk of the town has shifted from $10 billion compute clusters to $100 billion clusters to trillion-dollar clusters. Every six months another zero is added to the boardroom plans. Behind the scenes, there’s a fierce scramble to secure every power contract still available for the rest of the decade, every voltage transformer that can possibly be procured. American big business is gearing up to pour trillions of dollars into a long-unseen mobilization of American industrial might. By the end of the decade, American electricity production will have grown tens of percent; from the shale fields of Pennsylvania to the solar farms of Nevada, hundreds of millions of GPUs will hum.
The AGI race has begun. We are building machines that can think and reason. By 2025/26, these machines will outpace college graduates. By the end of the decade, they will be smarter than you or I; we will have superintelligence, in the true sense of the word. Along the way, national security forces not seen in half a century will be un-leashed, and before long, The Project will be on. If we’re lucky, we’ll be in an all-out race with the CCP; if we’re unlucky, an all-out war.
Everyone is now talking about AI, but few have the faintest glimmer of what is about to hit them. Nvidia analysts still think 2024 might be close to the peak. Mainstream pundits are stuck on the wilful blindness of “it’s just predicting the next word”. They see only hype and business-as-usual; at most they entertain another internet-scale technological change.
Before long, the world will wake up. But right now, there are perhaps a few hundred people, most of them in San Francisco and the AI labs, that have situational awareness. Through whatever peculiar forces of fate, I have found myself amongst them. A few years ago, these people were derided as crazy—but they trusted the trendlines, which allowed them to correctly predict the AI advances of the past few years. Whether these people are also right about the next few years remains to be seen. But these are very smart people—the smartest people I have ever met—and they are the ones building this technology. Perhaps they will be an odd footnote in history, or perhaps they will go down in history like Szilard and Oppenheimer and Teller. If they are seeing the future even close to correctly, we are in for a wild ride.
Let me tell you what we see.
Codeless Generative AI Pipelines
(GenAI with Milvus)
https://ml.dssconf.pl/user.html#!/lecture/DSSML24-041a/rate
Discover the potential of real-time streaming in the context of GenAI as we delve into the intricacies of Apache NiFi and its capabilities. Learn how this tool can significantly simplify the data engineering workflow for GenAI applications, allowing you to focus on the creative aspects rather than the technical complexities. I will guide you through practical examples and use cases, showing the impact of automation on prompt building. From data ingestion to transformation and delivery, witness how Apache NiFi streamlines the entire pipeline, ensuring a smooth and hassle-free experience.
Timothy Spann
https://www.youtube.com/@FLaNK-Stack
https://medium.com/@tspann
https://www.datainmotion.dev/
milvus, unstructured data, vector database, zilliz, cloud, vectors, python, deep learning, generative ai, genai, nifi, kafka, flink, streaming, iot, edge
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...sameer shah
"Join us for STATATHON, a dynamic 2-day event dedicated to exploring statistical knowledge and its real-world applications. From theory to practice, participants engage in intensive learning sessions, workshops, and challenges, fostering a deeper understanding of statistical methodologies and their significance in various fields."
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...Social Samosa
The Modern Marketing Reckoner (MMR) is a comprehensive resource packed with POVs from 60+ industry leaders on how AI is transforming the 4 key pillars of marketing – product, place, price and promotions.
The Building Blocks of QuestDB, a Time Series Databasejavier ramirez
Talk Delivered at Valencia Codes Meetup 2024-06.
Traditionally, databases have treated timestamps just as another data type. However, when performing real-time analytics, timestamps should be first class citizens and we need rich time semantics to get the most out of our data. We also need to deal with ever growing datasets while keeping performant, which is as fun as it sounds.
It is no wonder time-series databases are now more popular than ever before. Join me in this session to learn about the internal architecture and building blocks of QuestDB, an open source time-series database designed for speed. We will also review a history of some of the changes we have gone over the past two years to deal with late and unordered data, non-blocking writes, read-replicas, or faster batch ingestion.
State of Artificial intelligence Report 2023kuntobimo2016
Artificial intelligence (AI) is a multidisciplinary field of science and engineering whose goal is to create intelligent machines.
We believe that AI will be a force multiplier on technological progress in our increasingly digital, data-driven world. This is because everything around us today, ranging from culture to consumer products, is a product of intelligence.
The State of AI Report is now in its sixth year. Consider this report as a compilation of the most interesting things we’ve seen with a goal of triggering an informed conversation about the state of AI and its implication for the future.
We consider the following key dimensions in our report:
Research: Technology breakthroughs and their capabilities.
Industry: Areas of commercial application for AI and its business impact.
Politics: Regulation of AI, its economic implications and the evolving geopolitics of AI.
Safety: Identifying and mitigating catastrophic risks that highly-capable future AI systems could pose to us.
Predictions: What we believe will happen in the next 12 months and a 2022 performance review to keep us honest.
The Ipsos - AI - Monitor 2024 Report.pdfSocial Samosa
According to Ipsos AI Monitor's 2024 report, 65% Indians said that products and services using AI have profoundly changed their daily life in the past 3-5 years.
39. Test standard composition
Stored block size –v- LBA address
Windows 8.1
10.2G
Govdocs1 (1-
75,1-40) 59.8G
/dev/random
38.4G
Empty space
(zeros)
40. Block based hashing beats linear stream hashing with
low powered multicore CPU’s
Dual core i5-3337U 1.8GHz
Sparse data
Max CPU hash
throughput
Sparse data
Read I/O
limited