Covers different types of big data benchmarking, different suites, details into terasort, demo with TPCx-HS
Meetup Details of presentation:
http://www.meetup.com/lspe-in/events/203918952/
This document classifies and describes different types of database users: end users/novice users who interact directly with applications; online users who communicate directly with the database through an interface or application; application programmers who develop applications; database administrators who are responsible for designing, maintaining, and securing the database; and database implementers who build database management system software.
This document summarizes a presentation about software defined storage using the open source Gluster file system. It begins with an overview of storage concepts like reliability, performance, and scaling. It then discusses the history and types of storage and provides case studies of proprietary storage systems. The presentation introduces software defined storage and Gluster, describing its modular design, use in cloud computing, pros and cons. Key Gluster concepts are defined and its distributed and replicated volume types are explained. The presentation concludes with instructions for setting up and using Gluster.
This document discusses Oracle database backup and recovery. It covers the need for backups, different types of backups including full, incremental, physical and logical. It describes user-managed backups and RMAN-managed backups. For recovery, it discusses restoring from backups and applying redo logs to recover the database to a point in time. Flashback recovery is also mentioned.
This is the Complete Information about Data Replication you need, i am focused on these topics:
What is replication?
Who use it?
Types ?
Implementation Methods?
- Oracle Database is a comprehensive, integrated database management system that provides an open approach to information management.
- The Oracle architecture includes database structures like data files, control files, and redo log files as well as memory structures like the system global area (SGA) and process global area (PGA).
- Key components of the Oracle architecture include the database buffer cache, shared pool, redo log buffer, and background processes that manage instances.
The Message Passing Interface (MPI) in Layman's TermsJeff Squyres
Introduction to the basic concepts of what the Message Passing Interface (MPI) is, and a brief overview of the Open MPI open source software implementation of the MPI specification.
Neural networks are mathematical models inspired by biological neural networks. They are useful for pattern recognition and data classification through a learning process of adjusting synaptic connections between neurons. A neural network maps input nodes to output nodes through an arbitrary number of hidden nodes. It is trained by presenting examples to adjust weights using methods like backpropagation to minimize error between actual and predicted outputs. Neural networks have advantages like noise tolerance and not requiring assumptions about data distributions. They have applications in finance, marketing, and other fields, though designing optimal network topology can be challenging.
The document discusses managing users, roles, and privileges in Oracle databases. It covers creating, altering, and dropping users, viewing user information, predefined user accounts, different types of privileges including system privileges and object privileges, and user roles. It provides examples and descriptions of commands for working with users, roles, and privileges in Oracle databases.
This document classifies and describes different types of database users: end users/novice users who interact directly with applications; online users who communicate directly with the database through an interface or application; application programmers who develop applications; database administrators who are responsible for designing, maintaining, and securing the database; and database implementers who build database management system software.
This document summarizes a presentation about software defined storage using the open source Gluster file system. It begins with an overview of storage concepts like reliability, performance, and scaling. It then discusses the history and types of storage and provides case studies of proprietary storage systems. The presentation introduces software defined storage and Gluster, describing its modular design, use in cloud computing, pros and cons. Key Gluster concepts are defined and its distributed and replicated volume types are explained. The presentation concludes with instructions for setting up and using Gluster.
This document discusses Oracle database backup and recovery. It covers the need for backups, different types of backups including full, incremental, physical and logical. It describes user-managed backups and RMAN-managed backups. For recovery, it discusses restoring from backups and applying redo logs to recover the database to a point in time. Flashback recovery is also mentioned.
This is the Complete Information about Data Replication you need, i am focused on these topics:
What is replication?
Who use it?
Types ?
Implementation Methods?
- Oracle Database is a comprehensive, integrated database management system that provides an open approach to information management.
- The Oracle architecture includes database structures like data files, control files, and redo log files as well as memory structures like the system global area (SGA) and process global area (PGA).
- Key components of the Oracle architecture include the database buffer cache, shared pool, redo log buffer, and background processes that manage instances.
The Message Passing Interface (MPI) in Layman's TermsJeff Squyres
Introduction to the basic concepts of what the Message Passing Interface (MPI) is, and a brief overview of the Open MPI open source software implementation of the MPI specification.
Neural networks are mathematical models inspired by biological neural networks. They are useful for pattern recognition and data classification through a learning process of adjusting synaptic connections between neurons. A neural network maps input nodes to output nodes through an arbitrary number of hidden nodes. It is trained by presenting examples to adjust weights using methods like backpropagation to minimize error between actual and predicted outputs. Neural networks have advantages like noise tolerance and not requiring assumptions about data distributions. They have applications in finance, marketing, and other fields, though designing optimal network topology can be challenging.
The document discusses managing users, roles, and privileges in Oracle databases. It covers creating, altering, and dropping users, viewing user information, predefined user accounts, different types of privileges including system privileges and object privileges, and user roles. It provides examples and descriptions of commands for working with users, roles, and privileges in Oracle databases.
Deadlock Detection in Distributed SystemsDHIVYADEVAKI
The document discusses deadlocks in computing systems. It defines deadlocks and related concepts like livelock and starvation. It presents various approaches to deal with deadlocks including detection and recovery, avoidance through runtime checks, and prevention by restricting resource requests. Graph-based algorithms are described for detecting and preventing deadlocks by analyzing resource allocation graphs. The Banker's algorithm is introduced as a static prevention method. Finally, it discusses ways to eliminate the conditions required for deadlocks, like mutual exclusion, hold-and-wait, and circular wait.
This presentation is for people who want to understand how PostgreSQL shares information among processes using shared memory. Topics covered include the internal data page format, usage of the shared buffers, locking methods, and various other shared memory data structures.
This document discusses online analytical processing (OLAP) and related concepts. It defines data mining, data warehousing, OLTP, and OLAP. It explains that a data warehouse integrates data from multiple sources and stores historical data for analysis. OLAP allows users to easily extract and view data from different perspectives. The document also discusses OLAP cube operations like slicing, dicing, drilling, and pivoting. It describes different OLAP architectures like MOLAP, ROLAP, and HOLAP and data warehouse schemas and architecture.
The document is a question bank for the cloud computing course CS8791. It contains 26 multiple choice or short answer questions related to key concepts in cloud computing including definitions of cloud computing, characteristics of clouds, deployment models, service models, elasticity, horizontal and vertical scaling, live migration techniques, and dynamic resource provisioning.
Handling Schema Changes Using pt-online-schema change.Mydbops
This document provides information about an online schema change tool called pt-online-schema-change. It lists the tool's website and contact email and describes that the tool allows performing schema changes to live databases without blocking queries or DDL statements and with minimal impact on performance and downtime.
The document discusses data intensive computing frameworks. It provides background on big data, how it is stored and processed at scale. Specifically, it discusses distributed file systems like HDFS, databases including relational and NoSQL approaches, and data processing frameworks like MapReduce. It aims to explain challenges in big data and how these tools address issues like scalability, fault tolerance and distributed computation.
=-=-=-==-=-Overview of the Talk-=-=-=-=-=
Introduction to the Subject
Database
Rational Database
Object Rational Database
Database Management System
History
Programming
SQL,
Connecting Java, Matlab to a Database
Advance DBMS
Data Grid
BigTable
Demo
Products
MySQL, SQLite, Oracle,
DB2, Microsoft Access,
Microsoft SQL Server
Products Comparison.
Data pre-processing is a data mining technique that involves transforming raw data into an understandable format. Data cleansing or data cleaning is the process of detecting and correcting (or removing) corrupt or inaccurate records from a record set, table, or database and refers to identifying incomplete, incorrect, inaccurate or irrelevant parts of the data and then replacing, modifying, or deleting the dirty or coarse data.
This ppt is about the cleaning and pre-processing.
Database Models, Client-Server Architecture, Distributed Database and Classif...Rubal Sagwal
Introduction to Data Models
-Hierarchical Model
-Network Model
-Relational Model
-Client/Server Architecture
Introduction to Distributed Database
Classification of DBMS
The document discusses Liquibase, an open source tool for tracking and applying database schema changes. It describes how Liquibase allows developers to define database changes in change logs, tracks which changes have been applied, and facilitates rolling changes back if needed. Key features highlighted include supporting multiple formats, automatic rollback capabilities, and integration with development workflows.
Difference between Homogeneous and HeterogeneousFaraz Qaisrani
Muhammad Faraz Qaisrani from the 2nd Batch at Benazir Bhutto Shaheed University discusses types of distributed database management systems (DDBMS). There are two main types: homogeneous, where all data centers use the same software, and heterogeneous, where different data centers may use different database products. Homogeneous systems are easier to design and manage but can be difficult for organizations to implement uniformly. Heterogeneous systems allow integration of existing databases but require translations between different hardware and software.
OLAP (online analytical processing) allows users to easily extract and view data from different perspectives. It was invented by Edgar Codd in the 1980s and uses multidimensional data structures called cubes to store and analyze data. OLAP utilizes either a multidimensional (MOLAP), relational (ROLAP), or hybrid (HOLAP) approach to store cube data in databases and provide interactive analysis of data.
A distributed system is a collection of independent computers that appear as a single coherent system to users. Middleware acts as a bridge between operating systems and applications, especially over a network. Examples of distributed systems include the World Wide Web, the internet, and intranets within organizations. Distributed systems provide benefits like increased reliability, scalability, performance, and flexibility compared to centralized systems. However, they also present challenges around security, software complexity, and system failures.
The document discusses optimizing performance in MapReduce jobs. It covers understanding bottlenecks through metrics and logs, tuning parameters to reduce spills during the map task sort and spill phase like io.sort.mb and io.sort.record.percent, and tips for reducer fetch tuning. The goal is to help developers understand and address bottlenecks in their MapReduce jobs to improve performance.
This document discusses data mining techniques, including the data mining process and common techniques like association rule mining. It describes the data mining process as involving data gathering, preparation, mining the data using algorithms, and analyzing and interpreting the results. Association rule mining is explained in detail, including how it can be used to identify relationships between frequently purchased products. Methods for mining multilevel and multidimensional association rules are also summarized.
This document provides an overview of the Hadoop MapReduce Fundamentals course. It discusses what Hadoop is, why it is used, common business problems it can address, and companies that use Hadoop. It also outlines the core parts of Hadoop distributions and the Hadoop ecosystem. Additionally, it covers common MapReduce concepts like HDFS, the MapReduce programming model, and Hadoop distributions. The document includes several code examples and screenshots related to Hadoop and MapReduce.
This document discusses shared-memory parallel programming using OpenMP. It begins with an overview of OpenMP and the shared-memory programming model. It then covers key OpenMP constructs for parallelizing loops, including the parallel for pragma and clauses for declaring private variables. It also discusses managing shared data with critical sections and reductions. The document provides several techniques for improving performance, such as loop inversions, if clauses, and dynamic scheduling.
This document discusses I/O virtualization and GPU virtualization. It covers:
- Two approaches to I/O virtualization: hosted and device driver approaches. Hosted has lower engineering cost but lower performance.
- Methods to optimize para-virtualized I/O including split-driver models, reducing data copy costs, and hardware supports like IOMMU and SR-IOV.
- Challenges of GPU virtualization including whether to take a low-level virtualization or high-level API remoting approach. API remoting is preferred due to closed and evolving GPU hardware.
- Hardware pass-through of GPUs for high performance but low scalability. Industry solutions for remote desktop
This document provides an overview of Boyce-Codd normal form (BCNF) which is a type of database normalization. It explains that BCNF was developed in 1974 and aims to eliminate redundant data and ensure data dependencies make logical sense. The document outlines the five normal forms including 1NF, 2NF, 3NF, BCNF, and 4NF. It provides examples of converting non-BCNF tables into BCNF by identifying and removing overlapping candidate keys and grouping remaining items into separate tables based on functional dependencies.
This document discusses benchmarking Hadoop and big data systems. It provides an overview of common Hadoop benchmarks including microbenchmarks like TestDFSIO, TeraSort, and NNBench which test individual Hadoop components. It also describes BigBench, a benchmark modeled after TPC-DS that aims to test a more complete big data analytics workload using techniques like MapReduce, Hive, and Mahout across structured, semi-structured, and unstructured data. The document emphasizes using Hadoop distributions for administration and both microbenchmarks and full benchmarks like BigBench for evaluation.
This tutorial was held at IEEE BigData '14 on October 29, 2014 in Bethesda, ML, USA.
Presenters: Chaitan Baru and Tilmann Rabl
More information available at:
http://msrg.org/papers/BigData14-Rabl
Summary:
This tutorial will introduce the audience to the broad set of issues involved in defining big data benchmarks, for creating auditable industry-standard benchmarks that consider performance as well as price/performance. Big data benchmarks must capture the essential characteristics of big data applications and systems, including heterogeneous data, e.g. structured, semi- structured, unstructured, graphs, and streams; large-scale and evolving system configurations; varying system loads; processing pipelines that progressively transform data; workloads that include queries as well as data mining and machine learning operations and algorithms. Different benchmarking approaches will be introduced, from micro-benchmarks to application- level benchmarking.
Since May 2012, five workshops have been held on Big Data Benchmarking including participation from industry and academia. One of the outcomes of these meetings has been the creation of industry’s first big data benchmark, viz., TPCx-HS, the Transaction Processing Performance Council’s benchmark for Hadoop Systems. During these workshops, a number of other proposals have been put forward for more comprehensive big data benchmarking. The tutorial will present and discuss salient points and essential features of such benchmarks that have been identified in these meetings, by experts in big data as well as benchmarking. Two key approaches are now being pursued—one, called BigBench, is based on extending the TPC- Decision Support (TPC-DS) benchmark with big data applications characteristics. The other called Deep Analytics Pipeline, is based on modeling processing that is routinely encountered in real-life big data applications. Both will be discussed.
We conclude with a discussion of a number of future directions for big data benchmarking
Deadlock Detection in Distributed SystemsDHIVYADEVAKI
The document discusses deadlocks in computing systems. It defines deadlocks and related concepts like livelock and starvation. It presents various approaches to deal with deadlocks including detection and recovery, avoidance through runtime checks, and prevention by restricting resource requests. Graph-based algorithms are described for detecting and preventing deadlocks by analyzing resource allocation graphs. The Banker's algorithm is introduced as a static prevention method. Finally, it discusses ways to eliminate the conditions required for deadlocks, like mutual exclusion, hold-and-wait, and circular wait.
This presentation is for people who want to understand how PostgreSQL shares information among processes using shared memory. Topics covered include the internal data page format, usage of the shared buffers, locking methods, and various other shared memory data structures.
This document discusses online analytical processing (OLAP) and related concepts. It defines data mining, data warehousing, OLTP, and OLAP. It explains that a data warehouse integrates data from multiple sources and stores historical data for analysis. OLAP allows users to easily extract and view data from different perspectives. The document also discusses OLAP cube operations like slicing, dicing, drilling, and pivoting. It describes different OLAP architectures like MOLAP, ROLAP, and HOLAP and data warehouse schemas and architecture.
The document is a question bank for the cloud computing course CS8791. It contains 26 multiple choice or short answer questions related to key concepts in cloud computing including definitions of cloud computing, characteristics of clouds, deployment models, service models, elasticity, horizontal and vertical scaling, live migration techniques, and dynamic resource provisioning.
Handling Schema Changes Using pt-online-schema change.Mydbops
This document provides information about an online schema change tool called pt-online-schema-change. It lists the tool's website and contact email and describes that the tool allows performing schema changes to live databases without blocking queries or DDL statements and with minimal impact on performance and downtime.
The document discusses data intensive computing frameworks. It provides background on big data, how it is stored and processed at scale. Specifically, it discusses distributed file systems like HDFS, databases including relational and NoSQL approaches, and data processing frameworks like MapReduce. It aims to explain challenges in big data and how these tools address issues like scalability, fault tolerance and distributed computation.
=-=-=-==-=-Overview of the Talk-=-=-=-=-=
Introduction to the Subject
Database
Rational Database
Object Rational Database
Database Management System
History
Programming
SQL,
Connecting Java, Matlab to a Database
Advance DBMS
Data Grid
BigTable
Demo
Products
MySQL, SQLite, Oracle,
DB2, Microsoft Access,
Microsoft SQL Server
Products Comparison.
Data pre-processing is a data mining technique that involves transforming raw data into an understandable format. Data cleansing or data cleaning is the process of detecting and correcting (or removing) corrupt or inaccurate records from a record set, table, or database and refers to identifying incomplete, incorrect, inaccurate or irrelevant parts of the data and then replacing, modifying, or deleting the dirty or coarse data.
This ppt is about the cleaning and pre-processing.
Database Models, Client-Server Architecture, Distributed Database and Classif...Rubal Sagwal
Introduction to Data Models
-Hierarchical Model
-Network Model
-Relational Model
-Client/Server Architecture
Introduction to Distributed Database
Classification of DBMS
The document discusses Liquibase, an open source tool for tracking and applying database schema changes. It describes how Liquibase allows developers to define database changes in change logs, tracks which changes have been applied, and facilitates rolling changes back if needed. Key features highlighted include supporting multiple formats, automatic rollback capabilities, and integration with development workflows.
Difference between Homogeneous and HeterogeneousFaraz Qaisrani
Muhammad Faraz Qaisrani from the 2nd Batch at Benazir Bhutto Shaheed University discusses types of distributed database management systems (DDBMS). There are two main types: homogeneous, where all data centers use the same software, and heterogeneous, where different data centers may use different database products. Homogeneous systems are easier to design and manage but can be difficult for organizations to implement uniformly. Heterogeneous systems allow integration of existing databases but require translations between different hardware and software.
OLAP (online analytical processing) allows users to easily extract and view data from different perspectives. It was invented by Edgar Codd in the 1980s and uses multidimensional data structures called cubes to store and analyze data. OLAP utilizes either a multidimensional (MOLAP), relational (ROLAP), or hybrid (HOLAP) approach to store cube data in databases and provide interactive analysis of data.
A distributed system is a collection of independent computers that appear as a single coherent system to users. Middleware acts as a bridge between operating systems and applications, especially over a network. Examples of distributed systems include the World Wide Web, the internet, and intranets within organizations. Distributed systems provide benefits like increased reliability, scalability, performance, and flexibility compared to centralized systems. However, they also present challenges around security, software complexity, and system failures.
The document discusses optimizing performance in MapReduce jobs. It covers understanding bottlenecks through metrics and logs, tuning parameters to reduce spills during the map task sort and spill phase like io.sort.mb and io.sort.record.percent, and tips for reducer fetch tuning. The goal is to help developers understand and address bottlenecks in their MapReduce jobs to improve performance.
This document discusses data mining techniques, including the data mining process and common techniques like association rule mining. It describes the data mining process as involving data gathering, preparation, mining the data using algorithms, and analyzing and interpreting the results. Association rule mining is explained in detail, including how it can be used to identify relationships between frequently purchased products. Methods for mining multilevel and multidimensional association rules are also summarized.
This document provides an overview of the Hadoop MapReduce Fundamentals course. It discusses what Hadoop is, why it is used, common business problems it can address, and companies that use Hadoop. It also outlines the core parts of Hadoop distributions and the Hadoop ecosystem. Additionally, it covers common MapReduce concepts like HDFS, the MapReduce programming model, and Hadoop distributions. The document includes several code examples and screenshots related to Hadoop and MapReduce.
This document discusses shared-memory parallel programming using OpenMP. It begins with an overview of OpenMP and the shared-memory programming model. It then covers key OpenMP constructs for parallelizing loops, including the parallel for pragma and clauses for declaring private variables. It also discusses managing shared data with critical sections and reductions. The document provides several techniques for improving performance, such as loop inversions, if clauses, and dynamic scheduling.
This document discusses I/O virtualization and GPU virtualization. It covers:
- Two approaches to I/O virtualization: hosted and device driver approaches. Hosted has lower engineering cost but lower performance.
- Methods to optimize para-virtualized I/O including split-driver models, reducing data copy costs, and hardware supports like IOMMU and SR-IOV.
- Challenges of GPU virtualization including whether to take a low-level virtualization or high-level API remoting approach. API remoting is preferred due to closed and evolving GPU hardware.
- Hardware pass-through of GPUs for high performance but low scalability. Industry solutions for remote desktop
This document provides an overview of Boyce-Codd normal form (BCNF) which is a type of database normalization. It explains that BCNF was developed in 1974 and aims to eliminate redundant data and ensure data dependencies make logical sense. The document outlines the five normal forms including 1NF, 2NF, 3NF, BCNF, and 4NF. It provides examples of converting non-BCNF tables into BCNF by identifying and removing overlapping candidate keys and grouping remaining items into separate tables based on functional dependencies.
This document discusses benchmarking Hadoop and big data systems. It provides an overview of common Hadoop benchmarks including microbenchmarks like TestDFSIO, TeraSort, and NNBench which test individual Hadoop components. It also describes BigBench, a benchmark modeled after TPC-DS that aims to test a more complete big data analytics workload using techniques like MapReduce, Hive, and Mahout across structured, semi-structured, and unstructured data. The document emphasizes using Hadoop distributions for administration and both microbenchmarks and full benchmarks like BigBench for evaluation.
This tutorial was held at IEEE BigData '14 on October 29, 2014 in Bethesda, ML, USA.
Presenters: Chaitan Baru and Tilmann Rabl
More information available at:
http://msrg.org/papers/BigData14-Rabl
Summary:
This tutorial will introduce the audience to the broad set of issues involved in defining big data benchmarks, for creating auditable industry-standard benchmarks that consider performance as well as price/performance. Big data benchmarks must capture the essential characteristics of big data applications and systems, including heterogeneous data, e.g. structured, semi- structured, unstructured, graphs, and streams; large-scale and evolving system configurations; varying system loads; processing pipelines that progressively transform data; workloads that include queries as well as data mining and machine learning operations and algorithms. Different benchmarking approaches will be introduced, from micro-benchmarks to application- level benchmarking.
Since May 2012, five workshops have been held on Big Data Benchmarking including participation from industry and academia. One of the outcomes of these meetings has been the creation of industry’s first big data benchmark, viz., TPCx-HS, the Transaction Processing Performance Council’s benchmark for Hadoop Systems. During these workshops, a number of other proposals have been put forward for more comprehensive big data benchmarking. The tutorial will present and discuss salient points and essential features of such benchmarks that have been identified in these meetings, by experts in big data as well as benchmarking. Two key approaches are now being pursued—one, called BigBench, is based on extending the TPC- Decision Support (TPC-DS) benchmark with big data applications characteristics. The other called Deep Analytics Pipeline, is based on modeling processing that is routinely encountered in real-life big data applications. Both will be discussed.
We conclude with a discussion of a number of future directions for big data benchmarking
The document summarizes the Terasort algorithm used in Hadoop. It describes:
1) Terasort uses MapReduce to sample and sort very large datasets like 100TB in under 3 hours by leveraging thousands of nodes.
2) The algorithm first generates sample data using Teragen. It then uses the samples to partition the full data among reducers with Terasort.
3) Each reducer locally sorts its partition so the entire dataset is sorted when concatenated in order of the reducers.
This document discusses HiBench, a benchmark suite for Hadoop. It provides an overview of HiBench and how it can be used to characterize and evaluate Hadoop deployments. Evaluation results using HiBench show that a newer Intel Xeon server platform provides up to 86% more throughput and is up to 56% faster than an older platform. Evaluations between Hadoop versions 0.19.1 and 0.20.0 show that improvements in the newer version help reduce job completion times. The document concludes by providing suggestions for optimizing Hadoop deployments through hardware and software configurations.
Performance Testing of Big Data Applications - Impetus WebcastImpetus Technologies
Impetus webcast "Performance Testing of Big Data Applications" available at http://lf1.me/cqb/
This Impetus webcast talks about:
• A solution approach to measure performance and throughput of Big Data applications
• Insights into areas to focus for increasing the effectiveness of Big Data performance testing
• Tools available to address Big Data specific performance related challenges
This presentation was held at ISC 2014 on June 26, 2014 in Leipzig, Germany.
More information available at:
http://msrg.org/papers/ISC2014-Rabl
Abstract:
The Workshops for Big Data Benchmarking (http://clds.sdsc.edu/bdbc/workshops), which have been underway since May 2012, have identified a set of characteristics of big data applications that apply to industry as well as scientific application scenarios involving pipelines of processing with steps that include aggregation, cleaning, and annotation of large volumes of data; filtering, integration, fusion, subsetting, and compaction of data; and, subsequent analysis, including visualization, data mining, predictive analytics and, eventually, decision making. One of the outcomes of the WBDB workshops has been the formation of a Transaction Processing Council subcommittee on Big Data, which is initially defining a Hadoop systems benchmark, TPCx-HS, based on Terasort. TPCx-HS would be a simple, functional benchmark that would assist in determining basic resiliency and scalability features of large-scale systems. Other proposals are also actively under development including BigBench, which extends the TPC-DS benchmark for big data scenarios; Big Decision Benchmark from HP; HiBench from Intel; and the Deep Analytics Pipeline (DAP), which defines a sequence of end-to-end processing steps consisting of some of the operations mentioned above. Pipeline benchmarks reveal the need for different processing modalities and system characteristics for different steps in the pipeline. For example, early processing steps may process very large volumes of data and may benefit from a Hadoop and MapReduce-style of computing, while later steps may operate on more structured data and may require, say, SMP-style architectures or very large memory systems. This talk will provide an overview of these benchmark activities and discuss opportunities for collaboration and future work with industry partners.
YCSB++ is a benchmarking tool that provides extensions to Yahoo!'s Cloud Serving Benchmark (YCSB) to test advanced features of scalable table stores. It allows for distributed, coordinated testing across client nodes using ZooKeeper. It also enables fine-grained, correlated monitoring of systems using the OTUS monitor. The tool is useful for understanding performance problems and debugging complex interactions between components in table stores. Two illustrative examples show how YCSB++ can analyze the tradeoff between fast inserts and weak consistency using batch writing, as well as benchmark features for high-speed ingest like bulk loading and table pre-splitting.
This document outlines the Linux I/O stack as of kernel version 3.3. It shows the path that I/O requests take from applications through the various layers including direct I/O, the page cache, block I/O layer, I/O scheduler, storage devices, filesystems, and network filesystems. Optional components are shown that can be stacked on top of the basic I/O stack like LVM, device mapper targets, multipath, and network transports.
Hadoop in the Clouds, Virtualization and Virtual MachinesDataWorks Summit
Hadoop can virtualize cluster resources across tenants through abstractions like YARN application containers and HDFS files. For public clouds, Hadoop is often run on VMs for strong isolation, but the main challenge is persisting data when clusters are created and destroyed. For private clouds, Hadoop on VMs works well for test and development clusters, while Hadoop alone provides good multi-tenancy for production. If using VMs in production, understand the motivations and follow guidelines like allocating local disk and avoiding storage fragmentation.
Business Intelligence on Hadoop Benchmarkatscaleinc
AtScale, the first company to provide business users with speed, security and simplicity for BI on Hadoop shares the results here of a comprehensive Business Intelligence benchmark for SQL-on-Hadoop engines.
The goal of the “Business Intelligence for Hadoop” benchmark is to help technology evaluators select the best SQL-on-Hadoop technology for their use cases.
The benchmark tested the industry’s top SQL-on-Hadoop engines over key Business Intelligence (BI) workloads use case queries, and reveals and rates strengths and weaknesses of the engines, and reveals which ones are ideally suited to various scenarios.
To learn more about how AtScale can help you make BI work on Hadoop in your enterprise, visit www.atscale.com.
This document provides guidance on sizing and configuring Apache Hadoop clusters. It recommends separating master nodes, which run processes like the NameNode and JobTracker, from slave nodes, which run DataNodes, TaskTrackers and RegionServers. For medium to large clusters it suggests 4 master nodes and the remaining nodes as slaves. The document outlines factors to consider for optimizing performance and cost like selecting balanced CPU, memory and disk configurations and using a "shared nothing" architecture with 1GbE or 10GbE networking. Redundancy is more important for master than slave nodes.
This document summarizes the results of a benchmark comparing the performance of several cloud database systems, including Cassandra, HBase, Sherpa, and MySQL. The benchmark uses a standard workload and measures key metrics like latency and throughput. Overall, Cassandra showed strong write performance but weaker reads. Sherpa delivered good read and write latency as well as high throughput. HBase read latency was poor. Later versions of Cassandra showed performance improvements over earlier versions.
The Enterprise Data Lake has become the defacto repository of both structured and unstructured data within an enterprise. Being able to discover information across both structured and unstructured data using search is a key capability of enterprise data lake. In this workshop, we will provide an in-depth overview of HDP Search with focus on configuration, sizing and tuning. We will also deliver a working example to showcase the usage of HDP Search along with the rest of platform capabilities to deliver real world solution.
Spark Summit EU talk by Berni SchieferSpark Summit
This document summarizes experiences using the TPC-DS benchmark with Spark SQL 2.0 and 2.1 on a large cluster designed for Spark. It describes the configuration of the "F1" cluster including its hardware, operating system, Spark, and network settings. Initial results show that Spark SQL 2.0 provides significant improvements over earlier versions. While most queries completed successfully, some queries failed or ran very slowly, indicating areas for further optimization.
Public Terabyte Dataset Project: Web crawling with Amazon Elastic MapReduceHadoop User Group
The document discusses the Public Terabyte Dataset Project which aims to create a large crawl of top US domains for public use on Amazon's cloud. It describes how the project uses various Amazon Web Services like Elastic MapReduce and SimpleDB along with technologies like Hadoop, Cascading, and Tika for web crawling and data processing. Common issues encountered include configuration problems, slow performance from fetching all web pages or using Tika language detection, and generating log files instead of results.
We present a software model built on the Apache software stack (ABDS) that is well used in modern cloud computing, which we enhance with HPC concepts to derive HPC-ABDS.
We discuss layers in this stack
We give examples of integrating ABDS with HPC
We discuss how to implement this in a world of multiple infrastructures and evolving software environments for users, developers and administrators
We present Cloudmesh as supporting Software-Defined Distributed System as a Service or SDDSaaS with multiple services on multiple clouds/HPC systems.
We explain the functionality of Cloudmesh as well as the 3 administrator and 3 user modes supported
We present a software model built on the Apache software stack (ABDS) that is well used in modern cloud computing, which we enhance with HPC concepts to derive HPC-ABDS.
We discuss layers in this stack
We give examples of integrating ABDS with HPC
We discuss how to implement this in a world of multiple infrastructures and evolving software environments for users, developers and administrators
We present Cloudmesh as supporting Software-Defined Distributed System as a Service or SDDSaaS with multiple services on multiple clouds/HPC systems.
We explain the functionality of Cloudmesh as well as the 3 administrator and 3 user modes supported
The document provides an agenda for a DevOps advanced class on Spark being held in June 2015. The class will cover topics such as RDD fundamentals, Spark runtime architecture, memory and persistence, Spark SQL, PySpark, and Spark Streaming. It will include labs on DevOps 101 and 102. The instructor has over 5 years of experience providing Big Data consulting and training, including over 100 classes taught.
sudoers: Benchmarking Hadoop with ALOJANicolas Poggi
Presentation for the sudoers Barcelona group 0ct 06 2015, on benchmarking Hadoop with ALOJA open source benchmarking platform. The presentation was mostly a live DEMO, posting some slides for the people who could not attend.
http://lanyrd.com/2015/sudoers-barcelona-october/
Testing Big Data: Automated Testing of Hadoop with QuerySurgeRTTS
Are You Ready? Stepping Up To The Big Data Challenge In 2016 - Learn why Testing is pivotal to the success of your Big Data Strategy.
According to a new report by analyst firm IDG, 70% of enterprises have either deployed or are planning to deploy big data projects and programs this year due to the increase in the amount of data they need to manage.
The growing variety of new data sources is pushing organizations to look for streamlined ways to manage complexities and get the most out of their data-related investments. The companies that do this correctly are realizing the power of big data for business expansion and growth.
Learn why testing your enterprise's data is pivotal for success with big data and Hadoop. Learn how to increase your testing speed, boost your testing coverage (up to 100%), and improve the level of quality within your data - all with one data testing tool.
This document summarizes a presentation on implementing an ETL project using Apache Impala on the Hadoop platform. It discusses the components of a Hadoop ETL project, including data storage, sources, ETL processing, metadata, and targets. It then shares details about a customer case that involved replacing an SQL reporting system with Impala to access standardized data for reporting and analytics within required timeframes. The presentation emphasizes that successful Hadoop ETL projects require data engineering methodologies around organization, modularization, logging, and testing approaches.
Introduction to Mahout and Machine LearningVarad Meru
This presentation gives an introduction to Apache Mahout and Machine Learning. It presents some of the important Machine Learning algorithms implemented in Mahout. Machine Learning is a vast subject; this presentation is only a introductory guide to Mahout and does not go into lower-level implementation details.
This talk was held at the 10th meeting on February 3rd 2014 by Sean Owen.
Having collected Big Data, organizations are now keen on data science and “Big Learning”. Much of the focus has been on data science as exploratory analytics: offline, in the lab. However, building from that a production-ready large-scale operational analytics system remains a difficult and ad-hoc endeavor, especially when real-time answers are required. Design patterns for effective implementations are emerging, which take advantage of relaxed assumptions, adopt a new tiered "lambda" architecture, and pick the right scale-friendly algorithms to succeed. Drawing on experience from customer problems and the open source Oryx project at Cloudera, this session will provide examples of operational analytics projects in the field, and present a reference architecture and algorithm design choices for a successful implementation.
Building a Data Pipeline from Scratch - Joe CrobakHakka Labs
A data pipeline is a unified system for capturing events for analysis and building products. It involves capturing user events from various sources, storing them in a centralized data warehouse, and performing analysis and building products using tools like Hadoop. Key components of a data pipeline include an event framework, message bus, data serialization, data persistence, workflow management, and batch processing. A Lambda architecture allows for both batch and real-time processing of data captured by the pipeline.
ExtremeEarth: Hopsworks, a data-intensive AI platform for Deep Learning with ...Big Data Value Association
The main goal of the session is to showcase approaches that greatly simplify the work of a data analyst when performing data analytics, or when employing machine learning algorithms, over Big Data. The session will include presentations on
(a) How data analytics workflows can be easily and graphically composed, and then optimized for execution,
(b) How raw data with great variety can be easily queried using SQL interfaces, and
(c) How complex machine learning operations can be performed efficiently in distributed settings.
After these presentations, the speakers will participate in a discussion with the audience, in order to discuss further tools that could make the work of a data analyst more simple.
The document discusses the evolution of router architectures away from traditional router designs. It argues that routers should move from being chassis-based systems running proprietary operating systems to being more modular, microservices-based architectures using open standards like Linux. Key points of the new model outlined include using many small independent software and hardware units for increased resilience, running software in containers, and having a database-driven management and control plane. The document suggests this type of architecture could make routers more programmable, scalable, and adaptable to changing technology needs over time.
Testing Big Data: Automated ETL Testing of HadoopBill Hayduk
Learn why testing your enterprise's data is pivotal for success with Big Data and Hadoop. See how to increase your testing speed, boost your testing coverage (up to 100%), and improve the level of quality within your data warehouse - all with one ETL testing tool.
Agile Big Data Analytics Development: An Architecture-Centric ApproachSoftServe
Presented at The Hawaii International Conference on System Sciences by Hong-Mei Chen and Rick Kazman (University of Hawaii), Serge Haziyev (SoftServe).
The document summarizes two use cases for Hadoop in biotech companies. The first case discusses a large biotech firm "N" that implemented Hadoop to improve their drug development workflow using next generation DNA sequencing. Hadoop reduced the workflow from 6 weeks to 2 days. The second case discusses challenges at another biotech firm "M" around scaling genomic data analysis and Hadoop's role in addressing those challenges through improved data ingestion, storage, querying and analysis capabilities.
The document summarizes research done at the Barcelona Supercomputing Center on evaluating Hadoop platforms as a service (PaaS) compared to infrastructure as a service (IaaS). Key findings include:
- Provider (Azure HDInsight, Rackspace CBD, etc.) did not significantly impact performance of wordcount and terasort benchmarks.
- Data size and number of datanodes were more important factors, with diminishing returns on performance from adding more nodes.
- PaaS can save on maintenance costs compared to IaaS but may be more expensive depending on workload and VM size needed. Tuning may still be required with PaaS.
Matching Data Intensive Applications and Hardware/Software ArchitecturesGeoffrey Fox
There is perhaps a broad consensus as to important issues in practical parallel computing as applied to large scale simulations; this is reflected in supercomputer architectures, algorithms, libraries, languages, compilers and best practice for application development. However the same is not so true for data intensive problems even though commercial clouds presumably devote more resources to data analytics than supercomputers devote to simulations. We try to establish some principles that allow one to compare data intensive architectures and decide which applications fit which machines and which software.
We use a sample of over 50 big data applications to identify characteristics of data intensive applications and propose a big data version of the famous Berkeley dwarfs and NAS parallel benchmarks. We consider hardware from clouds to HPC. Our software analysis builds on the Apache software stack (ABDS) that is well used in modern cloud computing, which we enhance with HPC concepts to derive HPC-ABDS.
We illustrate issues with examples including kernels like clustering, and multi-dimensional scaling; cyberphysical systems; databases; and variants of image processing from beam lines, Facebook and deep-learning.
Matching Data Intensive Applications and Hardware/Software ArchitecturesGeoffrey Fox
This document discusses matching data intensive applications to hardware and software architectures. It provides examples of over 50 big data applications and analyzes their characteristics to identify common patterns. These patterns are used to propose a "big data version" of the Berkeley dwarfs and NAS parallel benchmarks for evaluating data-intensive systems. The document also analyzes hardware architectures from clouds to HPC and proposes integrating HPC concepts into the Apache software stack to develop an HPC-ABDS software stack for high performance data analytics. Key aspects of applications, hardware, and software architectures are illustrated with examples and diagrams.
Michael stack -the state of apache h basehdhappy001
The document provides an overview of Apache HBase, an open source, distributed, scalable, big data non-relational database. It discusses that HBase is modeled after Google's Bigtable and built on Hadoop for storage. It also summarizes that HBase is used by many large companies for applications such as messaging, real-time analytics, and search indexing. The project is led by an active community of committers and sees steady improvements and new features with each monthly release.
Orca: A Modular Query Optimizer Architecture for Big DataEMC
This document describes Orca, a new query optimizer architecture developed by Pivotal for its data management products. Orca is designed to be modular and portable, allowing it to optimize queries for both massively parallel processing (MPP) databases and Hadoop systems. The key features of Orca include its use of a memo structure to represent the search space of query plans, a job scheduler to efficiently explore the search space in parallel, and an extensible framework for property enforcement during query optimization. Performance tests showed that Orca provided query speedups of 10x to 1000x over previous optimization systems.
Apache Spark presentation at HasGeek FifthElelephant
https://fifthelephant.talkfunnel.com/2015/15-processing-large-data-with-apache-spark
Covering Big Data Overview, Spark Overview, Spark Internals and its supported libraries
Docker Overview detail about docker introduction, architecture, components and orchestration
Meetup Details of my presentation here:
http://www.meetup.com/DevOps-Meetup/events/222569192/
http://www.meetup.com/Scale-Warriors-of-Bangalore/events/223008532/
Introduction to Flocker which is a lightweight volume and container manager.
Meetup details of my presentation:
http://www.meetup.com/Docker-Bangalore/events/222476025/
Go is a compiled, garbage-collected programming language that supports concurrent programming through lightweight threads called goroutines and communication between goroutines via channels. It aims to provide both high-level and low-level programming with a clean syntax and minimal features. The document discusses Go's concurrency model, syntax, goroutines, channels, and use cases including cloud infrastructure, mobile development, and audio synthesis.
Kubernetes is an open-source system for managing containerized applications across multiple hosts. It groups related containers into pods that are scheduled together on the same host. Key components include the master node for managing the cluster, minion nodes for hosting pods, and kubelet software for running pods and managing containers. Pods allow tight coupling of related containers, while labels provide loose organization of cooperating pods.
Presentation detailed about SDN (Software Defined Network) overview . It covers from basics like different controllers and touches upon some technical details.
Covers Terminologies used, OpenFlow, Controllers, Open Day light, Cisco ONE, Google B4, NFV,etc
This document provides an overview of Docker, including what it is, how it compares to virtual machines and containers, its architecture and features. It discusses that Docker virtualizes using lightweight Linux containers rather than full virtual machines, and how this provides benefits like smaller size and faster performance compared to VMs. It also covers Docker's components like the Docker Engine, Hub and images, and how Docker can be used to develop, ship and run applications on any infrastructure.
Presentation provides introduction and detailed explanation of the Java 8 Lambda and Streams. Lambda covers with Method references, default methods and Streams covers with stream operations,types of streams, collectors. Also streams are elaborated with parallel streams and benchmarking comparison of sequential and parallel streams.
Additional slides are covered with Optional, Splitators, certain projects based on lambda and streams
Presentation detailed about capabilities of In memory Analytic using Apache Spark. Apache Spark overview with programming mode, cluster mode with Mosos, supported operations and comparison with Hadoop Map Reduce. Elaborating Apache Spark Stack expansion like Shark, Streaming, MLib, GraphX
Introducing Milvus Lite: Easy-to-Install, Easy-to-Use vector database for you...Zilliz
Join us to introduce Milvus Lite, a vector database that can run on notebooks and laptops, share the same API with Milvus, and integrate with every popular GenAI framework. This webinar is perfect for developers seeking easy-to-use, well-integrated vector databases for their GenAI apps.
Threats to mobile devices are more prevalent and increasing in scope and complexity. Users of mobile devices desire to take full advantage of the features
available on those devices, but many of the features provide convenience and capability but sacrifice security. This best practices guide outlines steps the users can take to better protect personal devices and information.
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?Speck&Tech
ABSTRACT: A prima vista, un mattoncino Lego e la backdoor XZ potrebbero avere in comune il fatto di essere entrambi blocchi di costruzione, o dipendenze di progetti creativi e software. La realtà è che un mattoncino Lego e il caso della backdoor XZ hanno molto di più di tutto ciò in comune.
Partecipate alla presentazione per immergervi in una storia di interoperabilità, standard e formati aperti, per poi discutere del ruolo importante che i contributori hanno in una comunità open source sostenibile.
BIO: Sostenitrice del software libero e dei formati standard e aperti. È stata un membro attivo dei progetti Fedora e openSUSE e ha co-fondato l'Associazione LibreItalia dove è stata coinvolta in diversi eventi, migrazioni e formazione relativi a LibreOffice. In precedenza ha lavorato a migrazioni e corsi di formazione su LibreOffice per diverse amministrazioni pubbliche e privati. Da gennaio 2020 lavora in SUSE come Software Release Engineer per Uyuni e SUSE Manager e quando non segue la sua passione per i computer e per Geeko coltiva la sua curiosità per l'astronomia (da cui deriva il suo nickname deneb_alpha).
Dr. Sean Tan, Head of Data Science, Changi Airport Group
Discover how Changi Airport Group (CAG) leverages graph technologies and generative AI to revolutionize their search capabilities. This session delves into the unique search needs of CAG’s diverse passengers and customers, showcasing how graph data structures enhance the accuracy and relevance of AI-generated search results, mitigating the risk of “hallucinations” and improving the overall customer journey.
A tale of scale & speed: How the US Navy is enabling software delivery from l...sonjaschweigert1
Rapid and secure feature delivery is a goal across every application team and every branch of the DoD. The Navy’s DevSecOps platform, Party Barge, has achieved:
- Reduction in onboarding time from 5 weeks to 1 day
- Improved developer experience and productivity through actionable findings and reduction of false positives
- Maintenance of superior security standards and inherent policy enforcement with Authorization to Operate (ATO)
Development teams can ship efficiently and ensure applications are cyber ready for Navy Authorizing Officials (AOs). In this webinar, Sigma Defense and Anchore will give attendees a look behind the scenes and demo secure pipeline automation and security artifacts that speed up application ATO and time to production.
We will cover:
- How to remove silos in DevSecOps
- How to build efficient development pipeline roles and component templates
- How to deliver security artifacts that matter for ATO’s (SBOMs, vulnerability reports, and policy evidence)
- How to streamline operations with automated policy checks on container images
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slackshyamraj55
Discover the seamless integration of RPA (Robotic Process Automation), COMPOSER, and APM with AWS IDP enhanced with Slack notifications. Explore how these technologies converge to streamline workflows, optimize performance, and ensure secure access, all while leveraging the power of AWS IDP and real-time communication via Slack notifications.
UiPath Test Automation using UiPath Test Suite series, part 5DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 5. In this session, we will cover CI/CD with devops.
Topics covered:
CI/CD with in UiPath
End-to-end overview of CI/CD pipeline with Azure devops
Speaker:
Lyndsey Byblow, Test Suite Sales Engineer @ UiPath, Inc.
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024Neo4j
Neha Bajwa, Vice President of Product Marketing, Neo4j
Join us as we explore breakthrough innovations enabled by interconnected data and AI. Discover firsthand how organizations use relationships in data to uncover contextual insights and solve our most pressing challenges – from optimizing supply chains, detecting fraud, and improving customer experiences to accelerating drug discoveries.
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AIVladimir Iglovikov, Ph.D.
Presented by Vladimir Iglovikov:
- https://www.linkedin.com/in/iglovikov/
- https://x.com/viglovikov
- https://www.instagram.com/ternaus/
This presentation delves into the journey of Albumentations.ai, a highly successful open-source library for data augmentation.
Created out of a necessity for superior performance in Kaggle competitions, Albumentations has grown to become a widely used tool among data scientists and machine learning practitioners.
This case study covers various aspects, including:
People: The contributors and community that have supported Albumentations.
Metrics: The success indicators such as downloads, daily active users, GitHub stars, and financial contributions.
Challenges: The hurdles in monetizing open-source projects and measuring user engagement.
Development Practices: Best practices for creating, maintaining, and scaling open-source libraries, including code hygiene, CI/CD, and fast iteration.
Community Building: Strategies for making adoption easy, iterating quickly, and fostering a vibrant, engaged community.
Marketing: Both online and offline marketing tactics, focusing on real, impactful interactions and collaborations.
Mental Health: Maintaining balance and not feeling pressured by user demands.
Key insights include the importance of automation, making the adoption process seamless, and leveraging offline interactions for marketing. The presentation also emphasizes the need for continuous small improvements and building a friendly, inclusive community that contributes to the project's growth.
Vladimir Iglovikov brings his extensive experience as a Kaggle Grandmaster, ex-Staff ML Engineer at Lyft, sharing valuable lessons and practical advice for anyone looking to enhance the adoption of their open-source projects.
Explore more about Albumentations and join the community at:
GitHub: https://github.com/albumentations-team/albumentations
Website: https://albumentations.ai/
LinkedIn: https://www.linkedin.com/company/100504475
Twitter: https://x.com/albumentations
Removing Uninteresting Bytes in Software FuzzingAftab Hussain
Imagine a world where software fuzzing, the process of mutating bytes in test seeds to uncover hidden and erroneous program behaviors, becomes faster and more effective. A lot depends on the initial seeds, which can significantly dictate the trajectory of a fuzzing campaign, particularly in terms of how long it takes to uncover interesting behaviour in your code. We introduce DIAR, a technique designed to speedup fuzzing campaigns by pinpointing and eliminating those uninteresting bytes in the seeds. Picture this: instead of wasting valuable resources on meaningless mutations in large, bloated seeds, DIAR removes the unnecessary bytes, streamlining the entire process.
In this work, we equipped AFL, a popular fuzzer, with DIAR and examined two critical Linux libraries -- Libxml's xmllint, a tool for parsing xml documents, and Binutil's readelf, an essential debugging and security analysis command-line tool used to display detailed information about ELF (Executable and Linkable Format). Our preliminary results show that AFL+DIAR does not only discover new paths more quickly but also achieves higher coverage overall. This work thus showcases how starting with lean and optimized seeds can lead to faster, more comprehensive fuzzing campaigns -- and DIAR helps you find such seeds.
- These are slides of the talk given at IEEE International Conference on Software Testing Verification and Validation Workshop, ICSTW 2022.
For the full video of this presentation, please visit: https://www.edge-ai-vision.com/2024/06/building-and-scaling-ai-applications-with-the-nx-ai-manager-a-presentation-from-network-optix/
Robin van Emden, Senior Director of Data Science at Network Optix, presents the “Building and Scaling AI Applications with the Nx AI Manager,” tutorial at the May 2024 Embedded Vision Summit.
In this presentation, van Emden covers the basics of scaling edge AI solutions using the Nx tool kit. He emphasizes the process of developing AI models and deploying them globally. He also showcases the conversion of AI models and the creation of effective edge AI pipelines, with a focus on pre-processing, model conversion, selecting the appropriate inference engine for the target hardware and post-processing.
van Emden shows how Nx can simplify the developer’s life and facilitate a rapid transition from concept to production-ready applications.He provides valuable insights into developing scalable and efficient edge AI solutions, with a strong focus on practical implementation.
How to Get CNIC Information System with Paksim Ga.pptxdanishmna97
Pakdata Cf is a groundbreaking system designed to streamline and facilitate access to CNIC information. This innovative platform leverages advanced technology to provide users with efficient and secure access to their CNIC details.
Securing your Kubernetes cluster_ a step-by-step guide to success !KatiaHIMEUR1
Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster.
However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks.
In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.
Building RAG with self-deployed Milvus vector database and Snowpark Container...Zilliz
This talk will give hands-on advice on building RAG applications with an open-source Milvus database deployed as a docker container. We will also introduce the integration of Milvus with Snowpark Container Services.
2. 2
Why ..
Evaluating the effect of a hardware/software
upgrade:
OS, Java VM,. . .
Hadoop, Cloudera CDH, Pig, Hive, Impala,.
. .
Debugging:
Compare with other clusters or published
results.
Performance tuning
3. 3
Industry Standard benchmarking organizations
• TPC - Transaction Processing Performance Council (http://www.tpc.org/ )
• SPEC - The Standard Performance Evaluation Corporation
(https://www.spec.org/ )
• CLDS – Centre for Large- scale Data System Research
(http://clds.sdsc.edu/bdbc)
• Top Outcomes
• BigData Top100 - an end-to-end application-layer benchmark for big data
applications
• Terasort - Functional benchmark focusing on Sort function ( quicksort using
MapReduce)
• Hibench
• Sort, Machine learning ( K-means clustering, Classification)
4. 4
Types of Benchmark
• Micro-benchmarks. To evaluate specific lower-level, system operations
• E.g., A Micro-benchmark Suite for Evaluating HDFS Operations on
Modern Clusters, Panda et al, OSU
• Functional / component benchmarks. Specific high-level function.
• E.g. Sorting: Terasort
• E.g. Basic SQL: Individual SQL operations, e.g. Select, Project, Join,
Order-By, ...
• Application-level benchmarks.
• Measure system performance (hardware and software) for a given
application scenario—with given data and workload
5. 5
Terasort using Hadoop
Terasort includes 3 MapReduce Applications
• Teragen – generates the data
• Terasort – samples the input data and uses them with MapReduce to
sort the data
• Teravalidate – validates the output data is sorted
7. 7
Map Reduce Modelloser look at MapReduce’s implementation model
source: http:/ / developer.yahoo.com/ hadoop/ tutorial/ module4.html”
8. 8
Benchmarking Suite
• HiBench, Yan Li, Intel (https://github.com/intel-hadoop/HiBench)
• YCSB -Yahoo Cloud Serving Benchmark, Brian Cooper, Yahoo!
(https://github.com/brianfrankcooper/YCSB/)
• Berkeley Big Data Benchmark, Pavlo et al., AMPLab
(https://amplab.cs.berkeley.edu/benchmark/)
• BigDataBench, Jianfeng Zhan, Chinese Academy of Sciences
(http://prof.ict.ac.cn/BigDataBench/)
• Grid Mix (http://hadoop.apache.org/docs/r1.2.1/gridmix.html)
• Big Bench (https://github.com/intel-hadoop/Big-Bench)
• TPCx-HS (http://www.tpc.org/tpcx-hs/ )
9. 9
TPCx-HS benchmarks
X: Express H: Hadoop S: Sort
• TPCx-HS was developed to provide an
objective measure of hardware,
operating system and commercial
Apache Hadoop File System API
compatible software distributions, and
to provide the industry with verifiable
performance, price-performance and
availability metrics.
• http://www.tpc.org/tpcx-hs/
11. 11
TPCx-HS benchmarks
Scale Factor
The TPCx-HS follows a stepped size model. Scale factor (SF) used for the test
dataset must be chosen from the set of fixed Scale Factors defined as :
• 1TB, 3TB, 10TB, 30TB, 100TB, 300TB, 1000TB, 3000TB, 10000TB.
• The corresponding number of records are
• 10B, 30B, 100B, 300B, 1000B, 3000B, 10000B, 30000B, 100000B,
where each record is 100 bytes generated by HSGen.
• http://www.tpc.org/tpcx-hs/
17. 17
Spark sorted the same data 3X faster using 10X fewer machines. All the sorting
took place on disk (HDFS), without using Spark’s in-memory cache.
CPU Type: Intel Xeon E5-2660 - 2.20 GHz Total # of Processors: 32 Total # of Cores: 320 Total # of Threads: 640 Cluster: Yes Data Generation Time (hours): .23 Data Sort Time (hours): 1.29 Data Validation Time (hours): .22 Total Storage/Database Size Ratio: 38.40
TPCx
-
HS FDR
11
January, 2015
Measured Configuration:
The
measured configuration
consisted of
:
Total Nodes: 16
Total Processors/Cores/Threads: 32/320/640
Total Memory: 4,096GB
Total Number of Storage Drives/Devices: 384
Total Storage Capacity: 384
TB
MVAPICH2 is an open source implementation of Message Passing Interface (MPI) that delivers the best performance, scalability and fault tolerance for high-end computing systems and servers using InfiniBand, 10GigE/iWARP and RoCE networking technologies.