This document discusses different methods of parallelism for data warehousing including data parallelism, temporal parallelism (pipelining), and their advantages. Data parallelism involves executing a single query across partitions of data using multiple query servers coordinated by a query coordinator. Temporal parallelism breaks a task into independent subtasks that can execute concurrently in a pipeline. Pipelining aims to increase throughput rather than decrease subtask time. The document also covers partitioning strategies like round robin, hash, and range partitioning and their suitability for different types of queries.
This document discusses parallelism in data warehousing. It explains that parallelism can improve performance for large table scans, joins, indexing, and data loading/modification operations. It also discusses Amdahl's law, which shows that the potential speedup from parallelism is limited by the percentage of sequential operations. Additionally, the document provides an overview of different parallel hardware architectures like SMP, distributed memory, and NUMA systems and software architectures like shared disk, shared nothing, and shared everything.
IRJET- Hadoop based Frequent Closed Item-Sets for Association Rules form ...IRJET Journal
This document summarizes a research paper that proposes a Hadoop-based approach called FiDoop-DP for efficiently mining frequent closed itemsets from big data using parallel computing. FiDoop-DP is a data partitioning technique that aims to improve the performance of parallel frequent itemset mining on Hadoop clusters by reducing redundant data transmission between nodes. It does this by grouping highly related transactions together in partitions based on transaction correlations to minimize redundant transactions. The paper describes how FiDoop-DP was implemented and evaluated on a 24-node Hadoop cluster using various datasets. Experimental results showed that FiDoop-DP significantly improved performance over existing parallel frequent pattern mining algorithms by an average of 18-31% by reducing computing loads through
The document provides an overview of big data, analytics, Hadoop, and related concepts. It discusses what big data is and the challenges it poses. It then describes Hadoop as an open-source platform for distributed storage and processing of large datasets across clusters of commodity hardware. Key components of Hadoop introduced include HDFS for storage, MapReduce for parallel processing, and various other tools. A word count example demonstrates how MapReduce works. Common use cases and companies using Hadoop are also listed.
This document discusses modeling algorithms using the MapReduce framework. It outlines types of learning that can be done in MapReduce, including parallel training of models, ensemble methods, and distributed algorithms that fit the statistical query model (SQM). Specific algorithms that can be implemented in MapReduce are discussed, such as linear regression, naive Bayes, logistic regression, and decision trees. The document provides examples of how these algorithms can be formulated and computed in a MapReduce paradigm by distributing computations across mappers and reducers.
WANdisco is a provider of non-stop software for global enterprises to meet the challenges of Big Data and distributed software development.
KEY HIGHLIGHTS, Session 1: Tuesday, Feb. 26, 5:15 p.m.-6 p.m.
Hadoop and HBase on the Cloud: A Case Study on Performance and Isolation
Cloud infrastructure is a flexible tool to orchestrate multiple Hadoop and HBase clusters, which provides strict isolation of data and compute resources for multiple customers. Most importantly, our benchmarks show that virtualized environment allows for higher average utilization of per-node resources. For more session information, visit http://na.apachecon.com/schedule/presentation/131/.
CO-PRESENTERS, Dr. Konstantin V. Shvachko, Chief Architect, Big Data, WANdisco and Jagane Sundar, CTO/VP Engineering, Big Data, WANdisco
A veteran Hadoop developer and respected author, Konstantin Shvachko is a technical expert specializing in efficient data structures and algorithms for large-scale distributed storage systems. Konstantin joined WANdisco through the AltoStor acquisition and before that he was founder and Chief Scientist at AltoScale, a Hadoop and HBase-as-a-Platform company acquired by VertiCloud. Konstantin played a lead architectural role at eBay, building two generations of the organization's Hadoop platform. At Yahoo!, he worked on the Hadoop Distributed File System (HDFS). Konstantin has dozens of publications and presentations to his credit and is currently a member of the Apache Hadoop PMC. Konstantin has a Ph.D. in Computer Science and M.S. in Mathematics from Moscow State University, Russia.
Jagane Sundar has extensive big data, cloud, virtualization, and networking experience and joined WANdisco through its AltoStor acquisition. Before AltoStor, Jagane was founder and CEO of AltoScale, a Hadoop and HBase-as-a-Platform company acquired by VertiCloud. His experience with Hadoop began as Director of Hadoop Performance and Operability at Yahoo! Jagane has such accomplishments to his credit as the creation of Livebackup, development of a user mode TCP Stack for Precision I/O, development of the NFS and PPP clients and parts of the TCP stack for JavaOS for Sun MicroSystems, and more. Jagane received his B.E. in Electronics and Communications Engineering from Anna University.
About WANdisco
WANdisco ( LSE : WAND ) is a provider of enterprise-ready, non-stop software solutions that enable globally distributed organizations to meet today's data challenges of secure storage, scalability and availability. WANdisco's products are differentiated by the company's patented, active-active data replication technology, serving crucial high availability (HA) requirements, including Hadoop Big Data and Application Lifecycle Management (ALM). Fortune Global 1000 companies including AT&T, Motorola, Intel and Halliburton rely on WANdisco for performance, reliability, security and availability. For additional information, please visit www.wandisco.com.
It will cover features of the HDF5 library for achieving better I/O performance and efficient storage. The following HDF5 features will be discussed: chunked storage layout.
This tutorial is for persons who are already familiar with HDF5 and wish to take advantage is some of its advanced features.
This document provides an overview of topics to be covered in a Big Data training. It will discuss uses of Big Data, Hadoop, HDFS architecture, MapReduce algorithm, WordCount example, tips for MapReduce, and distributing Twitter data for testing. Key concepts that will be covered include what Big Data is, how HDFS is architected, the MapReduce phases of map, sort, shuffle, and reduce, and how WordCount works as a simple MapReduce example. The goal is to introduce foundational Big Data and Hadoop concepts.
Apache Hadoop is a framework for distributed computation and storage of very large data sets on computer clusters. Hadoop began as a project to implement Google’s MapReduce programming model and has become synonymous with a rich ecosystem of related technologies, not limited to Apache Pig, Apache Hive, Apache Spark, Apache HBase, and others
This document discusses parallelism in data warehousing. It explains that parallelism can improve performance for large table scans, joins, indexing, and data loading/modification operations. It also discusses Amdahl's law, which shows that the potential speedup from parallelism is limited by the percentage of sequential operations. Additionally, the document provides an overview of different parallel hardware architectures like SMP, distributed memory, and NUMA systems and software architectures like shared disk, shared nothing, and shared everything.
IRJET- Hadoop based Frequent Closed Item-Sets for Association Rules form ...IRJET Journal
This document summarizes a research paper that proposes a Hadoop-based approach called FiDoop-DP for efficiently mining frequent closed itemsets from big data using parallel computing. FiDoop-DP is a data partitioning technique that aims to improve the performance of parallel frequent itemset mining on Hadoop clusters by reducing redundant data transmission between nodes. It does this by grouping highly related transactions together in partitions based on transaction correlations to minimize redundant transactions. The paper describes how FiDoop-DP was implemented and evaluated on a 24-node Hadoop cluster using various datasets. Experimental results showed that FiDoop-DP significantly improved performance over existing parallel frequent pattern mining algorithms by an average of 18-31% by reducing computing loads through
The document provides an overview of big data, analytics, Hadoop, and related concepts. It discusses what big data is and the challenges it poses. It then describes Hadoop as an open-source platform for distributed storage and processing of large datasets across clusters of commodity hardware. Key components of Hadoop introduced include HDFS for storage, MapReduce for parallel processing, and various other tools. A word count example demonstrates how MapReduce works. Common use cases and companies using Hadoop are also listed.
This document discusses modeling algorithms using the MapReduce framework. It outlines types of learning that can be done in MapReduce, including parallel training of models, ensemble methods, and distributed algorithms that fit the statistical query model (SQM). Specific algorithms that can be implemented in MapReduce are discussed, such as linear regression, naive Bayes, logistic regression, and decision trees. The document provides examples of how these algorithms can be formulated and computed in a MapReduce paradigm by distributing computations across mappers and reducers.
WANdisco is a provider of non-stop software for global enterprises to meet the challenges of Big Data and distributed software development.
KEY HIGHLIGHTS, Session 1: Tuesday, Feb. 26, 5:15 p.m.-6 p.m.
Hadoop and HBase on the Cloud: A Case Study on Performance and Isolation
Cloud infrastructure is a flexible tool to orchestrate multiple Hadoop and HBase clusters, which provides strict isolation of data and compute resources for multiple customers. Most importantly, our benchmarks show that virtualized environment allows for higher average utilization of per-node resources. For more session information, visit http://na.apachecon.com/schedule/presentation/131/.
CO-PRESENTERS, Dr. Konstantin V. Shvachko, Chief Architect, Big Data, WANdisco and Jagane Sundar, CTO/VP Engineering, Big Data, WANdisco
A veteran Hadoop developer and respected author, Konstantin Shvachko is a technical expert specializing in efficient data structures and algorithms for large-scale distributed storage systems. Konstantin joined WANdisco through the AltoStor acquisition and before that he was founder and Chief Scientist at AltoScale, a Hadoop and HBase-as-a-Platform company acquired by VertiCloud. Konstantin played a lead architectural role at eBay, building two generations of the organization's Hadoop platform. At Yahoo!, he worked on the Hadoop Distributed File System (HDFS). Konstantin has dozens of publications and presentations to his credit and is currently a member of the Apache Hadoop PMC. Konstantin has a Ph.D. in Computer Science and M.S. in Mathematics from Moscow State University, Russia.
Jagane Sundar has extensive big data, cloud, virtualization, and networking experience and joined WANdisco through its AltoStor acquisition. Before AltoStor, Jagane was founder and CEO of AltoScale, a Hadoop and HBase-as-a-Platform company acquired by VertiCloud. His experience with Hadoop began as Director of Hadoop Performance and Operability at Yahoo! Jagane has such accomplishments to his credit as the creation of Livebackup, development of a user mode TCP Stack for Precision I/O, development of the NFS and PPP clients and parts of the TCP stack for JavaOS for Sun MicroSystems, and more. Jagane received his B.E. in Electronics and Communications Engineering from Anna University.
About WANdisco
WANdisco ( LSE : WAND ) is a provider of enterprise-ready, non-stop software solutions that enable globally distributed organizations to meet today's data challenges of secure storage, scalability and availability. WANdisco's products are differentiated by the company's patented, active-active data replication technology, serving crucial high availability (HA) requirements, including Hadoop Big Data and Application Lifecycle Management (ALM). Fortune Global 1000 companies including AT&T, Motorola, Intel and Halliburton rely on WANdisco for performance, reliability, security and availability. For additional information, please visit www.wandisco.com.
It will cover features of the HDF5 library for achieving better I/O performance and efficient storage. The following HDF5 features will be discussed: chunked storage layout.
This tutorial is for persons who are already familiar with HDF5 and wish to take advantage is some of its advanced features.
This document provides an overview of topics to be covered in a Big Data training. It will discuss uses of Big Data, Hadoop, HDFS architecture, MapReduce algorithm, WordCount example, tips for MapReduce, and distributing Twitter data for testing. Key concepts that will be covered include what Big Data is, how HDFS is architected, the MapReduce phases of map, sort, shuffle, and reduce, and how WordCount works as a simple MapReduce example. The goal is to introduce foundational Big Data and Hadoop concepts.
Apache Hadoop is a framework for distributed computation and storage of very large data sets on computer clusters. Hadoop began as a project to implement Google’s MapReduce programming model and has become synonymous with a rich ecosystem of related technologies, not limited to Apache Pig, Apache Hive, Apache Spark, Apache HBase, and others
The document discusses MapReduce, including its programming model, internal framework, and improvements. It describes MapReduce as a programming model and framework that allows parallel processing of large datasets across commodity machines. The map function processes input key-value pairs to generate intermediate pairs, and the reduce function combines values for each key. The framework automatically parallelizes jobs and provides fault tolerance.
Hadoop Institutes: kelly technologies are the best Hadoop Training Institutes in Hyderabad. Providing Hadoop training by real time faculty in Hyderabad.
http://www.kellytechno.com/Hyderabad/Course/Hadoop-Training
This document provides an overview of big data concepts, technologies, and data scientists. It discusses how big data has outpaced traditional data warehousing and business intelligence technologies due to the increasing volumes, varieties, and velocities of data. It introduces Hadoop as an open source framework for distributed storage and processing of large datasets across clusters of commodity hardware. Key components of Hadoop like HDFS and MapReduce are explained at a high level. The document also discusses related open source projects that extend Hadoop's capabilities.
CBO choice between Index and Full Scan: the good, the bad and the ugly param...Franck Pachot
Usually, the conclusion comes at the end. But here I will clearly show my goal: I wish I will never see the optimizer_index_cost_adj parameters again. Especially when going to 12c where Adaptive Join can be completely fooled because of it. Choosing between index access and full table scan is a key point when optimizing a query, and historically the CBO came with several ways to influence that choice. But on some system, the workarounds have accumulated one on top of the other – biasing completely the CBO estimations. And we see nested loops on huge number of rows because of those false estimations.
This document discusses Hadoop, an open-source software framework for distributed storage and processing of large datasets across clusters of computers. It describes how Hadoop uses HDFS for distributed storage and fault tolerance, YARN for resource management, and MapReduce for parallel processing of large datasets. It provides details on the architecture of HDFS including the name node, data nodes, and clients. It also explains the MapReduce programming model and job execution involving map and reduce tasks. Finally, it states that as data volumes continue rising, Hadoop provides an affordable solution for large-scale data handling and analysis through its distributed and scalable architecture.
A sql implementation on the map reduce frameworkeldariof
Tenzing is a SQL query engine built on top of MapReduce for analyzing large datasets at Google. It provides low latency queries over heterogeneous data sources at petabyte scale. Tenzing supports a full SQL implementation with extensions and optimizations to achieve high performance comparable to commercial parallel databases by leveraging MapReduce and traditional database techniques. It is used by over 1,000 Google employees to query over 1.5 petabytes of data and serve 10,000 queries daily.
Accompanying slides for the class “Introduction to Hadoop” at the PRACE Autumn school 2020 - HPC and FAIR Big Data organized by the faculty of Mechanical Engineering of the University of Ljubljana (Slovenia).
This is slides from our recent HadoopIsrael meetup. It is dedicated to comparison Spark and Tez frameworks.
In the end of the meetup there is small update about our ImpalaToGo project.
PRACE Autumn school 2021 - Big Data with Hadoop and Keras
27-30 September 2021
Fakulteta za strojništvo
Europe/Ljubljana
Data and scripts are available at: https://www.events.prace-ri.eu/event/1226/timetable/
The document contains 31 questions and answers related to Hadoop concepts. It covers topics like common input formats in Hadoop, differences between TextInputFormat and KeyValueInputFormat, what are InputSplits and how they are created, how partitioning, shuffling and sorting occurs after the map phase, what is a combiner, functions of JobTracker and TaskTracker, how speculative execution works, using distributed cache and counters, setting number of mappers/reducers, writing custom partitioners, debugging Hadoop jobs, and failure handling processes for production Hadoop jobs.
This document discusses efficient analysis of big data using the MapReduce framework. It introduces the challenges of analyzing large and complex datasets, and describes how MapReduce addresses these challenges through its map and reduce functions. MapReduce allows distributed processing of big data across clusters of computers using a simple programming model.
The document discusses the Hadoop ecosystem. It provides an overview of Hadoop and its core components HDFS and MapReduce. HDFS is the storage component that stores large files across nodes in a cluster. MapReduce is the processing framework that allows distributed processing of large datasets in parallel. The document also discusses other tools in the Hadoop ecosystem like Hive, Pig, and Hadoop distributions from companies. It provides examples of running MapReduce jobs and accessing HDFS from the command line.
Less is More: 2X Storage Efficiency with HDFS Erasure CodingZhe Zhang
Ever since its creation, HDFS has been relying on data replication to shield against most failure scenarios. However, with the explosive growth in data volume, replication is getting quite expensive: the default 3x replication scheme incurs a 200% overhead in storage space and other resources (e.g., network bandwidth when writing the data). Erasure coding (EC) uses far less storage space while still providing the same level of fault tolerance. Under typical configurations, EC reduces the storage cost by ~50% compared with 3x replication.
Radiant it online training is the best online training for all software and networking courses, we are expertise in Hadoop online training, providing live projects on course duration.
This document provides a summary of improvements made to Hive's performance through the use of Apache Tez and other optimizations. Some key points include:
- Hive was improved to use Apache Tez as its execution engine instead of MapReduce, reducing latency for interactive queries and improving throughput for batch queries.
- Statistics collection was optimized to gather column-level statistics from ORC file footers, speeding up statistics gathering.
- The cost-based optimizer Optiq was added to Hive, allowing it to choose better execution plans.
- Vectorized query processing, broadcast joins, dynamic partitioning, and other optimizations improved individual query performance by over 100x in some cases.
Sap technical deep dive in a column oriented in memory databaseAlexander Talac
The document describes a lecture on column-oriented in-memory databases. The lecture covers the status quo of enterprise computing, database storage techniques like row and column storage layouts, in-memory database operators like scanning and aggregation, and advanced storage techniques like dictionary encoding and tuple reconstruction. The goal is to provide a deep technical understanding of column-oriented in-memory databases and their application in enterprise systems.
Parallel Data Processing with MapReduce: A SurveyKyong-Ha Lee
This document summarizes a survey on parallel data processing with MapReduce. It provides an overview of the MapReduce framework, including its architecture, key concepts of Map and Reduce functions, and how it handles parallel processing. It also discusses some inherent pros and cons of MapReduce, such as its simplicity but also performance limitations. Finally, it outlines approaches studied in recent literature to improve and optimize the MapReduce framework.
• What is MapReduce?
• What are MapReduce implementations?
Facing these questions I have make a personal research, and realize a synthesis, which has help me to clarify some ideas. The attached presentation does not intend to be exhaustive on the subject, but could perhaps bring you some useful insights.
The Hadoop platform uses the Hadoop Distributed File System (HDFS) to reliably store large files across thousands of nodes. It requires a minimum of computing power, memory, storage, and network bandwidth. A recommended cluster size depends on linear relationships between resources and efficiency. Dashboards can be created using data extracted from HDFS to SQL for analytics. The Hadoop architecture is designed to scale easily by adding more servers as data and workloads increase.
ADBS_parallel Databases in Advanced DBMSchandugoswami
This document discusses parallel database architecture. It covers various types of parallelism including I/O parallelism, inter-query parallelism, and intra-query parallelism. It describes techniques for partitioning relations across multiple disks to enable I/O parallelism, including round robin, hash, and range partitioning. It also addresses issues like skew in partitioning and techniques to handle skew like virtual processor partitioning and histograms.
The document discusses MapReduce, including its programming model, internal framework, and improvements. It describes MapReduce as a programming model and framework that allows parallel processing of large datasets across commodity machines. The map function processes input key-value pairs to generate intermediate pairs, and the reduce function combines values for each key. The framework automatically parallelizes jobs and provides fault tolerance.
Hadoop Institutes: kelly technologies are the best Hadoop Training Institutes in Hyderabad. Providing Hadoop training by real time faculty in Hyderabad.
http://www.kellytechno.com/Hyderabad/Course/Hadoop-Training
This document provides an overview of big data concepts, technologies, and data scientists. It discusses how big data has outpaced traditional data warehousing and business intelligence technologies due to the increasing volumes, varieties, and velocities of data. It introduces Hadoop as an open source framework for distributed storage and processing of large datasets across clusters of commodity hardware. Key components of Hadoop like HDFS and MapReduce are explained at a high level. The document also discusses related open source projects that extend Hadoop's capabilities.
CBO choice between Index and Full Scan: the good, the bad and the ugly param...Franck Pachot
Usually, the conclusion comes at the end. But here I will clearly show my goal: I wish I will never see the optimizer_index_cost_adj parameters again. Especially when going to 12c where Adaptive Join can be completely fooled because of it. Choosing between index access and full table scan is a key point when optimizing a query, and historically the CBO came with several ways to influence that choice. But on some system, the workarounds have accumulated one on top of the other – biasing completely the CBO estimations. And we see nested loops on huge number of rows because of those false estimations.
This document discusses Hadoop, an open-source software framework for distributed storage and processing of large datasets across clusters of computers. It describes how Hadoop uses HDFS for distributed storage and fault tolerance, YARN for resource management, and MapReduce for parallel processing of large datasets. It provides details on the architecture of HDFS including the name node, data nodes, and clients. It also explains the MapReduce programming model and job execution involving map and reduce tasks. Finally, it states that as data volumes continue rising, Hadoop provides an affordable solution for large-scale data handling and analysis through its distributed and scalable architecture.
A sql implementation on the map reduce frameworkeldariof
Tenzing is a SQL query engine built on top of MapReduce for analyzing large datasets at Google. It provides low latency queries over heterogeneous data sources at petabyte scale. Tenzing supports a full SQL implementation with extensions and optimizations to achieve high performance comparable to commercial parallel databases by leveraging MapReduce and traditional database techniques. It is used by over 1,000 Google employees to query over 1.5 petabytes of data and serve 10,000 queries daily.
Accompanying slides for the class “Introduction to Hadoop” at the PRACE Autumn school 2020 - HPC and FAIR Big Data organized by the faculty of Mechanical Engineering of the University of Ljubljana (Slovenia).
This is slides from our recent HadoopIsrael meetup. It is dedicated to comparison Spark and Tez frameworks.
In the end of the meetup there is small update about our ImpalaToGo project.
PRACE Autumn school 2021 - Big Data with Hadoop and Keras
27-30 September 2021
Fakulteta za strojništvo
Europe/Ljubljana
Data and scripts are available at: https://www.events.prace-ri.eu/event/1226/timetable/
The document contains 31 questions and answers related to Hadoop concepts. It covers topics like common input formats in Hadoop, differences between TextInputFormat and KeyValueInputFormat, what are InputSplits and how they are created, how partitioning, shuffling and sorting occurs after the map phase, what is a combiner, functions of JobTracker and TaskTracker, how speculative execution works, using distributed cache and counters, setting number of mappers/reducers, writing custom partitioners, debugging Hadoop jobs, and failure handling processes for production Hadoop jobs.
This document discusses efficient analysis of big data using the MapReduce framework. It introduces the challenges of analyzing large and complex datasets, and describes how MapReduce addresses these challenges through its map and reduce functions. MapReduce allows distributed processing of big data across clusters of computers using a simple programming model.
The document discusses the Hadoop ecosystem. It provides an overview of Hadoop and its core components HDFS and MapReduce. HDFS is the storage component that stores large files across nodes in a cluster. MapReduce is the processing framework that allows distributed processing of large datasets in parallel. The document also discusses other tools in the Hadoop ecosystem like Hive, Pig, and Hadoop distributions from companies. It provides examples of running MapReduce jobs and accessing HDFS from the command line.
Less is More: 2X Storage Efficiency with HDFS Erasure CodingZhe Zhang
Ever since its creation, HDFS has been relying on data replication to shield against most failure scenarios. However, with the explosive growth in data volume, replication is getting quite expensive: the default 3x replication scheme incurs a 200% overhead in storage space and other resources (e.g., network bandwidth when writing the data). Erasure coding (EC) uses far less storage space while still providing the same level of fault tolerance. Under typical configurations, EC reduces the storage cost by ~50% compared with 3x replication.
Radiant it online training is the best online training for all software and networking courses, we are expertise in Hadoop online training, providing live projects on course duration.
This document provides a summary of improvements made to Hive's performance through the use of Apache Tez and other optimizations. Some key points include:
- Hive was improved to use Apache Tez as its execution engine instead of MapReduce, reducing latency for interactive queries and improving throughput for batch queries.
- Statistics collection was optimized to gather column-level statistics from ORC file footers, speeding up statistics gathering.
- The cost-based optimizer Optiq was added to Hive, allowing it to choose better execution plans.
- Vectorized query processing, broadcast joins, dynamic partitioning, and other optimizations improved individual query performance by over 100x in some cases.
Sap technical deep dive in a column oriented in memory databaseAlexander Talac
The document describes a lecture on column-oriented in-memory databases. The lecture covers the status quo of enterprise computing, database storage techniques like row and column storage layouts, in-memory database operators like scanning and aggregation, and advanced storage techniques like dictionary encoding and tuple reconstruction. The goal is to provide a deep technical understanding of column-oriented in-memory databases and their application in enterprise systems.
Parallel Data Processing with MapReduce: A SurveyKyong-Ha Lee
This document summarizes a survey on parallel data processing with MapReduce. It provides an overview of the MapReduce framework, including its architecture, key concepts of Map and Reduce functions, and how it handles parallel processing. It also discusses some inherent pros and cons of MapReduce, such as its simplicity but also performance limitations. Finally, it outlines approaches studied in recent literature to improve and optimize the MapReduce framework.
• What is MapReduce?
• What are MapReduce implementations?
Facing these questions I have make a personal research, and realize a synthesis, which has help me to clarify some ideas. The attached presentation does not intend to be exhaustive on the subject, but could perhaps bring you some useful insights.
The Hadoop platform uses the Hadoop Distributed File System (HDFS) to reliably store large files across thousands of nodes. It requires a minimum of computing power, memory, storage, and network bandwidth. A recommended cluster size depends on linear relationships between resources and efficiency. Dashboards can be created using data extracted from HDFS to SQL for analytics. The Hadoop architecture is designed to scale easily by adding more servers as data and workloads increase.
ADBS_parallel Databases in Advanced DBMSchandugoswami
This document discusses parallel database architecture. It covers various types of parallelism including I/O parallelism, inter-query parallelism, and intra-query parallelism. It describes techniques for partitioning relations across multiple disks to enable I/O parallelism, including round robin, hash, and range partitioning. It also addresses issues like skew in partitioning and techniques to handle skew like virtual processor partitioning and histograms.
A quick review and demonstration on how to get started on parallel computing with R. Includes an example of SNOW cluster set up in the departmental lab.
Data Analytics and Simulation in Parallel with MATLAB*Intel® Software
This talk covers the current parallel capabilities in MATLAB*. Learn about its parallel language and distributed and tall arrays. Interact with GPUs both on the desktop and in the cluster. Combine this information into an interesting algorithmic framework for data analysis and simulation.
Data structures assignmentweek4b.pdfCI583 Data StructureOllieShoresna
The document discusses resiliency in file systems. Early file systems could become corrupted after crashes, losing unwritten file modifications or damaging unrelated files. This was because data structures describing the entire file system, like the superblock and inode list, could become corrupted. Modern file systems employ techniques like journaling and shadow paging to ensure all changes are written atomically and the file system remains consistent even after crashes.
Basics in algorithms and data structure Eman magdy
The document discusses data structures and algorithms. It notes that good programmers focus on data structures and their relationships, while bad programmers focus on code. It then provides examples of different data structures like trees and binary search trees, and algorithms for searching, inserting, deleting, and traversing tree structures. Key aspects covered include the time complexity of different searching algorithms like sequential search and binary search, as well as how to implement operations like insertion and deletion on binary trees.
Parallel Computing 2007: Bring your own parallel applicationGeoffrey Fox
This document discusses parallelizing several algorithms and applications including k-means clustering, frequent itemset mining, integer programming, computer chess, and support vector machines (SVM). For k-means and frequent itemset mining, the algorithms can be parallelized by partitioning the data across processors and performing partial computations locally before combining results with an allreduce operation. Computer chess can be parallelized by exploring different game tree branches simultaneously on different processors. SVM problems involve large dense matrices that are difficult to solve in parallel directly due to their size exceeding memory; alternative approaches include solving smaller subproblems independently.
This document provides an introduction to data structures and algorithms. It discusses arrays, stacks, queues and their applications. It also covers time and space complexity analysis of algorithms. Arrays are introduced as a linear data structure for storing similar data elements. Array elements can be accessed using an index or subscript. Multi-dimensional arrays can also be implemented for storing data in rows and columns.
Spark Summit EU talk by Sameer AgarwalSpark Summit
This document discusses Project Tungsten, which aims to substantially improve the memory and CPU efficiency of Spark. It describes how Spark has optimized IO but the CPU has become the bottleneck. Project Tungsten focuses on improving execution performance through techniques like explicit memory management, code generation, cache-aware algorithms, whole-stage code generation, and columnar in-memory data formats. It shows how these techniques provide significant performance improvements, such as 5-30x speedups on operators and 10-100x speedups on radix sort. Future work includes cost-based optimization and improving performance on many-core machines.
The document provides an overview of asynchronous processing and how it relates to scalability and performance. It discusses key topics like sync vs async, scheduling, latency measurement, concurrent vs lock-free vs wait-free data structures, I/O models like IO, AIO, NIO, zero-copy, and sorting algorithms. It emphasizes picking the right tools for the job and properly benchmarking and measuring performance.
The document discusses algorithms and their analysis. It defines an algorithm as a sequence of unambiguous steps to solve a problem within a finite time. Characteristics of algorithms include being unambiguous, having inputs/outputs, and terminating in finite time. Algorithm analysis involves determining theoretical and empirical time and space complexity as input size increases. Time complexity is analyzed by counting basic operations, while space complexity considers fixed and variable memory usage. Worst, best, and average cases analyze how efficiency varies with different inputs. Asymptotic analysis focuses on long-term growth rates to compare algorithms.
Cooperative Task Execution for Apache SparkDatabricks
Apache Spark has enabled a vast assortment of users to express batch, streaming, and machine learning computations, using a mixture of programming paradigms and interfaces. Lately, we observe that different jobs are often implemented as part of the same application to share application logic, state, or to interact with each other. Examples include online machine learning, real-time data transformation and serving, low-latency event monitoring and reporting. Although the recent addition of Structured Streaming to Spark provides the programming interface to enable such unified applications over bounded and unbounded data, the underlying execution engine was not designed to efficiently support jobs with different requirements (i.e., latency vs. throughput) as part of the same runtime. It therefore becomes particularly challenging to schedule such jobs to efficiently utilize the cluster resources while respecting their requirements in terms of task response times. Scheduling policies such as FAIR could alleviate the problem by prioritizing critical tasks, but the challenge remains, as there is no way to guarantee no queuing delays. Even though preemption by task killing could minimize queuing, it would also require task resubmission and loss of progress, leading to wasted cluster resources. In this talk, we present Neptune, a new cooperative task execution model for Spark with fine-grained control over resources such as CPU time. Neptune utilizes Scala coroutines as a lightweight mechanism to suspend task execution with sub-millisecond latency and introduces new scheduling policies that respect diverse task requirements while efficiently sharing the same runtime. Users can directly use Neptune for their continuous applications as it supports all existing DataFrame, DataSet, and RDD operators. We present an implementation of the execution model as part of Spark 2.4.0 and describe the observed performance benefits from running a number of streaming and machine learning workloads on an Azure cluster.
Speaker: Konstantinos Karanasos
A Tale of Data Pattern Discovery in ParallelJenny Liu
In the era of IoTs and A.I., distributed and parallel computing is embracing big data driven and algorithm focused applications and services. With rapid progress and development on parallel frameworks, algorithms and accelerated computing capacities, it still remains challenging on deliver an efficient and scalable data analysis solution. This talk shares a research experience on data pattern discovery in domain applications. In particular, the research scrutinizes key factors in analysis workflow design and data parallelism improvement on cloud.
Uwe Friedrichsen – Extreme availability and self-healing data with CRDTs - No...NoSQLmatters
Uwe Friedrichsen – Extreme availability and self-healing data with CRDTs
There are several challenges in the NoSQL world. Especially if you have very high availability requirements you have to accept temporal inconsistencies which you need to resolve explicitly. This is usually a tough job which requires implementing case by case business logic or even bothering the users to decide about the correct state of your data.Wouldn't it be great if we could solve this conflict resolution and data reconciliation process in a generic way at a pure technical level?That's exactly what CRDTs (Conflict-free Replicated Data Types) are about. CRDTs are data structures that are guaranteed to converge to a desired state while enabling extreme availability of the datastore.In this session you will learn what CRDTs are, how to design them, what you can do with them, what their limitations and tradeoffs are – of course garnished with lots of tips and tricks. Get ready to push the availability of your datastore to the max!
TensorFlow & TensorFrames w/ Apache Spark presents Marco Saviano. It discusses numerical computing with Apache Spark and Google TensorFlow. TensorFrames allows manipulating Spark DataFrames with TensorFlow programs. It provides most operations in row-based and block-based versions. Row-based processes rows individually while block-based processes blocks of rows together for better efficiency. Reduction operations coalesce rows until one row remains. Future work may improve communication between Spark and TensorFlow through direct memory copying and using columnar storage formats.
The document discusses various performance measures for parallel computing including speedup, efficiency, Amdahl's law, and Gustafson's law. Speedup is defined as the ratio of sequential to parallel execution time. Efficiency is defined as speedup divided by the number of processors. Amdahl's law provides an upper bound on speedup based on the fraction of sequential operations, while Gustafson's law estimates speedup based on the fraction of time spent in serial code for a fixed problem size on varying processors. Other topics covered include performance bottlenecks, data races, data race avoidance techniques, and deadlock avoidance using virtual channels.
This document provides an introduction and overview of key concepts in software development and data structures. It discusses the software development process, performance analysis using Big O notation, abstract data types, and introduces common data structures. Some key topics covered include specification and design of problems, implementation principles, testing and debugging, complexity analysis, preconditions and postconditions, and object-oriented programming as it relates to data structures.
C++ Is One Of The widely used programming language. Here is the complete presentation PPT notes of C++ programming language. hope it will be helpful to you.
This document discusses challenges faced in implementing Presto, an open source distributed SQL query engine, for targeted audience delivery at TiVo. It describes choosing appropriate instance types for Presto worker nodes based on memory needs. It also addresses scaling the Presto cluster elastically to handle query concurrency and maturity issues with the Presto software. The document provides insights on testing Presto using Docker containers and connecting to mocked tables.
This document provides an overview and introduction to the concepts taught in a data structures and algorithms course. It discusses the goals of reinforcing that every data structure has costs and benefits, learning commonly used data structures, and understanding how to analyze the efficiency of algorithms. Key topics covered include abstract data types, common data structures, algorithm analysis techniques like best/worst/average cases and asymptotic notation, and examples of analyzing the time complexity of various algorithms. The document emphasizes that problems can have multiple potential algorithms and that problems should be carefully defined in terms of inputs, outputs, and resource constraints.
This document provides an overview of the Python programming language. It begins with an introduction to running Python code and output. It then covers Python's basic data types like integers, floats, strings, lists, tuples and dictionaries. The document explains input and file I/O in Python as well as common control structures like if/else statements, while loops and for loops. It also discusses functions as first-class objects in Python that can be defined and passed as parameters. The document provides examples of higher-order functions like map, filter and reduce. Finally, it notes that functions can be defined inside other functions in Python.
This document provides an introduction to the Python programming language. It discusses Python versions and distributions, development environments, the Python interactive shell, basic Python data types like lists and strings, and gives examples of working with these data types through indexing, slicing and built-in methods. It also briefly introduces dictionaries and tuples. The document aims to provide newcomers to Python with essential information on the language and getting started.
This document provides an overview of interaction design beyond human-computer interaction. It discusses novel forms of interactive products that are embedded with computational power, such as refrigerators that provide recipes based on stored food. It also discusses augmented reality technologies that combine virtual and physical worlds. Examples of direct manipulation virtual environments and interactive virtual worlds are provided. The document discusses how representations can be dynalinked so that changes in one are reflected in another. It provides examples of aesthetically pleasing interactive products and virtual characters. Overall, the document gives a broad introduction to emerging areas of interaction design beyond traditional human-computer interaction, using examples of novel interactive products and technologies.
The document discusses an introduction to formal methods course. It provides biographical information about the instructor, Dr. Naveed Riaz, and outlines some of the key topics to be covered in the course, including formal languages, modeling systems using logic and set theory, and solving logic puzzles using propositional calculus and truth tables. It also gives an example of solving a logic problem about a person on an island of knights and knaves who says "If I am a knight then I will eat my hat".
This document discusses distance measures and scales of measurement that are important for k-nearest neighbor classification. It covers two major classes of distance measures - Euclidean and non-Euclidean. It also describes three major scales of measurement for data - nominal, ordinal, and interval scales. It provides examples of different distance functions like Euclidean, Manhattan, cosine, and edit distances. It discusses how the choice of distance measure depends on the type of data and its scale of measurement.
Frequent itemset mining using pattern growth methodShani729
The document discusses the FP-growth algorithm for mining frequent patterns without candidate generation. It begins with an overview of the performance bottlenecks of the Apriori algorithm and introduces the FP-growth approach. The key steps of FP-growth include compressing the transaction database into a frequent-pattern tree (FP-tree) structure, and then mining the FP-tree to find all frequent patterns. The mining process recursively constructs conditional FP-trees to decompose the problem into smaller sub-problems without candidate generation. Examples are provided to illustrate the FP-tree construction and pattern mining.
The document discusses data mining techniques for association rule mining. It defines association rules as relationships of the form A implies B, where A and B are itemsets in transactional data. The key steps are finding frequent itemsets that meet a minimum support threshold, and generating association rules from those itemsets that meet a minimum confidence threshold. The Apriori algorithm is described as the most common approach for efficiently finding all frequent itemsets in an iterative way by pruning itemsets with subsets that are not frequent.
The document describes a 4-step process for dimensional modeling: 1) choose a business process, 2) choose the grain, 3) choose facts, and 4) choose dimensions. It discusses important dimensional modeling concepts like grain, facts, dimensions, and hierarchies. It also covers considerations for choosing an appropriate grain like granularity, expressiveness, and hardware trade-offs. Aggregation can hide crucial facts but works for repetitive queries if the grain is justified. Dimensions should be single-valued attributes during transactions but can sometimes be multi-valued, requiring handling of problems. Dimensions must be applicable to the chosen grain.
This document discusses the need for dimensional modeling (DM) as a way to simplify complex entity-relationship (ER) data models optimized for online transaction processing (OLTP). ER modeling results in many normalized tables that are difficult for users to understand and query across. DM addresses this by collapsing dimensions into single tables and representing all data in a star schema with a central fact table linked to dimensional tables. This star schema structure is simpler for users to understand and allows for faster querying of data.
This document discusses five principal techniques for de-normalizing data in a data warehouse: collapsing tables, handling many-to-many relationships, splitting tables horizontally and vertically, pre-joining tables, and adding redundant columns. It provides examples and discusses trade-offs for each technique in terms of storage, performance, ease of use, and maintenance. De-normalization can improve query performance but also increases data storage requirements and maintenance effort. Careful consideration of each case is needed to determine if de-normalization provides overall benefits.
The document discusses why organizations implement data warehouses. It provides several key reasons:
1) Data recording and storage is growing exponentially in almost every industry, creating huge amounts of operational data. A data warehouse provides a platform for consolidated historical data analysis to support strategic decision making.
2) Knowledge workers want to analyze available data to extract useful information and insights to support decision making. This requires intelligent decision support capabilities.
3) In today's competitive environment, a data warehouse is a valuable marketing and business intelligence tool that can help organizations better understand customer needs and behavior to improve customer retention.
This document outlines the approach and summary of a course on data warehousing and mining. The course aims to develop an understanding of relational database concepts and how they apply and break down in very large databases and data warehousing. It will cover topics like online analytical processing, dimensional modeling, extract-transform-load, data quality management, and data mining concepts. The course also provides an overview of important reference books on data warehousing.
This document provides an introduction to databases, covering what a database is, what online transaction processing (OLTP) systems are and their strengths and weaknesses, an introduction to different keys used in databases, an overview of the entity-relationship (ER) model, and database normalization.
The document discusses the process of dimensional modeling from ER diagrams to dimensional models. It outlines a 4 step method: 1) Choose the business process, 2) Choose the grain, 3) Choose the facts, and 4) Choose the dimensions. It describes each step in detail, including choosing an appropriate grain, identifying additive vs. non-additive facts, handling slowly changing dimensions, and the tradeoffs of different approaches to modeling dimensions.
This document discusses data warehousing and online analytical processing (OLAP). It begins by explaining the relationship between data warehouses and OLAP, noting that OLAP supports analysis using data stored in a data warehouse. It then discusses different implementations of OLAP, including MOLAP (multidimensional OLAP using pre-aggregated data cubes), ROLAP (relational OLAP using relational databases), and HOLAP (hybrid OLAP combining aspects of MOLAP and ROLAP). It also addresses challenges in implementing OLAP and balancing performance and storage needs.
This document discusses data quality issues related to missing and noisy data. It introduces the concept of data imputation, which is a technique used to replace missing values in a dataset. Common imputation methods discussed include replacing missing numeric values with the mean, median, or mode of existing values for that attribute. The document also discusses different types of missing data (MCAR, MAR, NMAR) and techniques for handling missing data such as discarding records or attributes with many missing values.
The document discusses the need to create a centralized data warehouse (DWH) to store student and course registration data from multiple campuses of a university. It describes issues with the current ad hoc data storage approaches. Sample data is presented from Lahore, Karachi, and Islamabad campuses stored in different formats like text files, Excel, and Access. The data shows inconsistencies in attributes and structures across campuses. It also presents data from Peshawar campus stored in text files with some missing attributes. The problem is to normalize and integrate this heterogeneous multi-source data into a single DWH for analysis.
This document provides an introduction to Data Transformation Services (DTS) in Microsoft SQL Server. It discusses key DTS concepts like packages, tasks, transformations, and connections. DTS allows extracting, transforming, and consolidating data from disparate sources into a centralized location. The document demonstrates how to create, edit, and execute DTS packages using the Import/Export wizard, DTS Designer tool, and programming interfaces. It also covers tasks, transformations, connections, workflows, scheduling, versioning and other features of DTS packages.
This document describes a case study of building an agricultural data warehouse. It discusses the steps taken which include data acquisition, cleansing issues, transforming the data, deploying the data warehouse and using it for various purposes like data validation, generating reports, analyzing spray dates and sowing dates. It finds that ETL is a major challenge due to issues with data collection from farmers. The data warehouse helps enable decision making down to the extension worker level.
This document discusses the development of an agricultural data warehouse (Agri-DW) to store and analyze agricultural data from the Multan region of Pakistan. It describes the key players and pests/predators in the agricultural sector. It also discusses the need for using information technology and data warehousing in agriculture to help monitor factors like pest populations and economic threshold levels. Finally, it outlines a 12-step approach for developing the Agri-DW, including determining user needs, data modeling, and constructing a metadata repository.
Levelised Cost of Hydrogen (LCOH) Calculator ManualMassimo Talia
The aim of this manual is to explain the
methodology behind the Levelized Cost of
Hydrogen (LCOH) calculator. Moreover, this
manual also demonstrates how the calculator
can be used for estimating the expenses associated with hydrogen production in Europe
using low-temperature electrolysis considering different sources of electricity
Build the Next Generation of Apps with the Einstein 1 Platform.
Rejoignez Philippe Ozil pour une session de workshops qui vous guidera à travers les détails de la plateforme Einstein 1, l'importance des données pour la création d'applications d'intelligence artificielle et les différents outils et technologies que Salesforce propose pour vous apporter tous les bénéfices de l'IA.
Null Bangalore | Pentesters Approach to AWS IAMDivyanshu
#Abstract:
- Learn more about the real-world methods for auditing AWS IAM (Identity and Access Management) as a pentester. So let us proceed with a brief discussion of IAM as well as some typical misconfigurations and their potential exploits in order to reinforce the understanding of IAM security best practices.
- Gain actionable insights into AWS IAM policies and roles, using hands on approach.
#Prerequisites:
- Basic understanding of AWS services and architecture
- Familiarity with cloud security concepts
- Experience using the AWS Management Console or AWS CLI.
- For hands on lab create account on [killercoda.com](https://killercoda.com/cloudsecurity-scenario/)
# Scenario Covered:
- Basics of IAM in AWS
- Implementing IAM Policies with Least Privilege to Manage S3 Bucket
- Objective: Create an S3 bucket with least privilege IAM policy and validate access.
- Steps:
- Create S3 bucket.
- Attach least privilege policy to IAM user.
- Validate access.
- Exploiting IAM PassRole Misconfiguration
-Allows a user to pass a specific IAM role to an AWS service (ec2), typically used for service access delegation. Then exploit PassRole Misconfiguration granting unauthorized access to sensitive resources.
- Objective: Demonstrate how a PassRole misconfiguration can grant unauthorized access.
- Steps:
- Allow user to pass IAM role to EC2.
- Exploit misconfiguration for unauthorized access.
- Access sensitive resources.
- Exploiting IAM AssumeRole Misconfiguration with Overly Permissive Role
- An overly permissive IAM role configuration can lead to privilege escalation by creating a role with administrative privileges and allow a user to assume this role.
- Objective: Show how overly permissive IAM roles can lead to privilege escalation.
- Steps:
- Create role with administrative privileges.
- Allow user to assume the role.
- Perform administrative actions.
- Differentiation between PassRole vs AssumeRole
Try at [killercoda.com](https://killercoda.com/cloudsecurity-scenario/)
Determination of Equivalent Circuit parameters and performance characteristic...pvpriya2
Includes the testing of induction motor to draw the circle diagram of induction motor with step wise procedure and calculation for the same. Also explains the working and application of Induction generator
Tools & Techniques for Commissioning and Maintaining PV Systems W-Animations ...Transcat
Join us for this solutions-based webinar on the tools and techniques for commissioning and maintaining PV Systems. In this session, we'll review the process of building and maintaining a solar array, starting with installation and commissioning, then reviewing operations and maintenance of the system. This course will review insulation resistance testing, I-V curve testing, earth-bond continuity, ground resistance testing, performance tests, visual inspections, ground and arc fault testing procedures, and power quality analysis.
Fluke Solar Application Specialist Will White is presenting on this engaging topic:
Will has worked in the renewable energy industry since 2005, first as an installer for a small east coast solar integrator before adding sales, design, and project management to his skillset. In 2022, Will joined Fluke as a solar application specialist, where he supports their renewable energy testing equipment like IV-curve tracers, electrical meters, and thermal imaging cameras. Experienced in wind power, solar thermal, energy storage, and all scales of PV, Will has primarily focused on residential and small commercial systems. He is passionate about implementing high-quality, code-compliant installation techniques.
Height and depth gauge linear metrology.pdfq30122000
Height gauges may also be used to measure the height of an object by using the underside of the scriber as the datum. The datum may be permanently fixed or the height gauge may have provision to adjust the scale, this is done by sliding the scale vertically along the body of the height gauge by turning a fine feed screw at the top of the gauge; then with the scriber set to the same level as the base, the scale can be matched to it. This adjustment allows different scribers or probes to be used, as well as adjusting for any errors in a damaged or resharpened probe.
We have designed & manufacture the Lubi Valves LBF series type of Butterfly Valves for General Utility Water applications as well as for HVAC applications.
Applications of artificial Intelligence in Mechanical Engineering.pdfAtif Razi
Historically, mechanical engineering has relied heavily on human expertise and empirical methods to solve complex problems. With the introduction of computer-aided design (CAD) and finite element analysis (FEA), the field took its first steps towards digitization. These tools allowed engineers to simulate and analyze mechanical systems with greater accuracy and efficiency. However, the sheer volume of data generated by modern engineering systems and the increasing complexity of these systems have necessitated more advanced analytical tools, paving the way for AI.
AI offers the capability to process vast amounts of data, identify patterns, and make predictions with a level of speed and accuracy unattainable by traditional methods. This has profound implications for mechanical engineering, enabling more efficient design processes, predictive maintenance strategies, and optimized manufacturing operations. AI-driven tools can learn from historical data, adapt to new information, and continuously improve their performance, making them invaluable in tackling the multifaceted challenges of modern mechanical engineering.
Applications of artificial Intelligence in Mechanical Engineering.pdf
Lecture 25
1. Data WarehousingData Warehousing
11
Data WarehousingData Warehousing
Lecture-25Lecture-25
Need for Speed: Parallelism MethodologiesNeed for Speed: Parallelism Methodologies
Virtual University of PakistanVirtual University of Pakistan
Ahsan Abdullah
Assoc. Prof. & Head
Center for Agro-Informatics Research
www.nu.edu.pk/cairindex.asp
National University of Computers & Emerging Sciences, Islamabad
Email: ahsan1010@yahoo.com
2. Data Warehousing
2
MotivationMotivation
No need of parallelism if perfect computerNo need of parallelism if perfect computer
with single infinitely fast processorwith single infinitely fast processor
with an infinite memory with infinite bandwidthwith an infinite memory with infinite bandwidth
and its infinitely cheap too (free!)and its infinitely cheap too (free!)
Technology is not delivering (going to Moon analogy)Technology is not delivering (going to Moon analogy)
The Challenge is to buildThe Challenge is to build
infinitely fast processor out of infinitely manyinfinitely fast processor out of infinitely many
processors ofprocessors of finite speedfinite speed
Infinitely large memory with infinite memoryInfinitely large memory with infinite memory
bandwidth from infinite manybandwidth from infinite many finite storage unitsfinite storage units ofof
finite speedfinite speed
No text goes to graphics
3. Data Warehousing
3
Data Parallelism: ConceptData Parallelism: Concept
Parallel execution of a single data manipulationParallel execution of a single data manipulation
task across multiple partitions of data.task across multiple partitions of data.
Partitions static or dynamicPartitions static or dynamic
Tasks executed almost-independently acrossTasks executed almost-independently across
partitions.partitions.
““Query coordinator” must coordinate between theQuery coordinator” must coordinate between the
independently executing processes.independently executing processes.
No text goes to graphics
4. Data Warehousing
4
Data Parallelism: ExampleData Parallelism: Example
Emp Table
Partition 1Partition-1
Partition-2
Partition-k
.
.
.
62
440
1,123
Query
Server-1
Query
Server-2
Query
Server-k
.
.
.
Query
Coordinator
Select count (*)
from Emp
where age > 50
AND
sal > 10,000’;
Ans = 62 + 440 + ... + 1,123 = 99,000
5. Data Warehousing
5
To get a speed-up of N with N partitions, it must beTo get a speed-up of N with N partitions, it must be
ensured that:ensured that:
There are enough computing resources.There are enough computing resources.
Query-coordinator is very fast as compared to queryQuery-coordinator is very fast as compared to query
servers.servers.
Work done in each partition almost same to avoidWork done in each partition almost same to avoid
performance bottlenecks.performance bottlenecks.
Same number of records in each partition would notSame number of records in each partition would not
suffice.suffice.
Need to have uniform distribution of records w.r.t filterNeed to have uniform distribution of records w.r.t filter
criterion across partitions.criterion across partitions.
Data Parallelism: Ensuring Speed-UPData Parallelism: Ensuring Speed-UP
No text will go to graphics
6. Data Warehousing
6
Temporal Parallelism (pipelining)Temporal Parallelism (pipelining)
Involves taking a complex task and breaking it down intoInvolves taking a complex task and breaking it down into
independentindependent subtasks for parallel execution on a streamsubtasks for parallel execution on a stream
of data inputs.of data inputs.
Time = T/3 Time = T/3 Time = T/3
[] [] [][]
Task Execution Time = T
[] [] [] [] [] []
No text goes to graphics
7. Data Warehousing
7
Pipelining: Time ChartPipelining: Time Chart
Time = T/3
[][]
Time = T/3 Time = T/3
Time = T/3
[][]
Time = T/3 Time = T/3
Time = T/3
[]
Time = T/3 Time = T/3
T = 0 T = 1 T = 2
Time = T/3
[]
Time = T/3
T = 3
8. Data Warehousing
8
Pipelining: Speed-Up CalculationPipelining: Speed-Up Calculation
Time for sequential execution of 1 taskTime for sequential execution of 1 task = T= T
Time for sequential execution of N tasks = N * TTime for sequential execution of N tasks = N * T
(Ideal) time for pipelined execution of one task using an M stage pipeline(Ideal) time for pipelined execution of one task using an M stage pipeline
= T= T
(Ideal) time for pipelined execution of N tasks using an M stage pipeline(Ideal) time for pipelined execution of N tasks using an M stage pipeline
= T + ((N-1)= T + ((N-1) ×× (T/M))(T/M))
Speed-up (S) =Speed-up (S) =
Pipeline parallelism focuses on increasingPipeline parallelism focuses on increasing throughputthroughput of task execution,of task execution,
NOT on decreasing sub-taskNOT on decreasing sub-task execution timeexecution time..
9. Data Warehousing
9
Example: Bottling soft drinks in a factoryExample: Bottling soft drinks in a factory
1010 CRATES LOADS OF BOTTLESCRATES LOADS OF BOTTLES
Sequential executionSequential execution = 10= 10 ×× TT
Fill bottle, Seal bottle, Label Bottle pipelineFill bottle, Seal bottle, Label Bottle pipeline = T + T= T + T ×× (10-1)/3 = 4(10-1)/3 = 4 ×× TT
Speed-up = 2.50Speed-up = 2.50
2020 CRATES LOADS OF BOTTLESCRATES LOADS OF BOTTLES
Sequential executionSequential execution = 20= 20 ×× TT
Fill bottle, Seal bottle, Label Bottle pipelineFill bottle, Seal bottle, Label Bottle pipeline = T + T= T + T ×× (20-1)/3 = 7.3(20-1)/3 = 7.3 ×× TT
Speed-up = 2.72Speed-up = 2.72
4040 CRATES LOADS OF BOTTLESCRATES LOADS OF BOTTLES
Sequential executionSequential execution = 40= 40 ×× TT
Fill bottle, Seal bottle, Label Bottle pipeline = T + TFill bottle, Seal bottle, Label Bottle pipeline = T + T ×× (40-1)/3 = 14.0(40-1)/3 = 14.0 ×× TT
Speed-up = 2.85Speed-up = 2.85
Pipelining: Speed-Up ExamplePipelining: Speed-Up Example
Only 1st
two examples will go to graphics
10. Data Warehousing
10
Pipelining: Input vs Speed-UpPipelining: Input vs Speed-Up
1
1.2
1.4
1.6
1.8
2
2.2
2.4
2.6
2.8
3
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
Input (N)
Speed-up(S)
Asymptotic limit on speed-up for M stage pipeline is M.Asymptotic limit on speed-up for M stage pipeline is M.
The speed-up will NEVER be M, as initially filling theThe speed-up will NEVER be M, as initially filling the
pipeline took T time units.pipeline took T time units.
11. Data Warehousing
11
Pipelining: LimitationsPipelining: Limitations
Relational pipelines are rarely very longRelational pipelines are rarely very long
Even a chain of length ten is unusual.Even a chain of length ten is unusual.
Some relational operators do not produce firstSome relational operators do not produce first
output until consumed all their inputs.output until consumed all their inputs.
Aggregate and sort operators have this property. OneAggregate and sort operators have this property. One
cannot pipeline these operators.cannot pipeline these operators.
Often, execution cost of one operator is muchOften, execution cost of one operator is much
greater than others hence skew.greater than others hence skew.
e.g. Sum() or count() vs Group-by() or Join.e.g. Sum() or count() vs Group-by() or Join.
No text goes to graphics
12. Data Warehousing
12
Partitioning & QueriesPartitioning & Queries
Let’s evaluate how well different partitioningLet’s evaluate how well different partitioning
techniques support the following types oftechniques support the following types of
data access:data access:
Full Table Scan:Full Table Scan: Scanning the entire relationScanning the entire relation
Point Queries:Point Queries: Locating a tuple, e.g. whereLocating a tuple, e.g. where r.Ar.A
= 313= 313
Range Queries:Range Queries: Locating all tuples such thatLocating all tuples such that
the value of a given attribute lies within athe value of a given attribute lies within a
specified range. e.g., where 313specified range. e.g., where 313 ≤≤ r.Ar.A < 786.< 786.
yellow goes to graphics
13. Data Warehousing
13
Round RobinRound Robin
AdvantagesAdvantages
Best suited for sequential scan of entireBest suited for sequential scan of entire
relation on each query.relation on each query.
All disks have almost an equal number ofAll disks have almost an equal number of
tuples; retrieval work is thus well balancedtuples; retrieval work is thus well balanced
between disks.between disks.
Range queries are difficult to processRange queries are difficult to process
No clustering -- tuples are scattered acrossNo clustering -- tuples are scattered across
all disksall disks
Partitioning & QueriesPartitioning & Queries
yellow goes to graphics
14. Data Warehousing
14
Hash PartitioningHash Partitioning
Good for sequential accessGood for sequential access
With uniform hashing and using partitioning attributes asWith uniform hashing and using partitioning attributes as
a key, tuples will be equally distributed between disks.a key, tuples will be equally distributed between disks.
Good for point queries on partitioning attributeGood for point queries on partitioning attribute
Can lookup single disk, leaving others available forCan lookup single disk, leaving others available for
answering other queries.answering other queries.
Index on partitioning attribute can be local to disk, makingIndex on partitioning attribute can be local to disk, making
lookup and update very efficient even joins.lookup and update very efficient even joins.
• Range queries are difficult to processRange queries are difficult to process
No clustering -- tuples are scattered across allNo clustering -- tuples are scattered across all
disksdisks
Partitioning & QueriesPartitioning & Queries
yellow goes to graphics
15. Data Warehousing
15
Range PartitioningRange Partitioning
Provides data clustering by partitioning attribute value.Provides data clustering by partitioning attribute value.
Good for sequential accessGood for sequential access
Good for point queries on partitioning attribute: only oneGood for point queries on partitioning attribute: only one
disk needs to be accessed.disk needs to be accessed.
For range queries on partitioning attribute, one or a fewFor range queries on partitioning attribute, one or a few
disks may need to be accesseddisks may need to be accessed
− Remaining disks are available for other queries.Remaining disks are available for other queries.
− Good if result tuples are from one to a few blocks.Good if result tuples are from one to a few blocks.
− If many blocks are to be fetched, they are still fetched from one to aIf many blocks are to be fetched, they are still fetched from one to a
few disks, then potential parallelism in disk access is wastedfew disks, then potential parallelism in disk access is wasted
Partitioning & QueriesPartitioning & Queries
yellow goes to graphics
16. Data Warehousing
16
Parallel SortingParallel Sorting
Scan in parallel, and range partition on the go.Scan in parallel, and range partition on the go.
As partitioned data becomes available, performAs partitioned data becomes available, perform
“local” sorting.“local” sorting.
Resulting data is sorted and again range partitioned.Resulting data is sorted and again range partitioned.
Problem:Problem: skew or “hot spot”.skew or “hot spot”.
Solution:Solution: Sample the data at start to determineSample the data at start to determine
partition pointspartition points.
data
Processors
1 2 3 4 5
Hot spot
P1 P2 P3 P4 P5
1 4 1 2 1
17. Data Warehousing
17
Skew in PartitioningSkew in Partitioning
The distribution of tuples to disks may beThe distribution of tuples to disks may be skewedskewed
i.e. some disks have many tuples, while others may have fewer tuples.i.e. some disks have many tuples, while others may have fewer tuples.
Types of skew:Types of skew:
Attribute-value skew.Attribute-value skew.
Some values appear in the partitioning attributes of many tuples; allSome values appear in the partitioning attributes of many tuples; all
the tuples with the same value for the partitioning attribute end up inthe tuples with the same value for the partitioning attribute end up in
the same partition.the same partition.
Can occur with range-partitioning and hash-partitioning.Can occur with range-partitioning and hash-partitioning.
Partition skewPartition skew..
With range-partitioning, badly chosen partition vector may assignWith range-partitioning, badly chosen partition vector may assign
too many tuples to some partitions and too few to others.too many tuples to some partitions and too few to others.
Less likely with hash-partitioning if a good hash-function is chosen.Less likely with hash-partitioning if a good hash-function is chosen.
yellow goes to graphics
18. Data Warehousing
18
Handling Skew in Range-PartitioningHandling Skew in Range-Partitioning
To create a balanced partitioning vectorTo create a balanced partitioning vector
SortSort the relation on the partitioning attribute.the relation on the partitioning attribute.
Construct the partition vectorConstruct the partition vector by scanning theby scanning the
relation in sorted order as follows.relation in sorted order as follows.
After every 1/After every 1/nnthth
of the relation has been read, the value ofof the relation has been read, the value of
the partitioning attribute of the next tuple is added to thethe partitioning attribute of the next tuple is added to the
partition vector.partition vector.
nn denotes the number of partitions to be constructed.denotes the number of partitions to be constructed.
Duplicate entries or imbalancesDuplicate entries or imbalances can result ifcan result if
duplicates are present in partitioning attributes.duplicates are present in partitioning attributes.
yellow goes to graphics
19. Data Warehousing
19
Barriers to Linear Speedup & Scale-upBarriers to Linear Speedup & Scale-up
Amdahal’ LawAmdahal’ Law
StartupStartup
Time needed to start a large number of processors.Time needed to start a large number of processors.
Increase with increase in number of individual processors.Increase with increase in number of individual processors.
May also include time spent in opening files etc.May also include time spent in opening files etc.
InterferenceInterference
Slow down that each processor imposes on all others when sharing aSlow down that each processor imposes on all others when sharing a
common pool of resources “(e.g. memory).common pool of resources “(e.g. memory).
SkewSkew
Variance dominating the mean.Variance dominating the mean.
Service time of the job is service time of its slowest components.Service time of the job is service time of its slowest components.
yellow goes to graphics
20. Data Warehousing
20
Comparison of Partitioning TechniquesComparison of Partitioning Techniques
Shared disk/memory less sensitive to partitioning.
Shared nothing can benefit from good partitioning.
A…E F…J K…NO…S T…Z
Range
Good for equijoins, range
queries, group-by clauses,
can result in “hot spots”.
UsersUsers
A…E F…J K…NO…S T…Z
Round Robin
Good for load balancing,
but impervious to nature of
queries.
UsersUsers
A…E F…J K…NO…S T…Z
Hash
Good for equijoins, can
results in uneven data
distribution
UsersUsers
21. Data Warehousing
21
Parallel AggregatesParallel Aggregates
For each aggregate function, need a decomposition:
Count(S) = Σ count(s1) + Σ count(s2) + ….
Average(S) = Σ Avg(s1) + Σ Avg(s2) + ….
For groups:
Distribute data using hashing.
Sub aggregate groups close to the source.
Pass each sub-aggregate to its group’s site.
A…E F…J K…NO…S T…Z
22. Data Warehousing
22
When to use Range Partitioning?When to use Range Partitioning?
When to Use Hash Partitioning?When to Use Hash Partitioning?
When to Use List Partitioning?When to Use List Partitioning?
When to use Round-Robin Partitioning?When to use Round-Robin Partitioning?
When to use which partitioning Tech?When to use which partitioning Tech?
23. Data Warehousing
23
Parallelism Goals and MetricsParallelism Goals and Metrics
Speedup: TheSpeedup: The GoodGood, The, The BadBad & The& The UglyUgly
OldTime
NewTimeSpeedup=
Processors & Discs
The ideal
Speedup Curve
Linearity
Scale-up:Scale-up:
Transactional Scale-up: Fit for OLTP systemsTransactional Scale-up: Fit for OLTP systems
Batch Scale-up: Fit for Data Warehouse and OLAPBatch Scale-up: Fit for Data Warehouse and OLAP
Processors & Discs
A Bad Speedup Curve
Non-linear
Min Parallelism
Benefit
Processors & Discs
A Bad Speedup Curve
3-Factors
Startup
Interference
Skew