Generate-Test-and-Aggregate is a class of algorithms that can automatically derive efficient MapReduce programs.
MapReduce is a useful and popular programming model for large-scale parallel processing. However, for many complex problems, it is usually not easy to develop the efficient parallel algorithms that match MapReduce paradigm well.
The generator-based parallelization approach has been developed and introduced to simplify parallel programming by its automatic generating and optimizing mechanism. Efficient parallel algorithms can be generated from users' naive but correct programs by making use of generators which exploit knowledge of optimization theorems in the field of skeletal parallel programming. The obtained efficient-parallel algorithms are in the form that very fit for implementation with MapReduce.
By such an approach, a large class of generate-and-test-like computations can be efficiently programmed and computed over MapReduce. Thus a novel programming interface and framework can be built on top of MapReduce, and that would be helpful for resolving the difficulties on programmability and efficiency. In this paper we will introduce a framework that has such a novel programming interface for MapReduce. With this framework, users can just concentrate on making naive correct programs. We will show that a lot of so-called generate-and-test-like computations can be easily and efficiently implemented by this framework over MapReduce.
The problem considered is that of finding frequent subpaths of a database of paths in a fixed undirected
graph. This problem arises in applications such as predicting congestion in network and vehicular traffic.
An algorithm, called AFS, based on the classic frequent itemset mining algorithm Apriori is developed, but
with significantly improved efficiency over Apriori from exponential in transaction size to quadratic through exploiting the underlying graph structure. This efficiency makes AFS feasible for practical input path sizes. It is also proved that a natural generalization of the frequent subpaths problem is not amenable to any solution quicker than Apriori.
A Comparative Study of AI Techniques for Solving Hybrid Flow Shop (HFS) Sched...Majid_ahmed
This document summarizes a study that compares AI techniques for solving hybrid flow shop scheduling problems, specifically genetic algorithm (GA), simulated annealing (SA), and tabu search (TS). It first explains the components and concepts of each technique. Then it shows how they are applied to solve hybrid flow shop scheduling problems. Experimental results using benchmark problems show that TS generated the best results, finding acceptable solutions in 6 of 12 problem sets, while SA found solutions in 3 sets and GA in 3 sets. The best GA results used specific crossover operators. Increasing the number of inner steps in TS to generate neighborhoods also improved results.
쉽게 설명하는 GAN (What is this? Gum? It's GAN.)Hansol Kang
The document discusses generative adversarial networks (GANs). It begins with an introduction to GANs, describing their concept and training process. It then reviews a seminal GAN paper, discussing its mathematical formulation of GAN training as a minimax game and theoretical results showing global optimality can be achieved. The document concludes by outlining the configuration, implementation, and flowchart for a GAN experiment.
The document discusses Adaptable Constrained Genetic Programming (ACGP), which aims to automate the discovery of heuristics to guide the genetic programming search. It describes how ACGP develops first-order and second-order heuristics based on patterns observed in high-performing individuals, and uses these heuristics to bias mutation, crossover and regrowth. Experimental results on a target equation with explicit second-order structure show that ACGP with second-order heuristics outperforms both standard GP and ACGP with only first-order heuristics. The document concludes that ACGP is effective at discovering and exploiting problem structure through its adaptive heuristic approach.
This document discusses efficient solving techniques for answer set programming (ASP). It begins with an introduction to ASP, including its declarative programming paradigm based on stable model semantics. Computational tasks for ASP like model generation, optimum answer set search, and cautious reasoning are described along with their complexities. The document outlines the architecture of an ASP solver, covering input preprocessing, propagation methods, and learning heuristics. Model-guided and core-guided algorithms for optimum answer set search are also summarized.
Using Simulation to Investigate Requirements Prioritization StrategiesCS, NcState
The document summarizes a study that used simulation to investigate different requirements prioritization strategies, including plan-based (PB), agile-based (AG), and hybrid approaches. The simulation found that a hybrid strategy that prioritizes requirements based on value/cost, pruning low-value items, performed best overall. It dominated other strategies in terms of achieving the optimal frontier and balancing benefits and costs. The study concludes that combinations of strategies may work better than extreme plan-based or agile-based approaches alone.
The problem considered is that of finding frequent subpaths of a database of paths in a fixed undirected
graph. This problem arises in applications such as predicting congestion in network and vehicular traffic.
An algorithm, called AFS, based on the classic frequent itemset mining algorithm Apriori is developed, but
with significantly improved efficiency over Apriori from exponential in transaction size to quadratic through exploiting the underlying graph structure. This efficiency makes AFS feasible for practical input path sizes. It is also proved that a natural generalization of the frequent subpaths problem is not amenable to any solution quicker than Apriori.
A Comparative Study of AI Techniques for Solving Hybrid Flow Shop (HFS) Sched...Majid_ahmed
This document summarizes a study that compares AI techniques for solving hybrid flow shop scheduling problems, specifically genetic algorithm (GA), simulated annealing (SA), and tabu search (TS). It first explains the components and concepts of each technique. Then it shows how they are applied to solve hybrid flow shop scheduling problems. Experimental results using benchmark problems show that TS generated the best results, finding acceptable solutions in 6 of 12 problem sets, while SA found solutions in 3 sets and GA in 3 sets. The best GA results used specific crossover operators. Increasing the number of inner steps in TS to generate neighborhoods also improved results.
쉽게 설명하는 GAN (What is this? Gum? It's GAN.)Hansol Kang
The document discusses generative adversarial networks (GANs). It begins with an introduction to GANs, describing their concept and training process. It then reviews a seminal GAN paper, discussing its mathematical formulation of GAN training as a minimax game and theoretical results showing global optimality can be achieved. The document concludes by outlining the configuration, implementation, and flowchart for a GAN experiment.
The document discusses Adaptable Constrained Genetic Programming (ACGP), which aims to automate the discovery of heuristics to guide the genetic programming search. It describes how ACGP develops first-order and second-order heuristics based on patterns observed in high-performing individuals, and uses these heuristics to bias mutation, crossover and regrowth. Experimental results on a target equation with explicit second-order structure show that ACGP with second-order heuristics outperforms both standard GP and ACGP with only first-order heuristics. The document concludes that ACGP is effective at discovering and exploiting problem structure through its adaptive heuristic approach.
This document discusses efficient solving techniques for answer set programming (ASP). It begins with an introduction to ASP, including its declarative programming paradigm based on stable model semantics. Computational tasks for ASP like model generation, optimum answer set search, and cautious reasoning are described along with their complexities. The document outlines the architecture of an ASP solver, covering input preprocessing, propagation methods, and learning heuristics. Model-guided and core-guided algorithms for optimum answer set search are also summarized.
Using Simulation to Investigate Requirements Prioritization StrategiesCS, NcState
The document summarizes a study that used simulation to investigate different requirements prioritization strategies, including plan-based (PB), agile-based (AG), and hybrid approaches. The simulation found that a hybrid strategy that prioritizes requirements based on value/cost, pruning low-value items, performed best overall. It dominated other strategies in terms of achieving the optimal frontier and balancing benefits and costs. The study concludes that combinations of strategies may work better than extreme plan-based or agile-based approaches alone.
The document discusses ggplot2, a grammar of graphics plotting package for R. It introduces key concepts of ggplot2 including the layered grammar of graphics model and its components. These components - data, aesthetic mappings, statistical transformations, geometric objects, scales, coordinates, and faceting - provide flexibility to build complex plots from data. The document provides examples using ggplot2 to visualize birth and death rate data and explore the diamonds dataset.
Scalable frequent itemset mining using heterogeneous computing par apriori a...ijdpsjournal
Association Rule mining is one of the dominant tasks of data mining, which concerns in finding frequent
itemsets in large volumes of data in order to produce summarized models of mined rules. These models are
extended to generate association rules in various applications such as e-commerce, bio-informatics,
associations between image contents and non image features, analysis of effectiveness of sales and retail
industry, etc. In the vast increasing databases, the major challenge is the frequent itemsets mining in a
very short period of time. In the case of increasing data, the time taken to process the data should be
almost constant. Since high performance computing has many processors, and many cores, consistent runtime
performance for such very large databases on association rules mining is achieved. We, therefore,
must rely on high performance parallel and/or distributed computing. In literature survey, we have studied
the sequential Apriori algorithms and identified the fundamental problems in sequential environment and
parallel environment. In our proposed ParApriori, we have proposed parallel algorithm for GPGPU, and
we have also done the results analysis of our GPU parallel algorithm. We find that proposed algorithm
improved the computing time, consistency in performance over the increasing load. The empirical analysis
of the algorithm also shows that efficiency and scalability is verified over the series of datasets
experimented on many core GPU platform.
This slide was used in the "Mathematics of Logistics" seminar at Nishinari Laboratory, Faculty of Engineering, the University of Tokyo.
references:
1.久保幹雄 (2007) 『ロジスティクスの数理』 共立出版
2.Dimitri P. Bertsekas (2005). Dynamic Programming and Optimal Control. Athena Scientific. Vol 1,2. 4th edition.
Now a day enormous amount of data is getting explored through Internet of Things (IoT) as technologies
are advancing and people uses these technologies in day to day activities, this data is termed as Big Data
having its characteristics and challenges. Frequent Itemset Mining algorithms are aimed to disclose
frequent itemsets from transactional database but as the dataset size increases, it cannot be handled by
traditional frequent itemset mining. MapReduce programming model solves the problem of large datasets
but it has large communication cost which reduces execution efficiency. This proposed new pre-processed
k-means technique applied on BigFIM algorithm. ClustBigFIM uses hybrid approach, clustering using kmeans
algorithm to generate Clusters from huge datasets and Apriori and Eclat to mine frequent itemsets
from generated clusters using MapReduce programming model. Results shown that execution efficiency of
ClustBigFIM algorithm is increased by applying k-means clustering algorithm before BigFIM algorithm as
one of the pre-processing technique.
The document summarizes several improved algorithms that aim to address the drawbacks of the Apriori algorithm for association rule mining. It discusses six different approaches: 1) An intersection and record filter approach that counts candidate support only in transactions of sufficient length and uses set intersection; 2) An approach using set size and frequency to prune insignificant candidates; 3) An approach that reduces the candidate set and memory usage by only searching frequent itemsets once to delete candidates; 4) A partitioning approach that divides the database; 5) An approach using vertical data format to reduce database scans; and 6) A distributed approach to parallelize the algorithm across machines.
This document outlines an introduction to R graphics using ggplot2 presented by the Harvard MIT Data Center. The presentation introduces key concepts in ggplot2 including geometric objects, aesthetic mappings, statistical transformations, scales, faceting, and themes. It uses examples from the built-in mtcars dataset to demonstrate how to create common plot types like scatter plots, box plots, and regression lines. The goal is for students to be able to recreate a sample graphic by the end of the workshop.
This document describes a multi-level reduced order modeling approach with robust error bounds. It discusses applying dimensionality reduction algorithms to extract active subspaces from reduced complexity models, then equipping the reduced model with an error bound. A case study applies this approach to a nuclear reactor assembly model by extracting active subspaces from individual pin cell models to build a reduced order model in a more computationally efficient way than using the full assembly model.
The document discusses different string matching algorithms:
1. The naive string matching algorithm compares characters in the text and pattern sequentially to find matches.
2. The Robin-Karp algorithm uses hashing to quickly determine if the pattern is present in the text before doing full comparisons.
3. Finite automata models the pattern as states in an automaton to efficiently search the text for matches.
A PREFIXED-ITEMSET-BASED IMPROVEMENT FOR APRIORI ALGORITHMcsandit
Association rules is a very important part of data mining. It is used to find the interesting patterns from transaction databases. Apriori algorithm is one of the most classical algorithms
of association rules, but it has the bottleneck in efficiency. In this article, we proposed a prefixed-itemset-based data structure for candidate itemset generation, with the help of the structure we managed to improve the efficiency of the classical Apriori algorithm.
Talk on Optimization for Deep Learning, which gives an overview of gradient descent optimization algorithms and highlights some current research directions.
Distributed Formal Concept Analysis Algorithms Based on an Iterative MapReduc...Ruairi de Frein
An article from the Telecommunications Software & Systems Group, Waterford Institute of Technology, Ireland describing algorithms for distributed Formal Concept Analysis
ABSTRACT
While many existing formal concept analysis algorithms are efficient, they are typically unsuitable for distributed implementation. Taking the MapReduce (MR) framework as our inspiration we introduce a distributed approach for performing formal concept mining. Our method has its novelty in that we use a light-weight MapReduce runtime called Twister which is better suited to iterative algorithms than recent distributed approaches. First, we describe the theoretical foundations underpinning our distributed formal concept analysis approach. Second, we provide a representative exemplar of how a classic centralized algorithm can be implemented in a distributed fashion using our methodology: we modify Ganter's classic algorithm by introducing a family of MR* algorithms, namely MRGanter and MRGanter+ where the prefix denotes the algorithm's lineage. To evaluate the factors that impact distributed algorithm performance, we compare our MR* algorithms with the state-of-the-art. Experiments conducted on real datasets demonstrate that MRGanter+ is efficient, scalable and an appealing algorithm for distributed problems.
Accepted for publication at the International Conference for Formal Concept Analysis 2012.
Project participants: Biao Xu, Ruairí de Fréin, Eric Robson, Mícheál Ó Foghlú
Ruairí de Fréin: rdefrein (at) gmail (dot) com
bibtex:
@incollection{
year={2012},
isbn={978-3-642-29891-2},
booktitle={Formal Concept Analysis},
volume={7278},
series={Lecture Notes in Computer Science},
editor={Domenach, Florent and Ignatov, DmitryI. and Poelmans, Jonas},
doi={10.1007/978-3-642-29892-9_26},
title={Distributed Formal Concept Analysis Algorithms Based on an Iterative MapReduce Framework},
url={http://dx.doi.org/10.1007/978-3-642-29892-9_26},
publisher={Springer Berlin Heidelberg},
keywords={Formal Concept Analysis; Distributed Mining; MapReduce},
author={Xu, Biao and Fréin, Ruairí and Robson, Eric and Ó Foghlú, Mícheál},
pages={292-308}
}
DOWNLOAD
The article Arxiv: http://arxiv.org/abs/1210.2401
Dual-time Modeling and Forecasting in Consumer Banking (2016)Aijun Zhang
Longitudinal and survival data are naturally observed with multiple origination dates. They form a dual-time data structure with horizontal axis representing the calendar time and the vertical axis representing the lifetime. In this talk we discuss how to model dual-time data based on a decomposition strategy and how to forecast over the time horizon. Various statistical techniques are used for treating fixed and random effects.
Among other fields, we share the potential applications in quantitative risk management, and demonstrate a large-scale credit risk analysis powered by big data computing.
This document summarizes an article from the International Journal of Electronics and Communication Engineering & Technology. The article presents a new optimization method called Self Accelerated Smart Particle Swarm Optimization (SASPSO) for solving nonlinear programming problems. SASPSO updates particle positions based on the personal best and global best positions without requiring velocity equations. This reduces parameters and computational cost compared to standard PSO. The SASPSO method is tested on benchmark nonlinear problems and shows improved accuracy and number of optimal solutions compared to genetic algorithms.
This document describes a multi-level reduced order modeling approach with robust error bounds. It discusses applying dimensionality reduction algorithms to extract active subspaces from reduced complexity models, then equipping the reduced model with an error bound. It presents a case study applying this approach to a 7x7 nuclear fuel assembly benchmark model, extracting active subspaces from individual fuel pin cell models to build a reduced order model in a more computationally efficient way.
In this work, we propose to apply trust region optimization to deep reinforcement
learning using a recently proposed Kronecker-factored approximation to
the curvature. We extend the framework of natural policy gradient and propose
to optimize both the actor and the critic using Kronecker-factored approximate
curvature (K-FAC) with trust region; hence we call our method Actor Critic using
Kronecker-Factored Trust Region (ACKTR). To the best of our knowledge, this
is the first scalable trust region natural gradient method for actor-critic methods.
It is also a method that learns non-trivial tasks in continuous control as well as
discrete control policies directly from raw pixel inputs. We tested our approach
across discrete domains in Atari games as well as continuous domains in the MuJoCo
environment. With the proposed methods, we are able to achieve higher
rewards and a 2- to 3-fold improvement in sample efficiency on average, compared
to previous state-of-the-art on-policy actor-critic methods. Code is available at
https://github.com/openai/baselines.
The asynchronous parallel algorithms are developed to solve massive optimization problems in a distributed data system, which can be run in parallel on multiple nodes with little or no synchronization. Recently they have been successfully implemented to solve a range of difficult problems in practice. However, the existing theories are mostly based on fairly restrictive assumptions on the delays, and cannot explain the convergence and speedup properties of such algorithms. In this talk we will give an overview on distributed optimization, and discuss some new theoretical results on the convergence of asynchronous parallel stochastic gradient algorithm with unbounded delays. Simulated and real data will be used to demonstrate the practical implication of these theoretical results.
Analyzing high-frequency time series is increasingly useful with the current explosion in the availability of these data in several application areas, including but not limited to, climate, finance, health analytics, transportation, etc. This talk will give an overview of two statistical frameworks that could be useful for analyzing high-frequency financial time series leading to quantification of financial risk. These include a distribution free approach using penalized estimating functions for modeling inter-event durations and an approximate Bayesian approach for modeling counts of events in regular intervals. A few other potentially useful lines of research in this area will also be introduced.
Machine learning Algorithms with a Sagemaker demoHridyesh Bisht
An algorithm is a set of steps to solve a problem. Supervised learning uses labeled training data to teach models patterns which they can then use to predict labels for new unlabeled data. Unsupervised learning uses clustering and pattern detection to analyze and group unlabeled data. SageMaker is a fully managed service that allows users to build, train and deploy machine learning models and includes components for managing notebooks, labeling data, and deploying models through endpoints.
THE NEW HYBRID COAW METHOD FOR SOLVING MULTI-OBJECTIVE PROBLEMSijfcstjournal
In this article using Cuckoo Optimization Algorithm and simple additive weighting method the hybrid COAW algorithm is presented to solve multi-objective problems. Cuckoo algorithm is an efficient and structured method for solving nonlinear continuous problems. The created Pareto frontiers of the COAW proposed algorithm are exact and have good dispersion. This method has a high speed in finding the
Pareto frontiers and identifies the beginning and end points of Pareto frontiers properly. In order to validation the proposed algorithm, several experimental problems were analyzed. The results of which indicate the proper effectiveness of COAW algorithm for solving multi-objective problems.
The document summarizes several advanced policy gradient methods for reinforcement learning, including trust region policy optimization (TRPO), proximal policy optimization (PPO), and using the natural policy gradient with the Kronecker-factored approximation (K-FAC). TRPO frames policy optimization as solving a constrained optimization problem to limit policy updates, while PPO uses a clipped objective function as a pessimistic bound. Both methods improve upon vanilla policy gradients. K-FAC provides an efficient way to approximate the natural policy gradient using the Fisher information matrix. The document reviews the theory and algorithms behind these methods.
El presidente de la Comisión Nacional Bancaria y de Valores informó que el 93% de los ahorradores afectados por el caso Ficrea solicitaron el pago del seguro de depósito antes de la fecha límite del 17 de junio. Aunque todavía faltan detalles por definir como el proceso de concurso mercantil, este alto porcentaje de participación significa que gran parte del conflicto generado por el fraude de Ficrea ha concluido. Sin embargo, algunos ahorradores insatisfechos planean manifestarse para exigir una
The document discusses various tests used to measure the workability of concrete, including slump tests, compacting factor tests, and Vee-Bee tests. The slump test measures how much the concrete sinks in a standard cone after it is removed. The compacting factor test measures the ratio of partially compacted concrete weight to fully compacted concrete weight. The Vee-Bee test indirectly measures workability for concretes that cannot be tested by the slump test.
The document discusses ggplot2, a grammar of graphics plotting package for R. It introduces key concepts of ggplot2 including the layered grammar of graphics model and its components. These components - data, aesthetic mappings, statistical transformations, geometric objects, scales, coordinates, and faceting - provide flexibility to build complex plots from data. The document provides examples using ggplot2 to visualize birth and death rate data and explore the diamonds dataset.
Scalable frequent itemset mining using heterogeneous computing par apriori a...ijdpsjournal
Association Rule mining is one of the dominant tasks of data mining, which concerns in finding frequent
itemsets in large volumes of data in order to produce summarized models of mined rules. These models are
extended to generate association rules in various applications such as e-commerce, bio-informatics,
associations between image contents and non image features, analysis of effectiveness of sales and retail
industry, etc. In the vast increasing databases, the major challenge is the frequent itemsets mining in a
very short period of time. In the case of increasing data, the time taken to process the data should be
almost constant. Since high performance computing has many processors, and many cores, consistent runtime
performance for such very large databases on association rules mining is achieved. We, therefore,
must rely on high performance parallel and/or distributed computing. In literature survey, we have studied
the sequential Apriori algorithms and identified the fundamental problems in sequential environment and
parallel environment. In our proposed ParApriori, we have proposed parallel algorithm for GPGPU, and
we have also done the results analysis of our GPU parallel algorithm. We find that proposed algorithm
improved the computing time, consistency in performance over the increasing load. The empirical analysis
of the algorithm also shows that efficiency and scalability is verified over the series of datasets
experimented on many core GPU platform.
This slide was used in the "Mathematics of Logistics" seminar at Nishinari Laboratory, Faculty of Engineering, the University of Tokyo.
references:
1.久保幹雄 (2007) 『ロジスティクスの数理』 共立出版
2.Dimitri P. Bertsekas (2005). Dynamic Programming and Optimal Control. Athena Scientific. Vol 1,2. 4th edition.
Now a day enormous amount of data is getting explored through Internet of Things (IoT) as technologies
are advancing and people uses these technologies in day to day activities, this data is termed as Big Data
having its characteristics and challenges. Frequent Itemset Mining algorithms are aimed to disclose
frequent itemsets from transactional database but as the dataset size increases, it cannot be handled by
traditional frequent itemset mining. MapReduce programming model solves the problem of large datasets
but it has large communication cost which reduces execution efficiency. This proposed new pre-processed
k-means technique applied on BigFIM algorithm. ClustBigFIM uses hybrid approach, clustering using kmeans
algorithm to generate Clusters from huge datasets and Apriori and Eclat to mine frequent itemsets
from generated clusters using MapReduce programming model. Results shown that execution efficiency of
ClustBigFIM algorithm is increased by applying k-means clustering algorithm before BigFIM algorithm as
one of the pre-processing technique.
The document summarizes several improved algorithms that aim to address the drawbacks of the Apriori algorithm for association rule mining. It discusses six different approaches: 1) An intersection and record filter approach that counts candidate support only in transactions of sufficient length and uses set intersection; 2) An approach using set size and frequency to prune insignificant candidates; 3) An approach that reduces the candidate set and memory usage by only searching frequent itemsets once to delete candidates; 4) A partitioning approach that divides the database; 5) An approach using vertical data format to reduce database scans; and 6) A distributed approach to parallelize the algorithm across machines.
This document outlines an introduction to R graphics using ggplot2 presented by the Harvard MIT Data Center. The presentation introduces key concepts in ggplot2 including geometric objects, aesthetic mappings, statistical transformations, scales, faceting, and themes. It uses examples from the built-in mtcars dataset to demonstrate how to create common plot types like scatter plots, box plots, and regression lines. The goal is for students to be able to recreate a sample graphic by the end of the workshop.
This document describes a multi-level reduced order modeling approach with robust error bounds. It discusses applying dimensionality reduction algorithms to extract active subspaces from reduced complexity models, then equipping the reduced model with an error bound. A case study applies this approach to a nuclear reactor assembly model by extracting active subspaces from individual pin cell models to build a reduced order model in a more computationally efficient way than using the full assembly model.
The document discusses different string matching algorithms:
1. The naive string matching algorithm compares characters in the text and pattern sequentially to find matches.
2. The Robin-Karp algorithm uses hashing to quickly determine if the pattern is present in the text before doing full comparisons.
3. Finite automata models the pattern as states in an automaton to efficiently search the text for matches.
A PREFIXED-ITEMSET-BASED IMPROVEMENT FOR APRIORI ALGORITHMcsandit
Association rules is a very important part of data mining. It is used to find the interesting patterns from transaction databases. Apriori algorithm is one of the most classical algorithms
of association rules, but it has the bottleneck in efficiency. In this article, we proposed a prefixed-itemset-based data structure for candidate itemset generation, with the help of the structure we managed to improve the efficiency of the classical Apriori algorithm.
Talk on Optimization for Deep Learning, which gives an overview of gradient descent optimization algorithms and highlights some current research directions.
Distributed Formal Concept Analysis Algorithms Based on an Iterative MapReduc...Ruairi de Frein
An article from the Telecommunications Software & Systems Group, Waterford Institute of Technology, Ireland describing algorithms for distributed Formal Concept Analysis
ABSTRACT
While many existing formal concept analysis algorithms are efficient, they are typically unsuitable for distributed implementation. Taking the MapReduce (MR) framework as our inspiration we introduce a distributed approach for performing formal concept mining. Our method has its novelty in that we use a light-weight MapReduce runtime called Twister which is better suited to iterative algorithms than recent distributed approaches. First, we describe the theoretical foundations underpinning our distributed formal concept analysis approach. Second, we provide a representative exemplar of how a classic centralized algorithm can be implemented in a distributed fashion using our methodology: we modify Ganter's classic algorithm by introducing a family of MR* algorithms, namely MRGanter and MRGanter+ where the prefix denotes the algorithm's lineage. To evaluate the factors that impact distributed algorithm performance, we compare our MR* algorithms with the state-of-the-art. Experiments conducted on real datasets demonstrate that MRGanter+ is efficient, scalable and an appealing algorithm for distributed problems.
Accepted for publication at the International Conference for Formal Concept Analysis 2012.
Project participants: Biao Xu, Ruairí de Fréin, Eric Robson, Mícheál Ó Foghlú
Ruairí de Fréin: rdefrein (at) gmail (dot) com
bibtex:
@incollection{
year={2012},
isbn={978-3-642-29891-2},
booktitle={Formal Concept Analysis},
volume={7278},
series={Lecture Notes in Computer Science},
editor={Domenach, Florent and Ignatov, DmitryI. and Poelmans, Jonas},
doi={10.1007/978-3-642-29892-9_26},
title={Distributed Formal Concept Analysis Algorithms Based on an Iterative MapReduce Framework},
url={http://dx.doi.org/10.1007/978-3-642-29892-9_26},
publisher={Springer Berlin Heidelberg},
keywords={Formal Concept Analysis; Distributed Mining; MapReduce},
author={Xu, Biao and Fréin, Ruairí and Robson, Eric and Ó Foghlú, Mícheál},
pages={292-308}
}
DOWNLOAD
The article Arxiv: http://arxiv.org/abs/1210.2401
Dual-time Modeling and Forecasting in Consumer Banking (2016)Aijun Zhang
Longitudinal and survival data are naturally observed with multiple origination dates. They form a dual-time data structure with horizontal axis representing the calendar time and the vertical axis representing the lifetime. In this talk we discuss how to model dual-time data based on a decomposition strategy and how to forecast over the time horizon. Various statistical techniques are used for treating fixed and random effects.
Among other fields, we share the potential applications in quantitative risk management, and demonstrate a large-scale credit risk analysis powered by big data computing.
This document summarizes an article from the International Journal of Electronics and Communication Engineering & Technology. The article presents a new optimization method called Self Accelerated Smart Particle Swarm Optimization (SASPSO) for solving nonlinear programming problems. SASPSO updates particle positions based on the personal best and global best positions without requiring velocity equations. This reduces parameters and computational cost compared to standard PSO. The SASPSO method is tested on benchmark nonlinear problems and shows improved accuracy and number of optimal solutions compared to genetic algorithms.
This document describes a multi-level reduced order modeling approach with robust error bounds. It discusses applying dimensionality reduction algorithms to extract active subspaces from reduced complexity models, then equipping the reduced model with an error bound. It presents a case study applying this approach to a 7x7 nuclear fuel assembly benchmark model, extracting active subspaces from individual fuel pin cell models to build a reduced order model in a more computationally efficient way.
In this work, we propose to apply trust region optimization to deep reinforcement
learning using a recently proposed Kronecker-factored approximation to
the curvature. We extend the framework of natural policy gradient and propose
to optimize both the actor and the critic using Kronecker-factored approximate
curvature (K-FAC) with trust region; hence we call our method Actor Critic using
Kronecker-Factored Trust Region (ACKTR). To the best of our knowledge, this
is the first scalable trust region natural gradient method for actor-critic methods.
It is also a method that learns non-trivial tasks in continuous control as well as
discrete control policies directly from raw pixel inputs. We tested our approach
across discrete domains in Atari games as well as continuous domains in the MuJoCo
environment. With the proposed methods, we are able to achieve higher
rewards and a 2- to 3-fold improvement in sample efficiency on average, compared
to previous state-of-the-art on-policy actor-critic methods. Code is available at
https://github.com/openai/baselines.
The asynchronous parallel algorithms are developed to solve massive optimization problems in a distributed data system, which can be run in parallel on multiple nodes with little or no synchronization. Recently they have been successfully implemented to solve a range of difficult problems in practice. However, the existing theories are mostly based on fairly restrictive assumptions on the delays, and cannot explain the convergence and speedup properties of such algorithms. In this talk we will give an overview on distributed optimization, and discuss some new theoretical results on the convergence of asynchronous parallel stochastic gradient algorithm with unbounded delays. Simulated and real data will be used to demonstrate the practical implication of these theoretical results.
Analyzing high-frequency time series is increasingly useful with the current explosion in the availability of these data in several application areas, including but not limited to, climate, finance, health analytics, transportation, etc. This talk will give an overview of two statistical frameworks that could be useful for analyzing high-frequency financial time series leading to quantification of financial risk. These include a distribution free approach using penalized estimating functions for modeling inter-event durations and an approximate Bayesian approach for modeling counts of events in regular intervals. A few other potentially useful lines of research in this area will also be introduced.
Machine learning Algorithms with a Sagemaker demoHridyesh Bisht
An algorithm is a set of steps to solve a problem. Supervised learning uses labeled training data to teach models patterns which they can then use to predict labels for new unlabeled data. Unsupervised learning uses clustering and pattern detection to analyze and group unlabeled data. SageMaker is a fully managed service that allows users to build, train and deploy machine learning models and includes components for managing notebooks, labeling data, and deploying models through endpoints.
THE NEW HYBRID COAW METHOD FOR SOLVING MULTI-OBJECTIVE PROBLEMSijfcstjournal
In this article using Cuckoo Optimization Algorithm and simple additive weighting method the hybrid COAW algorithm is presented to solve multi-objective problems. Cuckoo algorithm is an efficient and structured method for solving nonlinear continuous problems. The created Pareto frontiers of the COAW proposed algorithm are exact and have good dispersion. This method has a high speed in finding the
Pareto frontiers and identifies the beginning and end points of Pareto frontiers properly. In order to validation the proposed algorithm, several experimental problems were analyzed. The results of which indicate the proper effectiveness of COAW algorithm for solving multi-objective problems.
The document summarizes several advanced policy gradient methods for reinforcement learning, including trust region policy optimization (TRPO), proximal policy optimization (PPO), and using the natural policy gradient with the Kronecker-factored approximation (K-FAC). TRPO frames policy optimization as solving a constrained optimization problem to limit policy updates, while PPO uses a clipped objective function as a pessimistic bound. Both methods improve upon vanilla policy gradients. K-FAC provides an efficient way to approximate the natural policy gradient using the Fisher information matrix. The document reviews the theory and algorithms behind these methods.
El presidente de la Comisión Nacional Bancaria y de Valores informó que el 93% de los ahorradores afectados por el caso Ficrea solicitaron el pago del seguro de depósito antes de la fecha límite del 17 de junio. Aunque todavía faltan detalles por definir como el proceso de concurso mercantil, este alto porcentaje de participación significa que gran parte del conflicto generado por el fraude de Ficrea ha concluido. Sin embargo, algunos ahorradores insatisfechos planean manifestarse para exigir una
The document discusses various tests used to measure the workability of concrete, including slump tests, compacting factor tests, and Vee-Bee tests. The slump test measures how much the concrete sinks in a standard cone after it is removed. The compacting factor test measures the ratio of partially compacted concrete weight to fully compacted concrete weight. The Vee-Bee test indirectly measures workability for concretes that cannot be tested by the slump test.
The document discusses calculating reactions to loads applied to beams. It begins by defining key terms like beams, loads, forces, and equilibrium. It explains that reactions must be calculated to balance applied loads and achieve static equilibrium. The document then provides examples of calculating reactions on simple beams using free body diagrams and the principles of moment and force equilibrium. Reactions are found by taking moments and forces around supports and setting equations equal to zero.
Aggregate impact value Calculation And usesShahryar Amin
This document describes a test to determine the aggregate impact value (AIV) of coarse aggregates. The AIV test measures the percentage of fines created when aggregates are subjected to a specified amount of impact. The test involves sieving aggregates into different sizes, filling a metal cylinder 1/3 full with coarse aggregates, subjecting it to hammer blows, then determining the weight of particles that pass through a 2.36mm sieve. The AIV percentage is calculated using the weights before and after impact. An AIV below 50% indicates aggregates suitable for construction, while above 50% suggests poorer quality.
This document discusses viscosity testing for bitumen used in road pavements. It defines viscosity as the resistance to flow and explains that viscosity testing determines the consistency and strength of bitumen at different temperatures. The document outlines different types of viscometers used to measure the time required for bitumen to flow through an orifice at standardized temperatures, and how the results are interpreted to select bitumen with an appropriate viscosity for use in road construction and maintenance.
This test determines the resistance of coarse aggregate to sudden shock or impact by filling a cup with aggregate, tamping and striking it, then sieving and calculating the percentage that breaks based on weights retained and passing specific sieves, with a lower percentage indicating greater resistance to impact.
Indian highways use road signs to convey important safety information to drivers. Signs communicate speed limits, road conditions, directions to cities and towns, and other driving rules. Uniform signage across India's vast highway system helps drivers safely navigate regardless of their starting point or destination.
This presentation contains IS Concrete mix design method and Basics of Design mix of concrete.It conveys; Objectives of Mix Design ;Grades of Concrete; Nominal Mix and Design Mix; Factors affecting Choice of Mix Design; Methods of Concrete Mix Design; IS Method Of Design.
Class 1 Moisture Content - Specific Gravity ( Geotechnical Engineering )Hossam Shafiq I
This document provides an introduction to a geotechnical engineering laboratory course at Texas Tech University. It includes information about the course syllabus, schedule, report format, and the objectives and procedures for the first lab which involves determining the moisture content, unit weight, and specific gravity of a soil sample. The significance of understanding soil properties for civil engineers is discussed. Key relationships between the weight and volume of the solid, water, and air phases in a soil sample are also explained.
Class 3 (b) Soil Classification ( Geotechnical Engineering )Hossam Shafiq I
The document discusses soil classification systems used in civil engineering, focusing on the Unified Soil Classification System (USCS). It describes the components and process of the USCS, including determining the percentages of gravel, sand, and fines based on sieve analysis. For soils with over 5% fines, the Atterberg limits test is used to classify on the plasticity chart. An example classification problem is worked through, showing a soil classified as SP-SM based on its grain size distribution and plasticity characteristics.
A PowerPoint Presentation On Superstructurekuntansourav
The document discusses different types of stone and brick masonry, including rubble masonry, ashlar masonry, and classifications within each. It also covers topics like doors, louvers, glazing, windows, ventilation, staircases, scaffolding, and shoring. Stone masonry uses stone units bonded with mortar, while brick masonry uses individual bricks laid in a pattern. Staircases require specific widths, heights, materials and other design elements to be safe and functional. Scaffolding and shoring are used to support structures during construction.
Introduction of system of coplanar forces (engineering mechanics)mashnil Gaddapawar
This document provides an overview of engineering mechanics. It discusses three main classifications of mechanics: mechanics of deformable bodies, mechanics of fluids, and mechanics of rigid bodies. Mechanics of deformable bodies deals with how forces are distributed inside bodies and cause stresses and deformations. Mechanics of fluids concerns liquids and gases and their applications in engineering. Mechanics of rigid bodies examines bodies that do not deform under forces. The document also outlines fundamental concepts in mechanics like length, time, displacement, velocity, and acceleration. It introduces important mechanical laws developed by Sir Isaac Newton like Newton's three laws of motion and Newton's law of universal gravitation. Other topics covered include units of measurement, force, characteristics and classification of forces, and resolution
Friction is the force that opposes the motion of objects in contact with one another. It causes objects to slow down and stop moving even without an apparent force being applied. Friction occurs due to bumps and hollows between surfaces and is greater on rougher surfaces, causing slower motion. While friction has disadvantages like causing wear and reducing efficiency, it also has advantages such as enabling brakes to stop moving vehicles and allowing objects to be gripped. Friction can be reduced by smoothing surfaces or adding lubricants between surfaces.
Nearly all water in the world contains contaminants, even in the absence of nearby pollution-causing activities
Many dissolved minerals, carbon compounds, and microbes find their way into drinking water as it comes in contact with air and soil
When pollutant and contaminant levels in drinking water are high, they may affect household routines and be detrimental to human health
The only way to ensure that your water supply is safe is to have a periodic laboratory water quality analysis done on your drinking water. Hach India is the leading provider of high end water quality analysis equipment in india
1. Superstructure construction includes column, beam, floor, wall and roof located above ground level. Materials used are timber, steel and concrete.
2. Timber floor construction involves plank wood supported by timber joists and beams. Reinforced concrete uses column and beam construction with formwork, steel bar installation and concrete pouring.
3. Load bearing walls support loads and transfer to foundation, with minimum thickness of one brick. Non-load bearing walls only support own weight and are half brick thickness.
This document discusses different types of forces. It begins by explaining that moving objects are said to be in motion. It then states that a push or pull acting on an object is called a force. The document goes on to list and briefly describe four main types of forces: gravitational force, magnetic force, nuclear force, and muscular force.
Water quality can be assessed through various physical, chemical, and biological indicators. It depends on factors like geology, ecosystem, and human activities. Standards are set based on intended uses like drinking, industrial, or environmental. Water is sampled and tested using on-site or laboratory methods to monitor these indicators. Maintaining adequate water quality is important for public health and ecosystem protection.
Friction opposes the motion of objects and is caused by bumps on surfaces sticking together when they touch. There are three main types of friction: static friction between non-moving surfaces, sliding friction between surfaces moving past each other, and rolling friction between rolling objects and surfaces. Adding sand to tires increases rolling friction and helps cars move on slippery surfaces by providing more traction between the tires and the ground. This relates to Newton's Second Law, as increasing friction generates a greater net force to overcome inertia according to the formula F=ma.
Automated Machine Learning via Sequential Uniform DesignsAijun Zhang
This document introduces automated machine learning (AutoML) and sequential uniform design-based hyperparameter optimization (SeqUDHO). It discusses existing hyperparameter optimization methods and proposes using sequential uniform design. Numerical experiments demonstrate that SeqUDHO outperforms other methods like random search, Bayesian optimization, and grid search on both simulated complex surfaces and real-world classification tasks with SVM, XGBoost, and CNN algorithms. Future work is outlined to improve the approach.
The document proposes a hybrid algorithm combining genetic algorithm and cuckoo search optimization to solve job shop scheduling problems. It aims to minimize makespan (completion time of all jobs) by scheduling jobs on machines. The genetic algorithm is used to explore the search space but can get trapped in local optima. Cuckoo search optimization performs local search faster than genetic algorithm and helps avoid local optima. Experimental results on benchmark problems show the hybrid algorithm yields better solutions in terms of makespan and runtime compared to genetic algorithm and ant colony optimization algorithms.
COMPARISON BETWEEN THE GENETIC ALGORITHMS OPTIMIZATION AND PARTICLE SWARM OPT...IAEME Publication
Close range photogrammetry network design is referred to the process of placing a set of
cameras in order to achieve photogrammetric tasks. The main objective of this paper is tried to find
the best location of two/three camera stations. The genetic algorithm optimization and Particle
Swarm Optimization are developed to determine the optimal camera stations for computing the three
dimensional coordinates. In this research, a mathematical model representing the genetic algorithm
optimization and Particle Swarm Optimization for the close range photogrammetry network is
developed. This paper gives also the sequence of the field operations and computational steps for this
task. A test field is included to reinforce the theoretical aspects.
Comparison between the genetic algorithms optimization and particle swarm opt...IAEME Publication
The document compares the genetic algorithms optimization and particle swarm optimization methods for designing close range photogrammetry networks. It presents the genetic algorithm and particle swarm optimization as two popular meta-heuristic algorithms inspired by natural evolution and collective animal behavior, respectively. The document develops mathematical models representing the genetic algorithm and particle swarm optimization for close range photogrammetry network design and evaluates them in a test field to reinforce the theoretical aspects.
MULTIPROCESSOR SCHEDULING AND PERFORMANCE EVALUATION USING ELITIST NON DOMINA...ijcsa
Task scheduling plays an important part in the improvement of parallel and distributed systems. The problem of task scheduling has been shown to be NP hard. The time consuming is more to solve the problem in deterministic techniques. There are algorithms developed to schedule tasks for distributed environment, which focus on single objective. The problem becomes more complex, while considering biobjective.This paper presents bi-objective independent task scheduling algorithm using elitist Nondominated
sorting genetic algorithm (NSGA-II) to minimize the makespan and flowtime. This algorithm generates pareto global optimal solutions for this bi-objective task scheduling problem. NSGA-II is implemented by using the set of benchmark instances. The experimental result shows NSGA-II generates efficient optimal schedules.
Performance Analysis of Genetic Algorithm as a Stochastic Optimization Tool i...paperpublications3
Abstract: Engineering design problems are complex by nature because of their critical objective functions involving many variables and Constraints. Engineers have to ensure the compatibility with the imposed specifications keeping the manufacturing costs low. Moreover, the methodology may vary according to the design problem.
The main issue is to choose the proper tool for optimization. In the earlier days, a design problem was optimized by some of the conventional optimization techniques like gradient Search, evolutionary optimization, random search etc. These are known as classical methods.
The method is to be properly Chosen depending on the nature of the problem- an incorrect choice may sometimes fail to give the optimal solution. So the methods are less robust.
Now-a-days soft-computing techniques are being widely used for optimizing a function. These are more robust. Genetic algorithm is one such method. It is an effective tool in the realm of stochastic optimization (non-classical). The algorithm produces many strings and generation to reach the optimal point.
The main objective of the paper is to optimize engineering design problems using Genetic Algorithm and to analyze how the algorithm reaches the optima effectively and closely. We choose a mathematical expression for the objective function in terms of the design variables and optimize the same under given constraints using GA.
Chap 8. Optimization for training deep modelsYoung-Geun Choi
연구실 내부 세미나 자료. Goodfellow et al. (2016), Deep Learning, MIT Press의 Chapter 8을 요약/발췌하였습니다. 깊은 신경망(deep neural network) 모형 훈련시 목적함수 최적화 방법으로 흔히 사용되는 방법들을 소개합니다.
The document discusses building human-based software estimation models that are accurate, intuitive, and easy to understand. It presents an approach using correlation and scale factors between estimated and actual effort. Experiments on a dataset of 178 samples show that combining correlation and scale factors into a decision tree achieves up to 93.3% accuracy. The resulting model bridges expert and algorithmic estimation methods.
A Tabu Search Heuristic For The Generalized Assignment ProblemSandra Long
This document proposes a Tabu search heuristic for solving the generalized assignment problem (GAP), which is an NP-hard combinatorial optimization problem. The algorithm uses a relaxed formulation of the GAP that allows for infeasible solutions by including a penalty term for capacity violations. It employs simple neighborhood structures and dynamically adjusts the penalty weighting over time using both recent and medium-term memory. Computational experiments show that the algorithm provides good quality solutions efficiently compared to other heuristic methods for the GAP.
The New Hybrid COAW Method for Solving Multi-Objective Problemsijfcstjournal
In this article using Cuckoo Optimization Algorithm and simple additive weighting method the hybrid COAW algorithm is presented to solve multi-objective problems. Cuckoo algorithm is an efficient and structured method for solving nonlinear continuous problems. The created Pareto frontiers of the COAW proposed algorithm are exact and have good dispersion. This method has a high speed in finding the Pareto frontiers and identifies the beginning and end points of Pareto frontiers properly. In order to validation the proposed algorithm, several experimental problems were analyzed. The results of which indicate the proper effectiveness of COAW algorithm for solving multi-objective problems
Two-Stage Eagle Strategy with Differential EvolutionXin-She Yang
The document describes a two-stage optimization strategy called the Eagle Strategy (ES) that combines global and local search algorithms to improve search efficiency. It evaluates applying ES to differential evolution (DE), a popular evolutionary algorithm. ES first uses randomization like Levy flights for global exploration, then switches to DE for intensive local search around promising solutions. The authors validate ES-DE on test functions, finding it requires only 9.7-24.9% of the function evaluations of pure DE. They also apply it to real-world pressure vessel and gearbox design problems, achieving solutions with 14.9-17.7% fewer function evaluations than pure DE.
Accelerated life testing plans are designed under multiple objective consideration, with the resulting Pareto optimal solutions classified and reduced using neural network and data envelopement analysis, respectively.
This document provides an introduction to Bayesian optimization and techniques used by SigOpt to optimize machine learning models and simulations. It discusses how Bayesian optimization uses a probabilistic model and acquisition function to efficiently search parameter spaces to find optimal configurations. Key aspects covered include Gaussian process and random forest regression models, expected improvement acquisition functions, and software packages that employ these methods like Spearmint, Hyperopt, and SMAC.
International Refereed Journal of Engineering and Science (IRJES)irjes
The core of the vision IRJES is to disseminate new knowledge and technology for the benefit of all, ranging from academic research and professional communities to industry professionals in a range of topics in computer science and engineering. It also provides a place for high-caliber researchers, practitioners and PhD students to present ongoing research and development in these areas.
130321 zephyrin soh - on the effect of exploration strategies on maintenanc...Ptidej Team
RQ2
RQ3
RQ4
Conclusion and
Future Work
Conclusion
Threats to Validity and
Future Work
9 / 30
This document presents an empirical study that investigates developers' program exploration strategies. The goal is to understand how developers navigate through a program's entities in order to help them more efficiently. The study analyzes developers' interaction histories to identify common exploration strategies and examines relationships between strategies and other factors like task type and expertise level. The results could help evaluate developer performance, improve comprehension models, and guide less experienced developers.
A Homomorphism-based Framework for Systematic Parallel Programming with MapRe...Yu Liu
This document describes a homomorphism-based framework for systematic parallel programming with MapReduce. The framework introduces a systematic approach to automatically generate fully parallelized and scalable MapReduce programs. It provides algorithmic programming interfaces that allow users to focus on the algebraic properties of problems, hiding the details of MapReduce. The framework was implemented on top of Hadoop and evaluated on several test problems, demonstrating good scalability and parallelism. Future work could decrease system overhead, optimize performance further, and extend the framework to more complex data structures like trees and graphs.
This document describes AURA, a hybrid approach to identify framework evolution. AURA uses call dependency graphs and code element similarity to identify replacement rules between framework versions. It aims to identify one-to-many, many-to-one, deleted, and cascading replacement rules automatically without requiring framework developer involvement or thresholds. The approach is inspired by and improves upon previous work on identifying framework evolution rules.
Similar to Implementing Generate-Test-and-Aggregate Algorithms on Hadoop (20)
A TPC Benchmark of Hive LLAP and Comparison with PrestoYu Liu
It is a TPC/H/DS benchmark on both Hive (Low Latency Analytical Processing) and Presto, comparing the two popular bigdata query engines.
The results shows significant advantages of Hive LLAP on performance and durability.
Cloud Era Transactional Processing -- Problems, Strategies and SolutionsYu Liu
The document discusses challenges and solutions for transactional processing in the cloud era. It covers modeling transactional consistency constraints, choosing appropriate consistency models like causal consistency, and state-of-the-art academic research in coordination avoidance, consistency models, and hardware efforts to improve transaction processing performance. The document provides definitions of consistency models and isolation levels and compares different approaches.
The document discusses natural language processing (NLP) for medical documents, specifically retrieving International Classification of Diseases (ICD) codes from free-text medical reports. It summarizes a medical NLP shared task called MedNLPDoc that aimed to retrieve information from Japanese medical reports. The highest performing system used a rule-based approach, showing rules can still outperform machine learning for medical NLP. Collaboration between researchers and enterprises was encouraged to resolve gaps between academic research and real-world requirements.
Survey on Parallel/Distributed Search EnginesYu Liu
This document summarizes a survey on parallel and distributed search engines. It discusses how web search tasks like crawling billions of documents, indexing terabytes of data, and responding to thousands of queries simultaneously require a parallel or distributed approach. It then provides examples of distributed search engines and technologies like MapReduce, and discusses challenges in distributed search like resource representation, selection, and result merging. Finally, it surveys parallel implementations of clustering algorithms and challenges in parallelizing hierarchical agglomerative clustering with MapReduce.
Paper introduction to Combinatorial Optimization on Graphs of Bounded TreewidthYu Liu
This slides introduced the paper: H. L. Bodlaender and a. M. C. a. Koster, “Combinatorial Optimization on Graphs of Bounded Treewidth,” Comput. J., vol. 51, no. 3, pp. 255–269, Nov. 2007.
Paper Introduction: Combinatorial Model and Bounds for Target Set SelectionYu Liu
The paper Combinatorial Model and Bounds for Target Set Selection by Eyal Ackerman, Oren Ben-Zwi, Guy Wolfovitz:
1. a combinatorial model for the dynamic activation process of
influential networks;
2. representing Perfect Target Set Selection Problem and its
variants by linear integer programs;
3. combinatorial lower and upper bounds on the size of the
minimum Perfect Target Set
An accumulative computation framework on MapReduce ppl2013Yu Liu
The document discusses an accumulative computation framework on MapReduce clusters. It presents examples of accumulative computation programs and benchmarks their performance on MapReduce. The experiments show the framework can process large datasets in a reasonable time and achieves near-linear speedup when increasing CPUs, demonstrating the efficiency and scalability of the approach. The accumulative computation pattern and framework simplify parallelizing problems that have data dependencies and allow encoding many parallel computations.
An Introduction of Recent Research on MapReduce (2011)Yu Liu
This document summarizes recent research on MapReduce. It outlines papers presented at the MAPREDUCE11 conference and Hadoop World 2010, including papers on resource attribution in data clusters, shared-memory MapReduce implementations, static type checking of MapReduce programs, QR factorizations, genome indexing, and optimizing data selection. It also summarizes talks and lists several interesting papers on topics like distributed data processing.
Introduction of A Lightweight Stage-Programming FrameworkYu Liu
The Lightweight Stage-Programming Framework introduced in this slides can be used for making efficient parallel DSL which can be transformed to MapReduce programs. To understand this slides, please firstly read http://www.slideshare.net/YuLiu19/a-generatetestaggregate-parallel-programming-library-on-spark.
Start From A MapReduce Graph Pattern-recognize AlgorithmYu Liu
This document summarizes a presentation on developing a MapReduce algorithm to recognize patterns in large graphs by finding connected components. It discusses:
- Motivation to study parallel graph algorithms and frameworks like MapReduce and Pregel
- The problem of finding link patterns in graphs by extracting connected components
- Background on semantic web and linked open data modeled as RDF graphs
- A naive O(2Ck)-iteration MapReduce algorithm to find connected components between pairs of datasets
- Examples and analysis of the algorithm's complexity and communication costs
Introduction of the Design of A High-level Language over MapReduce -- The Pig...Yu Liu
Pig is a platform for analyzing large datasets that uses Pig Latin, a high-level language, to express data analysis programs. Pig Latin programs are compiled into MapReduce jobs and executed on Hadoop. Pig Latin provides data manipulation constructs like SQL as well as user-defined functions. The Pig system compiles programs through optimization, code generation, and execution on Hadoop. Future work focuses on additional optimizations, non-Java UDFs, and interfaces like SQL.
On Extending MapReduce - Survey and ExperimentsYu Liu
It talks a survey and my experiments on extending MapReduce programming model. A BSP-based MapReduce interface was implemented and evaluated, which shows dramatically improvement on performance.
Introduction to Ultra-succinct representation of ordered trees with applicationsYu Liu
The document summarizes a paper on ultra-succinct representations of ordered trees. It introduces tree degree entropy, a new measure of information in trees. It presents a succinct data structure that uses nH*(T) + O(n log log n / log n) bits to represent an ordered tree T with n nodes, where H*(T) is the tree degree entropy. This representation supports computing consecutive bits of the tree's DFUDS representation in constant time. It also supports computing operations like lowest common ancestor, depth, and level-ancestor in constant time using an auxiliary structure of O(n(log log n)2 / log n) bits.
On Implementation of Neuron Network(Back-propagation)Yu Liu
This document outlines Yu Liu's work implementing and comparing different parallel versions of a neural network using backpropagation. It discusses motivations for parallel programming practice and library study. It provides an introduction to neural networks and backpropagation algorithms. Three implementations are compared: sequential C++ STL, Skelton library, and Intel TBB. Benchmark results show improved speedups from parallel versions. Remaining challenges are also noted, like addressing local minima problems and testing on larger data.
ScrewDriver Rebirth: Generate-Test-and-Aggregate Framework on HadoopYu Liu
This document describes Yu Liu's ScrewDriver Rebirth framework for implementing the generate-test-aggregate algorithm on Hadoop. The framework uses semiring structures to represent the generate, test, and aggregate functions. It defines Generator and Aggregater classes to implement generation and aggregation. The framework allows fusing operations by lifting semirings and defining new generators. Examples show various generators, tests, and aggregators run on Hadoop to evaluate performance improvements over the previous version.
A Homomorphism-based MapReduce Framework for Systematic Parallel ProgrammingYu Liu
The document outlines a homomorphism-based framework for parallel programming on MapReduce. It introduces homomorphisms and theorems about them. The framework represents lists as sets of key-value pairs distributed across nodes. Functions are implemented using this representation and MapReduce, allowing easy parallelization of problems like maximum prefix sum that are otherwise complex on MapReduce.
Towards Systematic Parallel Programming over MapReduceYu Liu
This document discusses programming with MapReduce and proposes a calculational approach for systematic parallel programming over MapReduce. It begins with background on MapReduce and examples of MapReduce programming. It then discusses issues with directly mapping sequential algorithms to MapReduce. The document proposes expressing computations as list homomorphisms, which can be automatically implemented as MapReduce jobs. It presents an interface for defining sequential functions as fold and unfold and discusses implementing list homomorphisms in MapReduce by representing lists and intermediate data. It evaluates the performance of the homomorphism-based approach.
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...Social Samosa
The Modern Marketing Reckoner (MMR) is a comprehensive resource packed with POVs from 60+ industry leaders on how AI is transforming the 4 key pillars of marketing – product, place, price and promotions.
Open Source Contributions to Postgres: The Basics POSETTE 2024ElizabethGarrettChri
Postgres is the most advanced open-source database in the world and it's supported by a community, not a single company. So how does this work? How does code actually get into Postgres? I recently had a patch submitted and committed and I want to share what I learned in that process. I’ll give you an overview of Postgres versions and how the underlying project codebase functions. I’ll also show you the process for submitting a patch and getting that tested and committed.
End-to-end pipeline agility - Berlin Buzzwords 2024Lars Albertsson
We describe how we achieve high change agility in data engineering by eliminating the fear of breaking downstream data pipelines through end-to-end pipeline testing, and by using schema metaprogramming to safely eliminate boilerplate involved in changes that affect whole pipelines.
A quick poll on agility in changing pipelines from end to end indicated a huge span in capabilities. For the question "How long time does it take for all downstream pipelines to be adapted to an upstream change," the median response was 6 months, but some respondents could do it in less than a day. When quantitative data engineering differences between the best and worst are measured, the span is often 100x-1000x, sometimes even more.
A long time ago, we suffered at Spotify from fear of changing pipelines due to not knowing what the impact might be downstream. We made plans for a technical solution to test pipelines end-to-end to mitigate that fear, but the effort failed for cultural reasons. We eventually solved this challenge, but in a different context. In this presentation we will describe how we test full pipelines effectively by manipulating workflow orchestration, which enables us to make changes in pipelines without fear of breaking downstream.
Making schema changes that affect many jobs also involves a lot of toil and boilerplate. Using schema-on-read mitigates some of it, but has drawbacks since it makes it more difficult to detect errors early. We will describe how we have rejected this tradeoff by applying schema metaprogramming, eliminating boilerplate but keeping the protection of static typing, thereby further improving agility to quickly modify data pipelines without fear.
The Ipsos - AI - Monitor 2024 Report.pdfSocial Samosa
According to Ipsos AI Monitor's 2024 report, 65% Indians said that products and services using AI have profoundly changed their daily life in the past 3-5 years.
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...Kaxil Naik
Navigating today's data landscape isn't just about managing workflows; it's about strategically propelling your business forward. Apache Airflow has stood out as the benchmark in this arena, driving data orchestration forward since its early days. As we dive into the complexities of our current data-rich environment, where the sheer volume of information and its timely, accurate processing are crucial for AI and ML applications, the role of Airflow has never been more critical.
In my journey as the Senior Engineering Director and a pivotal member of Apache Airflow's Project Management Committee (PMC), I've witnessed Airflow transform data handling, making agility and insight the norm in an ever-evolving digital space. At Astronomer, our collaboration with leading AI & ML teams worldwide has not only tested but also proven Airflow's mettle in delivering data reliably and efficiently—data that now powers not just insights but core business functions.
This session is a deep dive into the essence of Airflow's success. We'll trace its evolution from a budding project to the backbone of data orchestration it is today, constantly adapting to meet the next wave of data challenges, including those brought on by Generative AI. It's this forward-thinking adaptability that keeps Airflow at the forefront of innovation, ready for whatever comes next.
The ever-growing demands of AI and ML applications have ushered in an era where sophisticated data management isn't a luxury—it's a necessity. Airflow's innate flexibility and scalability are what makes it indispensable in managing the intricate workflows of today, especially those involving Large Language Models (LLMs).
This talk isn't just a rundown of Airflow's features; it's about harnessing these capabilities to turn your data workflows into a strategic asset. Together, we'll explore how Airflow remains at the cutting edge of data orchestration, ensuring your organization is not just keeping pace but setting the pace in a data-driven future.
Session in https://budapestdata.hu/2024/04/kaxil-naik-astronomer-io/ | https://dataml24.sessionize.com/session/667627
Build applications with generative AI on Google CloudMárton Kodok
We will explore Vertex AI - Model Garden powered experiences, we are going to learn more about the integration of these generative AI APIs. We are going to see in action what the Gemini family of generative models are for developers to build and deploy AI-driven applications. Vertex AI includes a suite of foundation models, these are referred to as the PaLM and Gemini family of generative ai models, and they come in different versions. We are going to cover how to use via API to: - execute prompts in text and chat - cover multimodal use cases with image prompts. - finetune and distill to improve knowledge domains - run function calls with foundation models to optimize them for specific tasks. At the end of the session, developers will understand how to innovate with generative AI and develop apps using the generative ai industry trends.
Codeless Generative AI Pipelines
(GenAI with Milvus)
https://ml.dssconf.pl/user.html#!/lecture/DSSML24-041a/rate
Discover the potential of real-time streaming in the context of GenAI as we delve into the intricacies of Apache NiFi and its capabilities. Learn how this tool can significantly simplify the data engineering workflow for GenAI applications, allowing you to focus on the creative aspects rather than the technical complexities. I will guide you through practical examples and use cases, showing the impact of automation on prompt building. From data ingestion to transformation and delivery, witness how Apache NiFi streamlines the entire pipeline, ensuring a smooth and hassle-free experience.
Timothy Spann
https://www.youtube.com/@FLaNK-Stack
https://medium.com/@tspann
https://www.datainmotion.dev/
milvus, unstructured data, vector database, zilliz, cloud, vectors, python, deep learning, generative ai, genai, nifi, kafka, flink, streaming, iot, edge
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...Aggregage
This webinar will explore cutting-edge, less familiar but powerful experimentation methodologies which address well-known limitations of standard A/B Testing. Designed for data and product leaders, this session aims to inspire the embrace of innovative approaches and provide insights into the frontiers of experimentation!
Implementing Generate-Test-and-Aggregate Algorithms on Hadoop
1. Background
Motivation and Objective
Design and implementation
Performance test
Conclusion and future work
Implementing Generate-Test-and-Aggregate
Algorithms on Hadoop
Yu Liu1, Sebastian Fischer2, Kento Emoto3, and Zhenjiang Hu4
1The Graduate University for Advanced Studies
2,4National Institute of Informatics
3University of Tokyo
September 28, 2011
Yu Liu1
, Sebastian Fischer2
, Kento Emoto3
, and Zhenjiang Hu4
Implementing Generate-Test-and-Aggregate Algorithms on Hadoo
2. Background
Motivation and Objective
Design and implementation
Performance test
Conclusion and future work
MapReduce
GTA algorithm
Parallelization of GTA algorithm
MapReduce
Computation in three phases: map, shuffle and reduce
Yu Liu1
, Sebastian Fischer2
, Kento Emoto3
, and Zhenjiang Hu4
Implementing Generate-Test-and-Aggregate Algorithms on Hadoo
3. Background
Motivation and Objective
Design and implementation
Performance test
Conclusion and future work
MapReduce
GTA algorithm
Parallelization of GTA algorithm
Programming with MapReduce
Programmers need to implement the following classes (Hadoop)
Yu Liu1
, Sebastian Fischer2
, Kento Emoto3
, and Zhenjiang Hu4
Implementing Generate-Test-and-Aggregate Algorithms on Hadoo
4. Background
Motivation and Objective
Design and implementation
Performance test
Conclusion and future work
MapReduce
GTA algorithm
Parallelization of GTA algorithm
Programming with MapReduce
The main difficulties of MapReduce Programming :
Nontrivial problems are usually difficult to be computed in a
divide-and-conquer fashion
Efficiency of parallel algorithms is difficult to be obtained
Yu Liu1
, Sebastian Fischer2
, Kento Emoto3
, and Zhenjiang Hu4
Implementing Generate-Test-and-Aggregate Algorithms on Hadoo
5. Background
Motivation and Objective
Design and implementation
Performance test
Conclusion and future work
MapReduce
GTA algorithm
Parallelization of GTA algorithm
Generate Test and Aggregate Algorithm
The Generate-Test-and-Aggregate (GTA for short) algorithm
consists of
generate can generate all possible solution candidates.
test filters the intermediate data.
aggregate computes a summary of valid intermediate data.
Yu Liu1
, Sebastian Fischer2
, Kento Emoto3
, and Zhenjiang Hu4
Implementing Generate-Test-and-Aggregate Algorithms on Hadoo
6. Background
Motivation and Objective
Design and implementation
Performance test
Conclusion and future work
MapReduce
GTA algorithm
Parallelization of GTA algorithm
Generate Test and Aggregate Algorithm
The Generate-Test-and-Aggregate (GTA for short) algorithm
consists of
generate can generate all possible solution candidates.
test filters the intermediate data.
aggregate computes a summary of valid intermediate data.
GTA is a very useful and common strategy for a large class of
problems
Yu Liu1
, Sebastian Fischer2
, Kento Emoto3
, and Zhenjiang Hu4
Implementing Generate-Test-and-Aggregate Algorithms on Hadoo
7. Background
Motivation and Objective
Design and implementation
Performance test
Conclusion and future work
MapReduce
GTA algorithm
Parallelization of GTA algorithm
An Example: Knapsack Problem
Fill a knapsack with items, each of certain value and weight, such that
the total value of packed items is maximal while adhering to a weight
restriction of the knapsack.
picture from Wikipedia
Yu Liu1
, Sebastian Fischer2
, Kento Emoto3
, and Zhenjiang Hu4
Implementing Generate-Test-and-Aggregate Algorithms on Hadoo
8. Background
Motivation and Objective
Design and implementation
Performance test
Conclusion and future work
MapReduce
GTA algorithm
Parallelization of GTA algorithm
An Example: Knapsack Problem
A knapsack program (GTA algorithm):
knapsack = maxvalue ◦ filter ◦ sublists
Yu Liu1
, Sebastian Fischer2
, Kento Emoto3
, and Zhenjiang Hu4
Implementing Generate-Test-and-Aggregate Algorithms on Hadoo
9. Background
Motivation and Objective
Design and implementation
Performance test
Conclusion and future work
MapReduce
GTA algorithm
Parallelization of GTA algorithm
An Example: Knapsack Problem
A knapsack program (GTA algorithm):
knapsack = maxvalue ◦ filter ◦ sublists
E.g, there are 3 items: (1kg, $1), (1kg, $2), (2kg, $2)
sublists [(1kg, $1), (1kg, $2), (2kg, $2)]
= [ ], [(1kg, $1)], [(1kg, $1), (1kg, $2)], [(1kg, $1), (1kg, $2), (2kg, $2)],
[(1kg, $1), (2kg, $2)], [(1kg, $2)], [(1kg, $2), (2kg, $2)], [(2kg, $2)]
Yu Liu1
, Sebastian Fischer2
, Kento Emoto3
, and Zhenjiang Hu4
Implementing Generate-Test-and-Aggregate Algorithms on Hadoo
10. Background
Motivation and Objective
Design and implementation
Performance test
Conclusion and future work
MapReduce
GTA algorithm
Parallelization of GTA algorithm
An Example: Knapsack Problem
A knapsack program (GTA algorithm):
knapsack = maxvalue ◦ filter ◦ sublists
Spouse the capacity of knapsack is 2 kg
filter [ ], [(1kg, $1)], [(1kg, $1), (1kg, $2)], [(1kg, $1), (1kg, $2), (2kg, $2)],
[(1kg, $1), (2kg, $2)], [(1kg, $2)], [(1kg, $2), (2kg, $2)], [(2kg, $2)]
= [ ], [(1kg, $1)], [(1kg, $1), (1kg, $2)], [(2kg, $2)], [(1kg, $2)]
Yu Liu1
, Sebastian Fischer2
, Kento Emoto3
, and Zhenjiang Hu4
Implementing Generate-Test-and-Aggregate Algorithms on Hadoo
11. Background
Motivation and Objective
Design and implementation
Performance test
Conclusion and future work
MapReduce
GTA algorithm
Parallelization of GTA algorithm
An Example: Knapsack Problem
A knapsack program (GTA algorithm):
knapsack = maxvalue ◦ filter ◦ sublists
maxvalue [ ], [(1kg, $1)], [(1kg, $1), (1kg, $2)], [(2kg, $2)], [(1kg, $2)]
= $3
Yu Liu1
, Sebastian Fischer2
, Kento Emoto3
, and Zhenjiang Hu4
Implementing Generate-Test-and-Aggregate Algorithms on Hadoo
12. Background
Motivation and Objective
Design and implementation
Performance test
Conclusion and future work
MapReduce
GTA algorithm
Parallelization of GTA algorithm
An Example: Knapsack Problem
A knapsack program (GTA algorithm):
knapsack = maxvalue ◦ filter ◦ sublists
This program is simple but inefficient because it generates
exponential intermediate data (2n).
Yu Liu1
, Sebastian Fischer2
, Kento Emoto3
, and Zhenjiang Hu4
Implementing Generate-Test-and-Aggregate Algorithms on Hadoo
13. Background
Motivation and Objective
Design and implementation
Performance test
Conclusion and future work
MapReduce
GTA algorithm
Parallelization of GTA algorithm
Theorems of Gernerating Efficient Parallel GTA Programs
Efficient parallel programs can be derived from users’
naive but correct programs in terms of a generate, a test, and an
aggregate functions [Emoto et. al., 2011]
aggregate ◦ test ◦ generate ⇒ list homomorphism
List homomorphisms is a class of recursive functions which match very well
with the divide-and-conquer paradigm [Bird, 87; Cole, 95].
Yu Liu1
, Sebastian Fischer2
, Kento Emoto3
, and Zhenjiang Hu4
Implementing Generate-Test-and-Aggregate Algorithms on Hadoo
14. Background
Motivation and Objective
Design and implementation
Performance test
Conclusion and future work
MapReduce
GTA algorithm
Parallelization of GTA algorithm
The Emoto’s theorem is under the following assumptions:
aggregate is a semiring homomorphism.
test is a list homomorphism.
generate is a polymorphism over semiring structures.
Yu Liu1
, Sebastian Fischer2
, Kento Emoto3
, and Zhenjiang Hu4
Implementing Generate-Test-and-Aggregate Algorithms on Hadoo
15. Background
Motivation and Objective
Design and implementation
Performance test
Conclusion and future work
Motivation and Objective
The Emoto’s fusion theorem shows us a possible way to
systematically implement efficient parallel programs with GTA
algorithm
Yu Liu1
, Sebastian Fischer2
, Kento Emoto3
, and Zhenjiang Hu4
Implementing Generate-Test-and-Aggregate Algorithms on Hadoo
16. Background
Motivation and Objective
Design and implementation
Performance test
Conclusion and future work
Motivation and Objective
The Emoto’s fusion theorem shows us a possible way to
systematically implement efficient parallel programs with GTA
algorithm
We need to evaluate this approach by
implementing a practical library, which should
have easy-to-use programming interface help users design
GTA algorithms
be able to generate efficient parallel programs on MapReduce
(Hadoop)
Yu Liu1
, Sebastian Fischer2
, Kento Emoto3
, and Zhenjiang Hu4
Implementing Generate-Test-and-Aggregate Algorithms on Hadoo
17. Background
Motivation and Objective
Design and implementation
Performance test
Conclusion and future work
System Overview
Yu Liu1
, Sebastian Fischer2
, Kento Emoto3
, and Zhenjiang Hu4
Implementing Generate-Test-and-Aggregate Algorithms on Hadoo
19. Background
Motivation and Objective
Design and implementation
Performance test
Conclusion and future work
Implementation on Hadoop
MapReducer is an Interface of list homomorphism
h[ ] = id⊕
h[a] = f a
h(x ++ y) = h x ⊕ h y
1 public interface MapReducer<Elem , Val , Res> {
2 public Val identity () ;
3 public Val element ( Elem elem ) ;
4 public Val combine ( Val left , Val right ) ;
5 public Res postprocess ( Val val ) ;
6 }
Yu Liu1
, Sebastian Fischer2
, Kento Emoto3
, and Zhenjiang Hu4
Implementing Generate-Test-and-Aggregate Algorithms on Hadoo
20. Background
Motivation and Objective
Design and implementation
Performance test
Conclusion and future work
Implementation on Hadoop
MapReducer is an Interface of list homomorphism
Aggregator defines a semiring homomorphism
(A, ⊕, ⊗) → (S, ⊕ , ⊗ )
1 public interface Aggregator<A ,S> {
2 public S zero () ;
3 public S one () ;
4 public S singleton ( A a ) ;
5 public S plus ( S left , S right ) ;
6 public S times ( S left , S right ) ;
7 }
Yu Liu1
, Sebastian Fischer2
, Kento Emoto3
, and Zhenjiang Hu4
Implementing Generate-Test-and-Aggregate Algorithms on Hadoo
21. Background
Motivation and Objective
Design and implementation
Performance test
Conclusion and future work
Implementation on Hadoop
MapReducer is an Interface of list homomorphism
Aggregator defines a semiring homomorphism
Test is almost list homomorphism, it inherits MapReducer
1 public interface Test<Elem , Key> extends MapReducer<Elem , ←
Key , Boolean> {}
Yu Liu1
, Sebastian Fischer2
, Kento Emoto3
, and Zhenjiang Hu4
Implementing Generate-Test-and-Aggregate Algorithms on Hadoo
22. Background
Motivation and Objective
Design and implementation
Performance test
Conclusion and future work
Implementation on Hadoop
MapReducer is an Interface of list homomorphism
Aggregator defines a semiring homomorphism
Test inherits MapReducer
Generator implements a MapReducer
polymorphic over semiring: Constructor
filter embedding: embed function return a new generator
1 public abstract class Generator<Elem , Single , Val , Res>
2 implements MapReducer<Elem , Val , Res> {
3 //The c o n t r a c t o r takes an i n s t a n c e of Aggregator
4 public Generator ( Aggregator< Single , Val> aggregator ) { . . . }
5
6 // take an i n s t a n c e of Test and r e t u r n a new i n s t a n c e of Generator
7 public <Key> Generator<Elem , Single , WritableMap<Key , Val>,Res>
8 embed ( final Test<Single , Key> test ) {
9 final Generator<Elem , Single , Val , Res> base = this ;
10 return new Generator<Elem , Single , WritableMap<Key , Val>,Res>
11 ( new Aggregator<Single , WritableMap<Key , Val>>(){ . . . }
12 }
13 public Val process ( List<Elem> list ) { . . . }
14 . . .
15 }
Yu Liu1
, Sebastian Fischer2
, Kento Emoto3
, and Zhenjiang Hu4
Implementing Generate-Test-and-Aggregate Algorithms on Hadoo
23. Background
Motivation and Objective
Design and implementation
Performance test
Conclusion and future work
Implementation on Hadoop
1 Users need to make their own Generator, Test, and Aggregator
by extending/implementing the library provided ones1
2 An instance of Generator will be created at run-time on each
working-node, which is also an efficient list homomorphism
3 The instance list homomorphism can be executed by Hadoop
in parallel
1
Our library provides commonly used Generators and Aggregators.
Yu Liu1
, Sebastian Fischer2
, Kento Emoto3
, and Zhenjiang Hu4
Implementing Generate-Test-and-Aggregate Algorithms on Hadoo
24. Background
Motivation and Objective
Design and implementation
Performance test
Conclusion and future work
Java Codes
Let’s have a look at the actual implementation of GTA Knapsack...
Yu Liu1
, Sebastian Fischer2
, Kento Emoto3
, and Zhenjiang Hu4
Implementing Generate-Test-and-Aggregate Algorithms on Hadoo
25. Background
Motivation and Objective
Design and implementation
Performance test
Conclusion and future work
Performance Evaluation
Environment: hardware
We configured clusters with 2, 4, 8, 16, and 32 nodes (virtual
machines). Each computing/data node has one CPU (VM, Xeon
E5530@2.4GHz, 1 core), 3 GB memory.
Test data
102 × 220 (≈ 108) knapsack items (3.2GB)
Each item’s weight is between 0 to 10 and the capacity of the
knapsack is 100.
Yu Liu1
, Sebastian Fischer2
, Kento Emoto3
, and Zhenjiang Hu4
Implementing Generate-Test-and-Aggregate Algorithms on Hadoo
26. Background
Motivation and Objective
Design and implementation
Performance test
Conclusion and future work
Evaluation on Hadoop
The Knapsack program scales well when increasing nodes of cluster
Yu Liu1
, Sebastian Fischer2
, Kento Emoto3
, and Zhenjiang Hu4
Implementing Generate-Test-and-Aggregate Algorithms on Hadoo
27. Background
Motivation and Objective
Design and implementation
Performance test
Conclusion and future work
Conclusion
The implementation of GTA library on Hadoop can
hide the technical details of MapReduce(Hadoop)
automatically do parallelization and optimization
generate MapReduce programs which have good scalability
make coding, testing and code-reusing much simpler
Yu Liu1
, Sebastian Fischer2
, Kento Emoto3
, and Zhenjiang Hu4
Implementing Generate-Test-and-Aggregate Algorithms on Hadoo
28. Background
Motivation and Objective
Design and implementation
Performance test
Conclusion and future work
Future Work
Optimization of current framework to gain better performance
Extension of current framework
Other approaches of systematic parallel programming
Yu Liu1
, Sebastian Fischer2
, Kento Emoto3
, and Zhenjiang Hu4
Implementing Generate-Test-and-Aggregate Algorithms on Hadoo
29. Background
Motivation and Objective
Design and implementation
Performance test
Conclusion and future work
Thanks
Questions?
The project is hosted on
http://screwdriver.googlecode.com
Yu Liu1
, Sebastian Fischer2
, Kento Emoto3
, and Zhenjiang Hu4
Implementing Generate-Test-and-Aggregate Algorithms on Hadoo
30. Background
Motivation and Objective
Design and implementation
Performance test
Conclusion and future work
Appendix: The Computation on Semiring
Definition (Semiring)
Given a set S and two binary operations ⊕ and ⊗, the triple (S, ⊕, ⊗) is called a
semiring if and only if
(S, ⊕) is a commutative monoid with identity element id⊕
(S, ⊗) is a monoid with identity element id⊗
⊗ is associative and distributes over ⊕
id⊕ is a zero of ⊗: id⊕ ⊗ a = a ⊗ id⊕ = id⊕
(Int, +, ×) is a semiring, (PositiveInt, +, max) is another semiring
Definition (Semiring homomorphism)
Given two semirings (S, ⊕, ⊗) and (S , ⊕ , ⊗ ), a function hom : S → S is a semiring
homomorphism from (S, ⊕, ⊗) to (S , ⊕ , ⊗ ), iff it is a monoid homomorphism from
(S, ⊕) to (S , ⊕ ) and also a monoid homomorphism from (S, ⊗) to (S , ⊗ ).
Yu Liu1
, Sebastian Fischer2
, Kento Emoto3
, and Zhenjiang Hu4
Implementing Generate-Test-and-Aggregate Algorithms on Hadoo
31. Background
Motivation and Objective
Design and implementation
Performance test
Conclusion and future work
Theorem (Filter-Embedding Fusion)
Given a set A, a finite monoid (M, ), a monoid homomorphism hom from ([A], ++ )
to (M, ), a semiring (S, ⊕, ⊗), a semiring homomorphism aggregate from
( [A] , ×++ ) to (S, ⊕, ⊗), a function ok : M → Bool and a polymorphic semiring
generator generate, the following equation holds:
aggregate ◦ filter(ok ◦ hom)
◦ generate ,x++ (λx → [x] )
= postprocessM ok
◦ generate⊕M ,⊗M
(λx → aggregateM [x] )
The result of fusion is an efficient algorithm in form of a list
homomorphism.
Yu Liu1
, Sebastian Fischer2
, Kento Emoto3
, and Zhenjiang Hu4
Implementing Generate-Test-and-Aggregate Algorithms on Hadoo
32. Background
Motivation and Objective
Design and implementation
Performance test
Conclusion and future work
List Homomorphism
List Homomorphism [Bird, 87; Cole,95] is a class of recursive
functions.
Definition of List Homomorphism
If there is an associative operator , such that for any list x and
list y
h (x ++ y) = h(x) h(y).
Where ++ is the list concatenation and h [a] = f a, h(x) id = h(x), id is an identity element of .
Yu Liu1
, Sebastian Fischer2
, Kento Emoto3
, and Zhenjiang Hu4
Implementing Generate-Test-and-Aggregate Algorithms on Hadoo
33. Background
Motivation and Objective
Design and implementation
Performance test
Conclusion and future work
List Homomorphism
List Homomorphism [Bird, 87; Cole,95] is a class of recursive
functions.
Definition of List Homomorphism
If there is an associative operator , such that for any list x and
list y
h (x ++ y) = h(x) h(y).
Where ++ is the list concatenation and h [a] = f a, h(x) id = h(x), id is an identity element of .
Instance of a list homomorphism
sum [a] = a
sum (x ++ y) = sum x + sum y.
Yu Liu1
, Sebastian Fischer2
, Kento Emoto3
, and Zhenjiang Hu4
Implementing Generate-Test-and-Aggregate Algorithms on Hadoo
34. Background
Motivation and Objective
Design and implementation
Performance test
Conclusion and future work
List Homomorphism
List Homomorphism [Bird, 87; Cole,95] is a class of recursive
functions.
Definition of List Homomorphism
If there is an associative operator , such that for any list x and
list y
h (x ++ y) = h(x) h(y).
Where ++ is the list concatenation and h [a] = f a, h(x) id = h(x), id is an identity element of .
A list homomorphism can be automatically parallelized by
MapReduce [Yu et. al., EuroPar11].
Yu Liu1
, Sebastian Fischer2
, Kento Emoto3
, and Zhenjiang Hu4
Implementing Generate-Test-and-Aggregate Algorithms on Hadoo
35. Background
Motivation and Objective
Design and implementation
Performance test
Conclusion and future work
Evaluation on Hadoop
We test 3.2GB data on {2 , 4, 8, 16, 32} nodes clusters and 32
GB data on {32, 64} nodes clusters
2 nodes 4 nodes 8 nodes 16 nodes 32 nodes 64 nodes
time(sec.) 1602 882 482 317 961 511
speedup – × 1.82 × 1.83 × 1.52 – × 1.88
Yu Liu1
, Sebastian Fischer2
, Kento Emoto3
, and Zhenjiang Hu4
Implementing Generate-Test-and-Aggregate Algorithms on Hadoo