Haskell-awk (Hawk) is a text processing tool that treats input streams as sequences of records similar to awk, but uses the Haskell programming language instead of Awk's programming language. It allows defining programs as Haskell expressions that can be lazily evaluated and composed using functions. Hawk aims to be fast by caching the compiled Haskell context and only recompiling when needed. It uses the haskell-src-exts and hint libraries to parse, interpret, and run Haskell programs on input streams.
Scalding is a Scala library built on top of Cascading that simplifies the process of defining MapReduce programs. It uses a functional programming approach where data flows are represented as chained transformations on TypedPipes, similar to operations on Scala iterators. This avoids some limitations of the traditional Hadoop MapReduce model by allowing for more flexible multi-step jobs and features like joins. The Scalding TypeSafe API also provides compile-time type safety compared to Cascading's runtime type checking.
"HFSP: Size-based Scheduling for Hadoop" presentation for BigData 2014Mario Pastorelli
The document proposes a new scheduler called Hadoop Fair Sojourn Protocol (HFSP) that prioritizes jobs based on estimated job size to improve system response times for short jobs. HFSP estimates job sizes both offline using prior information and online by monitoring early task performance. It uses a "virtual size" that decreases over time to avoid starvation of large jobs. Experiments show HFSP reduces average job completion times by 16% compared to the default Hadoop scheduler.
Accumulo is an open-source implementation of Google's BigTable distributed storage system. It was developed to store large amounts of structured data across commodity hardware. Accumulo allows for fast retrieval of data through its use of composite keys and indexes while also being scalable. Some key features include support for range queries, fast query speeds with the right schema, and built-in caching. The document provides an example of how tweets could be stored in Accumulo, either in a denormalized format to retrieve a user's timeline or across different tables to support different types of analyses.
Better Living Through Analytics - Strategies for Data DecisionsProduct School
Data is king! Get ready to understand how a successful analytics team can empower managers from product, marketing, and other areas to make effective, data-driven decisions.
Louis Cialdella, a data scientist at ZipRecruiter, shared some case studies and successful strategies that he has used at ZipRecruiter as well as previous experiences. The purpose of this data talk was to enlighten people on how to make sure that analysts can successfully partner with other departments and get them the information they need to do great things.
This document discusses various techniques for estimating project duration for software projects, including top-down, bottom-up, expert judgements, historical comparison, functional points, object points, critical path method (CPM), and program evaluation and review technique (PERT). It provides details on each technique, such as how top-down estimation takes effort as a function of project size, and bottom-up involves participation from those doing the work to set estimates. CPM and PERT are discussed in more detail, such as how CPM captures activities and relationships in a graph. The document aims to help determine the best technique to estimate duration for a given software project.
Story Points considered harmful – a new look at estimation techniquesVasco Duarte
The document discusses alternatives to using story points for estimating work in agile software development projects. It analyzes claims made about the benefits of story points and finds limited evidence to support these claims. The document proposes using the number of completed backlog items per sprint as a simpler and more efficient metric for planning, tracking progress, and estimating release dates. Correlation data from multiple projects shows story points and number of items completed tend to provide similar information about the amount of work done.
Haskell-awk (Hawk) is a text processing tool that treats input streams as sequences of records similar to awk, but uses the Haskell programming language instead of Awk's programming language. It allows defining programs as Haskell expressions that can be lazily evaluated and composed using functions. Hawk aims to be fast by caching the compiled Haskell context and only recompiling when needed. It uses the haskell-src-exts and hint libraries to parse, interpret, and run Haskell programs on input streams.
Scalding is a Scala library built on top of Cascading that simplifies the process of defining MapReduce programs. It uses a functional programming approach where data flows are represented as chained transformations on TypedPipes, similar to operations on Scala iterators. This avoids some limitations of the traditional Hadoop MapReduce model by allowing for more flexible multi-step jobs and features like joins. The Scalding TypeSafe API also provides compile-time type safety compared to Cascading's runtime type checking.
"HFSP: Size-based Scheduling for Hadoop" presentation for BigData 2014Mario Pastorelli
The document proposes a new scheduler called Hadoop Fair Sojourn Protocol (HFSP) that prioritizes jobs based on estimated job size to improve system response times for short jobs. HFSP estimates job sizes both offline using prior information and online by monitoring early task performance. It uses a "virtual size" that decreases over time to avoid starvation of large jobs. Experiments show HFSP reduces average job completion times by 16% compared to the default Hadoop scheduler.
Accumulo is an open-source implementation of Google's BigTable distributed storage system. It was developed to store large amounts of structured data across commodity hardware. Accumulo allows for fast retrieval of data through its use of composite keys and indexes while also being scalable. Some key features include support for range queries, fast query speeds with the right schema, and built-in caching. The document provides an example of how tweets could be stored in Accumulo, either in a denormalized format to retrieve a user's timeline or across different tables to support different types of analyses.
Better Living Through Analytics - Strategies for Data DecisionsProduct School
Data is king! Get ready to understand how a successful analytics team can empower managers from product, marketing, and other areas to make effective, data-driven decisions.
Louis Cialdella, a data scientist at ZipRecruiter, shared some case studies and successful strategies that he has used at ZipRecruiter as well as previous experiences. The purpose of this data talk was to enlighten people on how to make sure that analysts can successfully partner with other departments and get them the information they need to do great things.
This document discusses various techniques for estimating project duration for software projects, including top-down, bottom-up, expert judgements, historical comparison, functional points, object points, critical path method (CPM), and program evaluation and review technique (PERT). It provides details on each technique, such as how top-down estimation takes effort as a function of project size, and bottom-up involves participation from those doing the work to set estimates. CPM and PERT are discussed in more detail, such as how CPM captures activities and relationships in a graph. The document aims to help determine the best technique to estimate duration for a given software project.
Story Points considered harmful – a new look at estimation techniquesVasco Duarte
The document discusses alternatives to using story points for estimating work in agile software development projects. It analyzes claims made about the benefits of story points and finds limited evidence to support these claims. The document proposes using the number of completed backlog items per sprint as a simpler and more efficient metric for planning, tracking progress, and estimating release dates. Correlation data from multiple projects shows story points and number of items completed tend to provide similar information about the amount of work done.
These slides were prepared for a talk I presented at Eindhoven University of Technology, Ghent University, and KU Leuven in June 2019. The main thesis is that project activities are distributed as per the lognormal, but various complications may mask that. To resolve these complications we may need to partition the data, account for the Parkinson effect (early completions may be hidden), and account for rounding. It is also important to note that even under similar conditions some projects are slower on average than others, thus implying that we cannot use the ubiquitous independence assumption. Instead, the simplest model we can recommend is that projects are subject to linear association. Linear Association posits that there is a random bias element representing the between-projects variation. For prediction, we must take into account both the between-projects and within-project variation. If we do all that, we can correct one of the major shortcomings of PERT, namely its reliance on the invalid beta distribution and the independence assumption.
Data Science. Business Analytics is the statistical study of business data to gain insights. Data science is the study of data using statistics, algorithms and technology. Uses mostly structured data. Uses both structured and unstructured data.
From Lab to Factory: Creating value with dataPeadar Coyle
The document discusses lessons learned in developing data products and deploying data science projects from lab to production. It covers challenges such as integrating with other teams, managing stakeholders, monitoring models in production, and ensuring projects are supported by the necessary infrastructure, tools, and culture. Recommendations include focusing on clean, small data problems first, adopting DevOps practices like monitoring and pipelines, and improving communication between data scientists and other roles.
Weapons of Math Instruction: Evolving from Data0-Driven to Science-Drivenindeedeng
Donal McMahon, Director of Data Science at Indeed, presented how to transition from data-driven to science-driven product development. You’ll make better business decisions. It’s provable!
Data Science for Business Managers - An intro to ROI for predictive analyticsAkin Osman Kazakci
This module addresses critical business aspects related to launching a predictive analytics project. How to establish the relationship with business KPIs is discussed. A notion of data hunt, for planning & acquiring external data for better predictions is introduced. Model quality and it's role for ROI of data and prediction tasks are explained. The module is concluded with a glimpse on how collaborative data challenges can improve predictive model quality in no time.
Revisiting Size-Based Scheduling with Estimated Job SizesMatteo Dell'Amico
We study size-based schedulers, and focus on the impact of inaccurate job size information on response time and fairness. Our intent is to revisit previous results, which allude to performance degradation for even small errors on job size estimates, thus limiting the applicability of size-based schedulers.
We show that scheduling performance is tightly connected to workload characteristics: in the absence of large skew in the job size distribution, even extremely imprecise estimates suffice to outperform size-oblivious disciplines. Instead, when job sizes are heavily skewed, known size-based disciplines suffer.
In this context, we show -- for the first time -- the dichotomy of over-estimation versus under-estimation. The former is, in general, less problematic than the latter, as its effects are localized to individual jobs. Instead, under-estimation leads to severe problems that may affect a large number of jobs.
We present an approach to mitigate these problems: our technique requires no complex modifications to original
scheduling policies and performs very well. To support our claim, we proceed with a simulation-based evaluation that covers an unprecedented large parameter space, which takes into account a variety of synthetic and real workloads.
As a consequence, we show that size-based scheduling is practical and outperforms alternatives in a wide array of use-cases, even in presence of inaccurate size information.
SE - Lecture 11 - Software Project Estimation.pptxTangZhiSiang
This document discusses software project estimation. It begins by outlining the major activities of software project planning, which includes estimation. It then describes the estimation process, which involves predicting time, cost, and resources required. Several estimation techniques are discussed, including using historical metrics, task breakdown, size estimates, and automated tools. Accuracy depends on properly defining scope, available metrics, and team abilities. The document provides examples of using lines of code and function point approaches to estimate effort and cost.
From Lab to Factory: Or how to turn data into valuePeadar Coyle
We've all heard of 'big data' or data science, but how do we convert these trends into actual business value. I share case studies, and technology tips and talk about the challenges of the data science process. This is all based on two years of in-the-field research of deploying models, and going from prototypes to production.
These are slides from my talk at PyCon Ireland 2015
Symposium 2019 : Gestion de projet en Intelligence ArtificiellePMI-Montréal
L’objectif d’un projet impliquant l’intelligence artificielle est d’accélérer la prise de décision, voir même, d’automatiser les actions qui doivent être effectuées dans le cadre d’une tache. La principale difficulté est qu’il n’est pas possible de savoir à l’avance quelle méthode d’AI permettra d’atteindre l’objectif. La gestion du projet est souvent atypique et nécessite d’être flexible en respectant toutefois des contraintes de budget. Pour cette raison une approche waterfall est à éviter. Toutefois, nous allons voir qu’elle peut être exploitée dans certaines phases du projet.
Lors de cette présentation, nous allons voir les trois phases du projet : prototypage de la solution, mise en production, ainsi que les stratégies de maintien à plus long terme de la solution.
Dr. Nathanael Weill
Open & reproducible research - What can we do in practice?Felix Z. Hoffmann
- There is a reproducibility crisis in computational research even when code is made available. Out of 206 computational studies in Science magazine since a policy change mandating sharing, only 26 directly provided their code and data. Of those judged potentially reproducible when code was available, more than half still required significant effort to reproduce.
- Making research fully reproducible requires addressing issues like difficult computational environments, long run times, dependency on previous results, and clarity on what is required to reproduce a single finding. Following principles like ensuring code is re-runnable, repeatable, reproducible, reusable, and replicable can help achieve reproducibility. Publishing code on platforms like Zenodo and OSF can also aid reproducibility.
This document discusses concepts related to estimation and velocity in Scrum projects. It describes how to estimate product backlog items using story points or ideal days with relative sizing. Velocity is defined as the amount of work completed each sprint by totaling the sizes of completed backlog items. A team's velocity range is used for planning and process improvement. Planning poker is presented as a consensus-based technique for sizing items through discussion.
The document proposes a new preemption and priority-based scheduling algorithm for Hadoop Distributed File System (HDFS). It begins with an introduction to scheduling in Hadoop and describes existing scheduling algorithms like FIFO, fair, and capacity schedulers. It then discusses the limitations of these schedulers in handling priorities and preemption. The proposed algorithm allows the scheduler to make more efficient decisions by prioritizing jobs with high priority and preempting low priority jobs. Finally, a comparison table summarizes the different scheduling strategies for HDFS in terms of their scheduling methodology, benefits, limitations, and behaviors with priority and non-priority tasks.
Data Science as a Commodity: Use MADlib, R, & other OSS Tools for Data Scienc...Sarah Aerni
Slides from the Pivotal Open Source Hub Meetup
"Data Science as a Commodity: Use MADlib, R, & other OSS Tools for Data Science!"
As the need for data science as a key differentiator grows in all industries, from large corporations to startups, the need to get to results quickly is enabled by sharing ideas and methods in the community. The data science team at Pivotal leverages and contributes to this community of publicly available and open source technologies as part of their practice. We will share the resources we use by highlighting specific toolkits for building models (e.g. MADlib, R) and visualization (e.g. Gephi and Circos) along with their benefits and limitations by sharing examples from Pivotal's data science engagements. At the end of this session we hope to have answered the questions: Where can I get started with Data Science? Which toolkit is most appropriate for building a model with my dataset? How can I visualize my results to have the greatest impact?
Bio: Sarah Aerni is a member of the Pivotal Data Science team with a focus on healthcare and life science. She has a background in the field of Bioinformatics, developing tools to help biomedical researchers understand their data. She holds a B.S. In Biology with a specialization in Bioinformatics and minor in French Literature from UCSD, and an M.S. and Ph.D in Biomedical Informatics from Stanford University. During her time as a researcher she focused on the interface between machine learning and biology, building computational models enabling research for a broad range of fields in biomedicine. She also co-founded a start-up providing informatics services to researchers and small companies. At Pivotal she works with customers in life science and healthcare building models to derive insight and business value from their data.
Machine learning at b.e.s.t. summer universityLászló Kovács
Machine learning involves using patterns in data to make predictions without being explicitly programmed. This document provides an introduction to machine learning concepts through a real-world project example. It discusses what data scientists do, including prediction, anomaly detection, gaining insights, and decision making. The document then demonstrates machine learning applications in areas like predicting flight delays or employee attrition. It also covers important steps like data preprocessing, feature engineering, and building predictive models using decision trees.
Effort estimation for software developmentSpyros Ktenas
Software effort estimation has been an important issue for almost everyone in software industry at some point. Below I will try to give some basic details on methods, best practices, common mistakes and available tools.
You may also check a tool implementing methods for estimation at http://effort-estimation.gatory.com/
Spyros Ktenas
http://open-works.org/profiles/spyros-ktenas
Afternoons with Azure - Azure Machine Learning CCG
Journey through programming languages such as R, and Python that can be used for Machine Learning. Next, explore Azure Machine Learning Studio see the interconnectivity.
For more information about Microsoft Azure, call (813) 265-3239 or visit www.ccganalytics.com/solutions
Early Analysis and Debuggin of Linked Open Data CubesEnrico Daga
The release of the Data Cube Vocabulary specification introduces a standardised method for publishing statistics following the linked data principles. However, a statistical dataset can be very complex, and so understanding how to get value out of it may be hard. Analysts need the ability to quickly grasp the content of the data to be able to make use of it appropriately. In addition, while remodelling the data, data cube publishers need support to detect bugs and issues in the structure or content of the dataset. There are several aspects of RDF, the Data Cube vocabulary and linked data that can help with these issues however, including that they make the data "self-descriptive". Here, we attempt to answer the question "How feasible is it to use this feature to give an overview of the data in a way that would facilitate debugging and exploration of statistical linked open data?" We present a tool that automatically builds interactive facets as diagrams out of a Data Cube representation without prior knowledge of the data content to be used for debugging and early analysis. We show how this tool can be used on a large, complex dataset and we discuss the potential of this approach.
This document provides an overview of big data analytics. It discusses challenges of big data like increased storage needs and handling varied data formats. The document introduces Hadoop and Spark as approaches for processing large, unstructured data at scale. Descriptive and predictive analytics are defined, and a sample use case of sentiment analysis on Twitter data is presented, demonstrating data collection, modeling, and scoring workflows. Finally, the author's skills in areas like Java, Python, SQL, Hadoop, and predictive analytics tools are outlined.
Literature Review Basics and Understanding Reference Management.pptxDr Ramhari Poudyal
Three-day training on academic research focuses on analytical tools at United Technical College, supported by the University Grant Commission, Nepal. 24-26 May 2024
More Related Content
Similar to Size-Based Disciplines for Job Scheduling in Data-Intensive Scalable Computing Systems
These slides were prepared for a talk I presented at Eindhoven University of Technology, Ghent University, and KU Leuven in June 2019. The main thesis is that project activities are distributed as per the lognormal, but various complications may mask that. To resolve these complications we may need to partition the data, account for the Parkinson effect (early completions may be hidden), and account for rounding. It is also important to note that even under similar conditions some projects are slower on average than others, thus implying that we cannot use the ubiquitous independence assumption. Instead, the simplest model we can recommend is that projects are subject to linear association. Linear Association posits that there is a random bias element representing the between-projects variation. For prediction, we must take into account both the between-projects and within-project variation. If we do all that, we can correct one of the major shortcomings of PERT, namely its reliance on the invalid beta distribution and the independence assumption.
Data Science. Business Analytics is the statistical study of business data to gain insights. Data science is the study of data using statistics, algorithms and technology. Uses mostly structured data. Uses both structured and unstructured data.
From Lab to Factory: Creating value with dataPeadar Coyle
The document discusses lessons learned in developing data products and deploying data science projects from lab to production. It covers challenges such as integrating with other teams, managing stakeholders, monitoring models in production, and ensuring projects are supported by the necessary infrastructure, tools, and culture. Recommendations include focusing on clean, small data problems first, adopting DevOps practices like monitoring and pipelines, and improving communication between data scientists and other roles.
Weapons of Math Instruction: Evolving from Data0-Driven to Science-Drivenindeedeng
Donal McMahon, Director of Data Science at Indeed, presented how to transition from data-driven to science-driven product development. You’ll make better business decisions. It’s provable!
Data Science for Business Managers - An intro to ROI for predictive analyticsAkin Osman Kazakci
This module addresses critical business aspects related to launching a predictive analytics project. How to establish the relationship with business KPIs is discussed. A notion of data hunt, for planning & acquiring external data for better predictions is introduced. Model quality and it's role for ROI of data and prediction tasks are explained. The module is concluded with a glimpse on how collaborative data challenges can improve predictive model quality in no time.
Revisiting Size-Based Scheduling with Estimated Job SizesMatteo Dell'Amico
We study size-based schedulers, and focus on the impact of inaccurate job size information on response time and fairness. Our intent is to revisit previous results, which allude to performance degradation for even small errors on job size estimates, thus limiting the applicability of size-based schedulers.
We show that scheduling performance is tightly connected to workload characteristics: in the absence of large skew in the job size distribution, even extremely imprecise estimates suffice to outperform size-oblivious disciplines. Instead, when job sizes are heavily skewed, known size-based disciplines suffer.
In this context, we show -- for the first time -- the dichotomy of over-estimation versus under-estimation. The former is, in general, less problematic than the latter, as its effects are localized to individual jobs. Instead, under-estimation leads to severe problems that may affect a large number of jobs.
We present an approach to mitigate these problems: our technique requires no complex modifications to original
scheduling policies and performs very well. To support our claim, we proceed with a simulation-based evaluation that covers an unprecedented large parameter space, which takes into account a variety of synthetic and real workloads.
As a consequence, we show that size-based scheduling is practical and outperforms alternatives in a wide array of use-cases, even in presence of inaccurate size information.
SE - Lecture 11 - Software Project Estimation.pptxTangZhiSiang
This document discusses software project estimation. It begins by outlining the major activities of software project planning, which includes estimation. It then describes the estimation process, which involves predicting time, cost, and resources required. Several estimation techniques are discussed, including using historical metrics, task breakdown, size estimates, and automated tools. Accuracy depends on properly defining scope, available metrics, and team abilities. The document provides examples of using lines of code and function point approaches to estimate effort and cost.
From Lab to Factory: Or how to turn data into valuePeadar Coyle
We've all heard of 'big data' or data science, but how do we convert these trends into actual business value. I share case studies, and technology tips and talk about the challenges of the data science process. This is all based on two years of in-the-field research of deploying models, and going from prototypes to production.
These are slides from my talk at PyCon Ireland 2015
Symposium 2019 : Gestion de projet en Intelligence ArtificiellePMI-Montréal
L’objectif d’un projet impliquant l’intelligence artificielle est d’accélérer la prise de décision, voir même, d’automatiser les actions qui doivent être effectuées dans le cadre d’une tache. La principale difficulté est qu’il n’est pas possible de savoir à l’avance quelle méthode d’AI permettra d’atteindre l’objectif. La gestion du projet est souvent atypique et nécessite d’être flexible en respectant toutefois des contraintes de budget. Pour cette raison une approche waterfall est à éviter. Toutefois, nous allons voir qu’elle peut être exploitée dans certaines phases du projet.
Lors de cette présentation, nous allons voir les trois phases du projet : prototypage de la solution, mise en production, ainsi que les stratégies de maintien à plus long terme de la solution.
Dr. Nathanael Weill
Open & reproducible research - What can we do in practice?Felix Z. Hoffmann
- There is a reproducibility crisis in computational research even when code is made available. Out of 206 computational studies in Science magazine since a policy change mandating sharing, only 26 directly provided their code and data. Of those judged potentially reproducible when code was available, more than half still required significant effort to reproduce.
- Making research fully reproducible requires addressing issues like difficult computational environments, long run times, dependency on previous results, and clarity on what is required to reproduce a single finding. Following principles like ensuring code is re-runnable, repeatable, reproducible, reusable, and replicable can help achieve reproducibility. Publishing code on platforms like Zenodo and OSF can also aid reproducibility.
This document discusses concepts related to estimation and velocity in Scrum projects. It describes how to estimate product backlog items using story points or ideal days with relative sizing. Velocity is defined as the amount of work completed each sprint by totaling the sizes of completed backlog items. A team's velocity range is used for planning and process improvement. Planning poker is presented as a consensus-based technique for sizing items through discussion.
The document proposes a new preemption and priority-based scheduling algorithm for Hadoop Distributed File System (HDFS). It begins with an introduction to scheduling in Hadoop and describes existing scheduling algorithms like FIFO, fair, and capacity schedulers. It then discusses the limitations of these schedulers in handling priorities and preemption. The proposed algorithm allows the scheduler to make more efficient decisions by prioritizing jobs with high priority and preempting low priority jobs. Finally, a comparison table summarizes the different scheduling strategies for HDFS in terms of their scheduling methodology, benefits, limitations, and behaviors with priority and non-priority tasks.
Data Science as a Commodity: Use MADlib, R, & other OSS Tools for Data Scienc...Sarah Aerni
Slides from the Pivotal Open Source Hub Meetup
"Data Science as a Commodity: Use MADlib, R, & other OSS Tools for Data Science!"
As the need for data science as a key differentiator grows in all industries, from large corporations to startups, the need to get to results quickly is enabled by sharing ideas and methods in the community. The data science team at Pivotal leverages and contributes to this community of publicly available and open source technologies as part of their practice. We will share the resources we use by highlighting specific toolkits for building models (e.g. MADlib, R) and visualization (e.g. Gephi and Circos) along with their benefits and limitations by sharing examples from Pivotal's data science engagements. At the end of this session we hope to have answered the questions: Where can I get started with Data Science? Which toolkit is most appropriate for building a model with my dataset? How can I visualize my results to have the greatest impact?
Bio: Sarah Aerni is a member of the Pivotal Data Science team with a focus on healthcare and life science. She has a background in the field of Bioinformatics, developing tools to help biomedical researchers understand their data. She holds a B.S. In Biology with a specialization in Bioinformatics and minor in French Literature from UCSD, and an M.S. and Ph.D in Biomedical Informatics from Stanford University. During her time as a researcher she focused on the interface between machine learning and biology, building computational models enabling research for a broad range of fields in biomedicine. She also co-founded a start-up providing informatics services to researchers and small companies. At Pivotal she works with customers in life science and healthcare building models to derive insight and business value from their data.
Machine learning at b.e.s.t. summer universityLászló Kovács
Machine learning involves using patterns in data to make predictions without being explicitly programmed. This document provides an introduction to machine learning concepts through a real-world project example. It discusses what data scientists do, including prediction, anomaly detection, gaining insights, and decision making. The document then demonstrates machine learning applications in areas like predicting flight delays or employee attrition. It also covers important steps like data preprocessing, feature engineering, and building predictive models using decision trees.
Effort estimation for software developmentSpyros Ktenas
Software effort estimation has been an important issue for almost everyone in software industry at some point. Below I will try to give some basic details on methods, best practices, common mistakes and available tools.
You may also check a tool implementing methods for estimation at http://effort-estimation.gatory.com/
Spyros Ktenas
http://open-works.org/profiles/spyros-ktenas
Afternoons with Azure - Azure Machine Learning CCG
Journey through programming languages such as R, and Python that can be used for Machine Learning. Next, explore Azure Machine Learning Studio see the interconnectivity.
For more information about Microsoft Azure, call (813) 265-3239 or visit www.ccganalytics.com/solutions
Early Analysis and Debuggin of Linked Open Data CubesEnrico Daga
The release of the Data Cube Vocabulary specification introduces a standardised method for publishing statistics following the linked data principles. However, a statistical dataset can be very complex, and so understanding how to get value out of it may be hard. Analysts need the ability to quickly grasp the content of the data to be able to make use of it appropriately. In addition, while remodelling the data, data cube publishers need support to detect bugs and issues in the structure or content of the dataset. There are several aspects of RDF, the Data Cube vocabulary and linked data that can help with these issues however, including that they make the data "self-descriptive". Here, we attempt to answer the question "How feasible is it to use this feature to give an overview of the data in a way that would facilitate debugging and exploration of statistical linked open data?" We present a tool that automatically builds interactive facets as diagrams out of a Data Cube representation without prior knowledge of the data content to be used for debugging and early analysis. We show how this tool can be used on a large, complex dataset and we discuss the potential of this approach.
This document provides an overview of big data analytics. It discusses challenges of big data like increased storage needs and handling varied data formats. The document introduces Hadoop and Spark as approaches for processing large, unstructured data at scale. Descriptive and predictive analytics are defined, and a sample use case of sentiment analysis on Twitter data is presented, demonstrating data collection, modeling, and scoring workflows. Finally, the author's skills in areas like Java, Python, SQL, Hadoop, and predictive analytics tools are outlined.
Similar to Size-Based Disciplines for Job Scheduling in Data-Intensive Scalable Computing Systems (20)
Literature Review Basics and Understanding Reference Management.pptxDr Ramhari Poudyal
Three-day training on academic research focuses on analytical tools at United Technical College, supported by the University Grant Commission, Nepal. 24-26 May 2024
TIME DIVISION MULTIPLEXING TECHNIQUE FOR COMMUNICATION SYSTEMHODECEDSIET
Time Division Multiplexing (TDM) is a method of transmitting multiple signals over a single communication channel by dividing the signal into many segments, each having a very short duration of time. These time slots are then allocated to different data streams, allowing multiple signals to share the same transmission medium efficiently. TDM is widely used in telecommunications and data communication systems.
### How TDM Works
1. **Time Slots Allocation**: The core principle of TDM is to assign distinct time slots to each signal. During each time slot, the respective signal is transmitted, and then the process repeats cyclically. For example, if there are four signals to be transmitted, the TDM cycle will divide time into four slots, each assigned to one signal.
2. **Synchronization**: Synchronization is crucial in TDM systems to ensure that the signals are correctly aligned with their respective time slots. Both the transmitter and receiver must be synchronized to avoid any overlap or loss of data. This synchronization is typically maintained by a clock signal that ensures time slots are accurately aligned.
3. **Frame Structure**: TDM data is organized into frames, where each frame consists of a set of time slots. Each frame is repeated at regular intervals, ensuring continuous transmission of data streams. The frame structure helps in managing the data streams and maintaining the synchronization between the transmitter and receiver.
4. **Multiplexer and Demultiplexer**: At the transmitting end, a multiplexer combines multiple input signals into a single composite signal by assigning each signal to a specific time slot. At the receiving end, a demultiplexer separates the composite signal back into individual signals based on their respective time slots.
### Types of TDM
1. **Synchronous TDM**: In synchronous TDM, time slots are pre-assigned to each signal, regardless of whether the signal has data to transmit or not. This can lead to inefficiencies if some time slots remain empty due to the absence of data.
2. **Asynchronous TDM (or Statistical TDM)**: Asynchronous TDM addresses the inefficiencies of synchronous TDM by allocating time slots dynamically based on the presence of data. Time slots are assigned only when there is data to transmit, which optimizes the use of the communication channel.
### Applications of TDM
- **Telecommunications**: TDM is extensively used in telecommunication systems, such as in T1 and E1 lines, where multiple telephone calls are transmitted over a single line by assigning each call to a specific time slot.
- **Digital Audio and Video Broadcasting**: TDM is used in broadcasting systems to transmit multiple audio or video streams over a single channel, ensuring efficient use of bandwidth.
- **Computer Networks**: TDM is used in network protocols and systems to manage the transmission of data from multiple sources over a single network medium.
### Advantages of TDM
- **Efficient Use of Bandwidth**: TDM all
Introduction- e - waste – definition - sources of e-waste– hazardous substances in e-waste - effects of e-waste on environment and human health- need for e-waste management– e-waste handling rules - waste minimization techniques for managing e-waste – recycling of e-waste - disposal treatment methods of e- waste – mechanism of extraction of precious metal from leaching solution-global Scenario of E-waste – E-waste in India- case studies.
International Conference on NLP, Artificial Intelligence, Machine Learning an...gerogepatton
International Conference on NLP, Artificial Intelligence, Machine Learning and Applications (NLAIM 2024) offers a premier global platform for exchanging insights and findings in the theory, methodology, and applications of NLP, Artificial Intelligence, Machine Learning, and their applications. The conference seeks substantial contributions across all key domains of NLP, Artificial Intelligence, Machine Learning, and their practical applications, aiming to foster both theoretical advancements and real-world implementations. With a focus on facilitating collaboration between researchers and practitioners from academia and industry, the conference serves as a nexus for sharing the latest developments in the field.
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressionsVictor Morales
K8sGPT is a tool that analyzes and diagnoses Kubernetes clusters. This presentation was used to share the requirements and dependencies to deploy K8sGPT in a local environment.
Redefining brain tumor segmentation: a cutting-edge convolutional neural netw...IJECEIAES
Medical image analysis has witnessed significant advancements with deep learning techniques. In the domain of brain tumor segmentation, the ability to
precisely delineate tumor boundaries from magnetic resonance imaging (MRI)
scans holds profound implications for diagnosis. This study presents an ensemble convolutional neural network (CNN) with transfer learning, integrating
the state-of-the-art Deeplabv3+ architecture with the ResNet18 backbone. The
model is rigorously trained and evaluated, exhibiting remarkable performance
metrics, including an impressive global accuracy of 99.286%, a high-class accuracy of 82.191%, a mean intersection over union (IoU) of 79.900%, a weighted
IoU of 98.620%, and a Boundary F1 (BF) score of 83.303%. Notably, a detailed comparative analysis with existing methods showcases the superiority of
our proposed model. These findings underscore the model’s competence in precise brain tumor localization, underscoring its potential to revolutionize medical
image analysis and enhance healthcare outcomes. This research paves the way
for future exploration and optimization of advanced CNN models in medical
imaging, emphasizing addressing false positives and resource efficiency.
Using recycled concrete aggregates (RCA) for pavements is crucial to achieving sustainability. Implementing RCA for new pavement can minimize carbon footprint, conserve natural resources, reduce harmful emissions, and lower life cycle costs. Compared to natural aggregate (NA), RCA pavement has fewer comprehensive studies and sustainability assessments.
A review on techniques and modelling methodologies used for checking electrom...nooriasukmaningtyas
The proper function of the integrated circuit (IC) in an inhibiting electromagnetic environment has always been a serious concern throughout the decades of revolution in the world of electronics, from disjunct devices to today’s integrated circuit technology, where billions of transistors are combined on a single chip. The automotive industry and smart vehicles in particular, are confronting design issues such as being prone to electromagnetic interference (EMI). Electronic control devices calculate incorrect outputs because of EMI and sensors give misleading values which can prove fatal in case of automotives. In this paper, the authors have non exhaustively tried to review research work concerned with the investigation of EMI in ICs and prediction of this EMI using various modelling methodologies and measurement setups.
Presentation of IEEE Slovenia CIS (Computational Intelligence Society) Chapte...University of Maribor
Slides from talk presenting:
Aleš Zamuda: Presentation of IEEE Slovenia CIS (Computational Intelligence Society) Chapter and Networking.
Presentation at IcETRAN 2024 session:
"Inter-Society Networking Panel GRSS/MTT-S/CIS
Panel Session: Promoting Connection and Cooperation"
IEEE Slovenia GRSS
IEEE Serbia and Montenegro MTT-S
IEEE Slovenia CIS
11TH INTERNATIONAL CONFERENCE ON ELECTRICAL, ELECTRONIC AND COMPUTING ENGINEERING
3-6 June 2024, Niš, Serbia
Presentation of IEEE Slovenia CIS (Computational Intelligence Society) Chapte...
Size-Based Disciplines for Job Scheduling in Data-Intensive Scalable Computing Systems
1. Size-Based Disciplines for Job Scheduling
in Data-Intensive Scalable Computing
Systems
Mario Pastorelli
Jury:
Prof. Ernst BIERSACK
Prof. Guillaume URVOY-KELLER
Prof. Giovanni CHIOLA
Dr. Patrick BROWN
Supervisor:
Prof. Pietro MICHIARDI
Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 1
2. Context 1/3
In 2004, Google presented MapReduce, a system used to process
large quantity of data. The key ideas are:
Client-Server architecture
Move the computation, not the data
Programming model inspired by Lisp lists functions:
map : (k1, v1) → [(k2, v2)]
reduce : (k2, [v2]) → [(k3, v3)]
Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 2
3. Context 1/3
In 2004, Google presented MapReduce, a system used to process
large quantity of data. The key ideas are:
Client-Server architecture
Move the computation, not the data
Programming model inspired by Lisp lists functions:
map : (k1, v1) → [(k2, v2)]
reduce : (k2, [v2]) → [(k3, v3)]
Hadoop, the main open-source implementation of MapReduce, is
released one year later. It is widely adopted and used by many
important companies (Facebook, Twitter, Yahoo, IBM, Microsoft. . . )
Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 2
4. Context 2/3
In MapReduce, the Scheduling Policy is fundamental
Complexity of the system
Distributed resources
Multiple jobs running in parallel
Jobs are composed by two sequential phases, the map and the
reduce phase
Each phase is composed by multiple tasks, where each task runs on a
slot of a client
Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 3
5. Context 2/3
In MapReduce, the Scheduling Policy is fundamental
Complexity of the system
Distributed resources
Multiple jobs running in parallel
Jobs are composed by two sequential phases, the map and the
reduce phase
Each phase is composed by multiple tasks, where each task runs on a
slot of a client
Heterogeneous workloads
Big differences in jobs sizes
Interactive jobs (e.g. data exploration, algorithm tuning,
orchestration jobs. . . ) must run as soon as possible. . .
. . . without impacting batch jobs too much
Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 3
6. Context 3/3
Schedulers (strive to) optimize one or more metrics. For example:
Fairness: how a job is treated compared to the others
Mean response time: of jobs, that is the responsiveness of the system
. . .
Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 4
7. Context 3/3
Schedulers (strive to) optimize one or more metrics. For example:
Fairness: how a job is treated compared to the others
Mean response time: of jobs, that is the responsiveness of the system
. . .
Schedulers for Hadoop, e.g. the Fair Scheduler, focus on fairness
rather than other metrics
Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 4
8. Context 3/3
Schedulers (strive to) optimize one or more metrics. For example:
Fairness: how a job is treated compared to the others
Mean response time: of jobs, that is the responsiveness of the system
. . .
Schedulers for Hadoop, e.g. the Fair Scheduler, focus on fairness
rather than other metrics
Short response times are very important! Usually there is one or
more system administrators making a manual ad-hoc configuration
Fine-tuning of the scheduler parameters
Configuration of pools of jobs with priorities
Complex, error prone and difficult to adapt to workload/cluster
changes
Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 4
9. Motivations
Size-based schedulers are more efficient than other schedulers (in
theory). . .
Job priority based on the job size
Focus resources on a few jobs instead of splitting them among many
jobs
Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 5
10. Motivations
Size-based schedulers are more efficient than other schedulers (in
theory). . .
Job priority based on the job size
Focus resources on a few jobs instead of splitting them among many
jobs
. . . but (in practice) they are not adopted in real systems
Job size is unknown
No studies on applicability to distributed systems
Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 5
11. Motivations
Size-based schedulers are more efficient than other schedulers (in
theory). . .
Job priority based on the job size
Focus resources on a few jobs instead of splitting them among many
jobs
. . . but (in practice) they are not adopted in real systems
Job size is unknown
No studies on applicability to distributed systems
MapReduce is suitable for size-based scheduling
We don’t have the job size but we have the time to estimate it
No perfect estimation is required . . .
. . . as long as very different jobs are sorted correctly
Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 5
12. Size-Based Schedulers: Example
Job Arrival Time Size
job1 0s 30s
job2 10s 10s
job3 15s 10s
Processor Sharing
Shortest Remaining
Processing Time
(SRPT)
Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 6
13. Size-Based Schedulers: Example
Job Arrival Time Size
job1 0s 30s
job2 10s 10s
job3 15s 10s
Scheduler AVG sojourn time
Processor Sharing 35s
SRPT 25s
Processor Sharing
Shortest Remaining
Processing Time
(SRPT)
Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 6
14. Challenges
Job sizes are unknown: how do you obtain an approximation of a
job size while the job is running?
Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 7
15. Challenges
Job sizes are unknown: how do you obtain an approximation of a
job size while the job is running?
Estimation errors: how do you cope with an approximated size?
Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 7
16. Challenges
Job sizes are unknown: how do you obtain an approximation of a
job size while the job is running?
Estimation errors: how do you cope with an approximated size?
Scheduler for real and distributed systems: can we design a
size-based scheduler that works for existing systems?
Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 7
17. Challenges
Job sizes are unknown: how do you obtain an approximation of a
job size while the job is running?
Estimation errors: how do you cope with an approximated size?
Scheduler for real and distributed systems: can we design a
size-based scheduler that works for existing systems?
Job preemption: preemption is fundamental for scheduling, but
current system support it partially. Can we improve that?
Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 7
18. The Hadoop Fair Sojourn Protocol
Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 8
19. Hadoop Fair Sojourn Protocol [BIGDATA 2013]
Size-based scheduler for Hadoop that is fair and achieves small response
times
Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 9
20. Hadoop Fair Sojourn Protocol [BIGDATA 2013]
Size-based scheduler for Hadoop that is fair and achieves small response
times
The map and the reduce phases are treated independently and thus
a job has two sizes
Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 9
21. Hadoop Fair Sojourn Protocol [BIGDATA 2013]
Size-based scheduler for Hadoop that is fair and achieves small response
times
The map and the reduce phases are treated independently and thus
a job has two sizes
Sizes estimations are done in two steps by the Estimation Module
Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 9
22. Hadoop Fair Sojourn Protocol [BIGDATA 2013]
Size-based scheduler for Hadoop that is fair and achieves small response
times
The map and the reduce phases are treated independently and thus
a job has two sizes
Sizes estimations are done in two steps by the Estimation Module
Estimated sizes are then given in input to the Aging Module that
converts them into virtual sizes to avoid starvation
Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 9
23. Hadoop Fair Sojourn Protocol [BIGDATA 2013]
Size-based scheduler for Hadoop that is fair and achieves small response
times
The map and the reduce phases are treated independently and thus
a job has two sizes
Sizes estimations are done in two steps by the Estimation Module
Estimated sizes are then given in input to the Aging Module that
converts them into virtual sizes to avoid starvation
Schedule jobs with smallest virtual sizes
Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 9
24. Estimation Module
Two ways to estimate a job size:
Offline: based on the information available a priori (num tasks, block
size, past history . . . ):
available since job submission but not very precise
Online: based on the performance of a subset of t tasks:
need time for training but more precise
Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 10
25. Estimation Module
Two ways to estimate a job size:
Offline: based on the information available a priori (num tasks, block
size, past history . . . ):
available since job submission but not very precise
Online: based on the performance of a subset of t tasks:
need time for training but more precise
We need both:
Offline estimation for the initial size, because jobs need size since their
submission
Online estimation because it is more precise: when it is completed, the
job size is updated to the final size
Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 10
26. Estimation Module
Two ways to estimate a job size:
Offline: based on the information available a priori (num tasks, block
size, past history . . . ):
available since job submission but not very precise
Online: based on the performance of a subset of t tasks:
need time for training but more precise
We need both:
Offline estimation for the initial size, because jobs need size since their
submission
Online estimation because it is more precise: when it is completed, the
job size is updated to the final size
Tiny Jobs: jobs with less than t tasks are considered tiny and have
the highest priority possible
Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 10
27. Aging Module 1/2
Aging: the more a job stays in queue, the higher its priority will be
Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 11
28. Aging Module 1/2
Aging: the more a job stays in queue, the higher its priority will be
A technique used in the literature to age jobs is the Virtual Size
Each job is run in a simulation using processor sharing
The output of the simulation is the job virtual size, that is the job size
aged by the amount of time the job has spent in the simulation
Jobs are sorted by remaining virtual size and resources are assigned to
the job with smallest virtual size
0 1 2 3 4 5 6 7 8 9 10
time (s)
0.5
1
1.5
2
2.5
3
3.5
4
jobvirtualtime(s)
Job 1
Job 2
Job 3
Virtual Size (Simulation)
0 1 2 3 4 5 6 7 8 9 10
time (s)
0.5
1
1.5
2
2.5
3
3.5
4
jobsize(s)
Job 1
Job 2
Job 3
Real Size (Real Scheduling)
Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 11
29. Aging Module 2/2
In HFSP the estimated sizes are converted in virtual sizes by the
Aging Module
The simulation is run in a virtual cluster that has the same resources
of the real one
Simulating Processor Sharing with Max-Min Fair Sharing
Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 12
30. Aging Module 2/2
In HFSP the estimated sizes are converted in virtual sizes by the
Aging Module
The simulation is run in a virtual cluster that has the same resources
of the real one
Simulating Processor Sharing with Max-Min Fair Sharing
The number of tasks of a job determines how fast it can age
Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 12
31. Task Scheduling Policy
When a job is submitted
If it is tiny then assign a final size to it of 0
Else
assign an initial size to it based on its number of tasks
mark the job as in training stage and select t training tasks
Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 13
32. Task Scheduling Policy
When a job is submitted
If it is tiny then assign a final size to it of 0
Else
assign an initial size to it based on its number of tasks
mark the job as in training stage and select t training tasks
When a resource becomes available
If there are jobs in the training stage then assign a task from the job
with the smallest initial virtual size
Else assign a task from the job with the smallest final virtual size
Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 13
34. Experimental Setup
20 TaskTrackers (MapReduce clients) for a total of 40 map and 20
reduce slots
Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 15
35. Experimental Setup
20 TaskTrackers (MapReduce clients) for a total of 40 map and 20
reduce slots
Three kinds of workloads inspired by existing traces
Bin
Dataset
Size
Averag. num.
Map Tasks
Workload
DEV TEST PROD
1 1 GB < 5 65% 30% 0%
2 10 GB 10 − 50 20% 40% 10%
3 100 GB 50 − 150 10% 10% 60%
4 1 TB > 150 5% 20% 30%
Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 15
36. Experimental Setup
20 TaskTrackers (MapReduce clients) for a total of 40 map and 20
reduce slots
Three kinds of workloads inspired by existing traces
Bin
Dataset
Size
Averag. num.
Map Tasks
Workload
DEV TEST PROD
1 1 GB < 5 65% 30% 0%
2 10 GB 10 − 50 20% 40% 10%
3 100 GB 50 − 150 10% 10% 60%
4 1 TB > 150 5% 20% 30%
Each experiment is composed by 100 jobs taken from PigMix and
has been executed 5 times
Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 15
37. Experimental Setup
20 TaskTrackers (MapReduce clients) for a total of 40 map and 20
reduce slots
Three kinds of workloads inspired by existing traces
Bin
Dataset
Size
Averag. num.
Map Tasks
Workload
DEV TEST PROD
1 1 GB < 5 65% 30% 0%
2 10 GB 10 − 50 20% 40% 10%
3 100 GB 50 − 150 10% 10% 60%
4 1 TB > 150 5% 20% 30%
Each experiment is composed by 100 jobs taken from PigMix and
has been executed 5 times
HFSP compared to the Fair Scheduler
Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 15
38. Performance Metrics
Mean Response Time
A job response time is the time passed between the job submission and
when it completes
The mean of the response times of all jobs indicates the
responsiveness of the system under that scheduling policy
Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 16
39. Performance Metrics
Mean Response Time
A job response time is the time passed between the job submission and
when it completes
The mean of the response times of all jobs indicates the
responsiveness of the system under that scheduling policy
Fairness
A common approach is to use the job slowdown, i.e. the ratio
between job response time and its size, to indicate how fair the
scheduler has been with that job
In the literature a scheduler with same or smaller slowdowns than
the Processor Sharing is considered fair
Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 16
40. Results: Mean Response Time
D
EV
TEST
PRO
D
-34% -26%
-33%
25 28
109
38 38
163
MeanResponseTime(s)
HFSP Fair
Overall HFSP decreases the mean
response time of ∼30%
Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 17
41. Results: Mean Response Time
D
EV
TEST
PRO
D
-34% -26%
-33%
25 28
109
38 38
163
MeanResponseTime(s)
HFSP Fair
Overall HFSP decreases the mean
response time of ∼30%
Tiny jobs (bin 1) are treated in the same
way by the two schedulers: they run as
soon as possible
Medium, large and huge jobs (bins 2, 3
and 4) are consistently treated better
by HFSP thanks to its size-based
sequential nature
Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 17
42. Results: Fairness
0.1 1.0 10.0 100.0
Response time / Isolation runtime
0.0
0.2
0.4
0.6
0.8
1.0
ECDF
HFSP
Fair
DEV workload
0.1 1.0 10.0 100.0
Response time / Isolation runtime
0.0
0.2
0.4
0.6
0.8
1.0
ECDF
HFSP
Fair
TEST workload
0.1 1.0 10.0 100.0
Response time / Isolation runtime
0.0
0.2
0.4
0.6
0.8
1.0
ECDF
HFSP
Fair
PROD workload
HFSP is globally more fair to jobs than the Fair Scheduler
The “heavier” the workload is, the better HFSP treats jobs compared
to the Fair Scheduler
For the PROD workload, the gap between the median under HFSP
and the one under Fair is one order of magnitude
Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 18
43. Impact of the errors
Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 19
44. Task Times and Estimation Errors
Tasks of a single job are stable
Even a small number of
training tasks is enough for
estimating the phase size
1 10 102
task time / mean task time
0.0
0.2
0.4
0.6
0.8
1.0
ECDF
map
reduce
Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 20
45. Task Times and Estimation Errors
Tasks of a single job are stable
Even a small number of
training tasks is enough for
estimating the phase size
1 10 102
task time / mean task time
0.0
0.2
0.4
0.6
0.8
1.0
ECDF
map
reduce
0.25 0.5 1 2 4
error using 5 samples
0.0
0.2
0.4
0.6
0.8
1.0
ECDF
map
reduce
error = est. size
real size
error > 1 ⇒ estimated size is bigger
than the real one (over-estimation)
error < 1 ⇒ estimated size is smaller
than the real one (under-estimation)
Biggest errors are on over-estimating map
phases
Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 20
46. Estimation Errors: Job Sizes and Phases
bin2 bin3 bin4
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
4.0
Map Phase
bin2 bin3 bin4
0.8
1.0
1.2
1.4
1.6
1.8
2.0
2.2
Reduce Phase
Majority of estimated sizes are close to the correct one
Tendency to over-estimate in all the bins
Smaller errors on medium jobs (bin 2) compared to large and huge
ons (bin 3 and 4)
Switching jobs is highly unlikely
Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 21
47. FSP with Estimation Errors
Our experiments show that, in Hadoop, the estimation errors don’t
impact our size-based scheduler performance
Can we abstract from Hadoop and extract a general rule on the
applicability of size-based scheduling policies?
Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 22
48. FSP with Estimation Errors
Our experiments show that, in Hadoop, the estimation errors don’t
impact our size-based scheduler performance
Can we abstract from Hadoop and extract a general rule on the
applicability of size-based scheduling policies?
Simulative approach: simulations are fast making possible to try
different workloads, jobs arrival times and errors
Our results show that size-based schedulers, like FSP and SRPT, are
tolerant to errors in many cases
Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 22
49. FSP with Estimation Errors
Our experiments show that, in Hadoop, the estimation errors don’t
impact our size-based scheduler performance
Can we abstract from Hadoop and extract a general rule on the
applicability of size-based scheduling policies?
Simulative approach: simulations are fast making possible to try
different workloads, jobs arrival times and errors
Our results show that size-based schedulers, like FSP and SRPT, are
tolerant to errors in many cases
We created FSP+PS that tolerates even more “extreme” conditions
[MASCOTS 2014]
Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 22
51. Task Preemption in HFSP
In theory
Preemption consists in removing resources from a running job and
granting them to another one
Without knowledge of the workload, preemptive schedulers outmatch
their non-preemptive counterparts
Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 24
52. Task Preemption in HFSP
In theory
Preemption consists in removing resources from a running job and
granting them to another one
Without knowledge of the workload, preemptive schedulers outmatch
their non-preemptive counterparts
In practice
Preemption is difficult to implement
Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 24
53. Task Preemption in HFSP
In theory
Preemption consists in removing resources from a running job and
granting them to another one
Without knowledge of the workload, preemptive schedulers outmatch
their non-preemptive counterparts
In practice
Preemption is difficult to implement
In Hadoop
Task preemption support through the kill primitive: it removes
resources from a task by killing it ⇒ all work is lost!
Kill disadvantages are well known and usually it is disabled or used very
carefully
HFSP is a preemptive scheduler and supports the task kill primitive
Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 24
54. Results: Kill Preemption
1 10 100
slowdown (s)
0.0
0.2
0.4
0.6
0.8
1.0
ECDF
kill
wait
1 10 102 103 104 105
sojourn time (s)
0.0
0.2
0.4
0.6
0.8
1.0
ECDF
kill
wait
Kill improves fairness and response times of small and medium
jobs. . .
. . . but impacts heavily large jobs response times
Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 25
55. OS-Assisted Preemption
Kill preemption is non-optimal: it preempts running tasks but has a
high cost
Can we do a mechanism that is more similar to an ideal preemption?
Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 26
56. OS-Assisted Preemption
Kill preemption is non-optimal: it preempts running tasks but has a
high cost
Can we do a mechanism that is more similar to an ideal preemption?
Idea . . .
Instead of killing a task, we can suspend it where it is running
When the task should run again, we can resume it where it was
running
. . . but how can be implemented?
Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 26
57. OS-Assisted Preemption
Kill preemption is non-optimal: it preempts running tasks but has a
high cost
Can we do a mechanism that is more similar to an ideal preemption?
Idea . . .
Instead of killing a task, we can suspend it where it is running
When the task should run again, we can resume it where it was
running
. . . but how can be implemented?
Operating Systems know very well how to suspend and resume
processes
At low-level, tasks are processes
Exploit OS capabilities to get a new preemption primitive: Task
Suspension [DCPERF 2014]
Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 26
59. Conclusion
Size-based schedulers with estimated (imprecise) sizes can
outperform schedulers not size-based in real systems
Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 28
60. Conclusion
Size-based schedulers with estimated (imprecise) sizes can
outperform schedulers not size-based in real systems
We showed this by designing the Hadoop Fair Sojourn Protocol, a
size-based scheduler for a real and distributed system such as
Hadoop
Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 28
61. Conclusion
Size-based schedulers with estimated (imprecise) sizes can
outperform schedulers not size-based in real systems
We showed this by designing the Hadoop Fair Sojourn Protocol, a
size-based scheduler for a real and distributed system such as
Hadoop
HFSP is fair and achieves small mean response time
Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 28
62. Conclusion
Size-based schedulers with estimated (imprecise) sizes can
outperform schedulers not size-based in real systems
We showed this by designing the Hadoop Fair Sojourn Protocol, a
size-based scheduler for a real and distributed system such as
Hadoop
HFSP is fair and achieves small mean response time
It can also use Hadoop preemption mechanism to improve fairness
and response times of small jobs, but this will affect the performance
of large and huge jobs
Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 28
63. Future Work
HFSP + Suspension: adding the suspension mechanism to HFSP
raises many challenges, such as the eviction policy and the reduce
locality
Recurring Jobs: exploit the past runs of recurring jobs to obtain an
almost perfect estimation since their submission.
Complex Jobs: high-level languages and libraries push the scheduling
problem from simple jobs to complex jobs, that are chains of simple
jobs. Can we adapt HFSP to such jobs?
Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 29
65. Size-Based Scheduling with Estimated
Sizes
Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 1
66. Impact of Over-estimation and Under-estimation
Over-‐es'ma'on
Under-‐es'ma'on
t
t
t
t
Remaining
size
Remaining
size
Remaining
size
Remaining
size
J1
J2
J3
J2
J3
J1
^
J4
J5
J6
J4
J5
J6
^
Over-estimating a job affects only that job. Other jobs in queue are
not affected
Under-estimating a job can affect other jobs in queue
Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 2
67. FSP+PS
In FSP, under-estimated jobs can complete in the virtual system
before than in the real system. We call them late jobs
When a job is late, it should not prevent executing other jobs
FSP+PS solves the problem by scheduling late jobs using processor
sharing
Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 3
69. OS-Assisted Task Preemption
Kill preemption primitive has many drawbacks, can we do better?
Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 5
70. OS-Assisted Task Preemption
Kill preemption primitive has many drawbacks, can we do better?
At low level, tasks are processes and processes can be suspended
and resumed by the Operating System
Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 5
71. OS-Assisted Task Preemption
Kill preemption primitive has many drawbacks, can we do better?
At low level, tasks are processes and processes can be suspended
and resumed by the Operating System
We exploit this mechanism by enabling task suspension and
resuming
Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 5
72. OS-Assisted Task Preemption
Kill preemption primitive has many drawbacks, can we do better?
At low level, tasks are processes and processes can be suspended
and resumed by the Operating System
We exploit this mechanism by enabling task suspension and
resuming
No need to change existent jobs! Done at low-level and transparent
to the user
Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 5
73. OS-Assisted Task Preemption
Kill preemption primitive has many drawbacks, can we do better?
At low level, tasks are processes and processes can be suspended
and resumed by the Operating System
We exploit this mechanism by enabling task suspension and
resuming
No need to change existent jobs! Done at low-level and transparent
to the user
Bonus: the operating system manages the memory of processes
Memory of suspended tasks can be granted to other (running) tasks by
the OS. . .
. . . and because the OS knows how much memory the process needs,
only the memory required will be taken from the suspended task
Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 5
74. OS-Assisted Task Preemption: Trashing
Trashing: when data is continuously read from and written to swap
space, the machine performance are highly degraded to a point that
the machine doesn’t work properly anymore
Trashing is caused by the working set (memory) that is larger than
the system memory
Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 6
75. OS-Assisted Task Preemption: Trashing
Trashing: when data is continuously read from and written to swap
space, the machine performance are highly degraded to a point that
the machine doesn’t work properly anymore
Trashing is caused by the working set (memory) that is larger than
the system memory
In Hadoop this doesn’t happen because:
Running tasks per machine are limited
Heap space per task is limited
Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 6
76. OS-Assisted Task Preemption: Experiments
Test the worst case for suspension, that is when the jobs allocate all
the memory
Two jobs, th and tl , allocating 2 GB of memory
10 20 30 40 50 60 70 80 90
tl progress at launch of th (%)
80
90
100
110
120
130
140
150
sojourntimeth(s)
wait
kill
susp
10 20 30 40 50 60 70 80 90
tl progress at launch of th (%)
170
180
190
200
210
220
230
240
250
makespan(s)
wait
kill
susp
Our primitive outperform kill and wait
Overhead for swapping doesn’t affect the jobs too much
Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 7
77. OS-Assisted Task Preemption: Conclusions
Task Suspension/Resume outperform current preemption
implementations. . .
Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 8
78. OS-Assisted Task Preemption: Conclusions
Task Suspension/Resume outperform current preemption
implementations. . .
. . . but it raises new challenges, e.g. state locality for task suspended
Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 8
79. OS-Assisted Task Preemption: Conclusions
Task Suspension/Resume outperform current preemption
implementations. . .
. . . but it raises new challenges, e.g. state locality for task suspended
With a good scheduling policy (and eviction policy), OS-assisted
preemption can substitute current preemption mechanism
Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 8