The proof that the best response time in queuing systems is obtained by scheduling the jobs with the shortest remaining processing time dates back to 1966; since then, other size-based scheduling protocols that pair near-optimal response times with strong fairness guarantees have been proposed. Yet, despite these very desirable properties, size-based scheduling policies are almost never used in practice: a key reason is that, in real systems, it is prohibitive to know a priori exact job sizes.
In this talk, I will first describe our efforts to put in practice concepts coming from theory, developing HFSP: a size-based scheduler for Hadoop MapReduce that uses estimations rather than exact size information. We obtained results that were surprisingly good even with very inaccurate size estimations: this motivated us to return to theory, and perform an in-depth study of scheduling based on estimated sizes. We obtained very promising results: for a large class of workloads, size-based scheduling performs well even with very rough size estimations; for the other workloads, simple modifications to the existing scheduling protocols are sufficient to greatly enhance performance.
Optimizing Performance - Clojure Remote - Nikola PericNik Peric
When a project approaches production questions about performance always surface. This talk tackles several real-world problems that have occurred while bringing a data-driven project to production, and walks through the problem solving approach to each.
Cloud computing is the one of the emerging techniques to process the big data. Large collection of set or large
volume of data is known as big data. Processing of big data (MRI images and DICOM images) normally takes
more time compare with other data. The main tasks such as handling big data can be solved by using the concepts
of hadoop. Enhancing the hadoop concept it will help the user to process the large set of images or data. The
Advanced Hadoop Distributed File System (AHDF) and MapReduce are the two default main functions which
are used to enhance hadoop. HDF method is a hadoop file storing system, which is used for storing and retrieving
the data. MapReduce is the combinations of two functions namely maps and reduce. Map is the process of
splitting the inputs and reduce is the process of integrating the output of map’s input. Recently, in medical fields
the experienced problems like machine failure and fault tolerance while processing the result for the scanned
data. A unique optimized time scheduling algorithm, called Advanced Dynamic Handover Reduce Function
(ADHRF) algorithm is introduced in the reduce function. Enhancement of hadoop and cloud introduction of
ADHRF helps to overcome the processing risks, to get optimized result with less waiting time and reduction in
error percentage of the output image
When the number of data elements gets large - thousands to billions or more data points - standard visual representations and interaction techniques break down. In this talk, we will survey methods for scaling interactive visualizations to data sets too large to process or explore using traditional means. I will compare data reduction techniques such as sampling, aggregation and model fitting, as well as interesting hybrid approaches, and discuss their trade-offs. I will also describe methods to enable real-time interactive exploration within standards-compliant web browsers. Attendees will learn effective visualization techniques and interaction methods that are applicable to billion+ element databases.
FAST ALGORITHMS FOR UNSUPERVISED LEARNING IN LARGE DATA SETScsandit
The ability to mine and extract useful information automatically, from large datasets, is a
common concern for organizations (having large datasets), over the last few decades. Over the
internet, data is vastly increasing gradually and consequently the capacity to collect and store
very large data is significantly increasing.
Existing clustering algorithms are not always efficient and accurate in solving clustering
problems for large datasets.
However, the development of accurate and fast data classification algorithms for very large
scale datasets is still a challenge. In this paper, various algorithms and techniques especially,
approach using non-smooth optimization formulation of the clustering problem, are proposed
for solving the minimum sum-of-squares clustering problems in very large datasets. This
research also develops accurate and real time L2-DC algorithm based with the incremental
approach to solve the minimum
Optimizing Performance - Clojure Remote - Nikola PericNik Peric
When a project approaches production questions about performance always surface. This talk tackles several real-world problems that have occurred while bringing a data-driven project to production, and walks through the problem solving approach to each.
Cloud computing is the one of the emerging techniques to process the big data. Large collection of set or large
volume of data is known as big data. Processing of big data (MRI images and DICOM images) normally takes
more time compare with other data. The main tasks such as handling big data can be solved by using the concepts
of hadoop. Enhancing the hadoop concept it will help the user to process the large set of images or data. The
Advanced Hadoop Distributed File System (AHDF) and MapReduce are the two default main functions which
are used to enhance hadoop. HDF method is a hadoop file storing system, which is used for storing and retrieving
the data. MapReduce is the combinations of two functions namely maps and reduce. Map is the process of
splitting the inputs and reduce is the process of integrating the output of map’s input. Recently, in medical fields
the experienced problems like machine failure and fault tolerance while processing the result for the scanned
data. A unique optimized time scheduling algorithm, called Advanced Dynamic Handover Reduce Function
(ADHRF) algorithm is introduced in the reduce function. Enhancement of hadoop and cloud introduction of
ADHRF helps to overcome the processing risks, to get optimized result with less waiting time and reduction in
error percentage of the output image
When the number of data elements gets large - thousands to billions or more data points - standard visual representations and interaction techniques break down. In this talk, we will survey methods for scaling interactive visualizations to data sets too large to process or explore using traditional means. I will compare data reduction techniques such as sampling, aggregation and model fitting, as well as interesting hybrid approaches, and discuss their trade-offs. I will also describe methods to enable real-time interactive exploration within standards-compliant web browsers. Attendees will learn effective visualization techniques and interaction methods that are applicable to billion+ element databases.
FAST ALGORITHMS FOR UNSUPERVISED LEARNING IN LARGE DATA SETScsandit
The ability to mine and extract useful information automatically, from large datasets, is a
common concern for organizations (having large datasets), over the last few decades. Over the
internet, data is vastly increasing gradually and consequently the capacity to collect and store
very large data is significantly increasing.
Existing clustering algorithms are not always efficient and accurate in solving clustering
problems for large datasets.
However, the development of accurate and fast data classification algorithms for very large
scale datasets is still a challenge. In this paper, various algorithms and techniques especially,
approach using non-smooth optimization formulation of the clustering problem, are proposed
for solving the minimum sum-of-squares clustering problems in very large datasets. This
research also develops accurate and real time L2-DC algorithm based with the incremental
approach to solve the minimum
Time Series Analysis:Basic Stochastic Signal RecoveryDaniel Cuneo
Simple case of a recovering a stochastic signal from a time series with a linear combination of nuisance signals.
Errata:
corrected error in the Gaussian fit.
corrected the JackKnife example and un-centers data.
Corrected sig fig language and rationale
removed jk calculation of mean reformatted cells
STRIP: stream learning of influence probabilities.Albert Bifet
Influence-driven diffusion of information is a fundamental process in social networks. Learning the latent variables of such process, i.e., the influence strength along each link, is a central question towards understanding the structure and function of complex networks, modeling information cascades, and developing applications such as viral marketing.
Motivated by modern microblogging platforms, such as twitter, in this paper we study the problem of learning influence probabilities in a data-stream scenario, in which the network topology is relatively stable and the challenge of a learning algorithm is to keep up with a continuous stream of tweets using a small amount of time and memory. Our contribution is a number of randomized approximation algorithms, categorized according to the available space (superlinear, linear, and sublinear in the number of nodes n) and according to different models (landmark and sliding window). Among several results, we show that we can learn influence probabilities with one pass over the data, using O(nlog n) space, in both the landmark model and the sliding-window model, and we further show that our algorithm is within a logarithmic factor of optimal.
For truly large graphs, when one needs to operate with sublinear space, we show that we can still learn influence probabilities in one pass, assuming that we restrict our attention to the most active users.
Our thorough experimental evaluation on large social graph demonstrates that the empirical performance of our algorithms agrees with that predicted by the theory.
Best Hadoop Institutes : kelly tecnologies is the best Hadoop training Institute in Bangalore.Providing hadoop courses by realtime faculty in Bangalore.
Map Reduce Workloads: A Dynamic Job Ordering and Slot Configurationsdbpublications
MapReduce is a popular parallel computing paradigm for large-scale data processing in clusters and data centers. A MapReduce workload generally contains a set of jobs, each of which consists of multiple map tasks followed by multiple reduce tasks. Due to 1) that map tasks can only run in map slots and reduce tasks can only run in reduce slots, and 2) the general execution constraints that map tasks are executed before reduce tasks, different job execution orders and map/reduce slot configurations for a MapReduce workload have significantly different performance and system utilization. This survey proposes two classes of algorithms to minimize the make span and the total completion time for an offline MapReduce workload. Our first class of algorithms focuses on the job ordering optimization for a MapReduce workload under a given map/reduce slot configuration. In contrast, our second class of algorithms considers the scenario that we can perform optimization for map/reduce slot configuration for a MapReduce workload. We perform simulations as well as experiments on Amazon EC2 and show that our proposed algorithms produce results that are up to 15 - 80 percent better than currently unoptimized Hadoop, leading to significant reductions in running time in practice.
발표자: 송환준(KAIST 박사과정)
발표일: 2018.8.
(Parallel Clustering Algorithm Optimization for Large-Scale Data Analytics)
Clustering은 데이터 분석에 가장 널리 쓰이는 방법 중 하나로 주어진 데이터를 유사성에 기초하여 여러 개의 그룹으로 나누는 작업이다. 하지만 Clustering 방법의 높은 계산 복잡도 때문에 대용량 데이터 분석에는 잘 사용되지 못하고 있다. 최근 이 높은 복잡도 문제를 해결하기 위해 많은 연구가 Hadoop, Spark와 같은 분산 컴퓨팅 방식을 적용하고 있지만 기존 Clustering 알고리즘을 분산 환경에 최적화시키는 것은 쉽지 않다. 특히, 효율성을 높이기 위해 정확성을 손실하는 문제 그리고 여러 작업자들 간의 부하 불균형 문제는 알고리즘을 분산처리 할 때 발생하는 대표적인 문제이다. 본 세미나에서는 대표적 Clustering 알고리즘인 DBSCAN을 분산처리 할 때 발생하는 여러 도전 과제에 초점을 맞추고 이를 해결 할 수 있는 새로운 해결책을 제시한다. 실제로 이 방법은 최신 연구의 방법과 비교하여 정확도 손실 없이 최대 180배까지 알고리즘의 성능을 향상시켰다.
본 세미나는 SIGMOD 2018에서 발표한 다음 논문에 대한 내용이다.
Song, H. and Lee, J., "RP-DBSCAN: A Superfast Parallel DBSCAN Algorithm Based on Random Partitioning," In Proc. 2018 ACM Int'l Conf. on Management of Data (SIGMOD), Houston, Texas, pp. 1173 ~ 1187, June 2018
1. Background
- Concept of Clustering
- Concept of Distributed Processing (MapReduce)
- Clustering Algorithms (Focus on DBSCAN)
2. Challenges of Parallel Clustering
- Parallelization of Clustering Algorithm (Focus on DBSCAN)
- Existing Work
- Challenges
3. Our Approach
- Key Idea and Key Contribution
- Overview of Random Partitioning-DBSCAN
4. Experimental Results
5. Conclusions
Revisiting Size-Based Scheduling with Estimated Job SizesMatteo Dell'Amico
We study size-based schedulers, and focus on the impact of inaccurate job size information on response time and fairness. Our intent is to revisit previous results, which allude to performance degradation for even small errors on job size estimates, thus limiting the applicability of size-based schedulers.
We show that scheduling performance is tightly connected to workload characteristics: in the absence of large skew in the job size distribution, even extremely imprecise estimates suffice to outperform size-oblivious disciplines. Instead, when job sizes are heavily skewed, known size-based disciplines suffer.
In this context, we show -- for the first time -- the dichotomy of over-estimation versus under-estimation. The former is, in general, less problematic than the latter, as its effects are localized to individual jobs. Instead, under-estimation leads to severe problems that may affect a large number of jobs.
We present an approach to mitigate these problems: our technique requires no complex modifications to original
scheduling policies and performs very well. To support our claim, we proceed with a simulation-based evaluation that covers an unprecedented large parameter space, which takes into account a variety of synthetic and real workloads.
As a consequence, we show that size-based scheduling is practical and outperforms alternatives in a wide array of use-cases, even in presence of inaccurate size information.
A popular programming model for running data intensive applications on the cloud is map reduce. In
the Hadoop usually, jobs are scheduled in FIFO order by default. There are many map reduce
applications which require strict deadline. In Hadoop framework, scheduler wi t h deadline
con s t ra in t s has not been implemented. Existing schedulers d o not guarantee that the job will be
completed by a specific deadline. Some schedulers address the issue of deadlines but focus more on
improving s y s t em utilization. We have proposed an algorithm which facilitates the user to
specify a jobs deadline and evaluates whether the job can be finished before the deadline.
Scheduler with deadlines for Hadoop, which ensures that only jobs, whose deadlines can be met are
scheduled for execution. If the job submitted does not satisfy the specified deadline, physical or
virtual nodes can be added dynamically to complete the job within deadline[8].
his work introduces a new task preemption primitive for Hadoop, that allows tasks to be suspended and resumed exploiting existing memory management mechanisms readily available in modern operating systems. Our technique fills the gap that exists between the two extremes cases of killing tasks (which waste work) or waiting for their completion (which introduces latency): experimental results indicate superior performance and very small overheads when compared to existing alternatives.
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop ClustersXiao Qin
An increasing number of popular applications become data-intensive in nature. In the past decade, the World Wide Web has been adopted as an ideal platform for developing data-intensive applications, since the communication paradigm of the Web is sufficiently open and powerful. Data-intensive applications like data mining and web indexing need to access ever-expanding data sets ranging from a few gigabytes to several terabytes or even petabytes. Google leverages the MapReduce model to process approximately twenty petabytes of data per day in a parallel fashion. In this talk, we introduce the Google’s MapReduce framework for processing huge datasets on large clusters. We first outline the motivations of the MapReduce framework. Then, we describe the dataflow of MapReduce. Next, we show a couple of example applications of MapReduce. Finally, we present our research project on the Hadoop Distributed File System.
The current Hadoop implementation assumes that computing nodes in a cluster are homogeneous in nature. Data locality has not been taken into
account for launching speculative map tasks, because it is
assumed that most maps are data-local. Unfortunately, both
the homogeneity and data locality assumptions are not satisfied
in virtualized data centers. We show that ignoring the datalocality issue in heterogeneous environments can noticeably
reduce the MapReduce performance. In this paper, we address
the problem of how to place data across nodes in a way that
each node has a balanced data processing load. Given a dataintensive application running on a Hadoop MapReduce cluster,
our data placement scheme adaptively balances the amount of
data stored in each node to achieve improved data-processing
performance. Experimental results on two real data-intensive
applications show that our data placement strategy can always
improve the MapReduce performance by rebalancing data
across nodes before performing a data-intensive application
in a heterogeneous Hadoop cluster.
Time Series Analysis:Basic Stochastic Signal RecoveryDaniel Cuneo
Simple case of a recovering a stochastic signal from a time series with a linear combination of nuisance signals.
Errata:
corrected error in the Gaussian fit.
corrected the JackKnife example and un-centers data.
Corrected sig fig language and rationale
removed jk calculation of mean reformatted cells
STRIP: stream learning of influence probabilities.Albert Bifet
Influence-driven diffusion of information is a fundamental process in social networks. Learning the latent variables of such process, i.e., the influence strength along each link, is a central question towards understanding the structure and function of complex networks, modeling information cascades, and developing applications such as viral marketing.
Motivated by modern microblogging platforms, such as twitter, in this paper we study the problem of learning influence probabilities in a data-stream scenario, in which the network topology is relatively stable and the challenge of a learning algorithm is to keep up with a continuous stream of tweets using a small amount of time and memory. Our contribution is a number of randomized approximation algorithms, categorized according to the available space (superlinear, linear, and sublinear in the number of nodes n) and according to different models (landmark and sliding window). Among several results, we show that we can learn influence probabilities with one pass over the data, using O(nlog n) space, in both the landmark model and the sliding-window model, and we further show that our algorithm is within a logarithmic factor of optimal.
For truly large graphs, when one needs to operate with sublinear space, we show that we can still learn influence probabilities in one pass, assuming that we restrict our attention to the most active users.
Our thorough experimental evaluation on large social graph demonstrates that the empirical performance of our algorithms agrees with that predicted by the theory.
Best Hadoop Institutes : kelly tecnologies is the best Hadoop training Institute in Bangalore.Providing hadoop courses by realtime faculty in Bangalore.
Map Reduce Workloads: A Dynamic Job Ordering and Slot Configurationsdbpublications
MapReduce is a popular parallel computing paradigm for large-scale data processing in clusters and data centers. A MapReduce workload generally contains a set of jobs, each of which consists of multiple map tasks followed by multiple reduce tasks. Due to 1) that map tasks can only run in map slots and reduce tasks can only run in reduce slots, and 2) the general execution constraints that map tasks are executed before reduce tasks, different job execution orders and map/reduce slot configurations for a MapReduce workload have significantly different performance and system utilization. This survey proposes two classes of algorithms to minimize the make span and the total completion time for an offline MapReduce workload. Our first class of algorithms focuses on the job ordering optimization for a MapReduce workload under a given map/reduce slot configuration. In contrast, our second class of algorithms considers the scenario that we can perform optimization for map/reduce slot configuration for a MapReduce workload. We perform simulations as well as experiments on Amazon EC2 and show that our proposed algorithms produce results that are up to 15 - 80 percent better than currently unoptimized Hadoop, leading to significant reductions in running time in practice.
발표자: 송환준(KAIST 박사과정)
발표일: 2018.8.
(Parallel Clustering Algorithm Optimization for Large-Scale Data Analytics)
Clustering은 데이터 분석에 가장 널리 쓰이는 방법 중 하나로 주어진 데이터를 유사성에 기초하여 여러 개의 그룹으로 나누는 작업이다. 하지만 Clustering 방법의 높은 계산 복잡도 때문에 대용량 데이터 분석에는 잘 사용되지 못하고 있다. 최근 이 높은 복잡도 문제를 해결하기 위해 많은 연구가 Hadoop, Spark와 같은 분산 컴퓨팅 방식을 적용하고 있지만 기존 Clustering 알고리즘을 분산 환경에 최적화시키는 것은 쉽지 않다. 특히, 효율성을 높이기 위해 정확성을 손실하는 문제 그리고 여러 작업자들 간의 부하 불균형 문제는 알고리즘을 분산처리 할 때 발생하는 대표적인 문제이다. 본 세미나에서는 대표적 Clustering 알고리즘인 DBSCAN을 분산처리 할 때 발생하는 여러 도전 과제에 초점을 맞추고 이를 해결 할 수 있는 새로운 해결책을 제시한다. 실제로 이 방법은 최신 연구의 방법과 비교하여 정확도 손실 없이 최대 180배까지 알고리즘의 성능을 향상시켰다.
본 세미나는 SIGMOD 2018에서 발표한 다음 논문에 대한 내용이다.
Song, H. and Lee, J., "RP-DBSCAN: A Superfast Parallel DBSCAN Algorithm Based on Random Partitioning," In Proc. 2018 ACM Int'l Conf. on Management of Data (SIGMOD), Houston, Texas, pp. 1173 ~ 1187, June 2018
1. Background
- Concept of Clustering
- Concept of Distributed Processing (MapReduce)
- Clustering Algorithms (Focus on DBSCAN)
2. Challenges of Parallel Clustering
- Parallelization of Clustering Algorithm (Focus on DBSCAN)
- Existing Work
- Challenges
3. Our Approach
- Key Idea and Key Contribution
- Overview of Random Partitioning-DBSCAN
4. Experimental Results
5. Conclusions
Revisiting Size-Based Scheduling with Estimated Job SizesMatteo Dell'Amico
We study size-based schedulers, and focus on the impact of inaccurate job size information on response time and fairness. Our intent is to revisit previous results, which allude to performance degradation for even small errors on job size estimates, thus limiting the applicability of size-based schedulers.
We show that scheduling performance is tightly connected to workload characteristics: in the absence of large skew in the job size distribution, even extremely imprecise estimates suffice to outperform size-oblivious disciplines. Instead, when job sizes are heavily skewed, known size-based disciplines suffer.
In this context, we show -- for the first time -- the dichotomy of over-estimation versus under-estimation. The former is, in general, less problematic than the latter, as its effects are localized to individual jobs. Instead, under-estimation leads to severe problems that may affect a large number of jobs.
We present an approach to mitigate these problems: our technique requires no complex modifications to original
scheduling policies and performs very well. To support our claim, we proceed with a simulation-based evaluation that covers an unprecedented large parameter space, which takes into account a variety of synthetic and real workloads.
As a consequence, we show that size-based scheduling is practical and outperforms alternatives in a wide array of use-cases, even in presence of inaccurate size information.
A popular programming model for running data intensive applications on the cloud is map reduce. In
the Hadoop usually, jobs are scheduled in FIFO order by default. There are many map reduce
applications which require strict deadline. In Hadoop framework, scheduler wi t h deadline
con s t ra in t s has not been implemented. Existing schedulers d o not guarantee that the job will be
completed by a specific deadline. Some schedulers address the issue of deadlines but focus more on
improving s y s t em utilization. We have proposed an algorithm which facilitates the user to
specify a jobs deadline and evaluates whether the job can be finished before the deadline.
Scheduler with deadlines for Hadoop, which ensures that only jobs, whose deadlines can be met are
scheduled for execution. If the job submitted does not satisfy the specified deadline, physical or
virtual nodes can be added dynamically to complete the job within deadline[8].
his work introduces a new task preemption primitive for Hadoop, that allows tasks to be suspended and resumed exploiting existing memory management mechanisms readily available in modern operating systems. Our technique fills the gap that exists between the two extremes cases of killing tasks (which waste work) or waiting for their completion (which introduces latency): experimental results indicate superior performance and very small overheads when compared to existing alternatives.
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop ClustersXiao Qin
An increasing number of popular applications become data-intensive in nature. In the past decade, the World Wide Web has been adopted as an ideal platform for developing data-intensive applications, since the communication paradigm of the Web is sufficiently open and powerful. Data-intensive applications like data mining and web indexing need to access ever-expanding data sets ranging from a few gigabytes to several terabytes or even petabytes. Google leverages the MapReduce model to process approximately twenty petabytes of data per day in a parallel fashion. In this talk, we introduce the Google’s MapReduce framework for processing huge datasets on large clusters. We first outline the motivations of the MapReduce framework. Then, we describe the dataflow of MapReduce. Next, we show a couple of example applications of MapReduce. Finally, we present our research project on the Hadoop Distributed File System.
The current Hadoop implementation assumes that computing nodes in a cluster are homogeneous in nature. Data locality has not been taken into
account for launching speculative map tasks, because it is
assumed that most maps are data-local. Unfortunately, both
the homogeneity and data locality assumptions are not satisfied
in virtualized data centers. We show that ignoring the datalocality issue in heterogeneous environments can noticeably
reduce the MapReduce performance. In this paper, we address
the problem of how to place data across nodes in a way that
each node has a balanced data processing load. Given a dataintensive application running on a Hadoop MapReduce cluster,
our data placement scheme adaptively balances the amount of
data stored in each node to achieve improved data-processing
performance. Experimental results on two real data-intensive
applications show that our data placement strategy can always
improve the MapReduce performance by rebalancing data
across nodes before performing a data-intensive application
in a heterogeneous Hadoop cluster.
Hadoop Institutes: kelly technologies are the best Hadoop Training Institutes in Hyderabad. Providing Hadoop training by real time faculty in Hyderabad.
http://www.kellytechno.com/Hyderabad/Course/Hadoop-Training
International Journal of Computational Engineering Research(IJCER) is an intentional online Journal in English monthly publishing journal. This Journal publish original research work that contributes significantly to further the scientific knowledge in engineering and Technology.
Cloud Computing course presentation, Tarbiat Modares University
By: Sina Ebrahimi, Mohammadreza Noei
Advisor: Sadegh Dorri Nogoorani, PhD.
Presentation Data: 1397/03/07
Video Link in Aparat: https://www.aparat.com/v/N5VbK
Video Link on TMU Cloud: http://cloud.modares.ac.ir/public.php?service=files&t=9ecb8d2dd08df6f990a3eb63f42011f7
This presenation's pptx file (some animations may be lost in slideshare) : http://cloud.modares.ac.ir/public.php?service=files&t=f62282dbd205abaa66de2512d9fdfc83
Toxic effects of heavy metals : Lead and Arsenicsanjana502982
Heavy metals are naturally occuring metallic chemical elements that have relatively high density, and are toxic at even low concentrations. All toxic metals are termed as heavy metals irrespective of their atomic mass and density, eg. arsenic, lead, mercury, cadmium, thallium, chromium, etc.
Slide 1: Title Slide
Extrachromosomal Inheritance
Slide 2: Introduction to Extrachromosomal Inheritance
Definition: Extrachromosomal inheritance refers to the transmission of genetic material that is not found within the nucleus.
Key Components: Involves genes located in mitochondria, chloroplasts, and plasmids.
Slide 3: Mitochondrial Inheritance
Mitochondria: Organelles responsible for energy production.
Mitochondrial DNA (mtDNA): Circular DNA molecule found in mitochondria.
Inheritance Pattern: Maternally inherited, meaning it is passed from mothers to all their offspring.
Diseases: Examples include Leber’s hereditary optic neuropathy (LHON) and mitochondrial myopathy.
Slide 4: Chloroplast Inheritance
Chloroplasts: Organelles responsible for photosynthesis in plants.
Chloroplast DNA (cpDNA): Circular DNA molecule found in chloroplasts.
Inheritance Pattern: Often maternally inherited in most plants, but can vary in some species.
Examples: Variegation in plants, where leaf color patterns are determined by chloroplast DNA.
Slide 5: Plasmid Inheritance
Plasmids: Small, circular DNA molecules found in bacteria and some eukaryotes.
Features: Can carry antibiotic resistance genes and can be transferred between cells through processes like conjugation.
Significance: Important in biotechnology for gene cloning and genetic engineering.
Slide 6: Mechanisms of Extrachromosomal Inheritance
Non-Mendelian Patterns: Do not follow Mendel’s laws of inheritance.
Cytoplasmic Segregation: During cell division, organelles like mitochondria and chloroplasts are randomly distributed to daughter cells.
Heteroplasmy: Presence of more than one type of organellar genome within a cell, leading to variation in expression.
Slide 7: Examples of Extrachromosomal Inheritance
Four O’clock Plant (Mirabilis jalapa): Shows variegated leaves due to different cpDNA in leaf cells.
Petite Mutants in Yeast: Result from mutations in mitochondrial DNA affecting respiration.
Slide 8: Importance of Extrachromosomal Inheritance
Evolution: Provides insight into the evolution of eukaryotic cells.
Medicine: Understanding mitochondrial inheritance helps in diagnosing and treating mitochondrial diseases.
Agriculture: Chloroplast inheritance can be used in plant breeding and genetic modification.
Slide 9: Recent Research and Advances
Gene Editing: Techniques like CRISPR-Cas9 are being used to edit mitochondrial and chloroplast DNA.
Therapies: Development of mitochondrial replacement therapy (MRT) for preventing mitochondrial diseases.
Slide 10: Conclusion
Summary: Extrachromosomal inheritance involves the transmission of genetic material outside the nucleus and plays a crucial role in genetics, medicine, and biotechnology.
Future Directions: Continued research and technological advancements hold promise for new treatments and applications.
Slide 11: Questions and Discussion
Invite Audience: Open the floor for any questions or further discussion on the topic.
THE IMPORTANCE OF MARTIAN ATMOSPHERE SAMPLE RETURN.Sérgio Sacani
The return of a sample of near-surface atmosphere from Mars would facilitate answers to several first-order science questions surrounding the formation and evolution of the planet. One of the important aspects of terrestrial planet formation in general is the role that primary atmospheres played in influencing the chemistry and structure of the planets and their antecedents. Studies of the martian atmosphere can be used to investigate the role of a primary atmosphere in its history. Atmosphere samples would also inform our understanding of the near-surface chemistry of the planet, and ultimately the prospects for life. High-precision isotopic analyses of constituent gases are needed to address these questions, requiring that the analyses are made on returned samples rather than in situ.
The ability to recreate computational results with minimal effort and actionable metrics provides a solid foundation for scientific research and software development. When people can replicate an analysis at the touch of a button using open-source software, open data, and methods to assess and compare proposals, it significantly eases verification of results, engagement with a diverse range of contributors, and progress. However, we have yet to fully achieve this; there are still many sociotechnical frictions.
Inspired by David Donoho's vision, this talk aims to revisit the three crucial pillars of frictionless reproducibility (data sharing, code sharing, and competitive challenges) with the perspective of deep software variability.
Our observation is that multiple layers — hardware, operating systems, third-party libraries, software versions, input data, compile-time options, and parameters — are subject to variability that exacerbates frictions but is also essential for achieving robust, generalizable results and fostering innovation. I will first review the literature, providing evidence of how the complex variability interactions across these layers affect qualitative and quantitative software properties, thereby complicating the reproduction and replication of scientific studies in various fields.
I will then present some software engineering and AI techniques that can support the strategic exploration of variability spaces. These include the use of abstractions and models (e.g., feature models), sampling strategies (e.g., uniform, random), cost-effective measurements (e.g., incremental build of software configurations), and dimensionality reduction methods (e.g., transfer learning, feature selection, software debloating).
I will finally argue that deep variability is both the problem and solution of frictionless reproducibility, calling the software science community to develop new methods and tools to manage variability and foster reproducibility in software systems.
Exposé invité Journées Nationales du GDR GPL 2024
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...University of Maribor
Slides from:
11th International Conference on Electrical, Electronics and Computer Engineering (IcETRAN), Niš, 3-6 June 2024
Track: Artificial Intelligence
https://www.etran.rs/2024/en/home-english/
Salas, V. (2024) "John of St. Thomas (Poinsot) on the Science of Sacred Theol...Studia Poinsotiana
I Introduction
II Subalternation and Theology
III Theology and Dogmatic Declarations
IV The Mixed Principles of Theology
V Virtual Revelation: The Unity of Theology
VI Theology as a Natural Science
VII Theology’s Certitude
VIII Conclusion
Notes
Bibliography
All the contents are fully attributable to the author, Doctor Victor Salas. Should you wish to get this text republished, get in touch with the author or the editorial committee of the Studia Poinsotiana. Insofar as possible, we will be happy to broker your contact.
What is greenhouse gasses and how many gasses are there to affect the Earth.moosaasad1975
What are greenhouse gasses how they affect the earth and its environment what is the future of the environment and earth how the weather and the climate effects.
Richard's aventures in two entangled wonderlandsRichard Gill
Since the loophole-free Bell experiments of 2020 and the Nobel prizes in physics of 2022, critics of Bell's work have retreated to the fortress of super-determinism. Now, super-determinism is a derogatory word - it just means "determinism". Palmer, Hance and Hossenfelder argue that quantum mechanics and determinism are not incompatible, using a sophisticated mathematical construction based on a subtle thinning of allowed states and measurements in quantum mechanics, such that what is left appears to make Bell's argument fail, without altering the empirical predictions of quantum mechanics. I think however that it is a smoke screen, and the slogan "lost in math" comes to my mind. I will discuss some other recent disproofs of Bell's theorem using the language of causality based on causal graphs. Causal thinking is also central to law and justice. I will mention surprising connections to my work on serial killer nurse cases, in particular the Dutch case of Lucia de Berk and the current UK case of Lucy Letby.
2. Credits
.
......
Joint work with
Pietro Michiardi, Mario Pastorelli (EURECOM)
Antonio Barbuzzi (ex EURECOM, now @VisualDNA, UK)
Damiano Carra (University of Verona, Italy)
2
3. Outline
...1 Big Data and MapReduce
...2 Size-Based Scheduling for MapReduce
...3 Size-Based Scheduling With Errors
3
4. Big Data and MapReduce
Outline
...1 Big Data and MapReduce
...2 Size-Based Scheduling for MapReduce
...3 Size-Based Scheduling With Errors
4
5. Big Data and MapReduce Big Data
Big Data: Definition
.
......
Data that is too big for you to handle the way you normally do
5
6. Big Data and MapReduce Big Data
Big Data: Definition
.
......
Data that is too big for you to handle the way you normally do
.
The 3 (+2) Vs
..
......
Volume, Velocity, Variety
… plus Veracity and Value
5
7. Big Data and MapReduce Big Data
Big Data: Definition
.
......
Data that is too big for you to handle the way you normally do
.
The 3 (+2) Vs
..
......
Volume, Velocity, Variety
… plus Veracity and Value
.
…But Still…
..
......
Why is everybody talking about Big Data now?
5
8. Big Data and MapReduce Big Data
Big Data: Why Now?
.
1991: Maxtor 7040A
..
......
40 MB
600-700 KB/s
One minute to read it all
6
9. Big Data and MapReduce Big Data
Big Data: Why Now?
.
1991: Maxtor 7040A
..
......
40 MB
600-700 KB/s
One minute to read it all
.
Now: Western Digital Caviar
..
......
4 TB
128 MB/s
9 hours to read
6
10. Big Data and MapReduce Big Data
Moore and His Brothers
.
......
Moore’s Law: processing power doubles every 18 months
Kryder’s Law: storage capacity doubles every year
Nielsen’s Law: bandwidth doubles every 21 months
7
11. Big Data and MapReduce Big Data
Moore and His Brothers
.
......
Moore’s Law: processing power doubles every 18 months
Kryder’s Law: storage capacity doubles every year
Nielsen’s Law: bandwidth doubles every 21 months
.
......
Storage is cheap: we never throw away anything
Processing all that data is expensive
Moving it around is even worse
7
12. Big Data and MapReduce MapReduce
MapReduce
Bring the computation to the data – split in blocks across a cluster
.
Map
..
......
One task per block
Hadoop filesystem (HDFS): 64 MB by default
Stores locally key-value pairs
e.g., for word count: [(red, 15) , (green, 7) , . . .]
8
13. Big Data and MapReduce MapReduce
MapReduce
Bring the computation to the data – split in blocks across a cluster
.
Map
..
......
One task per block
Hadoop filesystem (HDFS): 64 MB by default
Stores locally key-value pairs
e.g., for word count: [(red, 15) , (green, 7) , . . .]
.
Reduce
..
......
# of tasks set by the programmer
Mapper output is partitioned by key and pulled from “mappers”
The Reduce function operates on all values for a single key
e.g., (green, [7, 42, 13, . . .])
8
14. Big Data and MapReduce MapReduce
The Problem With Scheduling
.
Current Workloads
..
......
Huge job size variance
Running time: seconds to hours
I/O: KBs to TBs
[Chen et al., VLDB ’12; Ren et al., VLDB ’13; Appuswamy et al., SOCC ’13]
.
Consequence
..
......
Interactive jobs are delayed by long ones
In smaller clusters long queues exacerbate the problem
9
15. Size-Based Scheduling for MapReduce
Outline
...1 Big Data and MapReduce
...2 Size-Based Scheduling for MapReduce
...3 Size-Based Scheduling With Errors
10
16. Size-Based Scheduling for MapReduce Size-Based Scheduling
Shortest Remaining Processing Time
100
usage (%)
cluster
50
10 15 37.5 42.5 50
time
(s)
100
usage (%)
cluster
10 5020 30
50
time
(s)
job 1
job 2
job 3
job 1 job 3job 2 job 1
11
17. Size-Based Scheduling for MapReduce Size-Based Scheduling
Shortest Remaining Processing Time
100
usage (%)
cluster
50
10 15 37.5 42.5 50
time
(s)
100
usage (%)
cluster
10 5020 30
50
time
(s)
job 1
job 2
job 3
job 1 job 3job 2 job 1
11
18. Size-Based Scheduling for MapReduce Size-Based Scheduling
Size-Based Scheduling
.
Shortest Remaining Processing Time (SRPT)
..
......
Minimizes average sojourn time (between job submission and
completion)
.
Fair Sojourn Protocol (FSP)
..
......
Jobs are scheduled in the order they would complete if doing
Processor Sharing (PS)
Avoids starving large jobs
Fairness: jobs guaranteed to complete before Processor Sharing
[Friedman & Henderson, SIGMETRICS ’03]
.
Unknown Job size
..
......
…and what if we can only estimate job size?
12
19. Size-Based Scheduling for MapReduce Size-Based Scheduling
Multi-Processor Size-Based Scheduling
10 13 3923.5
usage (%)
cluster
100
50
24.5
time
(s)
10 13 20 23 39
100
50
usage (%)
cluster
time
(s)
job 1
job 2
job 3
job 1
job 2
job 3
13
20. Size-Based Scheduling for MapReduce HFSP Implementation
HFSP In A Nutshell
.
Job Size Estimation
..
......
Naive estimation at first
After the first s “training” tasks have run, we update it
s = 5 by default
On t task slots, we give priority to training tasks
t avoids starving “old” jobs
“shortcut” for very small jobs
14
21. Size-Based Scheduling for MapReduce HFSP Implementation
HFSP In A Nutshell
.
Job Size Estimation
..
......
Naive estimation at first
After the first s “training” tasks have run, we update it
s = 5 by default
On t task slots, we give priority to training tasks
t avoids starving “old” jobs
“shortcut” for very small jobs
.
Scheduling Policy
..
......
We treat Map and Reduce phases as separate jobs
Virtual time: per-job simulated completion time
When a task slot frees up, we schedule one from the job that
completes earlier in the virtual time
14
22. Size-Based Scheduling for MapReduce HFSP Implementation
Job Size Estimation
.
Initial Estimation
..
......
k · l
k: # of tasks
l: average size of past Map/Reduce tasks
.
Second Estimation
..
......
After the s samples have run, compute an l′ as the average size of
the sample tasks
timeout (60 s by default): if tasks are not completed by then, use
progress %
Predicted job size: k · l′
15
23. Size-Based Scheduling for MapReduce HFSP Implementation
Virtual Time
.
......
Estimated job size is in a “serialized” single-machine format
Simulates a processor-sharing cluster to compute completion
time, based on
number of tasks per job
available task slots in the real cluster
Simulation is updated when
new jobs arrive
tasks complete
16
24. Size-Based Scheduling for MapReduce Experiments
Experimental Setup
.
Platform
..
......
36 machines with 4 CPUs, 16 GB RAM
.
Workloads
..
......
Generated with the PigMix benchmark: realistic operations on
synthetic data
Data sizes inspired by known measurements [Chen et al., VLDB ’12; Ren
et al., VLDB ’13]
.
Configuration
..
......
We compare to Hadoop’s FAIR scheduler
similar to processor-sharing
Delay scheduling enabled both for FAIR and HFSP
17
25. Size-Based Scheduling for MapReduce Experiments
Sojourn Time
101 102 103
Sojourn Time (s)
0.0
0.2
0.4
0.6
0.8
1.0
ECDF
HFSP
FAIR
101 102 103 104
Sojourn Time (s)
0.0
0.2
0.4
0.6
0.8
1.0
ECDF
HFSP
FAIR
“small” workload: ~16% better “large” workload: ~75% better
Sojourn time: time that passes between the moment a job is
submitted and it terminates
With higher load, the scheduler becomes decisive
Analogous results on different platform & different workload
18
26. Size-Based Scheduling for MapReduce Experiments
Job Size Estimation
0.25 0.5 1 2 4
Error
0.0
0.2
0.4
0.6
0.8
1.0
ECDF
MAP
REDUCE
Error:
real size
estimated size
Fits a log-normal distribution
The estimation isn’t even that good! Why does HFSP work that
well?
19
27. Size-Based Scheduling With Errors
Outline
...1 Big Data and MapReduce
...2 Size-Based Scheduling for MapReduce
...3 Size-Based Scheduling With Errors
20
28. Size-Based Scheduling With Errors Scheduling Simulation
Scheduling Simulation
How does size-based scheduling behave in presence of errors?
Lu et al. (MASCOTS 2004) suggest much worse results
We wrote a simulator to understand better, with Hadoop-like
workloads [Chen et al., VLDB ’12]
written in Python, efficient and easy to prototype new schedulers
21
29. Size-Based Scheduling With Errors Scheduling Simulation
Log-Normal Error Distribution
0.0 0.5 1.0 1.5 2.0 2.5 3.0
x
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
4.0
PDF
sigma= 0.125
sigma= 0.25
sigma= 1
sigma= 4
Error:
real size
estimated size
22
30. Size-Based Scheduling With Errors Scheduling Simulation
Weibull Job Size Distribution
0.0 0.5 1.0 1.5 2.0 2.5 3.0
x
0.0
0.5
1.0
1.5
2.0
PDF
shape= 0.125
shape= 1
shape= 2
shape= 4
Interpolates between
heavy-tailed job size distributions (sigma<1)
exponential distributions (sigma=1)
bell-shaped distributions (sigma>1) 23
31. Size-Based Scheduling With Errors Scheduling Simulation
Size-Based Scheduling With Errors
shape
0.125
0.25
0.5
1
2
4
sigm
a
0.125
0.25
0.5
1
2
4
MST/MST(PS)
0.25
0.5
1
2
4
8
16
32
64
128
shape
0.125
0.25
0.5
1
2
4
sigm
a
0.125
0.25
0.5
1
2
4
MST/MST(PS)
0.25
0.5
1
2
4
8
16
32
64
128
SRPT FSP
Problems for heavy-tailed job size distributions
Otherwise, size-based scheduling works very well
24
32. Size-Based Scheduling With Errors Scheduling Simulation
Over-Estimations and Under-Estimations
Over-‐es'ma'on
Under-‐es'ma'on
t
t
t
t
Remaining
size
Remaining
size
Remaining
size
Remaining
size
J1
J2
J3
J2
J3
J1
^
J4
J5
J6
J4
J5
J6
^
Under-estimations can wreak havoc with heavy-tailed
workloads
25
33. Size-Based Scheduling With Errors Scheduling Simulation
FSP + PS
.
Idea
..
......
Without errors, real jobs always complete before virtual ones
When they don’t (they are late), there has been an estimation
error
The scheduler can realize this, and take corrective action
26
34. Size-Based Scheduling With Errors Scheduling Simulation
FSP + PS
.
Idea
..
......
Without errors, real jobs always complete before virtual ones
When they don’t (they are late), there has been an estimation
error
The scheduler can realize this, and take corrective action
.
Realization
..
......
To avoid that late jobs block the system, just do processor
sharing between them instead of scheduling the ”most late” one
26
36. Size-Based Scheduling With Errors Scheduling Simulation
Take-Home Messages
.
......
Size-based scheduling on Hadoop is viable, and particularly
appealing for companies with (semi-)interactive jobs and smaller
clusters
.
......
Schedulers like HFSP (in practice) and FSP+PS (in theory) are robust
with respect to errors
therefore, simple rough estimations are sufficient
HFSP is available as free software at
http://github.com/bigfootproject/hfsp
Scheduling simulator at
https://bitbucket.org/bigfootproject/schedsim
HFSP: published at IEEE BIGDATA 2013
scheduling simulator and FSP+PS: under submission, available at
http://arxiv.org/abs/1403.5996
28
37. Bonus Content Comparison with SRPT
Schedulers vs. SRPT
0.125 0.25 0.5 1 2 4
shape
2
4
6
8
10
MST/MST(SRPT)
SRPTE
FSPE
FSPE+PS
PS
LAS
FIFO
29
39. Bonus Content Real Workloads
Web Cache
0.125 0.25 0.5 1 2 4
sigma
1
10
100
MST/MST(SRPT)
SRPTE
FSPE
FSPE+PS
PS
LAS
0.125 0.25 0.5 1 2 4
sigma
1
10
100
1000
10000
MST/MST(SRPT)
SRPTE
FSPE
FSPE+PS
PS
LAS
FIFO
Synthetic workload (shape=0.177) IRCache Web Cache
31
40. Bonus Content Job Preemption
Job Preemption
.
Supported in Hadoop
..
......
Kill running tasks
wastes work
Wait for them to finish
may take long
32
41. Bonus Content Job Preemption
Job Preemption
.
Supported in Hadoop
..
......
Kill running tasks
wastes work
Wait for them to finish
may take long
.
Our Choice
..
......
Map tasks: Wait
generally small
For Reduce tasks, we implemented Suspend and Resume
avoids the drawbacks of both Wait and Kill
32
42. Bonus Content Job Preemption
Job Preemption: Suspend and Resume
.
Our Solution
..
......We delegate to the OS: SIGSTOP and SIGCONT
33
43. Bonus Content Job Preemption
Job Preemption: Suspend and Resume
.
Our Solution
..
......We delegate to the OS: SIGSTOP and SIGCONT
.
......
The OS will swap tasks if and when memory is needed
no risk of thrashing: swapped data is loaded only when resuming
33
44. Bonus Content Job Preemption
Job Preemption: Suspend and Resume
.
Our Solution
..
......We delegate to the OS: SIGSTOP and SIGCONT
.
......
The OS will swap tasks if and when memory is needed
no risk of thrashing: swapped data is loaded only when resuming
.
......
Configurable maximum number of suspended tasks
if reached, switch to Wait
hard limit on memory allocated to suspended tasks
33
45. Bonus Content Job Preemption
Job Preemption: Suspend and Resume
.
Our Solution
..
......We delegate to the OS: SIGSTOP and SIGCONT
.
......
The OS will swap tasks if and when memory is needed
no risk of thrashing: swapped data is loaded only when resuming
.
......
Configurable maximum number of suspended tasks
if reached, switch to Wait
hard limit on memory allocated to suspended tasks
.
......
Between preemptable running tasks, suspend the youngest
likely to finish later
may have smaller memory footprint
33