Comparing Open Source implementations of Pregel and Related Systems.
Installation of Hadoop and the Pregel Related Systems.
Worked with Datasets of varying sizes from very small to very large. Large datasets that have around 30 million vertices and 50 million edges.
Worked on 1,4,8 node Amazon EC2 cluster.
4 Algorithms : PageRank,Shortest Path,KMeans,Collaborative Filtering
Cardinality Estimation through Histogram in Apache Spark 2.3 with Ron Hu and ...Databricks
Apache Spark 2.2 shipped with a state-of-art cost-based optimization framework that collects and leverages a variety of per-column data statistics (e.g., cardinality, number of distinct values, NULL values, max/min, avg/max length, etc.) to improve the quality of query execution plans. Skewed data distributions are often inherent in many real world applications. In order to deal with skewed distributions effectively, we added equal-height histograms to Apache Spark 2.3. Leveraging reliable statistics and histogram helps Spark make better decisions in picking the most optimal query plan for real world scenarios.
In this talk, we’ll take a deep dive into how Spark’s Cost-Based Optimizer estimates the cardinality and size of each database operator. Specifically, for skewed distribution workload such as TPC-DS, we will show histogram’s impact on query plan change, hence leading to performance gain.
Graphite is often regarded as very slow and not easily scalable. As a data driven company, we couldn't give up the statistical functions of Graphite. In this talk we show how SimilarWeb scaled its graphite stack to meet the demand.
Matrix Factorizations at Scale: a Comparison of Scientific Data Analytics on ...Databricks
Explore the trade-offs of performing linear algebra for data analysis and machine learning using Apache Spark, compared to traditional C and MPI implementations on HPC platforms. Apache Spark is designed for data analytics on cluster computing platforms with access to local disks and is optimized for data-parallel tasks.
This session will examine three widely-used and important matrix factorizations: NMF (for physical plausibility), PCA (for its ubiquity) and CX (for data interpretability). Learn how these methods are applied to terabyte-sized problems in particle physics, climate modeling and bioimaging, as use cases where interpretable analytics is of interest. The data matrices are tall-and-skinny, which enable the algorithms to map conveniently into Spark’s data-parallel model. We perform scaling experiments on up to 1600 Cray XC40 nodes, describe the sources of slowdowns and provide tuning guidance to obtain high performance. Based on joint work with Alex Gittens and many others.
Comparing Open Source implementations of Pregel and Related Systems.
Installation of Hadoop and the Pregel Related Systems.
Worked with Datasets of varying sizes from very small to very large. Large datasets that have around 30 million vertices and 50 million edges.
Worked on 1,4,8 node Amazon EC2 cluster.
4 Algorithms : PageRank,Shortest Path,KMeans,Collaborative Filtering
Cardinality Estimation through Histogram in Apache Spark 2.3 with Ron Hu and ...Databricks
Apache Spark 2.2 shipped with a state-of-art cost-based optimization framework that collects and leverages a variety of per-column data statistics (e.g., cardinality, number of distinct values, NULL values, max/min, avg/max length, etc.) to improve the quality of query execution plans. Skewed data distributions are often inherent in many real world applications. In order to deal with skewed distributions effectively, we added equal-height histograms to Apache Spark 2.3. Leveraging reliable statistics and histogram helps Spark make better decisions in picking the most optimal query plan for real world scenarios.
In this talk, we’ll take a deep dive into how Spark’s Cost-Based Optimizer estimates the cardinality and size of each database operator. Specifically, for skewed distribution workload such as TPC-DS, we will show histogram’s impact on query plan change, hence leading to performance gain.
Graphite is often regarded as very slow and not easily scalable. As a data driven company, we couldn't give up the statistical functions of Graphite. In this talk we show how SimilarWeb scaled its graphite stack to meet the demand.
Matrix Factorizations at Scale: a Comparison of Scientific Data Analytics on ...Databricks
Explore the trade-offs of performing linear algebra for data analysis and machine learning using Apache Spark, compared to traditional C and MPI implementations on HPC platforms. Apache Spark is designed for data analytics on cluster computing platforms with access to local disks and is optimized for data-parallel tasks.
This session will examine three widely-used and important matrix factorizations: NMF (for physical plausibility), PCA (for its ubiquity) and CX (for data interpretability). Learn how these methods are applied to terabyte-sized problems in particle physics, climate modeling and bioimaging, as use cases where interpretable analytics is of interest. The data matrices are tall-and-skinny, which enable the algorithms to map conveniently into Spark’s data-parallel model. We perform scaling experiments on up to 1600 Cray XC40 nodes, describe the sources of slowdowns and provide tuning guidance to obtain high performance. Based on joint work with Alex Gittens and many others.
Slides of the Apache Omid presentation at Hadoop Summit 2016 in San Jose, CA. Omid is a flexible, reliable, high performant and scalable transaction manager for HBase.
HDF4 and HDF-EOS format reading has recently been added to the NetCDF-Java 4.0 library, while HDF5 / NetCDF-4 format reading has been improved. This talk will summarize the status of reading the HDF family of formats through the NetCDF-Java library, with particular attention to the mapping between these formats and the Common Data Model.
Clustering has been one of the most widely studied topics
in data mining and k-means clustering has been one of
the popular clustering algorithms. K-means requires several
passes on the entire dataset, which can make it very expensive
for large disk-resident datasets. In view of this, a lot of work
has been done on various approximate versions of k-means,
which require only one or a small number of passes on the
entire dataset.
In this paper, we present a new algorithm, called Fast and
Exact K-means Clustering (FEKM), which typically requires
only one or a small number of passes on the entire dataset,
and provably produces the same cluster centers as reported
by the original k-means algorithm. The algorithm uses sampling
to create initial cluster centers, and then takes one or
more passes over the entire dataset to adjust these cluster
centers. We provide theoretical analysis to show that the cluster
centers thus reported are the same as the ones computed
by the original k-means algorithm. Experimental results from
a number of real and synthetic datasets show speedup between
a factor of 2 and 4.5, as compared to k-means.
This paper also describes and evaluates a distributed version
of FEKM, which we refer to as DFEKM. This algorithm
is suitable for analyzing data that is distributed across loosely
coupled machines. Unlike the previous work in this area,
DFEKM provably produces the same results as the original
k-means algorithm. Our experimental results show that
DFEKM is clearly better than two other possible options for
exact clustering on distributed data, which are down-loading
all data and running sequential k-means, or running parallel
k-means on a loosely coupled configuration. Moreover, even
in a tightly coupled environment, DFEKM can outperform
parallel k-means if there is a significant load imbalance
On Extending MapReduce - Survey and ExperimentsYu Liu
It talks a survey and my experiments on extending MapReduce programming model. A BSP-based MapReduce interface was implemented and evaluated, which shows dramatically improvement on performance.
The Parquet Format and Performance Optimization OpportunitiesDatabricks
The Parquet format is one of the most widely used columnar storage formats in the Spark ecosystem. Given that I/O is expensive and that the storage layer is the entry point for any query execution, understanding the intricacies of your storage format is important for optimizing your workloads.
As an introduction, we will provide context around the format, covering the basics of structured data formats and the underlying physical data storage model alternatives (row-wise, columnar and hybrid). Given this context, we will dive deeper into specifics of the Parquet format: representation on disk, physical data organization (row-groups, column-chunks and pages) and encoding schemes. Now equipped with sufficient background knowledge, we will discuss several performance optimization opportunities with respect to the format: dictionary encoding, page compression, predicate pushdown (min/max skipping), dictionary filtering and partitioning schemes. We will learn how to combat the evil that is ‘many small files’, and will discuss the open-source Delta Lake format in relation to this and Parquet in general.
This talk serves both as an approachable refresher on columnar storage as well as a guide on how to leverage the Parquet format for speeding up analytical workloads in Spark using tangible tips and tricks.
Dynamically Optimizing Queries over Large Scale Data PlatformsINRIA-OAK
Enterprises are adapting large-scale data processing platforms, such as Hadoop, to gain actionable insights from their "big data". Query optimization is still an open challenge in this environment due to the volume and heterogeneity of data, comprising both structured and un/semi-structured datasets. Moreover, it has become common practice to push business logic close to the data via user-defined functions (UDFs), which are usually opaque to the optimizer, further complicating cost-based optimization. As a result, classical relational query optimization techniques do not fit well in this setting, while at the same time, suboptimal query plans can be disastrous with large datasets. In this talk, I will present new techniques that take into account UDFs and correlations between relations for optimizing queries running on large scale clusters. We introduce "pilot runs", which execute part of the query over a sample of the data to estimate selectivities, and employ a cost-based optimizer that uses these selectivities to choose an initial query plan. Then, we follow a dynamic optimization approach, in which plans evolve as parts of the queries get executed. Our experimental results show that our techniques produce plans that are at least as good as, and up to 2x (4x) better for Jaql (Hive) than, the best hand-written left-deep query plans.
Real-Time Analysis of Streaming Synchotron Data: SCinet SC19 Technology Chall...Globus
This project, which involved streaming light source data from the SC19 show floor to Argonne’s Leadership Computing Facility (ALCF) outside Chicago, won the top prize at the inaugural SCinet Technology Challenge at SC19 in Denver, CO.
These slides are from a recent talk I gave at Lawrence Livermore Labs.
The talk gives an architectural outline of the MapR system and then discusses how this architecture facilitates large scale machine learning algorithms.
Slides of the Apache Omid presentation at Hadoop Summit 2016 in San Jose, CA. Omid is a flexible, reliable, high performant and scalable transaction manager for HBase.
HDF4 and HDF-EOS format reading has recently been added to the NetCDF-Java 4.0 library, while HDF5 / NetCDF-4 format reading has been improved. This talk will summarize the status of reading the HDF family of formats through the NetCDF-Java library, with particular attention to the mapping between these formats and the Common Data Model.
Clustering has been one of the most widely studied topics
in data mining and k-means clustering has been one of
the popular clustering algorithms. K-means requires several
passes on the entire dataset, which can make it very expensive
for large disk-resident datasets. In view of this, a lot of work
has been done on various approximate versions of k-means,
which require only one or a small number of passes on the
entire dataset.
In this paper, we present a new algorithm, called Fast and
Exact K-means Clustering (FEKM), which typically requires
only one or a small number of passes on the entire dataset,
and provably produces the same cluster centers as reported
by the original k-means algorithm. The algorithm uses sampling
to create initial cluster centers, and then takes one or
more passes over the entire dataset to adjust these cluster
centers. We provide theoretical analysis to show that the cluster
centers thus reported are the same as the ones computed
by the original k-means algorithm. Experimental results from
a number of real and synthetic datasets show speedup between
a factor of 2 and 4.5, as compared to k-means.
This paper also describes and evaluates a distributed version
of FEKM, which we refer to as DFEKM. This algorithm
is suitable for analyzing data that is distributed across loosely
coupled machines. Unlike the previous work in this area,
DFEKM provably produces the same results as the original
k-means algorithm. Our experimental results show that
DFEKM is clearly better than two other possible options for
exact clustering on distributed data, which are down-loading
all data and running sequential k-means, or running parallel
k-means on a loosely coupled configuration. Moreover, even
in a tightly coupled environment, DFEKM can outperform
parallel k-means if there is a significant load imbalance
On Extending MapReduce - Survey and ExperimentsYu Liu
It talks a survey and my experiments on extending MapReduce programming model. A BSP-based MapReduce interface was implemented and evaluated, which shows dramatically improvement on performance.
The Parquet Format and Performance Optimization OpportunitiesDatabricks
The Parquet format is one of the most widely used columnar storage formats in the Spark ecosystem. Given that I/O is expensive and that the storage layer is the entry point for any query execution, understanding the intricacies of your storage format is important for optimizing your workloads.
As an introduction, we will provide context around the format, covering the basics of structured data formats and the underlying physical data storage model alternatives (row-wise, columnar and hybrid). Given this context, we will dive deeper into specifics of the Parquet format: representation on disk, physical data organization (row-groups, column-chunks and pages) and encoding schemes. Now equipped with sufficient background knowledge, we will discuss several performance optimization opportunities with respect to the format: dictionary encoding, page compression, predicate pushdown (min/max skipping), dictionary filtering and partitioning schemes. We will learn how to combat the evil that is ‘many small files’, and will discuss the open-source Delta Lake format in relation to this and Parquet in general.
This talk serves both as an approachable refresher on columnar storage as well as a guide on how to leverage the Parquet format for speeding up analytical workloads in Spark using tangible tips and tricks.
Dynamically Optimizing Queries over Large Scale Data PlatformsINRIA-OAK
Enterprises are adapting large-scale data processing platforms, such as Hadoop, to gain actionable insights from their "big data". Query optimization is still an open challenge in this environment due to the volume and heterogeneity of data, comprising both structured and un/semi-structured datasets. Moreover, it has become common practice to push business logic close to the data via user-defined functions (UDFs), which are usually opaque to the optimizer, further complicating cost-based optimization. As a result, classical relational query optimization techniques do not fit well in this setting, while at the same time, suboptimal query plans can be disastrous with large datasets. In this talk, I will present new techniques that take into account UDFs and correlations between relations for optimizing queries running on large scale clusters. We introduce "pilot runs", which execute part of the query over a sample of the data to estimate selectivities, and employ a cost-based optimizer that uses these selectivities to choose an initial query plan. Then, we follow a dynamic optimization approach, in which plans evolve as parts of the queries get executed. Our experimental results show that our techniques produce plans that are at least as good as, and up to 2x (4x) better for Jaql (Hive) than, the best hand-written left-deep query plans.
Real-Time Analysis of Streaming Synchotron Data: SCinet SC19 Technology Chall...Globus
This project, which involved streaming light source data from the SC19 show floor to Argonne’s Leadership Computing Facility (ALCF) outside Chicago, won the top prize at the inaugural SCinet Technology Challenge at SC19 in Denver, CO.
These slides are from a recent talk I gave at Lawrence Livermore Labs.
The talk gives an architectural outline of the MapR system and then discusses how this architecture facilitates large scale machine learning algorithms.
We leave in the era where the atomic building elements of silicon computers, e.g., transistors and wires, are no longer visible using traditional optical microscopes and their sizes are measured in just tens of Angstroms. In addition, power dissipation per unit volume is bounded by the laws of Physics that all resulted among others in stagnating processor clock frequencies. Adding more and more processor cores that perform simpler and simpler tasks in an attempt to efficiently fill the available on-chip area seems to be the current trend taken by the Industry.
Data Analytics and Simulation in Parallel with MATLAB*Intel® Software
This talk covers the current parallel capabilities in MATLAB*. Learn about its parallel language and distributed and tall arrays. Interact with GPUs both on the desktop and in the cluster. Combine this information into an interesting algorithmic framework for data analysis and simulation.
Optimization of Continuous Queries in Federated Database and Stream Processin...Zbigniew Jerzak
The constantly increasing number of connected devices and sensors results in increasing volume and velocity of sensor-based streaming data. Traditional approaches for processing high velocity sensor data rely on stream processing engines. However, the increasing complexity of continuous queries executed on top of high velocity data has resulted in growing demand for federated systems composed of data stream processing engines and database engines. One of major challenges for such systems is to devise the optimal query execution plan to maximize the throughput of continuous queries.
In this paper we present a general framework for federated database and stream processing systems, and introduce the design and implementation of a cost-based optimizer for optimizing relational continuous queries in such systems. Our optimizer uses characteristics of continuous queries and source data streams to devise an optimal placement for each operator of a continuous query. This fine level of optimization, combined with the estimation of the feasibility of query plans, allows our optimizer to devise query plans which result in 8 times higher throughput as compared to the baseline approach which uses only stream processing engines. Moreover, our experimental results showed that even for simple queries, a hybrid execution plan can result in 4 times and 1.6 times higher throughput than a pure stream processing engine plan and a pure database engine plan, respectively.
Yahoo - Moving beyond running 100% of Apache Pig jobs on Apache TezDataWorks Summit
Last year at Yahoo, we spent great effort in scaling, stabilizing and making Pig on Tez production ready and by the end of the year retired running Pig jobs on Mapreduce. This talk will detail the performance and resource utilization improvements Yahoo achieved after migrating all Pig jobs to run on Tez.
After successful migration and the improved performance we shifted our focus to addressing some of the bottlenecks we identified and new optimization ideas that we came up with to make it go even faster. We will go over the new features and work done in Tez to make that happen like custom YARN ShuffleHandler, reworking DAG scheduling order, serialization changes, etc.
We will also cover exciting new features that were added to Pig for performance such as bloom join and byte code generation. A distributed bloom join that can create multiple bloom filters in parallel was straightforward to implement with the flexibility of Tez DAGs. It vastly improved performance and reduced disk and network utilization for our large joins. Byte code generation for projection and filtering of records is another big feature that we are targeting for Pig 0.17 which will speed up processing by reducing the virtual function calls.
Stream processing from single node to a clusterGal Marder
Building data pipelines shouldn't be so hard, you just need to choose the right tools for the task.
We will review Akka and Spark streaming, how they work and how to use them and when.
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...Reynold Xin
(Berkeley CS186 guest lecture)
Big Data Analytics Systems: What Goes Around Comes Around
Introduction to MapReduce, GFS, HDFS, Spark, and differences between "Big Data" and database systems.
Java Thread and Process Performance for Parallel Machine Learning on Multicor...Saliya Ekanayake
The growing use of Big Data frameworks on large machines highlights the importance of performance issues and the value of High Performance Computing (HPC) technology. This paper looks carefully at three major frameworks Spark, Flink and Message Passing Interface (MPI) both in scaling across nodes and internally over the many cores inside modern nodes.We focus on the special challenges of the Java Virtual Machine (JVM) using an Intel Haswell HPC cluster with 24 cores per node. Two parallel machine learning algorithms, K-Means clustering and Multidimensional Scaling (MDS) are used in our performance studies. We identify three major issues – thread models, affinity patterns, and communication mechanisms – as factors affecting performance by large factors and show how to optimize them so that Java can match the performance of traditional HPC languages like C. Further we suggest approaches that preserve the user interface and elegant dataflow approach of Flink and Spark but modify the runtime so that these Big Data frameworks can achieve excellent performance and realize the goals of HPCBig Data convergence.
Thorny path to the Large-Scale Graph Processing (Highload++, 2014)Alexey Zinoviev
Alexey Zinoviev presented this paper on the Highload++ conference http://www.highload.ru/2014/abstracts/1516.html
This paper covers next topics: Pregel, Graph Theory, Giraph, Okapi, GraphX, GraphChi, Spark, Shrotest Path Problem, Road Network, Road Graph
Network-aware Data Management for Large Scale Distributed Applications, IBM R...balmanme
IBM Research – Talk – June 24, 2015
Title:
Network-aware Data Management for Large Scale Distributed Applications
Abstract:
As current technology enables faster storage devices and larger interconnect bandwidth, there is a substantial need for novel system design and middleware architecture to address increasing latency, scalability, and throughput requirements. In this talk, I will outline network-aware data management and present solutions based on my past experience in large-scale data migration between remote repositories.
I will first describe my experience in the initial evaluation of 100Gbps network as a part of the Advance Network Initiative project. We needed intense fine-tuning in network, storage, and application layers, to take advantage of the higher network capacity. End-system bottlenecks and system performance play an important role especially in many-core platforms. I will introduce a special data movement prototype, successfully tested in one of the first 100Gbps demonstrations, in which applications map memory blocks for remote data, in contrast to the send/receive semantics. This prototype was used to stream climate data over wide-area for in-memory application processing and visualization.
Within this scope, I will introduce a flexible network reservation algorithm for on-demand bandwidth guaranteed virtual circuit services. Flexible reservations find best path in a time-dependent dynamic network topology to support predictable application performance. I will then present a data-scheduling model with advance provisioning, in which data movement operations are defined with earliest start and latest completion times.
I will conclude my talk with a very brief overview of my other related projects on performance engineering, hyper-converged virtual storage, and optimization in control and data path for virtualized environments.
Greater Chicago Area - Independent Non-Profit Organization Management Professional
View clifford sugerman's professional profile on LinkedIn. LinkedIn is the world's largest business network, helping professionals like clifford sugerman discover.
A Lightweight Infrastructure for Graph AnalyticsDonald Nguyen
Several domain-specific languages (DSLs) for parallel graph analytics have been proposed recently. In this pa- per, we argue that existing DSLs can be implemented on top of a general-purpose infrastructure that (i) supports very fine-grain tasks, (ii) implements autonomous, speculative execution of these tasks, and (iii) allows application-specific control of task scheduling policies. To support this claim, we describe such an implementation called the Galois system.
We demonstrate the capabilities of this infrastructure in three ways. First, we implement more sophisticated algorithms for some of the graph analytics problems tack- led by previous DSLs and show that end-to-end performance can be improved by orders of magnitude even on power-law graphs, thanks to the better algorithms facilitated by a more general programming model. Second, we show that, even when an algorithm can be expressed in existing DSLs, the implementation of that algorithm in the more general system can be orders of magnitude faster when the input graphs are road networks and similar graphs with high diameter, thanks to more sophisticated scheduling. Third, we implement the APIs of three existing graph DSLs on top of the common infrastructure in a few hundred lines of code and show that even for power-law graphs, the performance of the resulting implementations often exceeds that of the original DSL systems, thanks to the lightweight infrastructure.
Similar to Partitioning SKA Dataflows for Optimal Graph Execution (20)
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...Ana Luísa Pinho
Functional Magnetic Resonance Imaging (fMRI) provides means to characterize brain activations in response to behavior. However, cognitive neuroscience has been limited to group-level effects referring to the performance of specific tasks. To obtain the functional profile of elementary cognitive mechanisms, the combination of brain responses to many tasks is required. Yet, to date, both structural atlases and parcellation-based activations do not fully account for cognitive function and still present several limitations. Further, they do not adapt overall to individual characteristics. In this talk, I will give an account of deep-behavioral phenotyping strategies, namely data-driven methods in large task-fMRI datasets, to optimize functional brain-data collection and improve inference of effects-of-interest related to mental processes. Key to this approach is the employment of fast multi-functional paradigms rich on features that can be well parametrized and, consequently, facilitate the creation of psycho-physiological constructs to be modelled with imaging data. Particular emphasis will be given to music stimuli when studying high-order cognitive mechanisms, due to their ecological nature and quality to enable complex behavior compounded by discrete entities. I will also discuss how deep-behavioral phenotyping and individualized models applied to neuroimaging data can better account for the subject-specific organization of domain-general cognitive systems in the human brain. Finally, the accumulation of functional brain signatures brings the possibility to clarify relationships among tasks and create a univocal link between brain systems and mental functions through: (1) the development of ontologies proposing an organization of cognitive processes; and (2) brain-network taxonomies describing functional specialization. To this end, tools to improve commensurability in cognitive science are necessary, such as public repositories, ontology-based platforms and automated meta-analysis tools. I will thus discuss some brain-atlasing resources currently under development, and their applicability in cognitive as well as clinical neuroscience.
Professional air quality monitoring systems provide immediate, on-site data for analysis, compliance, and decision-making.
Monitor common gases, weather parameters, particulates.
Introduction:
RNA interference (RNAi) or Post-Transcriptional Gene Silencing (PTGS) is an important biological process for modulating eukaryotic gene expression.
It is highly conserved process of posttranscriptional gene silencing by which double stranded RNA (dsRNA) causes sequence-specific degradation of mRNA sequences.
dsRNA-induced gene silencing (RNAi) is reported in a wide range of eukaryotes ranging from worms, insects, mammals and plants.
This process mediates resistance to both endogenous parasitic and exogenous pathogenic nucleic acids, and regulates the expression of protein-coding genes.
What are small ncRNAs?
micro RNA (miRNA)
short interfering RNA (siRNA)
Properties of small non-coding RNA:
Involved in silencing mRNA transcripts.
Called “small” because they are usually only about 21-24 nucleotides long.
Synthesized by first cutting up longer precursor sequences (like the 61nt one that Lee discovered).
Silence an mRNA by base pairing with some sequence on the mRNA.
Discovery of siRNA?
The first small RNA:
In 1993 Rosalind Lee (Victor Ambros lab) was studying a non- coding gene in C. elegans, lin-4, that was involved in silencing of another gene, lin-14, at the appropriate time in the
development of the worm C. elegans.
Two small transcripts of lin-4 (22nt and 61nt) were found to be complementary to a sequence in the 3' UTR of lin-14.
Because lin-4 encoded no protein, she deduced that it must be these transcripts that are causing the silencing by RNA-RNA interactions.
Types of RNAi ( non coding RNA)
MiRNA
Length (23-25 nt)
Trans acting
Binds with target MRNA in mismatch
Translation inhibition
Si RNA
Length 21 nt.
Cis acting
Bind with target Mrna in perfect complementary sequence
Piwi-RNA
Length ; 25 to 36 nt.
Expressed in Germ Cells
Regulates trnasposomes activity
MECHANISM OF RNAI:
First the double-stranded RNA teams up with a protein complex named Dicer, which cuts the long RNA into short pieces.
Then another protein complex called RISC (RNA-induced silencing complex) discards one of the two RNA strands.
The RISC-docked, single-stranded RNA then pairs with the homologous mRNA and destroys it.
THE RISC COMPLEX:
RISC is large(>500kD) RNA multi- protein Binding complex which triggers MRNA degradation in response to MRNA
Unwinding of double stranded Si RNA by ATP independent Helicase
Active component of RISC is Ago proteins( ENDONUCLEASE) which cleave target MRNA.
DICER: endonuclease (RNase Family III)
Argonaute: Central Component of the RNA-Induced Silencing Complex (RISC)
One strand of the dsRNA produced by Dicer is retained in the RISC complex in association with Argonaute
ARGONAUTE PROTEIN :
1.PAZ(PIWI/Argonaute/ Zwille)- Recognition of target MRNA
2.PIWI (p-element induced wimpy Testis)- breaks Phosphodiester bond of mRNA.)RNAse H activity.
MiRNA:
The Double-stranded RNAs are naturally produced in eukaryotic cells during development, and they have a key role in regulating gene expression .
This presentation explores a brief idea about the structural and functional attributes of nucleotides, the structure and function of genetic materials along with the impact of UV rays and pH upon them.
What is greenhouse gasses and how many gasses are there to affect the Earth.moosaasad1975
What are greenhouse gasses how they affect the earth and its environment what is the future of the environment and earth how the weather and the climate effects.
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...University of Maribor
Slides from:
11th International Conference on Electrical, Electronics and Computer Engineering (IcETRAN), Niš, 3-6 June 2024
Track: Artificial Intelligence
https://www.etran.rs/2024/en/home-english/
3. SKA Science Data Processor (SDP) high-level dataflow
Data
ingesCon
at
0.5
TB/s
per
site
Data
management
Data
processing
130
PFlops
per
site
Data
analysis
and
Vis
4. 4
SKA Data Challenges
• Multiple concurrent observing projects
• Data sharing between projects
• Capital and operational budget limited
• Power, Cooling
• Acquisition, maintenance & software development costs
• Throughput: produce ~0.2-10 Tera Voxels/second
• Automatic 23/7 type of operation
• Data parallelism: Millions of related tasks on thousands of nodes
5. 5
Data deluge
5
Telescope
Raw
Data
Rate
Archive
Growth
MWA
1.4
TB/hour
5
PB/year
LSST
1.5
TB/hour
6
PB/year
ASKAP
9
TB/hour
5.5
PB/year
SKA1-‐Low
1,400
TB/hour
150
PB/year
arxiv.org/abs/1702.07617
6. 6
DALiuGE
6
• Defined once, executed anywhere (well)
– Separation
– Coherence
• Work with existing software components
• Extended dataflow model
– Unlock “hidden” parallelisms
– Data is given autonomy
• Decentralised execution via event propagations
• Built-in Data lifecycle management
7. 7
Related work
7
• Dataflow (DAG) computation model [7]
– Unlock “hidden” parallelisms
• DAG mapping (QAP) is a hard problem [5]
• Exact solutions
– Assignment graph [2], allocation graph [19] (max flow)
– O (|V| * P) à works on small graphs on small clusters
• Heuristics
– One-phase (HEFT) [18]
• Direct mapping from Ranked List A to Ranked List B
– Two-phase [13, 16]:
• (1) Partitioning (offline)
• (2) Mapping (online)
8. 8
Related work
8
• Resource Demand Abstraction (RDA)
– Aggregated workload “per partition”
– Estimates and capacity planning
• Existing two-phase methods mostly
– multi-processors on a single node
– We need multi-level scheduling/mapping
• Goal ≠ Maximum parallelism
– Resource footprint vs. execution latency
• Graph partitioning vs. Dataflow partitioning
– [1, 5, 20] vs. [16],…
– dataflows vs. long running MPI processes
A
E
B F
C G
H
D
10. 10
Partition problem
10
M(·∙)
is
a
funcCon
that
outputs
the
number
M
of
parCCons
given
a
PGT
and
a
soluCon
p
T(·∙)
is
a
funcCon
that
outputs
the
compleCon
Cme
T
given
a
PGT
and
a
parCCon
soluCon
p.
Ri(t)
denotes
the
aggregated
resource
demand
from
all
running
Drops
in
parCCon
i
at
Cme
t.
12. 12
12
• Stochastic Local Search Heuristics
– Meta-Heuristics
• Particle Swarm Optimisation
• Genetic algorithm
– Statistical mechanics
• Simulated annealing (MCMC)
• Mean field annealing
• Constraints-based Local Search
• Reinforcement learning (MDP)
– Monte Carlo Tree Search
Comparison
on
LOFAR
Imaging
(No
deadline,
DoP
=
4)
Min
Cost
#
Parts
Run
Time
Direct
HeurisCcs
(Edge
zeroing)
403
50
3
ParCcle
Swarm
OpCmisaCon
423
57
5
Simulated
annealing
713
73
64
Monte
Carlo
Tree
Search
(250
ms
“thinking”
Cme)
403
51
57
Monte
Carlo
Tree
Search
(150
ms
“thinking”
Cme)
408
52
35
AlphaGO
Partition algorithm (WIP: less greedy)
13. 13
Partitioning constraint (DoP)
13
• How to preserve constraints à Graph
theory to the rescue!
– Brutal force does not work well due to the
huge number of anti-chains
– Dilworth theorem (normal antichain)
• Let bpg = bipartite_graph(DAG)
• DoP == Poset Width ==
len(max_antichain) ==
len(min_num_chain) == cardinality(dag) -
len(max_matching(bpg))
– Maximum Weighted K-families (weighted
antichain)
• Split graph à Admissible Graph à
Residual Graph (using maxflow) à Pi
• Drops that satisfies a Pi equation is in the
maximum weighted antichain
20. 20
Graph execution on Tianhe2
20
70K
Drops
running
on
500
compute
nodes
at
the
Tianhe-‐2
Supercomputer
for
simulated
LOFAR
imaging
“simulated”
run
Gray
–
Drops
not
yet
started
Yellow
–
Drops
being
executed
Green
–
Drops
completed
execuCons
Red
–
Drops
failed
21. 21
Summary
21
• SKA Dataflows
• Related work
• Graph execution engine à DALiuGE
• Partitioning problem
• Partitioning algorithm (current + WIP)
• Partitioning constraint à DoP
• Case study and preliminary results