This document discusses parallel computing and cloud computing. It describes how compute-intensive bioinformatics tasks like OTU picking that take weeks on a desktop can be accelerated by distributing the workload across many processors. Cloud computing provides pay-as-you-go access to large compute clusters without the overhead of maintaining physical hardware. Public clouds like Amazon allow users to provision virtual machines for running analyses and terminate them when finished.
Slides for my talk at SkyCon'12 in Limerick.
Here I've squeezed four talks into one, covering a lot of ground quickly, so I've included links to more detailed presentations and other resources.
Unraveling mysteries of the Universe at CERN, with OpenStack and HadoopPiotr Turek
I will talk about the challenges faced, lessons learned and fun I had while reinventing the way offline data analysis is done at one of LHC (Large Hadron Collider) experiments. A journey, which took us to another land: of contemporary Big Data stack, and which finally married those two. Did it make any sense in the end? Come and you will know.
Among other things you will learn:
• the why, what and how of data analysis at CERN
• why latency variability in large distributed systems matters (literally ;))
• why using C++ as a scripting language is both the best and the worst idea ever
• how to implement a reliable Hadoop cluster provisioning mechanism on OpenStack
• how to marry a huge data analysis framework written in C++, with Hadoop 2
• what is the moral of this story
Slides for my talk at SkyCon'12 in Limerick.
Here I've squeezed four talks into one, covering a lot of ground quickly, so I've included links to more detailed presentations and other resources.
Unraveling mysteries of the Universe at CERN, with OpenStack and HadoopPiotr Turek
I will talk about the challenges faced, lessons learned and fun I had while reinventing the way offline data analysis is done at one of LHC (Large Hadron Collider) experiments. A journey, which took us to another land: of contemporary Big Data stack, and which finally married those two. Did it make any sense in the end? Come and you will know.
Among other things you will learn:
• the why, what and how of data analysis at CERN
• why latency variability in large distributed systems matters (literally ;))
• why using C++ as a scripting language is both the best and the worst idea ever
• how to implement a reliable Hadoop cluster provisioning mechanism on OpenStack
• how to marry a huge data analysis framework written in C++, with Hadoop 2
• what is the moral of this story
Building a queueing system in MongoDB and monitoring your cluster. Presentation by David Mytton at MongoSF May 2011 and MongoDB London User Group July 2011.
(PFC302) Performance Benchmarking on AWS | AWS re:Invent 2014Amazon Web Services
In this session, we explain how to measure the key performance-impacting metrics in a cloud-based application and best practices for a reliable benchmarking process. Measuring the performance of applications correctly can be challenging and there are many tools available to measure and track performance. This session will provide you with specific examples of good and bad tests. We make it clear how to get reliable measurements of and how to map benchmark results to your application. We also cover the importance of selecting tests wisely, repeating tests, and measuring variability. In addition a customer will provide real-life examples of how they developed their application testing stack, utilize it for repeatable testing and identify bottlenecks.
Down to Stack Traces, up from Heap DumpsAndrei Pangin
Глубже стек-трейсов, шире хип-дампов
Stack trace и heap dump - не просто инструменты отладки; это потайные дверцы к самым недрам виртуальной Java машины. Доклад будет посвящён малоизвестным особенностям JDK, так или иначе связанным с обоходом хипа и стеками потоков.
Мы разберём:
- как снимать дампы в продакшне без побочных эффектов;
- как работают утилиты jmap и jstack изнутри, и в чём хитрость forced режима;
- почему все профилировщики врут, и как с этим бороться;
- познакомимся с новым Stack-Walking API в Java 9;
- научимся сканировать Heap средствами JVMTI;
- узнаем о недокументированных функциях Хотспота и других интересных штуках.
Project Tungsten Phase II: Joining a Billion Rows per Second on a LaptopDatabricks
Tech-talk at Bay Area Apache Spark Meetup.
Apache Spark 2.0 will ship with the second generation Tungsten engine. Building upon ideas from modern compilers and MPP databases, and applying them to data processing queries, we have started an ongoing effort to dramatically improve Spark’s performance and bringing execution closer to bare metal. In this talk, we’ll take a deep dive into Apache Spark 2.0’s execution engine and discuss a number of architectural changes around whole-stage code generation/vectorization that have been instrumental in improving CPU efficiency and gaining performance.
FPGA based 10G Performance Tester for HW OpenFlow SwitchYutaka Yasuda
SDN operators need to measure the performance of OF HW switch on their site. Cause there is 1000 times differences in latency, depends on the specified flow entry. ASIC can forward in several μsecs but the software (CPU) may take msec.
To protect yourself from unexpected performance plunge, monitor your switches healthiness on your site.
Columnar processing for SQL-on-Hadoop: The best is yet to comeWang Zuo
Apache Parquet is an open-source file-format which arranges all of its data into columns – this is distinct from the traditional row-oriented layout, which stores entire rows consecutively. Columnar data offers lots of advantages to modern data engines – like Impala, Apache Spark, and Apache Flink – in terms of IO efficiency, but the full benefits of the format are yet to be realized.
We have been working with Intel to apply modern CPU instruction sets to the common programming tasks associated with querying data in Parquet format: decompression, predicate evaluation, and row-reconstruction. Our work has yielded significant speedups in standard query benchmarks running on Cloudera’s Impala SQL query engine, and very high speedups in targeted microbenchmarks.
In this talk we’ll describe the symbiosis between modern CPU architectures and the requirements of columnar data processing. We’ll show how vectorization – processing many items with a single instruction – is a widely applicable technique that can provide real performance benefits to all application frameworks that use columnar formats. We’ll present the changes that we have made to Impala’s ‘scanner,’ which reads Parquet data, and map out even more future enhancements.
This talk will be of interest to audiences interested in the internals of big data processing engines, or the impact of recent advances in modern CPU architectures.
These days fast code needs to operate in harmony with its environment. At the deepest level this means working well with hardware: RAM, disks and SSDs. A unifying theme is treating memory access patterns in a uniform and predictable way that is sympathetic to the underlying hardware. For example writing to and reading from RAM and Hard Disks can be significantly sped up by operating sequentially on the device, rather than randomly accessing the data. In this talk we’ll cover why access patterns are important, what kind of speed gain you can get and how you can write simple high level code which works well with these kind of patterns.
Building a queueing system in MongoDB and monitoring your cluster. Presentation by David Mytton at MongoSF May 2011 and MongoDB London User Group July 2011.
(PFC302) Performance Benchmarking on AWS | AWS re:Invent 2014Amazon Web Services
In this session, we explain how to measure the key performance-impacting metrics in a cloud-based application and best practices for a reliable benchmarking process. Measuring the performance of applications correctly can be challenging and there are many tools available to measure and track performance. This session will provide you with specific examples of good and bad tests. We make it clear how to get reliable measurements of and how to map benchmark results to your application. We also cover the importance of selecting tests wisely, repeating tests, and measuring variability. In addition a customer will provide real-life examples of how they developed their application testing stack, utilize it for repeatable testing and identify bottlenecks.
Down to Stack Traces, up from Heap DumpsAndrei Pangin
Глубже стек-трейсов, шире хип-дампов
Stack trace и heap dump - не просто инструменты отладки; это потайные дверцы к самым недрам виртуальной Java машины. Доклад будет посвящён малоизвестным особенностям JDK, так или иначе связанным с обоходом хипа и стеками потоков.
Мы разберём:
- как снимать дампы в продакшне без побочных эффектов;
- как работают утилиты jmap и jstack изнутри, и в чём хитрость forced режима;
- почему все профилировщики врут, и как с этим бороться;
- познакомимся с новым Stack-Walking API в Java 9;
- научимся сканировать Heap средствами JVMTI;
- узнаем о недокументированных функциях Хотспота и других интересных штуках.
Project Tungsten Phase II: Joining a Billion Rows per Second on a LaptopDatabricks
Tech-talk at Bay Area Apache Spark Meetup.
Apache Spark 2.0 will ship with the second generation Tungsten engine. Building upon ideas from modern compilers and MPP databases, and applying them to data processing queries, we have started an ongoing effort to dramatically improve Spark’s performance and bringing execution closer to bare metal. In this talk, we’ll take a deep dive into Apache Spark 2.0’s execution engine and discuss a number of architectural changes around whole-stage code generation/vectorization that have been instrumental in improving CPU efficiency and gaining performance.
FPGA based 10G Performance Tester for HW OpenFlow SwitchYutaka Yasuda
SDN operators need to measure the performance of OF HW switch on their site. Cause there is 1000 times differences in latency, depends on the specified flow entry. ASIC can forward in several μsecs but the software (CPU) may take msec.
To protect yourself from unexpected performance plunge, monitor your switches healthiness on your site.
Columnar processing for SQL-on-Hadoop: The best is yet to comeWang Zuo
Apache Parquet is an open-source file-format which arranges all of its data into columns – this is distinct from the traditional row-oriented layout, which stores entire rows consecutively. Columnar data offers lots of advantages to modern data engines – like Impala, Apache Spark, and Apache Flink – in terms of IO efficiency, but the full benefits of the format are yet to be realized.
We have been working with Intel to apply modern CPU instruction sets to the common programming tasks associated with querying data in Parquet format: decompression, predicate evaluation, and row-reconstruction. Our work has yielded significant speedups in standard query benchmarks running on Cloudera’s Impala SQL query engine, and very high speedups in targeted microbenchmarks.
In this talk we’ll describe the symbiosis between modern CPU architectures and the requirements of columnar data processing. We’ll show how vectorization – processing many items with a single instruction – is a widely applicable technique that can provide real performance benefits to all application frameworks that use columnar formats. We’ll present the changes that we have made to Impala’s ‘scanner,’ which reads Parquet data, and map out even more future enhancements.
This talk will be of interest to audiences interested in the internals of big data processing engines, or the impact of recent advances in modern CPU architectures.
These days fast code needs to operate in harmony with its environment. At the deepest level this means working well with hardware: RAM, disks and SSDs. A unifying theme is treating memory access patterns in a uniform and predictable way that is sympathetic to the underlying hardware. For example writing to and reading from RAM and Hard Disks can be significantly sped up by operating sequentially on the device, rather than randomly accessing the data. In this talk we’ll cover why access patterns are important, what kind of speed gain you can get and how you can write simple high level code which works well with these kind of patterns.
(DAT402) Amazon RDS PostgreSQL:Lessons Learned & New FeaturesAmazon Web Services
Learn the specifics of Amazon RDS for PostgreSQL’s capabilities and extensions that make it powerful. This session begins with a brief overview of the RDS PostgreSQL service, how it provides High Availability & Durability and will then deep dive into the new features that we have released since re:Invent 2014, including major version upgrade and newly added PostgreSQL extensions to RDS PostgreSQL. During the session, we will also discuss lessons learned running a large fleet of PostgreSQL instances, including specific recommendations. In addition we will present benchmarking results looking at differences between the 9.3, 9.4 and 9.5 releases.
Instructions for Submissions thorugh G- Classroom.pptxJheel Barad
This presentation provides a briefing on how to upload submissions and documents in Google Classroom. It was prepared as part of an orientation for new Sainik School in-service teacher trainees. As a training officer, my goal is to ensure that you are comfortable and proficient with this essential tool for managing assignments and fostering student engagement.
Biological screening of herbal drugs: Introduction and Need for
Phyto-Pharmacological Screening, New Strategies for evaluating
Natural Products, In vitro evaluation techniques for Antioxidants, Antimicrobial and Anticancer drugs. In vivo evaluation techniques
for Anti-inflammatory, Antiulcer, Anticancer, Wound healing, Antidiabetic, Hepatoprotective, Cardio protective, Diuretics and
Antifertility, Toxicity studies as per OECD guidelines
Operation “Blue Star” is the only event in the history of Independent India where the state went into war with its own people. Even after about 40 years it is not clear if it was culmination of states anger over people of the region, a political game of power or start of dictatorial chapter in the democratic setup.
The people of Punjab felt alienated from main stream due to denial of their just demands during a long democratic struggle since independence. As it happen all over the word, it led to militant struggle with great loss of lives of military, police and civilian personnel. Killing of Indira Gandhi and massacre of innocent Sikhs in Delhi and other India cities was also associated with this movement.
Macroeconomics- Movie Location
This will be used as part of your Personal Professional Portfolio once graded.
Objective:
Prepare a presentation or a paper using research, basic comparative analysis, data organization and application of economic information. You will make an informed assessment of an economic climate outside of the United States to accomplish an entertainment industry objective.
Welcome to TechSoup New Member Orientation and Q&A (May 2024).pdfTechSoup
In this webinar you will learn how your organization can access TechSoup's wide variety of product discount and donation programs. From hardware to software, we'll give you a tour of the tools available to help your nonprofit with productivity, collaboration, financial management, donor tracking, security, and more.
Unit 8 - Information and Communication Technology (Paper I).pdfThiyagu K
This slides describes the basic concepts of ICT, basics of Email, Emerging Technology and Digital Initiatives in Education. This presentations aligns with the UGC Paper I syllabus.
3. Robust
searches
• Some0mes
your
queries
will
fail
– Won’t
produce
output
(good)
– Will
produce
incorrect
output
(bad)
• Fail
loudly!
Produce
a
(useful)
error
message
on
failure.
4. Designing
robust
searches
• Make
assump0ons
explicit
– If
you’re
assuming
that
your
records
start
with
‘>’,
search
for
‘^>’
to
avoid
matching
‘>’
characters
that
show
up
in
other
places
• Match
full
lines
by
including
^
and
$
in
your
search
query
• Check
the
number
of
matches
that
were
made:
is
it
reasonable?
5. Tes0ng
of
soWware
• Start
thinking
about
what
posi0ve
and
nega0ve
controls
for
these
terms
might
look
like.
SoWware
tes0ng
is
something
we’ll
be
discussing
regularly
through-‐out
the
semester.
7. Cluster
compu0ng
• Many
computers
connected
to
one
another
to
serve
as
a
larger
compute
resource.
• Compute-‐intensive
jobs
can
be
split
over
many
systems
and
run
in
parallel.
• Similar
to
desktop
compute
hardware,
but
different
casing,
no
(or
only
few)
displays/
keyboards
directly
connected.
• Owned
and
maintained
“in-‐house”.
8. Why
is
parallel
compu0ng
important
in
bioinforma0cs?
454
Illumina
Genome
Illumina
Pla$orm
Sanger
Illumina
MiSeq
(Titanium)
Analyzer
II
HiSeq2000
150
(single
end);
Read
Length
(bases)
~1000
~400
150
(single
end)
100
(single
end)
250
soon
Number
of
reads
96
or
384
~1,000,000
~100,000,000
~1,600,000,000
~10,000,000
Maximum
number
of
12,000
(barcode-‐ 24,000
(barcode-‐ 2500
(barcode-‐
n/a
1000
samples
per
run
limited)
limited)
limited)
Sequences
per
$1
0.44
100
5000
200,000
12,500
(sequencing
costs
only)
10. OTU
picking:
example
of
a
compute
intensive
process
What taxa are represented in
each sample?
Reference tree
of non-redundant
full length sequences
>PC.634_1 FLP3FBN01ELBSX
CTGGGCCGTGTCTCAGTCCCAATGTGGCCGTTTACCCTCTCAGGCCGG
CTACGCATCATCGCCTTGGTGGGCCGTTACCTCACCAACTAGCTAATG
CGCCGCAGGTCCATCCATGTTCACGCCTTGATGGGCGCTTTAATATAC
TGAGCATGCGCTCTGTATACCTATCCGGTTTTAGCTACCGTTTCCAGC
AGTTATCCCGGACACATGGGCTAGG
>PC.634_2 FLP3FBN01EG8AX
TTGGACCGTGTCTCAGTTCCAATGTGGGGGCCTTCCTCTCAGAACCCC
TATCCATCGAAGGCTTGGTGGGCCGTTACCCCGCCAACAACCTAATGG
AACGCATCCCCATCGATGACCGAAGTTCTTTAATAGTTCTACCATGCG
GAAGAACTATGCCATCGGGTATTAATCTTTCTTTCGAAAGGCTATCCC
BLAST against
CGAGTCATCGGCAGGTTGGATACGTGTTACTCACCCGTGCGCCGGT
>PC.354_3 FLP3FBN01EEWKD
reference tree
TTGGGCCGTGTCTCAGTCCCAATGTGGCCGATCAGTCTCTTAACTCGG
CTATGCATCATTGCCTTGGTAAGCCGTTACCTTACCAACTAGCTAATG
CACCGCAGGTCCATCCAAGAGTGATAGCAGAACCATCTTTCAAACTCT
AGACATGCGTCTAGTGTTGTTATCCGGTATTAGCATCTGTTTCCAGGT
GTTATCCCAGTCTCTTGGG
...
11. OTU
picking:
example
of
a
compute
intensive
process
What taxa are represented in Reference tree
of non-redundant
each sample? full length sequences
>PC.634_1 FLP3FBN01ELBSX
CTGGGCCGTGTCTCAGTCCCAATGTGGCCGTTTACCCTCTCAGGCCGG
CTACGCATCATCGCCTTGGTGGGCCGTTACCTCACCAACTAGCTAATG
CGCCGCAGGTCCATCCATGTTCACGCCTTGATGGGCGCTTTAATATAC
TGAGCATGCGCTCTGTATACCTATCCGGTTTTAGCTACCGTTTCCAGC
AGTTATCCCGGACACATGGGCTAGG
>PC.634_2 FLP3FBN01EG8AX
TTGGACCGTGTCTCAGTTCCAATGTGGGGGCCTTCCTCTCAGAACCCC
TATCCATCGAAGGCTTGGTGGGCCGTTACCCCGCCAACAACCTAATGG
AACGCATCCCCATCGATGACCGAAGTTCTTTAATAGTTCTACCATGCG
GAAGAACTATGCCATCGGGTATTAATCTTTCTTTCGAAAGGCTATCCC BLAST against
CGAGTCATCGGCAGGTTGGATACGTGTTACTCACCCGTGCGCCGGT
>PC.354_3 FLP3FBN01EEWKD reference tree
TTGGGCCGTGTCTCAGTCCCAATGTGGCCGATCAGTCTCTTAACTCGG
CTATGCATCATTGCCTTGGTAAGCCGTTACCTTACCAACTAGCTAATG
CACCGCAGGTCCATCCAAGAGTGATAGCAGAACCATCTTTCAAACTCT
AGACATGCGTCTAGTGTTGTTATCCGGTATTAGCATCTGTTTCCAGGT
GTTATCCCAGTCTCTTGGG
...
Clusters of “Operational Taxonomic Units” (OTUs);
Per sample hits on reference tree;
Taxonomic assignments
12. OTU
Picking
• For
1
billion
sequence
reads,
the
ini0al
step
ran
for
~116
hours
on
110
processors
requiring
4GB
of
RAM
per
job
for
workers
and
64GB
of
RAM
for
master.
13. OTU
Picking
• For
1
billion
sequence
reads,
the
ini0al
step
ran
for
~116
hours
on
110
processors
requiring
4GB
of
RAM
per
job
for
workers
and
64GB
of
RAM
for
the
master
job.
• So,
on
a
single
processor
desktop
with
64GB
of
RAM…
12760
hours
or
532
days!
14. OTU
Picking
• For
1
billion
sequence
reads,
the
ini0al
step
ran
for
~116
hours
on
110
processors
requiring
4GB
of
RAM
per
job
for
workers
and
64GB
of
RAM
for
the
master
job.
• So,
on
a
single
processor
desktop
with
64GB
of
RAM…
12760
hours
or
532
days!
• One
HiSeq2000
generates
this
data
in
a
week!
15. Cloud
compu0ng
• Implemented
on
a
cluster
(or
grid),
but
compute
power
is
rented
as
a
service
to
support
arbitrary
applica0ons.
16. Maintaining
hardware
is
expensive
• Temperature
(redundant
cooling
systems)
• Redundant
network
connec0ons
• Hardware
maintenance
(e.g.,
replacing
hard
drives)
• Fire
suppression
• Back-‐up
power
• System
administrator
($$)
17. Pay-‐as-‐you-‐go
compute
power
• Public
clouds
(e.g.,
Amazon)
rent
compute
resources
• Log
in,
boot
virtual
machine
image,
run
analyses,
and
terminate
instance.
• Cheaper
for
many
tasks
than
buying,
maintaining,
and
suppor0ng
a
compute
cluster.
18. Types
of
cloud
offerings
• Applica0ons/SaaS
(e.g.,
Google
Docs,
gmail,
Dropbox,
iCloud)
• Compu0ng
planorm/PaaS
(e.g.,
Google
App
Engine)
• Raw
compute
resources/IaaS
(e.g.,
Amazon
Elas0c
Compute
Cloud
(EC2))
19. Cloud
compu0ng
op0ons
• Amazon
Elas0c
Compute
Cloud
(EC2)
• Magellan
–
Argonne's
DOE
Cloud
Compu0ng
• Data
Intensive
Academic
Grid
(DIAG)
–
Ins0tute
for
Genome
Sciences
(IGS),
University
of
Maryland
School
of
Medicine
(UMSOM)
20. Interac0ng
with
the
Amazon
Cloud
• Boot
virtual
machine
image
via
web
interface
(or
a
third-‐party
tool
like
StarCluster).
• Log
in
and
work
via
terminal
(or
via
web
interface
with
IPython
Notebook)
• Move
data
back
and
forth
via
sWp/scp
or
a
graphical
sWp
client
(e.g.,
Cyberduck
[free/
cross-‐planorm])
21. Virtual
machines
• A
“guest”
opera0ng
system
running
within
a
“host”
opera0ng
system
• A
soWware
implementa0on
of
a
computer,
that
operates
like
a
physical
computer.
• A
developer
can
create
a
virtual
machine
image
which
contains
their
tools
pre-‐installed.
Users
can
then
instan)ate
that
image
to
work
with
those
tools.
Browse
this
page:
hvp://en.wikipedia.org/wiki/Virtual_machine
22. Benefits
that
virtual
machines
offer
bioinforma0cs
• Reproducibility:
can
publish
protocols
with
a
virtual
machine
instance
id.
• Updates
are
burden
of
developer,
not
user.
• Coupled
with
cloud
compu0ng,
it’s
the
perfect
model
for
users
with
sporadic
compute
needs.
24. QIIME
virtual
machine
• The
QIIME
package
distributes
an
EC2
virtual
machine
with
QIIME
and
its
(many)
dependencies
pre-‐installed.
• Dependencies
include
commonly
used
tools
like
BLAST,
muscle,
FastTree,
uclust,
IPython,
and
a
lot
more.
A
par0al
list
is
available
here:
hvp://qiime.org/install/install.html
• Latest
machine
iden0fier
can
always
be
found
at:
hvp://qiime.org/home_sta0c/dataFiles.html
25. I
think
there
is
a
world
market
for
maybe
five
computers.
-‐
Thomas
Watson,
IBM
Founder,
1943
26. I
think
there
is
a
world
market
for
maybe
five
computers.
-‐
Thomas
Watson,
IBM
Founder,
1943
Units sold by year
All
figures
are
in
units
of
1000.
hvp://jeremyreimer.com/postman/node/329
27. The
democra0za0on
of
DNA
sequencing
+
+
Affordable
Cloud
Open-‐source
sequencing
compu0ng
soWware
28. This
work
is
licensed
under
the
Crea0ve
Commons
Avribu0on
3.0
United
States
License.
To
view
a
copy
of
this
license,
visit
hvp://crea0vecommons.org/licenses/by/3.0/us/
or
send
a
lever
to
Crea0ve
Commons,
171
Second
Street,
Suite
300,
San
Francisco,
California,
94105,
USA.
Feel
free
to
use
or
modify
these
slides,
but
please
credit
me
by
placing
the
following
avribu0on
informa0on
where
you
feel
that
it
makes
sense:
Greg
Caporaso,
www.caporaso.us.