Step‐by‐Step Evolution of Vertebrate Blood Coagulation Madhusudan Katti
Department of Biology, Consortium for Evolutionary Studies & Tri Beta Biological Honor Society, California State University, Fresno present:
Step‐by‐Step Evolution of Vertebrate Blood Coagulation
by
Dr. Russell F. Doolittle
Dept. Chemistry & Biochemistry and Molecular Biology , University of California, San Diego
Abstract
The availability of whole genome sequences for a variety of vertebrates is making it possible to reconstruct the step-by-step evolution of complex phenomena like blood coagulation, an event that in mammals involves the interplay of more than two dozen genetically encoded factors. Gene inventories for different organisms are revealing when during vertebrate evolution certain factors first made their appearance or, on occasion, disappeared from some lineages. The whole genome sequence databases of two protochordates and seven non-mammalian vertebrates were examined in search of some 20 genes known to be associated with blood clotting in mammals. No genuine orthologs were found in the protochordate genomes (sea squirt and amphioxus). As for vertebrates, although the jawless fish have genes for generating the thrombin-catalyzed conversion of fibrinogen to fibrin, they lack several clotting factors, including two thought to be essential for the activation of thrombin in mammals. Fish in general lack genes for the “contact factor” proteases, the predecessor forms of which make their first appearance in tetrapods. The full complement of factors known to be operating in humans doesn’t occur until pouched marsupials (opossum), at least one key factor still being absent in egg-laying mammals like platypus.
On: Friday, January 29, 2010
At: 3:00‐‐‐4:00 PM
In: Science II, Room 109
Computational approaches to study GeneticsArithmer Inc.
Slide for Arithmer Seminar given by Dr. Jeffrey Fawcett (RIKEN) at Arithmer inc.
The topic is how data science is used in genetics, especially in analyzing thoroughbred gene pool.
"Arithmer Seminar" is weekly held, where professionals from within and outside our company give lectures on their respective expertise.
The slides are made by the lecturer from outside our company, and shared here with his/her permission.
Arithmer株式会社は東京大学大学院数理科学研究科発の数学の会社です。私達は現代数学を応用して、様々な分野のソリューションに、新しい高度AIシステムを導入しています。AIをいかに上手に使って仕事を効率化するか、そして人々の役に立つ結果を生み出すのか、それを考えるのが私たちの仕事です。
Arithmer began at the University of Tokyo Graduate School of Mathematical Sciences. Today, our research of modern mathematics and AI systems has the capability of providing solutions when dealing with tough complex issues. At Arithmer we believe it is our job to realize the functions of AI through improving work efficiency and producing more useful results for society.
Step‐by‐Step Evolution of Vertebrate Blood Coagulation Madhusudan Katti
Department of Biology, Consortium for Evolutionary Studies & Tri Beta Biological Honor Society, California State University, Fresno present:
Step‐by‐Step Evolution of Vertebrate Blood Coagulation
by
Dr. Russell F. Doolittle
Dept. Chemistry & Biochemistry and Molecular Biology , University of California, San Diego
Abstract
The availability of whole genome sequences for a variety of vertebrates is making it possible to reconstruct the step-by-step evolution of complex phenomena like blood coagulation, an event that in mammals involves the interplay of more than two dozen genetically encoded factors. Gene inventories for different organisms are revealing when during vertebrate evolution certain factors first made their appearance or, on occasion, disappeared from some lineages. The whole genome sequence databases of two protochordates and seven non-mammalian vertebrates were examined in search of some 20 genes known to be associated with blood clotting in mammals. No genuine orthologs were found in the protochordate genomes (sea squirt and amphioxus). As for vertebrates, although the jawless fish have genes for generating the thrombin-catalyzed conversion of fibrinogen to fibrin, they lack several clotting factors, including two thought to be essential for the activation of thrombin in mammals. Fish in general lack genes for the “contact factor” proteases, the predecessor forms of which make their first appearance in tetrapods. The full complement of factors known to be operating in humans doesn’t occur until pouched marsupials (opossum), at least one key factor still being absent in egg-laying mammals like platypus.
On: Friday, January 29, 2010
At: 3:00‐‐‐4:00 PM
In: Science II, Room 109
Computational approaches to study GeneticsArithmer Inc.
Slide for Arithmer Seminar given by Dr. Jeffrey Fawcett (RIKEN) at Arithmer inc.
The topic is how data science is used in genetics, especially in analyzing thoroughbred gene pool.
"Arithmer Seminar" is weekly held, where professionals from within and outside our company give lectures on their respective expertise.
The slides are made by the lecturer from outside our company, and shared here with his/her permission.
Arithmer株式会社は東京大学大学院数理科学研究科発の数学の会社です。私達は現代数学を応用して、様々な分野のソリューションに、新しい高度AIシステムを導入しています。AIをいかに上手に使って仕事を効率化するか、そして人々の役に立つ結果を生み出すのか、それを考えるのが私たちの仕事です。
Arithmer began at the University of Tokyo Graduate School of Mathematical Sciences. Today, our research of modern mathematics and AI systems has the capability of providing solutions when dealing with tough complex issues. At Arithmer we believe it is our job to realize the functions of AI through improving work efficiency and producing more useful results for society.
Comparative genomic analysis in Zingiberales: what can we learn from banana to enable Ensete and Boesenbergia to reach their potential?
Talk for Plant and Animal Genomics XXV 25 - San Diego January 2017
Trude Schwarzacher, Jennifer A. Harikrishna and Pat Heslop-Harrison, University of Leicester and University of Malaya
phh(a)molcyt.com
Within the Zingiberales there are many orphan crops that are grown in Africa and Asia where recently started genomic efforts will have an impact for the future understanding and breeding of these crops. Advanced genomics and genome knowledge of the taxonomically closely related genus Musa will help identify genes and their function. We will discuss relevant recent work with Musa and results from DNA sequencing, examinations of diversity and studies of genome structure, gene expression and epigenetic control in Boesenbergia and ensete. Ensete is an important starch staple food in Ethiopia. It is harvested just as the monocarpic plant starts to flower, a few years after planting, and the stored starch extracted from the pseudo-stem and corm. A genome sequence has been published, but there is little genomics. Characterization of the diversity in the species and understanding of the differences to Musa will enable selection and breeding for crop improvement to meet the requirements of increasing populations, climate change and environmental sustainability. Boesenbergia rotunda is widely used in traditional medicine in Asia and has been shown to produce secondary metabolites with antiviral activity. For high throughput propagation and metabolite production in vitro culture is employed; embryogenic calli of B. rotunda in vitro are able to regenerate into plants but lose this ability after prolonged periods in cell suspension media. Epigenetic factors, including histone modifications and DNA methylation are likely to play crucial roles in the regulation of genes involved in totipotency and plant regeneration. These findings are also relevant to other crops within the Zingiberales. Further details will be given at www.molcyt.com
I investigated the assumption that race and ancestry can be determined using DNA sequence analysis. I was able to present the results of my senior project at Luther College Research Symposium in April 2010.
A short presentation about Silva rRNA database. Silva provides comprehensive, quality checked and regularly updated datasets of aligned small (16S/18S, SSU) and large subunit (23S/28S, LSU) ribosomal RNA (rRNA) sequences for all three domains of life (Bacteria, Archaea and Eukarya).
Lightning fast genomics with Spark, Adam and ScalaAndy Petrella
We are at a time where biotech allow us to get personal genomes for $1000. Tremendous progress since the 70s in DNA sequencing have been done, e.g. more samples in an experiment, more genomic coverages at higher speeds. Genomic analysis standards that have been developed over the years weren't designed with scalability and adaptability in mind. In this talk, we’ll present a game changing technology in this area, ADAM, initiated by the AMPLab at Berkeley. ADAM is framework based on Apache Spark and the Parquet storage. We’ll see how it can speed up a sequence reconstruction to a factor 150.
Comparative genomic analysis in Zingiberales: what can we learn from banana to enable Ensete and Boesenbergia to reach their potential?
Talk for Plant and Animal Genomics XXV 25 - San Diego January 2017
Trude Schwarzacher, Jennifer A. Harikrishna and Pat Heslop-Harrison, University of Leicester and University of Malaya
phh(a)molcyt.com
Within the Zingiberales there are many orphan crops that are grown in Africa and Asia where recently started genomic efforts will have an impact for the future understanding and breeding of these crops. Advanced genomics and genome knowledge of the taxonomically closely related genus Musa will help identify genes and their function. We will discuss relevant recent work with Musa and results from DNA sequencing, examinations of diversity and studies of genome structure, gene expression and epigenetic control in Boesenbergia and ensete. Ensete is an important starch staple food in Ethiopia. It is harvested just as the monocarpic plant starts to flower, a few years after planting, and the stored starch extracted from the pseudo-stem and corm. A genome sequence has been published, but there is little genomics. Characterization of the diversity in the species and understanding of the differences to Musa will enable selection and breeding for crop improvement to meet the requirements of increasing populations, climate change and environmental sustainability. Boesenbergia rotunda is widely used in traditional medicine in Asia and has been shown to produce secondary metabolites with antiviral activity. For high throughput propagation and metabolite production in vitro culture is employed; embryogenic calli of B. rotunda in vitro are able to regenerate into plants but lose this ability after prolonged periods in cell suspension media. Epigenetic factors, including histone modifications and DNA methylation are likely to play crucial roles in the regulation of genes involved in totipotency and plant regeneration. These findings are also relevant to other crops within the Zingiberales. Further details will be given at www.molcyt.com
I investigated the assumption that race and ancestry can be determined using DNA sequence analysis. I was able to present the results of my senior project at Luther College Research Symposium in April 2010.
A short presentation about Silva rRNA database. Silva provides comprehensive, quality checked and regularly updated datasets of aligned small (16S/18S, SSU) and large subunit (23S/28S, LSU) ribosomal RNA (rRNA) sequences for all three domains of life (Bacteria, Archaea and Eukarya).
Lightning fast genomics with Spark, Adam and ScalaAndy Petrella
We are at a time where biotech allow us to get personal genomes for $1000. Tremendous progress since the 70s in DNA sequencing have been done, e.g. more samples in an experiment, more genomic coverages at higher speeds. Genomic analysis standards that have been developed over the years weren't designed with scalability and adaptability in mind. In this talk, we’ll present a game changing technology in this area, ADAM, initiated by the AMPLab at Berkeley. ADAM is framework based on Apache Spark and the Parquet storage. We’ll see how it can speed up a sequence reconstruction to a factor 150.
Genome project of Human and methods of sequencing human genome; Genome project of Rice and its post genome sequencing era; Arabidopsis genome project: Why Rice and Arabidopsis chosen for genome project?
Project IdeaFor our project, we will be focusing on the wastewat.docxbriancrawford30935
Project Idea
For our project, we will be focusing on the wastewater issues in Rio de Janeiro, Brazil. Rio de Janeiro was chosen to host the summer 2016 Olympics. Making sure that the water is safe enough for human contact will affect the athletes not only while they compete, but also during their stay in the city. Guanabara Bay, where the water events will take place, contains 84% of the city’s untreated sewage. Tests done by the International Olympic Committee (IOC) showed that the hazardous waste in Guanabara Bay is 1.7 million times more hazardous than what is considered hazardous in California. The International Olympic Committee will use the World Health Organization (WHO) guidelines to determine if the Guanabara Bay will be safe enough to host the water events or if a new venue will have to be used. In order to fix this problem in time for the Olympics, the water quality standards set by the World Health Organization need to be enforced and more funding from the government is needed.
1
Introduction to Animal Genetics
Animal Sciences 121
Origins of the Science of
Genetics
Earliest Theories
Pangenesis
Hippocrates and Aristotle,
The organism formed through sexual reproduction
“substance” from the egg and “form” from the seminal
fluid. Sperm and Egg from all parts of the body
each giving its own traits
Accepted by many scientists into late 19th Cent
(including Charles Darwin)
Darwin’s Idea
Origins of the Science of
Genetics
Earliest Theories
Preformationism
1694 Nicolaas Hartsoeker
Postulated the theory of
“Homunculi”
Completely formed miniature individual inside
sperm and egg cells (he never claimed to have actually
seen these ‘little men”)
Origins of the Science of
Genetics
Earliest Theories
Acquired Characteristics
Jean Baptist de Lamark 1800s
Use or disuse of organs, limbs, other controlled
whether they were passed to offspring
Related to Pangensis
http://upload.wikimedia.org/wikipedia/commons/f/f0/Jean-baptiste_lamarck2.jpg
http://upload.wikimedia.org/wikipedia/commons/f/f0/Jean-baptiste_lamarck2.jpg
2
Earliest Theories
Germplasm
late 1800s August Weismann
Sex cells are fundamentally different than
body cells – somatoplasm
1st major scientific challenge to Pangenesis
Mouse tails – cut off – offspring had normal tails
Origins of the Science of
Genetics
Genetics, as the
study of heredity and
its application to
animal agriculture,
had its practical
origin in peas.
Born 1822, Czech Republic
Augustinian friar
1856 began experiments,
results presented to the
Brünn Society for National
History in 1865,
and published, 1866.
Gregor Johann
Mendel
Mendel’s Pea Plant Traits
3:1
http://upload.wikimedia.org/wikipedia/commons/1/15/August_Weismann.jpg
http://upload.wikimedia.org/wikipedia/commons/1/15/August_Weismann.jpg
3
Mendel’s Laws
Segregation –
Each trait has two possibilities which
pass to offspring in a random but
predictable way 9:3:3:1
Mendel’s L.
Chromosomes, Crops and Superdomestication - Pat Heslop-Harrison MalaysiaPat (JS) Heslop-Harrison
PUBLIC SEMINAR At Agro-Biotechnology Institute, ABI Serdang
Prof J. S. “Pat” Heslop-Harrison,
University of Leicester
Academic Icon, University of Malaya
Chromosomes, Crops and Superdomestication
Crop improvement is reliant on the exploitation of new biodiversity and new combinations of diversity. I will discuss our work on genome structure and evolution, involving processes including polyploidy, introgression, recombination and repetitive DNA changes. Identification and measurement of diversity and relationships assists in use of new gene combinations or new crops, through synthesizing new hybrid species, by chromosome engineering or by transgenic strategies. We are studying crops including wheat, Brassica and banana, using genome sequencing, repetitive sequence comparison, and cytogenetics. Plants, pathogens and farmers have been involved in a three-way fight since the start of agriculture, and the concept of superdomestication involves systematic identification of needs from crops, only then followed by finding appropriate characters and bringing them together in new varieties. Crops will continue to deliver the products needed for food, fibre, fuel and fibre in an increasingly sustainable and safe manner.
Forrester CXNYC 2017 - Delivering great real-time cx is a true craftDataStax Academy
Companies today are innovating with real-time data to deliver truly amazing customer experiences in the moment. Real-time data management for real-time customer experience is core to staying ahead of competition and driving revenue growth. Join Trays to learn how Comcast is differentiating itself from it's own historical reputation with Customer Experience strategies.
Introduction to DataStax Enterprise Graph DatabaseDataStax Academy
DataStax Enterprise (DSE) Graph is a built to manage, analyze, and search highly connected data. DSE Graph, built on NoSQL Apache Cassandra delivers continuous uptime along with predictable performance and scales for modern systems dealing with complex and constantly changing data.
Download DataStax Enterprise: Academy.DataStax.com/Download
Start free training for DataStax Enterprise Graph: Academy.DataStax.com/courses/ds332-datastax-enterprise-graph
Introduction to DataStax Enterprise Advanced Replication with Apache CassandraDataStax Academy
DataStax Enterprise Advanced Replication supports one-way distributed data replication from remote database clusters that might experience periods of network or internet downtime. Benefiting use cases that require a 'hub and spoke' architecture.
Learn more at http://www.datastax.com/2016/07/stay-100-connected-with-dse-advanced-replication
Advanced Replication docs – https://docs.datastax.com/en/latest-dse/datastax_enterprise/advRep/advRepTOC.html
Data Modeling is the one of the first things to sink your teeth into when trying out a new database. That's why we are going to cover this foundational topic in enough detail for you to get dangerous. Data Modeling for relational databases is more than a touch different than the way it's approached with Cassandra. We will address the quintessential query-driven methodology through a couple of different use cases, including working with time series data for IoT. We will also demo a new tool to get you bootstrapped quickly with MovieLens sample data. This talk should give you the basics you need to get serious with Apache Cassandra.
Hear about how Coursera uses Cassandra as the core of its scalable online education platform. I'll discuss the strengths of Cassandra that we leverage, as well as some limitations that you might run into as well in practice.
In the second part of this talk, we'll dive into how best to effectively use the Datastax Java drivers. We'll dig into how the driver is architected, and use this understanding to develop best practices to follow. I'll also share a couple of interesting bug we've run into at Coursera.
Cassandra @ Sony: The good, the bad, and the ugly part 1DataStax Academy
This talk covers scaling Cassandra to a fast growing user base. Alex and Isaias will cover new best practices and how to work with the strengths and weaknesses of Cassandra at large scale. They will discuss how to adapt to bottlenecks while providing a rich feature set to the playstation community.
Cassandra @ Sony: The good, the bad, and the ugly part 2DataStax Academy
This talk covers scaling Cassandra to a fast growing user base. Alex and Isaias will cover new best practices and how to work with the strengths and weaknesses of Cassandra at large scale. They will discuss how to adapt to bottlenecks while providing a rich feature set to the playstation community.
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdfPeter Spielvogel
Building better applications for business users with SAP Fiori.
• What is SAP Fiori and why it matters to you
• How a better user experience drives measurable business benefits
• How to get started with SAP Fiori today
• How SAP Fiori elements accelerates application development
• How SAP Build Code includes SAP Fiori tools and other generative artificial intelligence capabilities
• How SAP Fiori paves the way for using AI in SAP apps
Communications Mining Series - Zero to Hero - Session 1DianaGray10
This session provides introduction to UiPath Communication Mining, importance and platform overview. You will acquire a good understand of the phases in Communication Mining as we go over the platform with you. Topics covered:
• Communication Mining Overview
• Why is it important?
• How can it help today’s business and the benefits
• Phases in Communication Mining
• Demo on Platform overview
• Q/A
Removing Uninteresting Bytes in Software FuzzingAftab Hussain
Imagine a world where software fuzzing, the process of mutating bytes in test seeds to uncover hidden and erroneous program behaviors, becomes faster and more effective. A lot depends on the initial seeds, which can significantly dictate the trajectory of a fuzzing campaign, particularly in terms of how long it takes to uncover interesting behaviour in your code. We introduce DIAR, a technique designed to speedup fuzzing campaigns by pinpointing and eliminating those uninteresting bytes in the seeds. Picture this: instead of wasting valuable resources on meaningless mutations in large, bloated seeds, DIAR removes the unnecessary bytes, streamlining the entire process.
In this work, we equipped AFL, a popular fuzzer, with DIAR and examined two critical Linux libraries -- Libxml's xmllint, a tool for parsing xml documents, and Binutil's readelf, an essential debugging and security analysis command-line tool used to display detailed information about ELF (Executable and Linkable Format). Our preliminary results show that AFL+DIAR does not only discover new paths more quickly but also achieves higher coverage overall. This work thus showcases how starting with lean and optimized seeds can lead to faster, more comprehensive fuzzing campaigns -- and DIAR helps you find such seeds.
- These are slides of the talk given at IEEE International Conference on Software Testing Verification and Validation Workshop, ICSTW 2022.
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024Neo4j
Neha Bajwa, Vice President of Product Marketing, Neo4j
Join us as we explore breakthrough innovations enabled by interconnected data and AI. Discover firsthand how organizations use relationships in data to uncover contextual insights and solve our most pressing challenges – from optimizing supply chains, detecting fraud, and improving customer experiences to accelerating drug discoveries.
Pushing the limits of ePRTC: 100ns holdover for 100 daysAdtran
At WSTS 2024, Alon Stern explored the topic of parametric holdover and explained how recent research findings can be implemented in real-world PNT networks to achieve 100 nanoseconds of accuracy for up to 100 days.
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
GridMate - End to end testing is a critical piece to ensure quality and avoid...ThomasParaiso2
End to end testing is a critical piece to ensure quality and avoid regressions. In this session, we share our journey building an E2E testing pipeline for GridMate components (LWC and Aura) using Cypress, JSForce, FakerJS…
NYC* 2013 - "Analyzing the Human Genome/DNA with Cassandra"
1. blueplastic.com/dna.pdf
Analyzing the human
genome/DNA with
Cassandra
BY SAMEER FAROOQUI
linkedin.com/in/blueplastic/
SAMEER@BLUEPLASTIC.COM
@blueplastic
http://youtu.be/ziqx2hJY8Hg
3. 3.3 billion base pairs
Thymine Cytosine
T A A C C C N A A C . . .
A T T G G G N T T G
Adenine Guanine
98.5 % of genome identical
(3 in 10,000 bases differ)
1 Human: US population: Earth population:
3 GB (uncompressed) 900 PB (uncompressed) 2.8 exabytes (uncompressed)
900 MB (compressed gz) 1 Exabyte = 1 million TB
4. Humans: 46 Chromosomes
3.3 billion bp
242 MB
Mom Dad
178 MB
4 billion bp
XY
XX
Y chromosome:
58 million base pairs
(2% of total DNA)
59 MB X chromosome:
61 MB 154 MB 155 million base pairs
(5% of total DNA)
5. Why Chromosomes ??
Garden Snail
Adder's Tongue Fern
Fruit Fly 54 Ch
1,200 Chromosomes
8 Ch 2 billion bp
165 million bp
Gorilla
Elephant 48 Ch
56 Ch 3.4 billion bp
5.8 billion bp Onion
16 Ch
~18 billion bp
Highly repetitive
6. Human Genome Project vs 1000 Genomes Project
- ~15 year project: 1989 – 2003 - Launched Jan 2008
- Sequenced 99% of the genome (400 gaps) - Oct 2012: 1092 human genomes complete
from 14 populations
- >70% of the genome came from an
anonymous male donor from Buffalo, New - Goal: 2,500 sequences from 26 specific
York (code name RP11) populations like: Han Chinese, Japanese,
British, Columbian, Maratha/India,
- Cost about $3 billion dollars Punjabi/Pakistan, Finnish, African Americans
- Work done by 111 global institutions
- Cost about $40 million ($16,000 per person)
Download @ http://www.1000genomes.org/
7. - In 2010: 179 human genomes
- Discussed DNA from 2 families of:
Mother / Father / Child
- One of biology’s most cited papers in 2011
Link: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3042601/
8. Download at : http://hgdownload.cse.ucsc.edu/goldenPath/hg19/chromosomes
- Feb 2009 assembly of one human
genome (hg19)
- One gzip FASTA file per
chromosome
1) rsync -avzP rsync://hgdownload.cse.ucsc.edu/goldenPath/hg19/chromosomes/ .
2) gunzip <file>.fa.gz
9. Exploring DNA from your browser… Chromosome 2
Gene: MCM6
A:T - Can digest milk SNP: rs4988235
G:C - Lactose intolerance Position: 136,608,646 bp from pter
Click here
http://useast.ensembl.org/Homo_sapiens/Location/Genome
10. T A A C C C T A A C C C T A A C C C T A A C C C T A A C C C
A T T G G G A T T G G G A T T G G G A T T G G G A T T G G G
Chromosome #1 : 250 million base pairs (across both C-pairs)
(8% of total DNA)
centromere
pter P (short arm) Q (long arm) qter
0 0 43
36
1q12 1q42.2
1p36.32 1p31.1 1q43
4,316 known genes
11. Compound Keys
(22 pairs + X + Y)
Partition key : remaining keys 24 Column Families
humanID:cell_type:parent
Chrom-1 Chrom-2 Chrom-3 Chrom-Y
humanID cell_type parent
595-36-0000 normal mother
595-36-0000 normal father
595-36-0000 cancer mother
595-36-0000 cancer father
595-36-1111 normal mother
595-36-1111 normal father
12. XY
XX
Chrom-1 Chrom-2 Chrom-X Chrom-Y
humanID cell parent 1 2
595-36-0000 normal mother
595-36-0000 normal father
595-36-1111 normal mother
595-36-1111 normal father
Chrom-1 Column Family on disk
595-36-0000 [normal, [normal, mother, [normal, father, [normal, father,
mother, 1]: TAG 2]: GCC 1]: TAG 2]: GCC
595-36-1111 [normal, [normal, mother, [normal, father, [normal, father,
mother, 1]: TAG 2]: GCC 1]: TAG 2]: GCC
13. Partition based on humanID w/ Murmur3Partitioner
B
A C
Chrom-1 Chrom-Y
D humanID cell_type parent
595-36-0000 normal mother
595-36-0000 normal father
Send to range A
595-36-0000 cancer mother
595-36-0000 cancer father
595-36-1111 normal mother
Send to range D
595-36-1111 normal father
Now it’s possible to do row range scans down the same humanID…
… and get all the DNA for human #1000
15. Conditions related to genes on Chromosome 1
- Alzheimer disease
- Neuroblastoma
- breast cancer
- color vision deficiency
- early-onset glaucoma
- Emery-Dreifuss muscular dystrophy
- Parkinson disease
17. What read queries do we want to perform?
Write once, read many times type of database
1) Give me the PSEN2 gene for 2,000 people w/ Alzheimer's
25,000 sequential bp
2) Give me all of the humans who have the lactose intolerance SNP on CR-2
18. Translation: DNA -> Proteins
codon
DNA
T A A C C C T A A C C C T A A A C T
A T T G G G A T T G G G A T T T G A
Amino Isoleucine Glycine Isoleucine
Glycine Isoleucine STOP
Acids
I G I G I
Protein (20 different types)
20. 125 million bp
= 41 m cols (short arm) P centromere Q (long arm)
3 36 0 0 43
Chrom-1 1q43.7492932
humanID cell_type parent 1p36 1p35 ... 1p1 1p0 1q0 1q1 ... 1q43.7
595-36-0000 normal mother TAG GCC CAG CAG TCA CTG NNN GAT
595-36-0000 normal father TAG GCC CAG CAG TAA CTG NNN GAT
595-36-0000 cancer mother TAG GCC CAG CAG TCA CTG NNN GAT
595-36-0000 cancer father TAG GCC CAG CAG CTG NNN GAT
595-36-1111 normal mother TAG GCC CAG CAG TCC TCA CTG NNN GAT
595-36-1111 normal father TAG GCC CAG CAG TCA CTG NNN GAT Chrom-21
595-36-1111 normal 3rd
(SNP) Point Mutation TAG GCC CAG CAG TCA CTG TAG GCC CAG CAG TAA CTG
Deletion Mutation TAG GCC CAG CAG TCA CTG TAG GCC CAG CAG ___ CTG
Insertion Mutation TAG GCC CAG CAG TCA CTG TAG GCC CAG CAG TCC TCA CTG
21. 4x reduction in total data size + 35% faster reads
To detect SNPs
Excellent candidate for compression!
Create Secondary Index
Chrom-1
humanID cell_type parent 1p36 1p35 ... 1p1 1p0 1q0 1q1 ... 1q43.7
595-36-0000 normal mother TAG GCC CAG CAG TCA CTG NNN GAT
595-36-0000 normal father TAG GCC CAG CAG TAA CTG NNN GAT
595-36-0000 cancer mother TAG GCC CAG CAG TCA CTG NNN GAT
595-36-0000 cancer father TAG GCC CAG CAG CTG NNN GAT
595-36-1111 normal mother TAG GCC CAG CAG TCC TCA CTG NNN GAT
595-36-1111 normal father TAG GCC CAG CAG TCA CTG NNN GAT Chrom-21
595-36-1111 normal 3rd
To get all of the people with the SNP:
cqlsh:dna_table> SELECT humanID FROM Chrom-1
WHERE 1q0 = ‘TAA’;
22. Query: Give me the X gene for 2 people
X
Chrom-1
humanID cell_type parent 1p36 1p35 ... 1p1 1p0 1q0 1q1 ... 1q43.7
595-36-0000 normal mother TAG GCC CAG CAG TCA CTG NNN GAT
595-36-0000 normal father TAG GCC CAG CAG TAA CTG NNN GAT
595-36-0000 cancer mother TAG GCC CAG CAG TCA CTG NNN GAT
595-36-0000 cancer father TAG GCC CAG CAG CTG NNN GAT
595-36-1111 normal mother TAG GCC CAG CAG TCC TCA CTG NNN GAT
595-36-1111 normal father TAG GCC CAG CAG TCA CTG NNN GAT Chrom-21
595-36-1111 normal 3rd
cqlsh:dna_table> SELECT 1q0, 1q1 FROM Chrom-1
WHERE humanID in(595-36-000, 595-36-111);
23. Storing the total USA population Genome in Cassandra
(314 million people)
9 million columns
41 million columns
Chrom-1 Chrom-2 Chrom-3 Chrom-Y
P-key
3 billion cols
SS:CT:M 125 MB of data
1.5 GB
SS:CT:F 125 MB of data
SS:CT:M
SS:CT:F
900 PB =X SS:CT:M
SS:CT:F
46,000
nodes
1000 Genomes Project
(20 TB each) 630 million rows (2 for each person) Oct 2012: 1092 genomes sequenced
No Replication
3.2 TB data total
24. Cost per Human Genome sequence
$120,000,000
$100,000,000
$80,000,000
$60,000,000
Huh ?
$40,000,000
Linear scale $20,000,000
$20 million
increments $0
2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012
Series 1
25. Cost per Human Genome sequence
$100,000,000
$10,000,000
$1,000,000 Super
Logarithmic
Jan 2008 Scale!
$100,000
Switched to next-gen sequencing
$10,000
Logarithmic scale $1,000
10x
increments $100
2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012
Genome Sequencing Moore's Law
27. 8% of human DNA
(98,000 fragments)
T A A C C C T A A C C C T A A C C C T A A C C C T A A C C C
A T T G G G A T T G G G A T T G G G A T T G G G A T T G G G
HIV-1 virus genome: https://www.ncbi.nlm.nih.gov/nuccore/9629357?report=fasta
28. Get Ubuntu 12.10
http://www.ubuntu.com/download
(note CentOS/Red Hat has install issues with Biopython)
DataStax Community Edition of Cassandra + OpsCenter
http://www.datastax.com/download/community
Free python tools for biological computation
http://biopython.org
Cassandra python client library
Pycassa https://github.com/pycassa/pycassa
29. blueplastic.com/dna.pdf
Polychaos dubium
620 billion bp (200x humans)
Sameer Farooqui
sameer@blueplastic.com
- Freelance Big Data consultant and trainer
- Taught 50+ courses on Hadoop, HBase, Cassandra and OpenStack
Ex: Hortonworks, Accenture R&D, Symantec
- Co-author on v2 of Cassandra book
- Coming late 2013 linkedin.com/in/blueplastic/
@blueplastic
http://youtu.be/ziqx2hJY8Hg
30. James Watson: How we discovered DNA Juan Enriquez: The life-code that will reshape the future
http://www.ted.com/talks/james_watson_on_how_he_discovered_dna.html http://www.ted.com/talks/juan_enriquez_on_genomics_and_our_future.html