1. Single Cell RNA-seq reveals ectopic and aberrant lung
resident cell populations in IPF
UAB Division of Pulmonary
and Critical Care Medicine
Special Journal Club
September 26th, 2019
Presenter: Thi Nguyen
Dr. Duncan’s lab
bioRxiv preprint Sep. 6, 2019
2. Outline
• Background (authors + 10X chromium)
• Figures
• Summary of findings
• Discussion
• Comparative analyses of scRNAseq studies in IPF
3. Background
Naftali Kaminski, MD
Section Chief
Pulmonary, Critical Care
& Sleep Medicine
Yale School of Medicine
Ivan O. Rosas
Associate Professor
Harvard Medical School
Main research interests:
• pioneer in high throughput genomic approaches to elucidate mechanisms and
improve IPF diagnosis/ treatment
• integrate ‘omics” data with clinical information for personalized medicine
Single Cell RNA-seq for IPF project:
• April 27th 2017, Three Lakes Partner in collaboration with MATTER announced $1
million cash in IPF Catalyst Challenge
• November 9-11th, PFFSUMMIT2017, Nashville, TN, Taylor Adams presented
encouraging poster that showed promising results from scRNAseq (5 IPF+ 5 CT)
• 2018, their team won the cash prize and proposed to sequence 100 lungs within
3-4 months and promised to share results in short order.
• Sept. 2019, published bioRxiv preprint at the same time as the Kropski’s group.
18. Summary
• single cell atlas of IPF lungs
• found the aberrant basaloid cells in IPF
• found an ectopic VE cell population
• lineage analysis: Fb and myoFb are independent cell
types that becomes invasive and fibrotic in IPF
• IPF GRN network is shifted from a balanced diverse
GRN to a more fragmented and modular type.
19. Discussion
• What are the origin of the aberrant basaloid cells in IPF?
• Where do the COL15A1+ VE cells come from?
• Are myoFb just differentiated Fb or are they from an
independent lineage?
• Can we target these cells to cure IPF?
• How do the knowledge of IPF cell atlas advance IPF research?
What are the next questions that scRNAseq technology can help
us to uncover?
21. Earlier scRNAseq studies on IPF
• FACS sort for CD45-CD31-CD326+
HTII-280+ AT2 cells
• 3 IPF (325 cells) + 3 Ct lungs (215 cells)
• IPF AT2 cells coexpress AT1, AT2 and conducting
airway selective markers -> indeterminate state
of differentiation not seen in normal lung
• 8 Ct + 4 IPF + 2 SS + 1 polymyositis 1 + HP
(biopsies)
• 76,070 cells
• distinct population of alveolar MΦ with high
expression of profibrotic genes
• found KRT5+TP63+ SOX2+ cells in both
normal and fibrotic lungs
24. scRNAseq isn’t the answer to everything
1. cells need to be dissociated into single cells
2. can scRNAseq recover every cells? How representative is 10^4 cells /10^12 cells?
3. FACS sorting/ frozen cells cause artifacts
4. low capture efficiency/ high drop out-> unable to detect low-abundance transcripts
5. information about cells’ original spatial context is lost
6. low starting material -> data are noisier, more variable than bulk
7. curse of dimensionality
Editor's Notes
This study was done as a collaboration between Kaminski’s group at Yale who is a pionerr n high-thruput genomic approaches to study IPF and Ivan Rosas group, who supply them with the clinical lung specifimens.
WIth a strong inteest in integrating high throughput ‘omics’ data to generate tools for precision mediine in IPF, back in 2017 he started doing single cell –RNAseq but with limited samples. April, 201e, 3 lakes partners, a venture philanthropy with mission to end IPF, in collaboration with Matter, the health care technology incubator and innovation hub announed $1 million cash award for the IPF catalyst challenge.
At the PPFsummit 2017 in Nashville, a postbaccalaureate researcher in Kaminski’s lab, Taylor Adams, presented the team’s first single-cell data in a poster with promising results from scRNAseq of only 5 IPF and 5 CT. This reearch attracted Three Lakes attention.
in 2018, their team one the cash prize and proposed to sequence 100 lungs within
3-4 months and promised to share results in short order. This amount of funding has enabled their team to accellerate the process of scRNAseq, with unprecedented scale.
Just earlier this month, they published bioRxiv preprint at the same time as the Kropski’s group from Vanderbilt.
Dr. Kaminski has a strong interest in integrating high throughput ‘omics’ data, such as genome scale DNA variants, coding and non-coding RNAs, microbiome and metabolome information with clinical information to generate tools for personalized medicine of lung diseases that are significantly more precise, predictive and patient centered than anything that is currently available.
Three Lakes Partners, a venture philanthropy committed to ending idiopathic pulmonary fibrosis (IPF), in collaboration with MATTER, the healthcare technology incubator and innovation hub, announced its $1 million IPF Catalyst Challenge this evening during a gathering of some of the world's foremost IPF experts and healthcare thought leaders at MATTER's headquarters in Chicago's Merchandise Mart.
In fact, a postbaccalaureate researcher in Kaminski’s lab, Taylor Adams, presented the team’s first single-cell data in a poster at last year’s Pulmonary Fibrosis Foundation Summit in Nashville. That presentation, Kaminski believes, first attracted Three Lakes’ attention.
Using single-cell transcriptomics, Kaminski and his team plan to sequence the RNA (ribonucleic acid) of every cell in more than 100 donor lungs affected by IPF and other lung diseases. Ivan O. Rosas, MD, a physician at Brigham and Women’s Hospital in Boston and associate professor of pulmonary and critical care at Harvard Medical School, is providing the lungs for this research as part of an ongoing collaboration.
Both the Kaminski group and the Kropski group utilized the 10X Genomics' Chromium technology which can partition single cells, or sometimes nuclei into a small nanolitter-scale oil droplet. Each droplet contains uniqely barcoded beades called gel in beads emulsions.
Inseide the droplet, the cells are lysed and their mRNA is captured on the uniqe barcoded bead. then mRNA is reversed transcribed to make cDNA, PCR ampiflied, then pooled and sequenced on a high thruput platform.
Using this technology, many novel cell types have been discovered. a Notable example is the discovery of pulmonay ionocyte back in 2018, which was published in nature.
Back to the IPF study. Here is the overview of experimental design. They profile total 79 human lungs, with 32 lung explants from IPF, 18 from COPD and 29 lungs control from unused donor lungs.
The lungs are dissociated to make single cell suspension, and stored in liquid nitrogen in the Ivan Rosas group before handing them over to the Kaminski group. Then they use the single cell barcoding technology to capture each cells’ mRNA, make cDNA and PCR amplied then sequenced.
Then next step they do data processing, exploratory analysis and then validation using IHC.
In totalled they successfully sequenced 312,928 cells from the distal lung paranchyma as shown in the UMAP presentation here.
UMAP (uniform manifold approximation and projection) a new non-linear dimension reduction techniqe. It has faster run time, more reproducibility and can preserve global structure better than older techniques such as t-NE.
in this UMAP, each dot represent a cells, each cells relationship to other cells are represented in the multidimentional space of gene expression. Human cells have 20,000 genes so with such high dimension, we need a dimension reduction technique to visualize the data. Typically scRNAseq can only capture about 15% of total transcripts, so each cells has about 3000 gene expression values which correspond to 3000 dimension.
U rMAp=
what it does is that it takes a high dimensional dataset and reduce it to a low dimensional dataset while retaining a lot of the informational in the original dataset, in such a way that the cluster of the cells in the high dimensional space is preserved.
. These cells are grouped into 38 discrete cells types shown in different colors coded here, grouped in to 4 broader cell categories.
High-dimensional data, meaning data that requires more than two or three dimensions to represent, can be difficult to interpret. One approach to simplification is to assume that the data of interest lie on an embedded non-linear manifold within the higher-dimensional space. If the manifold is of low enough dimension, the data can be visualised in the low-dimensional space.
. That manifold is, of course, just the low dimensional euclidean space we are trying to embed into. T
Figure C shows the heatmap of marker genes expression values for these 38 discrete cells types, in 4 broader groups. epithelial, stromal, myeloid and lymphoid categories. For each cell type, they only show the top 5 genes most differentially expressed between the particular cell type against the rest of othr cell types in the same category. Each column shows the average expression for one subject.
Then they also do hierachical clustering by disease status in the top. They forgot to include the figure legend for the color code of different disease but I guess the blue are normal, yellow are COPD and red are IPF.
Overall this figures just want to show us the validity of their classification of cells.
Heatmap of 235 marker genes for all 38 identified cell types, categorized into 4 broad cell categories. Each cell type is represented by the top 5 genes ranked by false-detection-rate adjusted p-value of a Wilcoxon rank-sum test between the average expression per subject value for each cell type against the other average subject expression of the other cell types in their respective grouping. Each column represents the average expression value for one subject, hierarchically grouped by disease 240 status and cell type. Gene expression values are unity normalized from 0 to 1.
UMAP of all epithelial cells labeled by cell types on the left or disease on teh right or subject.
Boxplots show the distribution of the proportion of each cell types of all epithelial cells per subject, and stratified by different disease group. You can see that the epithelial cell repertoire of IPF lung has increased proportion of airway epithelial cells and decreased in aveolar epithelial cells. They also mention that there are profound change in gene expression of epithelial cell in IPF lung compared to COPD or ct (IPF cell atlas data mining site).
Among the epithelial cells, they identified a population of cells that was transcriptionally distint from any epithelial cell types previously described that they called aberrant basaloid cells.
Heat map of average gene epxression and predicted trancriptional factor activity per subject across each epithelial cell type. These columns are also grouped by disease status and cell type.
They also zoomed into provide more info for annotation of aberrant basaloid cells.
These cells are transcriptionally distinct from other epithelial cells. In addition to epithelial markers, they express basal cell marker such as TP63, KRT17, LAMB3, LAMC2, but do not expressed well-established basal markers such as KRT5 and 15. THey express markers of EMT such as VIM, CDH1, FN1, COL1A1, TNC, HMGA2 and senesence related genes such as CDKN1A2A, CCND etc. These cells also express highest level of IPF-related molecule suchaa as MMP7, alphaVbeta6 subunits and EPHB2. These cellls are predicted to express the TF SOX9, which is important for distal airway development, repair and oncogenesis.
These cells were not found in ct lungs
IHC of aberrant basaloid cells in IPF lungs: epithelial cells covering fibroblast foci are p63+KRT17+ basaloid cells staining COX2, p21 and HMGA2 positive, while basal cells in bronchi do not.
To validate their results, they reanalyzed IPF cell single cell data by Refgman published earlier this year.
Correlation matrix showing spearman rho correlation coefficnet color coded showing well correlation between analogous cell subset between their stud y and reyfman studies. Hierachical clustering is also applied to the cell populations showing the hierachical relationship among different epithelial cell subsets, showing the aberrant basaloid cells are very closely resembled the basal cells.
Next they want to explore the endothelial cell repertoire in IPF lung.
Cluster analysis of VE cells show 4 population characterized as capillary, arterial or venous VE. They also discovered an abnormal ectopic fifth population of VE which express COL15A1.
Using Human protein atlas, they know that COL15A1 VE cells are restricted to vasculature near major airways. Therefore they name this Populations of cells VE peribroncial.
UMAPs of endothelial and mesenchymal cells labelled by cell type, disease status and subject. In subject plot, each color represented by unique color.
B. Heatmap showing characteristics of 5 subtupes of VE. Each column is one individual cell, which is group by disease or by subject in the row on top
Boxplots show the percent makeup distributions of each VE cell-type amongst all VE cells within each disease group.
You can see that VE peribronical cells are found in all disease states but are substantially more abundant in IPF.
So they localize these endothelial cells in the lungs by staining with CD31, pan endothelial cells marker and COl15A1.
in control lungs, these cells are confined to the bronchial vasculature surrounding large promimal airways but in IPF, these cells can be found in the distal lung at the edge of fibroblastic foci.
Violin plots of expression of pan-VE markers and peribronchial VE-specific markers across VE cells from distal
and airway lung samples from an independent dataset.
These confirm that the COL15A1 + VE cells are an ectopic VE population in the distal lung in IPF.
It is similar to a box plot, with the addition of a rotated kernel density plot on each side
Violin plots are similar to box plots, except that they also show the probability density of the data at different values,
A violin plot is more informative than a plain box plot. While a box plot only shows summary statistics such as mean/median and interquartile ranges, the violin plot shows the full distribution of the data. The difference is particularly useful when the data distribution is multimodal (more than one peak)
A violin plot is a hybrid of a box plot and a kernel density plot, which shows peaks in the data.
On each side of the gray line is a kernel density estimation to show the distribution shape of the data. Wider sections of the violin plot represent a higher probability that members of the population will take on the given value; the skinnier sections represent a lower probability.
They also characterize the Fb and myoFb in IPF. So how they define these mesenchymal cells is to select cells with PDGFRB+ which are negative for known smooth muscle cells markers.
This strategy leads them to identify 2 distinct stromal populations as shown in heat map in A.
Fb are dfined as cells express CD34 and ECM proteins such as FBN1, FBLN2 and VIT while myoFb are cells express high cytoskeleton markers such as MYLK, NEBL, MYO10 etc.
Each column is average expression per cell type for one subject. And these columns are grouped acorrding to disease type. YOu can see in the IPF group, their myoFb has higher expression of ECM protein such as COL8A1, ACTA2.
B. UMAPs of myoFb and Fb color coded by cell type, disease and unsupervised Louvain sub-clusters. Basically this is an method of unsupervised clustering (PCA first, then use the k-nearest neighbor algorithms), where you can arbitrarily choose the number of clusters for different resolution of the clustering of the dataset. Here they obviously choose n=8. Unsupervised means the cluster is achived through the data points relationship among each other, not through predetermined class/ group.
Next, they applies a lineage reconstruction technique called PAGA to these subclusters of Fb and myoFb. Basically this technique allows the conversio of these clusters and their relationship into graph presentations of node and edge. Node represents subcluster and edge represents their interconnectivity with each otherr in the so-called phenotype-space.
The strength of connectivity among the nodes is calculated and denoted here as edge confidence. You can see that the connectivity among the fb subclster are so much stronger than among a Fb cluster with a myofb subcluster.
In order to analyze the lineage trajectory among Fb and myFb, they implemented the DPT algorithm which attempts to use scRNAseq data to reconstruct the developmental progression of the cells.
Then again they use the UMAP dimensionality reduction technique after the DPT to obtain figure C, again labeled in colors by cell type, disease status and subject.
D and E are heatmaps of Fb and myoFB ordered by DPT distance along UMAP manifolds that transition from control enriched region towards IPF-enriched archetype. The color codes for the distance.
These heat map show the continuous trajectory developmental progression of both lineage of FB and myeoFb from normal to IPF.
. That manifold is, of course, just the low dimensional euclidean space we are trying to embed into. T
Next they perform gene regulatory network analysis. They implement the bigSScale approach to control and IPF cells, but exclude the COPD samples.
In this method, cells are recursively clustered down to subcluster. Z score are calcualated based on DE between subclusters. Then they construct gene correlation matrix using Pearson correlation coeff and cosine distance and filtered the nodes by cosine correlations and has GO anatoation as gene regulator.
Doing this, they constructed a network of 13,000 nodes in ct and 12,427 in IPF with about 300,000 edges. In this network, nodes are genes, edge = correlation of regulatory relationship.
Node size= page rank centrality.
Page rank is an algorithm Google search engine use to rank web pages. it measure the importance of one node in the network based on the assumption that the most important node will have the most connections/ links coming from other nodes.
largest cluster are color coded.
Top cell types in each cluster are highlited, and color coded according to the most dominant cell type in each cluster.
Over all, you can see that the IPF network are more dense and has more discrete clustering/ more isolated compared to the ct.
Ct network has more diverse and cells are more spreadout/ diverse than IPF, both within the cluster and across the cluster.
Also in the IPF network, the aberant basaloid cells is very dense and located near the epithelial cell cluster, which is very isolated from the rest of the cell types.
PageRank (PR) is an algorithm used by Google Search to rank web pages in their search engine results.
PageRank works by counting the number and quality of links to a page to determine a rough estimate of how important the website is. The underlying assumption is that more important websites are likely to receive more links from other website
Next, they use PageRank algorithm to rank the most influential genes on other genes in the network. Basically this algorithms based on the assumptions that the genes has the most connection/ edges to other genes is the most influencing genes in the network.
Cartoon illustrating the basic principle of PageRank. The size of each face is proportional to the total size of the other faces which are pointing to it.
Using this page rank approach, they identify 300 nodes/ genes highlighted in red tha tmost influence the genes/ in the network.
Here is the same GRN as before with the top 300 nodes highlided in red that most different between the ct and IPF network based on theri pagerank centrality. The node size correspond to the Pagerank centrality. You can see genes that are the influencer in driving teh difference between IPF and ct network belongs to the BMP/WNT signaling pathway.
E. They did gene set enrichment of these 300 genes and show results related to cellular aging, response to TGFbeta1, epithelial tuube formation and SMC differentiation.
discover the shift in alveolar epithelial cells gene expression -> airway EC
-> aberrant basaloid cells
discover the shift in alveolar epithelial cells gene expression -> airway EC
-> aberrant basaloid cells
CD326 (EPCAM)
HTII-280 marker for type II AT cells
CD326 (EPCAM)
HTII-280 marker for type II AT cells
discover the shift in alveolar epithelial cells gene expression -> airway EC
-> aberrant basaloid cells
The pseudostratified epithelium of the mouse trachea and human airways contains a population of basal cells expressing Trp-63 (p63) and cytokeratins 5 (Krt5) and Krt14. Using a KRT5-CreER(T2) transgenic mouse line for lineage tracing, we show that basal cells generate differentiated cells during postnatal growth and in the adult during both steady state and epithelial repair. We have fractionated mouse basal cells by FACS and identified 627 genes preferentially expressed in a basal subpopulation vs. non-BCs. Analysis reveals potential mechanisms regulating basal cells and allows comparison with other epithelial stem cells. To study basal cell behaviors, we describe a simple in vitro clonal sphere-forming assay in which mouse basal cells self-renew and generate luminal cells, including differentiated ciliated cells, in the absence of stroma. The transcriptional profile identified 2 cell-surface markers, ITGA6 and NGFR, which can be used in combination to purify human lung basal cells by FACS. Like those from the mouse trachea, human airway basal cells both self-renew and generate luminal daughters in the sphere-forming assay.
Idiopathic pulmonary fibrosis is a common form of interstitial lung disease resulting in alveolar remodeling and progressive loss of pulmonary function because of chronic alveolar injury and failure to regenerate the respiratory epithelium. Histologically, fibrotic lesions and honeycomb structures expressing atypical proximal airway epithelial markers replace alveolar structures, the latter normally lined by alveolar type 1 (AT1) and AT2 cells. Bronchial epithelial stem cells (BESCs) can give rise to AT2 and AT1 cells or honeycomb cysts following bleomycin-mediated lung injury. However, little is known about what controls this binary decision or whether this decision can be reversed. Here we report that inactivation of Fgfr2b in BESCs impairs their contribution to both alveolar epithelial regeneration and honeycomb cysts after bleomycin injury. By contrast overexpression of Fgf10 in BESCs enhances fibrosis resolution by favoring the more desirable outcome of alveolar epithelial regeneration over the development of pathologic honeycomb cysts.