Identification and comparison of orthologous cell types from primate embryoid bodies shows limits of marker gene transferability

发布时间：2026-04-07来源：e life

Abstract

The identification of cell types remains a major challenge. Even after a decade of single-cell RNA sequencing (scRNA-seq), reasonable cell type annotations almost always include manual non-automated steps. The identification of orthologous cell types across species complicates matters even more, but at the same time strengthens the confidence in the assignment. Here, we generate and analyze a dataset consisting of embryoid bodies (EBs) derived from induced pluripotent stem cells (iPSCs) of four primate species: humans, orangutans, cynomolgus, and rhesus macaques. This kind of data includes a continuum of developmental cell types, multiple batch effects (i.e. species and individuals) and uneven cell type compositions and hence poses many challenges. We developed a semi-automated computational pipeline combining classification and marker-based cluster annotation to identify orthologous cell types across primates. This approach enabled the investigation of cross-species conservation of gene expression. Consistent with previous studies, our data confirm that broadly expressed genes are more conserved than cell type-specific genes, raising the question of how conserved, inherently cell type-specific, marker genes are. Our analyses reveal that human marker genes are less effective in macaques and vice versa, highlighting the limited transferability of markers across species. Overall, our study advances the identification of orthologous cell types across species, provides a well-curated cell type reference for future in vitro studies and informs the transferability of marker genes across species.

Introduction

Cell types are a central concept for biology, but are, as other concepts like species, practically difficult to identify. Theoretically, one would consider all stable, irreversible states on a directed developmental trajectory as cell types. In practice, we are limited by our experimental possibilities. Historically, cell type definitions hinged on observations of cell morphology in a tissue context, which was later combined with immunofluorescence analyses of marker genes (

Bakken et al., 2017

). A lot of the functional knowledge that we have about cell types today is based on such visual and marker-based cell type definitions. With single-cell sequencing, our capabilities to characterize and identify new cell types have radically changed (

The Tabula Muris Consortium et al., 2018

;

Regev et al., 2017

). Clustering cells by their expression profiles enables a more systematic and higher-resolution identification of groups of cells that are then interpreted as cell types. However, distinguishing them from cell states or technical artifacts is not straightforward. A key criterion for defining a true cell type is its reproducibility across experiments, individuals, or even species.

Hence, identifying the same, i.e., orthologous, cell types across individuals and species is crucial. There are three principal strategies to match cell types from scRNA-seq data. (1) One is to integrate all cells prior to performing a cell type assignment on a shared embedding (

Song et al., 2023

). (2) The second approach is to consider cell types from one species as the reference and transfer these annotations to the other species using classification methods (

Liu et al., 2023

). (3) The third strategy is to assign clusters and match them across species, which has the advantage of not requiring data integration of multiple species or an annotated reference (

Castro-Mondragon et al., 2022

;

Bakken et al., 2021

;

Suresh et al., 2023

).

Furthermore, established marker genes are still heavily used to validate and interpret clusters identified by scRNA-seq data (

Zhang et al., 2019b

;

Guo and Li, 2021

;

Ianevski et al., 2022

). Together with newly identified transcriptomic markers for human and mouse, they are collected in databases (

Franzén et al., 2019

;

Zhang et al., 2019a

) and provide the basis for follow-up studies using spatial transcriptomics and/or immunofluorescence approaches. However, previous studies have shown that the same cell types may be defined by different marker genes in different species (

Hodge et al., 2019

;

Bakken et al., 2021

). For example,

Krienen et al., 2020

found that only a modest fraction of interneuron subtype-specific genes overlapped between primates and even less between primate and rodent species.

To better understand how gene expression in general and the expression of marker genes in particular evolves across closely related species, we used induced pluripotent stem cells (iPSCs) and their derived cell types from humans and non-human primates (NHP). One fairly straightforward way to obtain diverse cell types from iPSCs are embryoid bodies (EBs). EBs are the simplest type of iPSC-derived organoids contain a dynamic mix of cell types from all three germ layers and result from spontaneous differentiation upon withdrawal of key pluripotency factors (

Brickman and Serup, 2017

;

Itskovitz-Eldor et al., 2000

;

Rhodes et al., 2022

;

Guo et al., 2019

;

Han et al., 2018

).

EBs and brain organoids from humans and chimpanzees have, for example, been used to infer human-specific gene regulation in brain organoids (

Kanton et al., 2019

) or to investigate mechanisms of gene expression evolution (

Barr et al., 2023

).

Here, we explore to what extent levels of cell type specificity of marker genes are conserved in primates. We generated scRNA-seq data of 8 and 16-day-old EBs from human, orangutan (

Pongo abelii

), cynomolgus (

Macaca fascicularis

), and rhesus macaque (

Macaca mulatta

) iPSCs. Using this data, we established an analysis pipeline to identify and assign orthologous cell types. With this annotation, we provide a well-curated cell type reference for in vitro studies of early primate development. Moreover, it allowed us to assess the cell type specificity and expression conservation of genes across species. We find that even though the cell type-specificity of a marker gene remains similar across species, its discriminatory power still decreases with phylogenetic distance.

Results

Generation of embryoid bodies from iPSCs of different primate species

We generated EBs from iPSCs across multiple primate species: two human iPSC clones (from two individuals), two orangutan clones (from one individual), three cynomolgus clones (from two individuals), and three rhesus clones (from one individual) (

Geuder et al., 2021

;

Jocher et al., 2024

;

Edenhofer et al., 2024

). To optimize conditions for generating a sufficient number of cells from all three germ layers across these four species, we tested combinations of two culturing media (‘EB-medium’ and ‘DFK20,’ see Methods) and two EB-differentiation conditions (‘single-cell seeding’ and ‘clump seeding,’ see Methods). After 7 days of differentiation, germ layer composition was analyzed by flow cytometry (

Figure 1—figure supplement 1A, B and C

). Among the four tested protocols, culture in DFK20 medium with clump seeding resulted in the most balanced representation of all germ layers, yielding a substantial number of cells from each layer across all species (

Figure 1—figure supplement 1D

).

Under these conditions, we established an EB formation protocol based on 8 days of floating culture in dishes, followed by 8 days of attached culture (

Figure 1A

). This results in the formation of cells from all three germ layers, as confirmed by immunofluorescence staining for AFP (endoderm), β-III-tubulin (ectoderm) and α-SMA (mesoderm) (

Figure 1B

). To generate scRNA-seq data, we dissociated 8 or 16-day-old EBs into single cells and pooled cells from all four species to minimize batch effects (

Figure 1C

). We performed the experiment in three independent replicates, generating a total of four lanes and six lanes of 10 x Genomics scRNA-seq at day 8 and day 16, respectively (

Figure 1—figure supplement 2A

). This resulted in a dataset comprising over 85,000 cells after filtering and doublet removal, distributed fairly equally over time points, species, and clones (

Figure 1—figure supplement 2B–D

).

Figure 1

with 3 supplements

see all

Download asset

Open asset

Generation of primate embryoid bodies.

(

A

) Overview of the embryoid body (EB) differentiation workflow of the four primate species human (

*Homo sapiens*

), orangutan (

*Pongo abelii*

), cynomolgus (

*Macaca fascicularis*

), and rhesus (

*Macaca mulatta*

), including their phylogenetic relationship. Scale bar represents 500 µm. (

B

) Immunofluorescence staining of day 16 EBs using
$α$
-fetoprotein (AFP),
$β$
-III-tubulin, and
$α$
-smooth muscle actin (
$α$
-SMA). Scale bar represents 100 µm. (

C

) Schematic overview of the sampling and processing steps prior to 10 x scRNA-seq. (

D

) UMAP representation of the whole scRNA-seq dataset, integrated across all four species with Harmony. Single cells are colored by the expression of known marker genes for the three germ layers and undifferentiated cells. (

E

) UMAP representation, colored by assigned germ layers, split by species. Created with

.

In agreement with the immunofluorescence staining, we detected well-established marker genes of pluripotent cells and of all three germ layers (

Ludwig et al., 2023

) in the scRNA-seq data:

SOX2

,

SOX10

, and

STMN4

expression was used to label ectodermal cells,

APOA1,

and

EPCAM

for endodermal cells,

COL1A1

and

ACTA2

(α-SMA) for mesodermal cells, and

POU5F1

and

NANOG

for pluripotent cells (

Figure 1D

). Expression of these marker genes corresponded well with a classification based on a published scRNA-seq dataset from 21-day-old human EB (

Rhodes et al., 2022

). This initial, rough germ layer assignment shows that our differentiation protocol generates EBs with the expected germ layers and cell type diversity from all four species (

Figure 1E

,

Figure 1—figure supplement 3A

).

Assignment of orthologous cell types

Many integration methods encounter difficulties when they are applied to data from multiple species and uneven cell type compositions (

Song et al., 2023

). Indeed, when comparing clusters derived from an integrated embedding across all species (

Hie et al., 2019

;

Korsunsky et al., 2019

) to the aforementioned preliminary cell type assignments, we observed signs of overfitting. For instance, a cluster predominantly containing cells classified as neurons in humans, cynomolgus, and rhesus macaques consisted mainly of early ectoderm and mesoderm cells in orangutans (

Figure 1—figure supplement 3B and C

). To address this issue, we developed an approach that assigns orthologous cell types without a common embedding space in an interactive shiny app (

;

Figure 2A and B

):

Figure 2

with 6 supplements

see all

Download asset

Open asset

Assignment of orthologous cell types across species.

(

A

) Schematic overview of the pipeline to match clusters between species and assign orthologous cell types. (

B

) Sankey plot visualizing the intermediate steps of the cell type assignment pipeline. Each line represents a cell which are colored by their species of origin on the left and by their current cell type assignment during the annotation procedure on the right. An initial set of 118 high-resolution clusters (HRCs), 25–35 per species, was combined into 26 orthologous cell type clusters (OCCs). Similar cell type clusters were merged, and after further manual refinement, provided the basis for final orthologous cell type assignments. (

C

) Fraction of annotated cell types per species. (

D

) UMAPs for each species colored by cell type. (

E

) To validate our cell type assignments, we selected three marker genes per cell type that exhibit a similar expression pattern across all four species and have been reported to be specific for this cell type in both human and mouse (

Appendix 1—table 1

). The heatmap depicts the fraction of cells of a cell type in which the respective gene was detected for cell types present in at least three species.

First, we assign cells to clusters separately for each species. To avoid losing rare cell types, we aim to obtain at least double as many high-resolution clusters (HRCs) per species as expected cell types. We then use the HRCs of one species as a reference to classify the cells of the other species using SingleR (

Aran et al., 2019

). These pair-wise comparisons are done reciprocally for each species and via a cross-validation approach also within each species (see Methods). For each comparison, we average the two values for the fraction of cells annotated as the other HRC. For example, a perfect ‘reciprocal best-hit’ between HRC-A in human and HRC-B in rhesus would have all cells of HRC-B assigned to HRC-A when using the human as a reference and reciprocally all cells in HRC-A assigned to HRC-B when using the rhesus as a reference. Next, we used the resulting distance matrix as input for hierarchical clustering to find orthologous clusters across species and merge similar clusters within species. Here, the user can choose and adjust the final cell type cluster number. This allows us to identify orthologous cell type clusters (OCCs) across all four species, while retaining species-specific clusters when no matching cluster was identified.

In the last steps, OCCs are manually further refined by merging neighboring OCCs with similar marker gene and transcriptome profiles (see Methods). To avoid bias, we first identify marker genes independently for each species solely based on scRNA-seq expression data (

Hao et al., 2021

). We then intersect those lists to identify the top-ranking marker genes with consistently good specificity across all species. The final set of conserved marker genes then serves us to derive cell type labels by searching the literature as well as databases of known marker genes (

Figure 2E

). If the marker-gene-based cell type assignment reveals cluster inconsistencies, they can be marked for further splitting. This feature is of particular importance for rare cell types. For example, we separated a cluster of early progenitor cells into iPSCs, cardiac progenitors, and early epithelial cells.

Suresh et al., 2023

devised a conceptually similar approach to ours to identify orthologous cell types across species. The main difference is that they used scores from MetaNeighbor

Crow et al., 2018

where we use SingleR to measure distances between HRCs. However, in essence, both scores are based on rank correlations, and hence it may not be surprising that both scoring systems yield consistent cluster groupings that show high replicability across species. However, using our SingleR-based scores to compare OCCs across species may yield more clearly defined correspondences compared to MetaNeighbor scores (

Figure 2—figure supplements 1

and

2

).

Overall, we are confident that our approach yields meaningful orthologous cell type assignments, without requiring a prior annotation per species or a reference dataset. Moreover, the necessary fine-tuning of the cell type clusters by the expert user is facilitated by an interactive app.

Many cell types are shared between day 8 and day 16 EBs

Using the strategy described in the previous section, we detected a total of 15 reproducible cell types from the three germ layers, all of which were detected in at least three cell lines in three independent replicates. Among these, we identified four cell types that represent the latest time points along ectodermal developmental lineages (astrocyte progenitor, granule precursor, neurons, neural crest II), four that represent the latest time points along mesodermal lineages (fibroblasts, smooth muscle cells, cardiac endothelial cells, cardiac fibroblasts), and two that represent the latest detected time points along endodermal lineages (epithelial cells, hepatocytes). Many of these cell types were present at both sampling times (

Figure 2—figure supplement 3C

). The most notable exception is that orangutan EBs lost the majority of ectodermal cells at the later time point. Aside from this technical deviation—likely caused by the additional handling step (see previous chapter)—some more differentiated cell types only appear at day 16 at appreciable frequencies. This is most pronounced for smooth muscle cells in all species, but also holds for neuron-like cells in humans. Overall, this leads to an increase in the observed cell type diversity over time.

To further evaluate differences between the two sampling time points, we performed pseudotime analyses (

Street et al., 2018

) on the experiments integrated per species and germ layer, defining iPSCs as the origin and the differentiated cell types listed above as the endpoints of the developmental trajectories (

Figure 2—figure supplements 4

–

6

). As expected, day 16 cells generally occupy later positions along the trajectories than day 8 cells, yet the distributions overlap: iPSCs and precursor states, such as early ectoderm are still detectable, albeit at lower frequency, in the day 16 EBs. Still, the few states that are confined to one of the two time points improve cross-species comparability when both are considered jointly. Integrating day 8 and day 16 increased the overlap in detected cell types between species; for example, human neural cells were only observed at day 16, whereas they were already present at day 8 in macaques, and we, therefore, used the combined data from both time points for downstream analyses.

Overall, 9 of the 15 cell types were detected in at least 3 species, and 7 cell types were reproducibly detected in all four species (

Figure 2C and D

;

Figure 2—figure supplement 3

). These 7 cell types consisted of iPSCs, two cell types representing ectoderm: early ectoderm and neural crest, two cell types of mesodermal origin: smooth muscle cells and cardiac fibroblasts, and two endodermal cell types: epithelial cells and hepatocytes (

Figure 2C and E

) and are used for the analysis of pleiotropy and marker genes in the remainder of this manuscript.

Cell type-specific genes have less conserved expression levels

Based on the premise that it is not necessarily the expression level, but rather the expression breadth that determines expression conservation (

Duret and Mouchiroud, 2000

), we developed a method to call a gene ‘expressed’ or not that considers the expression variance across the cells of one type, which we then used to score cell type-specificity and expression conservation (

Figure 3B

); see Methods.

Figure 3

with 1 supplement

see all

Download asset

Open asset

Effect of cell type specificity on expression conservation.

(

A

) UMAP visualizations depicting expression patterns of selected example genes:

*SOX10*

(conserved cell type-specific expression in neural crest cells),

*ESRG*

(species-specific and cell type-specific expression in human iPSCs), and

*RPL22*

(conserved, broad expression). (

B

) For each gene, expression was summarized per species and cell type as the expression fraction and binarized into ‘not expressed’/’expressed’ (black frame) based on cell type-specific thresholds. The same example genes as in (

A

) are shown here. iPSCs: induced pluripotent stem cells, EE: early ectoderm, NC: neural crest, SMC: smooth muscle cells, CFib: cardiac fibroblasts, EC: epithelial cells, Hepa: hepatocytes. (

C

) Boxplot of expression conservation of genes according to the number of different cell types in which a gene is expressed in humans (cell type specificity). (

D

) Boxplot of the fraction of coding sequence sites that were found to evolve under constraint based on a 43 primate phylogeny (

Sullivan et al., 2023

), stratified by human cell type specificity.

For example, we find that the neural crest marker

SOX10

(

Mollaaghababa and Pavan, 2003

) is cell type-specific and conserved, the lncRNA

ESRG

is iPSC- and human-specific; in contrast,

RPL22

, a gene that encodes a protein of the large ribosomal subunit, is broadly expressed and conserved (

Figure 3A

). Overall, we find on average ∼15% of genes to be cell type-specific, i.e., our score determined them to be expressed in only one cell type, while ∼40% of genes were found to be broadly expressed in all seven cell types (

Figure 3—figure supplement 1A

).

Additionally, we obtained a measure of expression conservation, which quantifies the consistency of the cell type expression score across species. We found that broadly expressed genes present in all cell types exhibited high expression conservation, whereas cell type-specific genes tended to be more species-specific (

Figure 3C

;

Figure 3—figure supplement 1B

).

Unsurprisingly, broadly expressed genes also showed higher average expression levels (

Kliesmete et al., 2024

;

Figure 3—figure supplement 1D

). To ensure that the observed relationship between expression breadth and conservation in our data is not solely due to expression level differences, we sub-sampled genes from all cell type-specificity levels for comparable mean expression. This did not change the pattern: also, broadly expressed genes with a low mean expression level are highly conserved across species (

Figure 3—figure supplement 1E and F

). Moreover, the coding sequences of broadly expressed genes show higher levels of constraint than more cell type-specific genes, thus supporting the notion that the higher conservation of the expression pattern that we observed here is due to evolutionary stable functional constraints on this set of genes (

Figure 3D

;

Figure 3—figure supplement 1C

).

Marker gene conservation

Building on our previous observation that cell type-specific genes are less conserved across species, we investigated the conservation and transferability of marker genes, which are, by definition, cell type-specific, in greater detail. To this end, we call marker genes for all cell types and species, using a combination of differential expression analysis and a quantile rank-score based test for differential distribution detection (

Ling et al., 2021

). Additionally, we define a good marker gene as one that is upregulated and expressed in a higher fraction of cells compared to the rest. To prioritize marker genes, we rank them based on the difference in the detection fraction: the proportion of cells of a given type in which a gene is detected compared to its detection rate in all other cells.

We found a low overlap of top marker genes among species, with a median of 15 of the top 100-ranked marker genes per cell type shared across all four species, while a larger proportion of markers was unique to individual species (

Figure 4A

). Notably, these species-specific markers often exhibited cell type-specific expression in only one species, with reduced or non-specific expression in others (

Figure 4B

;

Figure 4—figure supplement 1

).

Figure 4

with 3 supplements

see all

Download asset

Open asset

Evaluation of marker gene conservation.

(

A

) UpSet plot illustrating the overlap between species for the top 100 marker genes per cell type. (

B

) Heatmap showing the expression fractions of marker genes: on the left, markers shared among all species, and on the right, markers unique to the human ranking. For each cell type, one representative gene is labeled and further detailed in

Figure 4—figure supplement 1

. iPSCs: induced pluripotent stem cells, EE: early ectoderm, NC: neural crest, SMC: smooth muscle cells, CFib: cardiac fibroblasts, EC: epithelial cells, Hepa: hepatocytes. (

C

) Rank-biased overlap (RBO) analysis comparing the concordance of gene rankings per cell type for lncRNAs, protein-coding genes, and transcription factors. (

D

) Average F1-score for a k-nearest neighbor (kNN)-classifier trained in the human clone 29B5 to predict cell type identity based on the expression of 1–30 marker genes. Each line represents the performance in a different clone, with shaded areas indicating 95% bootstrap confidence intervals.

Given the special role of transcriptional regulators for the definition of a cell type (

Arendt et al., 2016

) and the differences in conservation between protein-coding and non-coding RNAs (

Johnsson et al., 2014

), we analyzed the comparability of marker genes of different types. To this end, we assessed the concordance of the top 100 marker genes across species for protein-coding genes, lncRNAs, transcription factors (TFs), or all genes using rank-biased overlap (RBO) scores (

Webber et al., 2010

). We find that marker genes that are TFs have the highest concordance between species and that the two macaque species, which are also phylogenetically most similar, are also most similar in their ranked marker gene lists. In contrast, lncRNA markers show the lowest overlap between species. In fact, their cross-species conservation is so low that they also significantly reduce the performance if they are included together with protein-coding markers (

Figure 4C

).

To properly evaluate the performance of marker genes, it is essential to consider their ability to differentiate between cell types. This discriminatory power ultimately determines how well marker genes perform in cell type classification within and across species. To this end, we trained a k-nearest neighbors (kNN) classifier on varying numbers of marker genes per cell type in one human clone (29B5) and evaluated prediction performance using the average F1 score across cell types (

Figure 4—figure supplement 2

). Again, we analyzed markers from a set of all protein-coding genes and TFs only and found that even though TFs appear to be more conserved across species, they do not discriminate cell types as well as the top protein-coding markers (

Figure 4—figure supplement 3

). Using protein-coding marker genes only determined with 29B5 to classify the other human clone, we achieve good discriminatory power (F1 score>0.9) with only 11 marker genes per cell type. In contrast, the classification performance for clones from the other species was substantially lower, failing to reach the performance levels observed in human clones even when using up to 30 marker genes (

Figure 4D

).

In summary, we find that lncRNA marker genes have low transferability between species, while protein-coding markers do reasonably well. However, the predictive value of marker genes decreases with increasing phylogenetic distance, requiring longer marker gene lists to achieve accurate cell type classification for more distantly related species.

Discussion

An essential criterion for a true cell type is reproducibility across experiments, individuals, or even species. This raises the question of how to reliably identify reproducible cell types across species. When cell types are annotated separately for each species, their reproducibility can be evaluated based on transcriptomic similarity (

Crow et al., 2018

;

Wang et al., 2021

). If integration-based methods are used to accomplish this task (

Barr et al., 2023

;

Bakken et al., 2021

), reproducibility not only depends on the similarity of the expression profiles but also on cell type composition. Integration works best when the cell type compositions are as similar as possible across experiments. This, however, is not the case for organoids, which often have highly heterogeneous cell type compositions (

He et al., 2023

) and our EB data are no exception. Moreover, integration methods struggle with large and variable batch effects, which are expected due to the varying phylogenetic distances across species (

Song et al., 2023

). In contrast, classification methods, such as SingleR (

Aran et al., 2019

) rely mainly on the similarity to a reference profile, which makes it less vulnerable to cell type composition and batch effects. Hence, in our pipeline to identify orthologous cell types, we mainly rely on classification. We start with an unsupervised approach in that we identify cell clusters and then ensure reproducibility as well as comparability using a supervised approach with reciprocal classification of clusters across all species pairs.

Defining cell types in a developmental dataset is particularly challenging, and we do not believe that there is one perfect solution that would fit all cell types and samples. Therefore, we rely on an interactive approach that we implemented in a shiny app (

) to facilitate the flexible choice of parameters for cluster matching, merging and inspection by visualizing marker genes.

Suresh et al., 2023

employed a similar approach also requiring several manual parameter choices. This makes a formal comparison difficult. Generally, both methods seem to agree well on the orthology assignments of cell type clusters (

Figure 2—figure supplement 2

&

Figure 2—figure supplement 1

). MetaNeighbor, as used by

Suresh et al., 2023

, provides a more quantitative and potentially more sensitive framework for assessing cross-species cell type relationships. However, this higher sensitivity may also make it more affected by data with a lower signal-to-noise ratio, such as our developmental time series.

Hence, the carefully annotated dataset presented here can serve as a valuable resource for future research. Non-human primate iPSCs are central to many studies focusing on evolutionary comparisons, and the pool of iPSC lines for these purposes is expected to grow, incorporating more species and individuals. In this context, the transcriptomic data we generated offer a reference dataset that can be used to verify the pluripotency and differentiation potential of non-human primate iPSC lines by examining gene expression during EB formation.

The set of shared cell types between all four primate species allowed us to evaluate the conservation and transferability of marker genes between species. To begin with, marker genes are by definition cell type-specific, and also with this dataset, we can show that they are less conserved than broadly expressed genes. Expression breadth can be interpreted as a sign of pleiotropy and hence higher functional constraint (

Hastings, 1996

;

Duret and Mouchiroud, 2000

). Conversely, we expect cell type-specific marker genes to be among the least conserved genes. Indeed, we and others find that the overlap of marker genes across species is limited (

Hodge et al., 2019

;

Krienen et al., 2020

;

Bakken et al., 2021

;

Feng et al., 2022

). Moreover, conservation varies significantly across gene biotypes. On the one hand, lncRNAs, which are often highly cell type-specific, exhibit lower cross-species conservation. Their low sequence conservation further complicates their utility for comparative studies (

Johnsson et al., 2014

). On the other hand, TFs, which have been proposed as central elements of a Core Regulatory Complex (CoRC) that defines cell type identity (

Arendt et al., 2016

), are among the most conserved markers across species. However, the power to distinguish cell types based solely on the expression of TF markers remains lower than when markers are selected from the broader set of all protein-coding genes (

Figure 4—figure supplement 3

). Even though within species, a handful of marker genes can achieve remarkable accuracy, their discriminatory power remains lower for other species. Thus, whole transcriptome profiles offer a more comprehensive approach to cross-species cell type classification for single-cell data.

This said, marker genes remain fundamental to most current cell type annotations. Moreover, marker genes will continue to be used to match cell types across modalities, as, for example, to validate cell type properties in experiments that are often based on immunofluorescence of individual markers or gene panels as used for spatial transcriptomics (

Benito-Kwiecinski et al., 2021

;

Gulati et al., 2025

). To this end, we have refined the ranking of marker genes beyond differential expression analysis to focus on consistent differences in detection rate. Markers identified in this way are bound to translate better into protein-based validations than markers defined based on expression levels, due to the discrepancy of mRNA and protein expression (

Pascal et al., 2008

). Furthermore, the presence-absence signal is more robust against cross-species fluctuations in gene expression than measures based on expression level differences.

In conclusion, we present a robust reference dataset for early primate development alongside tools to identify and evaluate orthologous cell types. Our findings emphasize the need for caution when transferring marker genes for cell type annotation and characterization in cross-species studies.

Materials and methods

Cell lines

Request a detailed protocol

We used 10 iPSC lines that were all generated in-house and have already been published (

Table 1

). Absence of Sendai virus was confirmed by RT-PCR, and all lines are mycoplasma-free. Cell lines were authenticated using SNP panels that were established using RNA-seq data (

Jocher et al., 2024

).

Table 1

Cell lines.

List of cell lines used for embryoid body (EB) differentiation.

ID	Species	Sex	Publication
29B5	Homo sapiens	Male	Geuder et al., 2021
63Ab2.2	Homo sapiens	Female	Geuder et al., 2021
69A1	Pongo abelii	Male	Geuder et al., 2021
68A20	Pongo abelii	Male	Geuder et al., 2021
82A3	Macaca fascicularis	Female	Edenhofer et al., 2024
56B1	Macaca fascicularis	Female	Edenhofer et al., 2024
56A1	Macaca fascicularis	Female
87B1	Macaca mulatta	Male	Jocher et al., 2024
83D1	Macaca mulatta	Male	Jocher et al., 2024
83Ab1.1	Macaca mulatta	Male	Jocher et al., 2024

EB differentiation method comparison

Request a detailed protocol

Four EB differentiation protocols are compared initially, which are combinations of two differentiation media (DFK20 and EB-medium) and two differentiation methods (dish and 96-well).

For single-cell differentiation in 96-well plates, primate iPSCs from one 80% confluent 6-well are washed with DPBS and incubated with Accumax (Sigma-Aldrich, SCR006) for 7 min at 37 °C. Afterwards, iPSCs are dissociated to single cells, the enzymatic reaction is stopped by adding DPBS, and cells are counted and pelleted at 300×g for 5 min. Single cells are resuspended in EB-medium consisting of StemFit Basic02 (Nippon Genetics, 3821.00) w/o bFGF or DFK20, both supplemented with 10 µM Y-27632 (Biozol, ESI-ST10019). The DFK20 medium consists of DMEM/F12 (Fisher Scientific, 15373541) with 20% KSR (Thermo Fisher Scientific, 10828–028), 1% MEM non-essential amino acids (Thermo Fisher Scientific, 11140–035), 1% Glutamax (Thermo Fisher Scientific, 35050038), 100 U/mL Penicillin, 100 µg/mL Streptomycin (Thermo Fisher Scientific, 15140122), and 0.1 mM 2-Mercaptoethanol (Thermo Fisher Scientific, M3148). Afterwards, 9000 cells in 150 µl medium are seeded per well of a Nuclon Sphera 96-well plate (Fisher Scientific, 15396123) and cultured at 37 °C and 5% CO
₂
. A medium change with the corresponding EB differentiation medium w/o Rock inhibitor is performed every other day during the whole protocol. EBs are collected from the 96-well plate and subjected to flow cytometry after 7 days of differentiation.

For clump differentiation in culture dishes, primate iPSCs from one 80% confluent 12-well are washed with DPBS and incubated with 0.5 mM EDTA (Carl Roth, CN06.3) for 3–5 min at RT. The EDTA is removed, StemFit (Nippon Genetics, 3821.00) supplemented with 10 µM Y-27632 (Biozol, ESI-ST10019) is added and cells are dissociated to clumps of varying sizes. Subsequently, the clumps are transferred to sterile bacterial dishes with vents and cultured at 37 °C and 5% CO
₂
. After 24 hr, the medium is exchanged by either EB-medium or DFK20 supplemented with 10 µM Y-27632 for an additional 24 hr, before changing the medium to EB-medium or DFK20. A medium change is performed every other day during the protocol from day 4 on. EBs are collected from the dishes and subjected to flow cytometry after 7 days of differentiation.

Flow cytometry

Request a detailed protocol

Flow cytometry is performed on day 7 of the differentiation protocol. Therefore, 1/10 of the EBs are collected, washed with DPBS, incubated with Accumax (Sigma-Aldrich, SCR006) for 10 min at 37 °C and dissociated to single cells. After washing, cells are incubated with the Viability Dye eFluor 780 (Thermo Fisher Scientific, 65-0865-18) diluted 1/1000 in PBS for 30 min at 4 °C in the dark. The live/dead stain is quenched by the addition of Cell Staining Buffer (CSB) consisting of DPBS with 0.5% BSA (Sigma-Aldrich, A3059), 0.01% NaN
₃
(Sigma-Aldrich, S2002), and 2 mM EDTA (Carl Roth, CN06.3). Subsequently, cells are pelleted and incubated with a mixture of the following antibodies diluted 1/200 in CSB for 1 hr at 4 °C in the dark. The antibodies used are anti-TRA-1–60-AF488 (STEMCELL Technologies, 60064AD.1), anti-CXCR4-PE (BioLegend, 306505), anti-NCAM1-PE/Cy7 (BioLegend, 318317), and anti-PDGFRα-APC (BioLegend, 323511). After centrifugation, cells are resuspended in PBS containing 0.5% BSA, 0.01% NaN
₃
, and 1 µg/ml DNase I (STEMCELL Technologies, 07469), filtered through a strainer and analyzed using the BD FACS Canto Flow Cytometry System. Flow cytometry data are analyzed using FlowJo (V10.8.2).

In-vitro embryoid body differentiation

Request a detailed protocol

Two human, two orangutan, three cynomolgus, and three rhesus iPSC lines are used for EB differentiation. The human and orangutan iPSCs are reprogrammed from urinary cells, while cynomolgus and rhesus iPSCs were reprogrammed from fibroblasts. All cell lines were characterized and validated previously and were tested negative for mycoplasma and SeV reprogramming vector integration (

Geuder et al., 2021

;

Jocher et al., 2024

;

Edenhofer et al., 2024

).

For embryoid body formation prior to 10 x scRNA-seq, the EB differentiation protocol using DFK20 medium in culture dishes is performed in duplicates for each clone. After 8 days of floating culture in dishes, EBs from both replicates are pooled and seeded into 6-wells coated with 0.2% gelatin (Sigma-Aldrich, G1890) for another 8 days of attached culture with subsequent medium changes every other day. In total, three replicates of EB formation are performed on different days, and each replicate includes cell lines from all four primate species.

scRNA-seq library generation and sequencing

Request a detailed protocol

EBs are sampled on day 8 and day 16 of the protocol. For dissociation, floating EBs are collected, while attached EBs are kept in their wells, washed with DPBS, and incubated with Accumax (Sigma-Aldrich, SCR006) for 10–20 min at 37 °C. Afterwards, EBs are pipetted up and down with a p1000 pipette until they are completely dissociated. The enzymatic reaction is stopped by adding DFK20 medium, cells are pelleted at 300 g for 5 min and resuspended in 1 mL DPBS. If cell clumps are observed, the liquid is filtered through a 40 µm strainer before counting them with a Countess II automated cell counter (Thermo Fisher Scientific, C10228). Equal cell numbers from each cell line are pooled, washed with DPBS +0.04% BSA and resuspended in DPBS +0.04% BSA aiming for a final concentration of 800–1000 cells/µL. scRNA-seq libraries are generated using the 10 x Genomics Chromium Next GEM Single Cell 3’ Kit V3.1 workflow in three replicates. Each time, evenly pooled single cells from the different cell lines are loaded on 2–6 lanes of a 10 x chip, targeting 16,000 cells per lane. Libraries are sequenced on an Illumina NextSeq1000/1500 with a 100-cycle kit and the following sequencing setup: read 1 (28 bases), read 2 (10 bases), read 3 (10 bases), and read 4 (90 bases).

Alignment of scRNA-seq data

Request a detailed protocol

Reads are processed with Cell Ranger version 7.0.0. We map all reads to four reference genomes:

Homo sapiens

GRCh38 (GENCODE release 32),

Pongo abelii

Susie_PABv2/ponAbe3,

Macaca fascicularis

macFas6, and

Macaca mulatta

rheMac10. The orangutan, cynomolgus macaque, and rhesus macaque GTF files are created by transferring the hg38 annotation to the corresponding primate genomes via the tool Liftoff (

Shumate and Salzberg, 2021

), followed by removal of transcripts with partial mapping (<50%), low sequence identity (<50%), or excessive length (>100 bp difference and >2 length ratio) for all species.

Species and individual demultiplexing

Request a detailed protocol

Since we pool cells from multiple species on each 10 x lane, we use cellsnp-lite (

Huang and Huang, 2021

) version 1.2.0 and vireo (

Huang et al., 2019

) version 0.5.7 to assign single cells to their respective species. Initially, we obtain a list of 51000 informative variants (referred to as ‘species vcf file’) from a bulk RNA-seq experiment involving samples from

Homo sapiens

,

Pongo abelii

and

Macaca fascicularis

, mapped to the GRCh38 reference genome. We run cellsnp-lite in mode 2b for whole-chromosome pileup and filter for high-coverage homozygous variants to identify informative variants.

For the demultiplexing of species in the scRNA-seq data, we employ a two-step strategy:

Initial species assignment: Using the Cell Ranger output aligned to GRCh38, we genotype each single cell with cellsnp-lite providing the species vcf file as candidate SNPs and setting a minimum UMI count filter of 10. Subsequently, we assign single cells to human, orangutan, or macaque identity with vireo using again the species vcf file as the donor file.

Distinguishing macaque species: To differentiate between the two macaque species,

Macaca fascicularis

and

Macaca mulatta

, we use the Cell Ranger output aligned to rheMac10. After genotyping with cellsnp-lite, we demultiplex with vireo, specifying the number of donors to two, without providing a donor vcf file in this case. We assign the donor, for which the majority of distinguishing variants agreed with the rheMac10 reference alleles, to

Macaca mulatta,

and the other donor to

Macaca fascicularis

.

To distinguish different human individuals pooled in the same experiment, we genotype single cells with cellsnp-lite with a candidate vcf file of 7.4 million common variants from the 1000 Genomes Project, demultiplexed with vireo specifying two donors and assign donors to individuals based on the intersection with variants from bulk RNA-seq data of the same individuals. To distinguish between different cynomolgus individuals, we use a reference vcf with informative variants obtained from bulk RNA-seq data to genotype single cells and demultiplex the individuals.

Processing of scRNA-seq data

Request a detailed protocol

We remove background RNA with CellBender version 0.2.0 (

Fleming et al., 2023

) at a false positive rate (FPR) of 0.01. After quality control, we retain cells with more than 1000 detected genes and a mitochondrial fraction below 8%. We remove cross-species doublets based on the vireo assignments and intra-species doublets using scDblFinder version 1.6.0 (

Germain et al., 2021

), specifying the expected doublet rate based on the cross-species doublet fraction. For each species, we normalize the counts with scran version 1.28.2 (

Lun et al., 2016

) and integrated data from different experiments with scanorama (

Hie et al., 2019

). UMAP dimensionality reductions are created with Seurat version 4.3.0 on the first 30 components of the scanorama corrected embedding per species.

Besides the separate processing per species, we also create an integrated dataset of all four species together using Harmony version 0.1.1 (

Korsunsky et al., 2019

). We identify clusters on the first 20 Harmony-integrated PCs with Seurat at a resolution of 0.1, resulting in a number of clusters similar to the broad cell types described in a human EB dataset (

Rhodes et al., 2022

;

Figure 1D and E

).

Reference-based classification

Request a detailed protocol

To get an initial cell type annotation, we download a reference dataset of day 21 human EBs (

Rhodes et al., 2022

). We normalize the count matrix with scran and intersect the genes between reference 441 and our scRNA-seq dataset. Next, we train a SingleR version 2.0.0 (

Aran et al., 2019

) classifier for 442 the broad cell type classes defined in

Figure 1G

of the original publication (

Rhodes et al., 2022

) using 443 trainSingleR with pseudo-bulk aggregation. Cell type labels are transferred to cells of each species 444 with

classifySingleR.

Orthologous cell type annotation

Request a detailed protocol

To annotate orthologous cell types, we first perform high-resolution clustering of the scRNA-seq data for each species separately. For this, we take the first 20 components of the Scanorama-corrected embedding as input to perform clustering in Seurat with

FindNeighbors

and

FindClusters

at a resolution of 2 to obtain the initial HRCs.

Next, we score the similarity of all HRCs with an approach based on reciprocal classification. For each species, we train a SingleR classifier on all HRCs of a species. We then classify the cells of all other species with

classifySingleR

. In this way, we can calculate the similarity of each HRC in the target species to each HRC in the reference species as the fraction of cells of the target HRC classified as the reference HRC. To also obtain similarity scores between HRCs within a species, we split the data of each species into a reference set with 80% of cells and a test set with 20% of cells. Analogous to the cross-species classification scheme, we transfer HRC labels from the reference set to the test set and score the overlap of target and reference HRCs.

In the next step, we combine HRCs based on pairwise similarity scores. We average the bidirectional similarity scores for each HRC pair and construct a distance matrix with all HRCs. Subsequently, based on hierarchical clustering (hclust, average method), we define 26 initial orthologous cell type clusters (OCCs) based on the visual inspection of the distance matrix. In this way, we merge similar HRCs within species and match HRCs across species to obtain a set of OCCs.

OCCs with very similar expression and marker profiles can be further merged. Therefore, we create pseudobulk profiles for each OCC and calculate Spearman’s

ρ

for all pair-wise comparisons within a species (s) based on the 2000 most variable genes. We perform hierarchical clustering on
$1 - {\bar{ρ}}_{s}$
and merge orthologous clusters at a cut height of 0.1, that was interactively determined by also inspecting the similarity of the top marker genes as found by Seurat’s

FindMarkers

. In the shiny app, we provide a list of OCC markers for each species separately, but also the intersection of conserved markers. Based on those marker combinations, the user can then assign the cell types. If the marker gene distribution as visualized in UMAPs reveals overmerged OCCs, the user can split them interactively. Specifically, we separate merged OCC 4 into iPSCs, cardiac progenitor cells and early epithelial cells for the final assignment. We assign merged OCC5 as neural crest I, but re-annotate a subcluster present only in cynomolgus and rhesus macaques as fibroblasts. Similarly, we re-annotate a subcluster of merged OCC12 (granule precursor cells) as astrocyte progenitors in cynomolgus and rhesus macaque. Finally, we exclude OCCs with less than 800 cells that are only present in 1 or 2 species.

We assess the correspondence of the final cell type assignments across species with two approaches. For the scores shown in

Figure 2—figure supplement 1

, we apply the same reciprocal classification approach as described above, providing cell type labels instead of hrcs as initial clusters. For the scores shown in

Figure 2—figure supplement 2

, we use the function MetaNeighborUS of MetaNeighbor Version 1.18.0 to compare cell type labels across species.

Pseudotime analysis

Request a detailed protocol

Pseudotime trajectories were inferred separately for ectodermal, mesodermal, and endodermal lineages in each species using slingshot (version 2.12.0) (

Street et al., 2018

). For each germ layer, cells were filtered to include iPSCs and cell types belonging to the respective germ layer. The analysis was based on Scanorama-integrated PCA embeddings (

Hie et al., 2019

), with iPSCs defined as the starting cluster and germ layer-specific differentiated cell types as endpoints (ectoderm: astrocyte progenitors, granule precursor cells, neurons, and neural crest cells; mesoderm: fibroblasts, smooth muscle cells, cardiac endothelial cells, and cardiac fibroblasts; endoderm: epithelial cells and hepatocytes). If neural_crest_II was absent, neural_crest_I was used as an alternative endpoint. PHATE embeddings (phateR, version 1.0.7) were (

Moon et al., 2019

) computed from the Scanorama PCA space to visualize the inferred lineages in two dimensions.

Presence-absence scoring of expression

Request a detailed protocol

To determine when to define a gene as expressed in a certain cell type, we derive a lower limit of gene detection per cell type and species while accounting for noise and differences in power to detect expression. We first filter the count matrices for each clone, keeping only genes with at least 1% nonzero counts and cells within three median absolute deviations for number of UMIs and the number of genes with counts >0 per cell type and species. These filtered matrices are then downsampled so that we keep the same number of cells in each species (n=18,800), while keeping the original cell type proportion. Next, per species, we estimate the following distributional characteristics per gene (i) across cell types (j): (1) the fraction of nonzero counts (
$f_{i j}$
), (2) the mean (
$μ_i j \pm s . e . (μ_{i j})$
) and dispersion (
$θ_i$
) of the negative binomial distribution using glmgampoi v1.10.2 (

Ahlmann-Eltze and Huber, 2021

). In the next step, we define a putative expression status per gene per cell type. (1) Genes are detectable if their log mean expression
$l o g (μ_{i j})$
is above the fifth quantile of the
$l o g (μ)$
value distribution across all genes per cell type. (2) Genes are reliably estimable if the ratio
$l o g (\frac{s . e . (μ_{i j})}{μ_{i j}})$
is below the 90th quantile of
$l o g (\frac{s . e . (μ)}{μ})$
value distribution. Only when both conditions are met is the expression status set to 1, otherwise 0. A binomial logistic regression model using Firth’s bias reduction method as implemented in R package logistf (version 1.26.0) is then applied to derive the minimal gene detection needed to call a gene expressed, i.e., when P(Y=1) solve
$l o g (\frac{p}{1 - p}) = a + b * f_{i j}$
towards
$f_{i j}$
. To ensure consistency between species, we set the detection threshold for each cell type to the maximum threshold among all species.

Cell type specificity and expression conservation scores

Request a detailed protocol

To assess cell type specificity and expression conservation of genes across species, we first determine in which cell types a gene is expressed in a species, using the thresholds defined in the previous section. Thus, we determine cell type specificity as the number of cell types in which a gene was found to be expressed. Here, this score can be maximally 7, i.e., the gene is detected in all cell types that were found in all four species.

To evaluate expression conservation, we develop a phylogenetically weighted conservation score for each gene, reflecting the number of species in which the gene is expressed, weighted by the scaled phylogenetic distance as estimated in

Bininda-Emonds et al., 2007

. For each gene, we calculate the expression conservation score as follows:

E x p r e s s i o n c o n s e r v a t i o n = \frac{1}{N_{c t}} \sum_{c t} \sum_{b \in d e t e c t e d} b l

where
$N_{c t}$
is the number of cell types in which the gene is detected. We then simply sum the scaled branch lengths
$b l$
across all cell types (
$c t$
) and branches (
$b$
) on which we infer the gene to be expressed. Because we only have four species, we only have one internal branch, for which we infer expression if at least one great ape and one macaque species show expression in that cell type. The score ranges from 0.075 (detected only in cynomolgus or rhesus macaque) to 1 (detected in the same cell types in all four species).

Furthermore, we extract measures of sequence conservation for protein-coding genes from Supplementary Data S14 in the study by 2023 (

Sullivan et al., 2023

). Here, we use the fraction of CDS bases with primate phastCons ≥0.96 as a gene-based measure of constraint.

Marker gene detection

Request a detailed protocol

We filter the count matrices for each clone to retain only genes with nonzero counts in one of the 7 cell types that were detected in all species. We then downsample these filtered matrices to equalize the number of cells across species, leaving us with ∼11,600 cells per species. Furthermore, to mitigate differences in statistical power due to varying numbers of cells per cell type, we perform testing on cell types with a minimum of 10 and a maximum of 250 cells for each pairwise comparison of ‘self’ versus ‘other.’ The maximum of 250 cells ensures that the cell type composition of the ‘other’ is comparable across species. We identify marker genes using the p-values (
$p_{a d j} < 0.1$
) determined by ZIQ-Rank (

Ling et al., 2021

) and use Seurat

FindMarkers

with logistic regression to identify the cell types for which the gene is a marker. Furthermore, the marker gene needs to be above the cell type’s detection threshold (see above) and needs to be up-regulated in the cell type for which it is a marker (log fold change >0.01). Finally, a marker gene must be detected in a larger proportion of cells for which it is a marker than in other cell types (
$p_{j} - {\bar{p}}_{o t h e r} = Δ > 0.01$
). The detection proportion Δ is also used to sort the lists of marker genes, deeming the genes with the largest Δ as the best marker genes. In order to also gauge within-species variation in marker gene detection, we conducted the same analysis across clones instead of species. In order to compare cross-species reproducibility of different types of marker genes, i.e., protein-coding, lncRNAs and transcriptional regulators, we wanted to compare the ranked lists of marker genes across species. To this end, we perform a concordance analysis using RBO (

Webber et al., 2010

) on the top 100 marker genes (rbo R package version 0.0.1). For this part, a list of transcription factors were created by selecting genes with at least one annotated motif in the motif databases JASPAR 2022 vertebrate core (

Castro-Mondragon et al., 2022

), JASPAR 2022 vertebrate unvalidated (

Castro-Mondragon et al., 2022

) and IMAGE

Madsen et al., 2018

. Annotations for protein-coding and lncRNA genes were extracted from the Ensembl GTF file provided with the human Cell Ranger reference dataset (GRCh38-2020-A). To assess the predictive performance of marker genes, we conduct a kNN classification (FNN R package version ). We train a kNN classifier (k=3) on the log-normalized counts of the top 1–30 human markers per cell type in the human clone 29B5. We then predict the cell type identity in all clones and summarize classification performance per cell type with F1-scores, as well as the average F1-score across all seven cell types.

Appendix 1

Appendix 1—table 1

Marker genes.

Literature review for marker genes used in human and mouse / rodents to determine a specific cell type.

Cell type	Marker gene	Used in human	Used in mouse
iPSCs	POU5F1	Nguyen et al., 2018	Loh et al., 2006
iPSCs	NANOG	Nguyen et al., 2018	Apostolou et al., 2013
iPSCs	L1TD1	Närvä et al., 2012	Närvä et al., 2012
early ectoderm	SOX2	Graham et al., 2003	Lodato et al., 2013
early ectoderm	HES5	Ziller et al., 2015	Harada et al., 2021
early ectoderm	RFX4	Ziller et al., 2015	Kawase et al., 2014
granule precursor cells	NFIA	Tan et al., 2023	Fraser et al., 2020
granule precursor cells	ZIC1	Aruga et al., 1998	Schüller et al., 2006
granule precursor cells	ZIC4	Aruga et al., 1998	Blank et al., 2011
neural crest	SOX10	Mollaaghababa and Pavan, 2003	Mollaaghababa and Pavan, 2003 ; Kim et al., 2003
neural crest	FOXD3	Tseng et al., 2016	Dottori et al., 2001
neural crest	S100B	Hackland et al., 2017	Murphy et al., 1991
neurons	STMN2	Klim et al., 2019	Guerra San Juan et al., 2022 ; Ware et al., 2016
neurons	TAGLN3 (NP25)	Mori et al., 2004	Ware et al., 2016
neurons	DCX	Gleeson et al., 1999	Gleeson et al., 1999
smooth muscle cells	COL8A1	Rojas et al., 2024	Muhl et al., 2022
smooth muscle cells	ACTG2	Hashmi et al., 2020	Muhl et al., 2022
smooth muscle cells	ACTA2	Rojas et al., 2024	Muhl et al., 2022
cardiac fibroblasts	TNNT2	Mononen et al., 2020	Tachampa and Wongtawan, 2020
cardiac fibroblasts	DCN	Floy et al., 2021	Ko et al., 2022
cardiac fibroblasts	HAND2	Mononen et al., 2020	Furtado et al., 2014
epithelial cells	CDH1	Oikawa et al., 2018	Bondow et al., 2012
epithelial cells	EPCAM	Martowicz et al., 2016	Huang et al., 2018
epithelial cells	CLDN7	Farkas et al., 2015	Xing et al., 2020
hepatocytes	TTR	Banas et al., 2007	Lavon and Benvenisty, 2005
hepatocytes	APOA1	Krueger et al., 2013	De Giorgi et al., 2021
hepatocytes	APOA2	Krueger et al., 2013	Peng et al., 2018

Data availability

Code for analysis and figures is available on GitHub (

; copy archived at

Janssen, 2024

), and accompanying files are deposited in Zenodo (

). All sequencing files were deposited in GEO (GSE280441).

The following data sets were generated

1. Jocher J
2. Janssen P
3. Vieth B
4. Edenhofer FC
5. Dietl T
6. Térmeg A
7. Geuder J
8. Enard W
9. Hellmann I
(2024)

NCBI Gene Expression Omnibus

ID GSE280441. Identification and comparison of orthologous cell types from primate embryoid bodies shows limits of marker gene transferability.

1. Janssen P
(2024)

Zenodo

Identification and comparison of orthologous cell types from primate embryoid bodies shows limits of marker gene transferability.

References

1. Ahlmann-Eltze C
2. Huber W
(2021)

glmGamPoi: fitting Gamma-Poisson generalized linear models on single cell count data

Bioinformatics

36

:5701–5702.
- PubMed
- Google Scholar

1. Apostolou E
2. Ferrari F
3. Walsh RM
4. Bar-Nur O
5. Stadtfeld M
6. Cheloufi S
7. Stuart HT
8. Polo JM
9. Ohsumi TK
10. Borowsky ML
11. Kharchenko PV
12. Park PJ
13. Hochedlinger K
(2013)

Genome-wide chromatin interactions of the Nanog locus in pluripotency, differentiation, and reprogramming

Cell Stem Cell

12

:699–712.
- PubMed
- Google Scholar

1. Aran D
2. Looney AP
3. Liu L
4. Wu E
5. Fong V
6. Hsu A
7. Chak S
8. Naikawadi RP
9. Wolters PJ
10. Abate AR
11. Butte AJ
12. Bhattacharya M
(2019)

Reference-based analysis of lung single-cell sequencing reveals a transitional profibrotic macrophage

Nature Immunology

20

:163–172.
- PubMed
- Google Scholar

1. Arendt D
2. Musser JM
3. Baker CVH
4. Bergman A
5. Cepko C
6. Erwin DH
7. Pavlicev M
8. Schlosser G
9. Widder S
10. Laubichler MD
11. Wagner GP
(2016)

The origin and evolution of cell types

Nature Reviews. Genetics

17

:744–757.
- PubMed
- Google Scholar

1. Aruga J
2. Minowa O
3. Yaginuma H
4. Kuno J
5. Nagai T
6. Noda T
7. Mikoshiba K
(1998)

Mouse Zic1 is involved in cerebellar development

The Journal of Neuroscience

18

:284–293.
- PubMed
- Google Scholar

1. Bakken T
2. Cowell L
3. Aevermann BD
4. Novotny M
5. Hodge R
6. Miller JA
7. Lee A
8. Chang I
9. McCorrison J
10. Pulendran B
11. Qian Y
12. Schork NJ
13. Lasken RS
14. Lein ES
15. Scheuermann RH
(2017)

Cell type discovery and representation in the era of high-content single cell phenotyping

BMC Bioinformatics

18

:559.
- PubMed
- Google Scholar

1. Bakken TE
2. Jorstad NL
3. Hu Q
4. Lake BB
5. Tian W
6. Kalmbach BE
7. Crow M
8. Hodge RD
9. Krienen FM
10. Sorensen SA
11. Eggermont J
12. Yao Z
13. Aevermann BD
14. Aldridge AI
15. Bartlett A
16. Bertagnolli D
17. Casper T
18. Castanon RG
19. Crichton K
20. Daigle TL
21. Dalley R
22. Dee N
23. Dembrow N
24. Diep D
25. Ding S-L
26. Dong W
27. Fang R
28. Fischer S
29. Goldman M
30. Goldy J
31. Graybuck LT
32. Herb BR
33. Hou X
34. Kancherla J
35. Kroll M
36. Lathia K
37. van Lew B
38. Li YE
39. Liu CS
40. Liu H
41. Lucero JD
42. Mahurkar A
43. McMillen D
44. Miller JA
45. Moussa M
46. Nery JR
47. Nicovich PR
48. Niu S-Y
49. Orvis J
50. Osteen JK
51. Owen S
52. Palmer CR
53. Pham T
54. Plongthongkum N
55. Poirion O
56. Reed NM
57. Rimorin C
58. Rivkin A
59. Romanow WJ
60. Sedeño-Cortés AE
61. Siletti K
62. Somasundaram S
63. Sulc J
64. Tieu M
65. Torkelson A
66. Tung H
67. Wang X
68. Xie F
69. Yanny AM
70. Zhang R
71. Ament SA
72. Behrens MM
73. Bravo HC
74. Chun J
75. Dobin A
76. Gillis J
77. Hertzano R
78. Hof PR
79. Höllt T
80. Horwitz GD
81. Keene CD
82. Kharchenko PV
83. Ko AL
84. Lelieveldt BP
85. Luo C
86. Mukamel EA
87. Pinto-Duarte A
88. Preissl S
89. Regev A
90. Ren B
91. Scheuermann RH
92. Smith K
93. Spain WJ
94. White OR
95. Koch C
96. Hawrylycz M
97. Tasic B
98. Macosko EZ
99. McCarroll SA
100. Ting JT
101. Zeng H
102. Zhang K
103. Feng G
104. Ecker JR
105. Linnarsson S
106. Lein ES
(2021)

Comparative cellular analysis of motor cortex in human, marmoset and mouse

Nature

598

:111–119.
- PubMed
- Google Scholar

1. Banas A
2. Teratani T
3. Yamamoto Y
4. Tokuhara M
5. Takeshita F
6. Quinn G
7. Okochi H
8. Ochiya T
(2007)

Adipose tissue-derived mesenchymal stem cells as a source of human hepatocytes

Hepatology

46

:219–228.
- PubMed
- Google Scholar

1. Barr KA
2. Rhodes KL
3. Gilad Y
(2023)

The relationship between regulatory changes in cis and trans and the evolution of gene expression in humans and chimpanzees

Genome Biology

24

:207.
- PubMed
- Google Scholar

1. Benito-Kwiecinski S
2. Giandomenico SL
3. Sutcliffe M
4. Riis ES
5. Freire-Pritchett P
6. Kelava I
7. Wunderlich S
8. Martin U
9. Wray GA
10. McDole K
11. Lancaster MA
(2021)

An early cell shape transition drives evolutionary expansion of the human forebrain

Cell

184

:2084–2102.
- PubMed
- Google Scholar

1. Bininda-Emonds ORP
2. Cardillo M
3. Jones KE
4. MacPhee RDE
5. Beck RMD
6. Grenyer R
7. Price SA
8. Vos RA
9. Gittleman JL
10. Purvis A
(2007)

The delayed rise of present-day mammals

Nature

446

:507–512.
- PubMed
- Google Scholar

1. Blank MC
2. Grinberg I
3. Aryee E
4. Laliberte C
5. Chizhikov VV
6. Henkelman RM
7. Millen KJ
(2011)

Multiple developmental programs are altered by loss of Zic1 and Zic4 to cause Dandy-Walker malformation cerebellar pathogenesis

Development

138

:1207–1216.
- PubMed
- Google Scholar

1. Bondow BJ
2. Faber ML
3. Wojta KJ
4. Walker EM
5. Battle MA
(2012)

E-cadherin is required for intestinal morphogenesis in the mouse

Developmental Biology

371

:1–12.
- PubMed
- Google Scholar

1. Brickman JM
2. Serup P
(2017)

Properties of embryoid bodies

Wiley Interdisciplinary Reviews. Developmental Biology

6

:259.
- PubMed
- Google Scholar

1. Castro-Mondragon JA
2. Riudavets-Puig R
3. Rauluseviciute I
4. Lemma RB
5. Turchi L
6. Blanc-Mathieu R
7. Lucas J
8. Boddie P
9. Khan A
10. Manosalva Pérez N
11. Fornes O
12. Leung TY
13. Aguirre A
14. Hammal F
15. Schmelter D
16. Baranasic D
17. Ballester B
18. Sandelin A
19. Lenhard B
20. Vandepoele K
21. Wasserman WW
22. Parcy F
23. Mathelier A
(2022)

JASPAR 2022: the 9th release of the open-access database of transcription factor binding profiles

Nucleic Acids Research

50

:D165–D173.
- PubMed
- Google Scholar

1. Crow M
2. Paul A
3. Ballouz S
4. Huang ZJ
5. Gillis J
(2018)

Characterizing the replicability of cell types defined by single cell RNA-sequencing data using MetaNeighbor

Nature Communications

9

:884.
- PubMed
- Google Scholar

1. De Giorgi M
2. Li A
3. Hurley A
4. Barzi M
5. Doerfler AM
6. Cherayil NA
7. Smith HE
8. Brown JD
9. Lin CY
10. Bissig K-D
11. Bao G
12. Lagor WR
(2021)

Targeting the

Apoa1

locus for liver-directed gene therapy

Molecular Therapy. Methods & Clinical Development

21

:656–669.
- PubMed
- Google Scholar

1. Dottori M
2. Gross MK
3. Labosky P
4. Goulding M
(2001)

The winged-helix transcription factor Foxd3 suppresses interneuron differentiation and promotes neural crest cell fate

Development

128

:4127–4138.
- PubMed
- Google Scholar

1. Duret L
2. Mouchiroud D
(2000)

Determinants of substitution rates in mammalian genes: expression pattern affects selection intensity but not mutation rate

Molecular Biology and Evolution

17

:68–74.
- PubMed
- Google Scholar

1. Edenhofer FC
2. Térmeg A
3. Ohnuki M
4. Jocher J
5. Kliesmete Z
6. Briem E
7. Hellmann I
8. Enard W
(2024)

Generation and characterization of inducible KRAB-dCas9 iPSCs from primates for cross-species CRISPRi

iScience

27

:110090.
- PubMed
- Google Scholar

1. Farkas AE
2. Hilgarth RS
3. Capaldo CT
4. Gerner-Smidt C
5. Powell DR
6. Vertino PM
7. Koval M
8. Parkos CA
9. Nusrat A
(2015)

HNF4α regulates claudin-7 protein expression during intestinal epithelial differentiation

The American Journal of Pathology

185

:2206–2218.
- PubMed
- Google Scholar

1. Feng M
2. Swevers L
3. Sun J
(2022)

Hemocyte clusters defined by scRNA-Seq in

Bombyx mori

:

In Silico

analysis of predicted marker genes and implications for potential functional roles

Frontiers in Immunology

13

:852702.
- PubMed
- Google Scholar

1. Fleming SJ
2. Chaffin MD
3. Arduini A
4. Akkad AD
5. Banks E
6. Marioni JC
7. Philippakis AA
8. Ellinor PT
9. Babadi M
(2023)

Unsupervised removal of systematic background noise from droplet-based single-cell experiments using CellBender

Nature Methods

20

:1323–1335.
- PubMed
- Google Scholar

1. Floy ME
2. Givens SE
3. Matthys OB
4. Mateyka TD
5. Kerr CM
6. Steinberg AB
7. Silva AC
8. Zhang J
9. Mei Y
10. Ogle BM
11. McDevitt TC
12. Kamp TJ
13. Palecek SP
(2021)

Developmental lineage of human pluripotent stem cell-derived cardiac fibroblasts affects their functional phenotype

FASEB Journal

35

:e21799.
- PubMed
- Google Scholar

1. Franzén O
2. Gan LM
3. Björkegren JLM
(2019)

PanglaoDB: a web server for exploration of mouse and human single-cell RNA sequencing data

Database

2019

:baz046.
- PubMed
- Google Scholar

1. Fraser J
2. Essebier A
3. Brown AS
4. Davila RA
5. Harkins D
6. Zalucki O
7. Shapiro LP
8. Penzes P
9. Wainwright BJ
10. Scott MP
11. Gronostajski RM
12. Bodén M
13. Piper M
14. Harvey TJ
(2020)

Common regulatory targets of NFIA, NFIX and NFIB during postnatal cerebellar development

Cerebellum

19

:89–101.
- PubMed
- Google Scholar

1. Furtado MB
2. Costa MW
3. Pranoto EA
4. Salimova E
5. Pinto AR
6. Lam NT
7. Park A
8. Snider P
9. Chandran A
10. Harvey RP
11. Boyd R
12. Conway SJ
13. Pearson J
14. Kaye DM
15. Rosenthal NA
(2014)

Cardiogenic genes expressed in cardiac fibroblasts contribute to heart development and repair

Circulation Research

114

:1422–1434.
- PubMed
- Google Scholar

1. Germain PL
2. Lun A
3. Garcia Meixide C
4. Macnair W
5. Robinson MD
(2021)

Doublet identification in single-cell sequencing data using

scDblFinder

F1000Research

10

:979.
- PubMed
- Google Scholar

1. Geuder J
2. Wange LE
3. Janjic A
4. Radmer J
5. Janssen P
6. Bagnoli JW
7. Müller S
8. Kaul A
9. Ohnuki M
10. Enard W
(2021)

A non-invasive method to generate induced pluripotent stem cells from primate urine

Scientific Reports

11

:3516.
- PubMed
- Google Scholar

1. Gleeson JG
2. Lin PT
3. Flanagan LA
4. Walsh CA
(1999)

Doublecortin is a microtubule-associated protein and is expressed widely by migrating neurons

Neuron

23

:257–271.

https://doi.org/10.1016/s0896-6273(00)80778-3
- PubMed
- Google Scholar

1. Graham V
2. Khudyakov J
3. Ellis P
4. Pevny L
(2003)

SOX2 functions to maintain neural progenitor identity

Neuron

39

:749–765.

https://doi.org/10.1016/s0896-6273(03)00497-5
- PubMed
- Google Scholar

1. Guerra San Juan I
2. Nash LA
3. Smith KS
4. Leyton-Jaimes MF
5. Qian M
6. Klim JR
7. Limone F
8. Dorr AB
9. Couto A
10. Pintacuda G
11. Joseph BJ
12. Whisenant DE
13. Noble C
14. Melnik V
15. Potter D
16. Holmes A
17. Burberry A
18. Verhage M
19. Eggan K
(2022)

Loss of mouse Stmn2 function causes motor neuropathy

Neuron

110

:1671–1688.
- PubMed
- Google Scholar

1. Gulati GS
2. D’Silva JP
3. Liu Y
4. Wang L
5. Newman AM
(2025)

Profiling cell identity and tissue architecture with single-cell and spatial transcriptomics

Nature Reviews. Molecular Cell Biology

26

:11–31.
- PubMed
- Google Scholar

1. Guo H
2. Tian L
3. Zhang JZ
4. Kitani T
5. Paik DT
6. Lee WH
7. Wu JC
(2019)

Single-cell RNA sequencing of human embryonic stem cell differentiation delineates adverse effects of nicotine on embryonic development

Stem Cell Reports

12

:772–786.
- PubMed
- Google Scholar

1. Guo H
2. Li J
(2021)

scSorter: assigning cells to known cell types according to marker genes

Genome Biology

22

:69.
- PubMed
- Google Scholar

1. Hackland JOS
2. Frith TJR
3. Thompson O
4. Marin Navarro A
5. Garcia-Castro MI
6. Unger C
7. Andrews PW
(2017)

Top-down inhibition of BMP signaling enables robust induction of hPSCs into neural crest in fully defined, xeno-free conditions

Stem Cell Reports

9

:1043–1052.
- PubMed
- Google Scholar

1. Han X
2. Chen H
3. Huang D
4. Chen H
5. Fei L
6. Cheng C
7. Huang H
8. Yuan GC
9. Guo G
(2018)

Mapping human pluripotent stem cell differentiation pathways using high throughput single-cell RNA-sequencing

Genome Biology

19

:47.
- PubMed
- Google Scholar

1. Hao Y
2. Hao S
3. Andersen-Nissen E
4. Mauck WM III
5. Zheng S
6. Butler A
7. Lee MJ
8. Wilk AJ
9. Darby C
10. Zager M
11. Hoffman P
12. Stoeckius M
13. Papalexi E
14. Mimitou EP
15. Jain J
16. Srivastava A
17. Stuart T
18. Fleming LM
19. Yeung B
20. Rogers AJ
21. McElrath JM
22. Blish CA
23. Gottardo R
24. Smibert P
25. Satija R
(2021)

Integrated analysis of multimodal single-cell data

Cell

184

:3573–3587.
- PubMed
- Google Scholar

1. Harada Y
2. Yamada M
3. Imayoshi I
4. Kageyama R
5. Suzuki Y
6. Kuniya T
7. Furutachi S
8. Kawaguchi D
9. Gotoh Y
(2021)

Cell cycle arrest determines adult neural stem cell ontogeny by an embryonic Notch-nonoscillatory Hey1 module

Nature Communications

12

:6562.
- PubMed
- Google Scholar

1. Hashmi SK
2. Barka V
3. Yang C
4. Schneider S
5. Svitkina TM
6. Heuckeroth RO
(2020)

Pseudo-obstruction-inducing ACTG2R257C alters actin organization and function

JCI Insight

5

:140604.
- PubMed
- Google Scholar

1. Hastings KE
(1996)

Strong evolutionary conservation of broadly expressed protein isoforms in the troponin I gene family and other vertebrate gene families

Journal of Molecular Evolution

42

:631–640.
- PubMed
- Google Scholar

Preprint
1. He Z
2. Dony L
3. Fleck JS
4. Szałata A
5. Li KX
6. Slišković I
7. Lin HC
8. Santel M
9. Atamian A
10. Quadrato G
11. Sun J
12. Paşca SP
13. Camp JG
14. Theis F
15. Treutlein B
(2023)

An integrated transcriptomic cell atlas of human neural organoids

bioRxiv.
- Google Scholar

1. Hie B
2. Bryson BD
3. Berger B
(2019)

Efficient integration of heterogeneous single-cell transcriptomes using Scanorama

Nature Biotechnology

37

:685–691.
- PubMed
- Google Scholar

1. Hodge RD
2. Bakken TE
3. Miller JA
4. Smith KA
5. Barkan ER
6. Graybuck LT
7. Close JL
8. Long B
9. Johansen N
10. Penn O
11. Yao Z
12. Eggermont J
13. Höllt T
14. Levi BP
15. Shehata SI
16. Aevermann B
17. Beller A
18. Bertagnolli D
19. Brouner K
20. Casper T
21. Cobbs C
22. Dalley R
23. Dee N
24. Ding S-L
25. Ellenbogen RG
26. Fong O
27. Garren E
28. Goldy J
29. Gwinn RP
30. Hirschstein D
31. Keene CD
32. Keshk M
33. Ko AL
34. Lathia K
35. Mahfouz A
36. Maltzer Z
37. McGraw M
38. Nguyen TN
39. Nyhus J
40. Ojemann JG
41. Oldre A
42. Parry S
43. Reynolds S
44. Rimorin C
45. Shapovalova NV
46. Somasundaram S
47. Szafer A
48. Thomsen ER
49. Tieu M
50. Quon G
51. Scheuermann RH
52. Yuste R
53. Sunkin SM
54. Lelieveldt B
55. Feng D
56. Ng L
57. Bernard A
58. Hawrylycz M
59. Phillips JW
60. Tasic B
61. Zeng H
62. Jones AR
63. Koch C
64. Lein ES
(2019)

Conserved cell types with divergent features in human versus mouse cortex

Nature

573

:61–68.
- PubMed
- Google Scholar

1. Huang L
2. Yang Y
3. Yang F
4. Liu S
5. Zhu Z
6. Lei Z
7. Guo J
(2018)

Functions of EpCAM in physiological processes and diseases (Review)

International Journal of Molecular Medicine

42

:1771–1785.
- PubMed
- Google Scholar

1. Huang Y
2. McCarthy DJ
3. Stegle O
(2019)

Vireo: Bayesian demultiplexing of pooled single-cell RNA-seq data without genotype reference

Genome Biology

20

:273.
- PubMed
- Google Scholar

1. Huang X
2. Huang Y
(2021)

Cellsnp-lite: an efficient tool for genotyping single cells

Bioinformatics

37

:4569–4571.
- PubMed
- Google Scholar

1. Ianevski A
2. Giri AK
3. Aittokallio T
(2022)

Fully-automated and ultra-fast cell-type identification using specific marker combinations from single-cell transcriptomic data

Nature Communications

13

:1246.
- PubMed
- Google Scholar

1. Itskovitz-Eldor J
2. Schuldiner M
3. Karsenti D
4. Eden A
5. Yanuka O
6. Amit M
7. Soreq H
8. Benvenisty N
(2000)

Differentiation of human embryonic stem cells into embryoid bodies compromising the three embryonic germ layers

Molecular Medicine

6

:88–95.
- PubMed
- Google Scholar

Software
1. Janssen P
(2024)

Primate embryoid body analysis, version swh:1:rev:f6bd4b033bf4b167d6d5370543661c19e2a17e3d

Software Heritage.

1. Jocher J
2. Edenhofer FC
3. Janssen P
4. Müller S
5. Lopez-Parra DC
6. Geuder J
7. Enard W
(2024)

Generation and characterization of three fibroblast-derived Rhesus Macaque induced pluripotent stem cell lines

Stem Cell Research

74

:103277.
- PubMed
- Google Scholar

1. Johnsson P
2. Lipovich L
3. Grandér D
4. Morris KV
(2014)

Evolutionary conservation of long non-coding RNAs; sequence, structure, function

Biochimica et Biophysica Acta

1840

:1063–1071.
- PubMed
- Google Scholar

1. Kanton S
2. Boyle MJ
3. He Z
4. Santel M
5. Weigert A
6. Sanchís-Calleja F
7. Guijarro P
8. Sidow L
9. Fleck JS
10. Han D
11. Qian Z
12. Heide M
13. Huttner WB
14. Khaitovich P
15. Pääbo S
16. Treutlein B
17. Camp JG
(2019)

Organoid single-cell genomic atlas uncovers human-specific features of brain development

Nature

574

:418–422.
- PubMed
- Google Scholar

1. Kawase S
2. Kuwako K
3. Imai T
4. Renault-Mihara F
5. Yaguchi K
6. Itohara S
7. Okano H
(2014)

Regulatory factor X transcription factors control Musashi1 transcription in mouse neural stem/progenitor cells

Stem Cells and Development

23

:2250–2261.
- PubMed
- Google Scholar

1. Kim J
2. Lo L
3. Dormand E
4. Anderson DJ
(2003)

SOX10 maintains multipotency and inhibits neuronal differentiation of neural crest stem cells

Neuron

38

:17–31.

https://doi.org/10.1016/s0896-6273(03)00163-6
- PubMed
- Google Scholar

1. Kliesmete Z
2. Orchard P
3. Lee VYK
4. Geuder J
5. Krauß SM
6. Ohnuki M
7. Jocher J
8. Vieth B
9. Enard W
10. Hellmann I
(2024)

Evidence for compensatory evolution within pleiotropic regulatory elements

Genome Research

34

:1528–1539.
- PubMed
- Google Scholar

1. Klim JR
2. Williams LA
3. Limone F
4. Guerra San Juan I
5. Davis-Dusenbery BN
6. Mordes DA
7. Burberry A
8. Steinbaugh MJ
9. Gamage KK
10. Kirchner R
11. Moccia R
12. Cassel SH
13. Chen K
14. Wainger BJ
15. Woolf CJ
16. Eggan K
(2019)

ALS-implicated protein TDP-43 sustains levels of STMN2, a mediator of motor neuron growth and repair

Nature Neuroscience

22

:167–179.
- PubMed
- Google Scholar

1. Ko T
2. Nomura S
3. Yamada S
4. Fujita K
5. Fujita T
6. Satoh M
7. Oka C
8. Katoh M
9. Ito M
10. Katagiri M
11. Sassa T
12. Zhang B
13. Hatsuse S
14. Yamada T
15. Harada M
16. Toko H
17. Amiya E
18. Hatano M
19. Kinoshita O
20. Nawata K
21. Abe H
22. Ushiku T
23. Ono M
24. Ikeuchi M
25. Morita H
26. Aburatani H
27. Komuro I
(2022)

Cardiac fibroblasts regulate the development of heart failure via Htra3-TGF-β-IGFBP7 axis

Nature Communications

13

:3275.
- PubMed
- Google Scholar

1. Korsunsky I
2. Millard N
3. Fan J
4. Slowikowski K
5. Zhang F
6. Wei K
7. Baglaenko Y
8. Brenner M
9. Loh PR
10. Raychaudhuri S
(2019)

Fast, sensitive and accurate integration of single-cell data with Harmony

Nature Methods

16

:1289–1296.
- PubMed
- Google Scholar

1. Krienen FM
2. Goldman M
3. Zhang Q
4. C. H. del Rosario R
5. Florio M
6. Machold R
7. Saunders A
8. Levandowski K
9. Zaniewski H
10. Schuman B
11. Wu C
12. Lutservitz A
13. Mullally CD
14. Reed N
15. Bien E
16. Bortolin L
17. Fernandez-Otero M
18. Lin JD
19. Wysoker A
20. Nemesh J
21. Kulp D
22. Burns M
23. Tkachev V
24. Smith R
25. Walsh CA
26. Dimidschstein J
27. Rudy B
28. S. Kean L
29. Berretta S
30. Fishell G
31. Feng G
32. McCarroll SA
(2020)

Innovations present in the primate interneuron repertoire

Nature

586

:262–269.
- Google Scholar

1. Krueger WH
2. Tanasijevic B
3. Barber V
4. Flamier A
5. Gu X
6. Manautou J
7. Rasmussen TP
(2013)

Cholesterol-secreting and statin-responsive hepatocytes from human ES and iPS cells to model hepatic involvement in cardiovascular health

PLOS ONE

8

:e67296.
- PubMed
- Google Scholar

1. Lavon N
2. Benvenisty N
(2005)

Study of hepatocyte differentiation using embryonic stem cells

Journal of Cellular Biochemistry

96

:1193–1202.
- PubMed
- Google Scholar

1. Ling W
2. Zhang W
3. Cheng B
4. Wei Y
(2021)

Zero-inflated quantile rank-score based test (ziqrank) with application to scrna-seq differential gene expression analysis

The Annals of Applied Statistics

15

:1673–1696.
- PubMed
- Google Scholar

1. Liu X
2. Shen Q
3. Zhang S
(2023)

Cross-species cell-type assignment from single-cell RNA-seq data by a heterogeneous graph neural network

Genome Research

33

:96–111.
- PubMed
- Google Scholar

1. Lodato MA
2. Ng CW
3. Wamstad JA
4. Cheng AW
5. Thai KK
6. Fraenkel E
7. Jaenisch R
8. Boyer LA
(2013)

SOX2 co-occupies distal enhancer elements with distinct POU factors in ESCs and NPCs to specify cell state

PLOS Genetics

9

:e1003288.
- PubMed
- Google Scholar

1. Loh Y-H
2. Wu Q
3. Chew J-L
4. Vega VB
5. Zhang W
6. Chen X
7. Bourque G
8. George J
9. Leong B
10. Liu J
11. Wong K-Y
12. Sung KW
13. Lee CWH
14. Zhao X-D
15. Chiu K-P
16. Lipovich L
17. Kuznetsov VA
18. Robson P
19. Stanton LW
20. Wei C-L
21. Ruan Y
22. Lim B
23. Ng H-H
(2006)

The Oct4 and Nanog transcription network regulates pluripotency in mouse embryonic stem cells

Nature Genetics

38

:431–440.
- PubMed
- Google Scholar

1. Ludwig TE
2. Andrews PW
3. Barbaric I
4. Benvenisty N
5. Bhattacharyya A
6. Crook JM
7. Daheron LM
8. Draper JS
9. Healy LE
10. Huch M
11. Inamdar MS
12. Jensen KB
13. Kurtz A
14. Lancaster MA
15. Liberali P
16. Lutolf MP
17. Mummery CL
18. Pera MF
19. Sato Y
20. Shimasaki N
21. Smith AG
22. Song J
23. Spits C
24. Stacey G
25. Wells CA
26. Zhao T
27. Mosher JT
(2023)

ISSCR standards for the use of human stem cells in basic research

Stem Cell Reports

18

:1744–1752.
- PubMed
- Google Scholar

1. Lun ATL
2. Bach K
3. Marioni JC
(2016)

Pooling across cells to normalize single-cell RNA sequencing data with many zero counts

Genome Biology

17

:1–14.
- Google Scholar

1. Madsen JGS
2. Rauch A
3. Van Hauwaert EL
4. Schmidt SF
5. Winnefeld M
6. Mandrup S
(2018)

Integrated analysis of motif activity and gene expression changes of transcription factors

Genome Research

28

:243–255.
- PubMed
- Google Scholar

1. Martowicz A
2. Seeber A
3. Untergasser G
(2016)

The role of EpCAM in physiology and pathology of the epithelium

Histology and Histopathology

31

:349–355.
- PubMed
- Google Scholar

1. Mollaaghababa R
2. Pavan WJ
(2003)

The importance of having your SOX on: role of SOX10 in the development of neural crest-derived melanocytes and glia

Oncogene

22

:3024–3034.
- PubMed
- Google Scholar

1. Mononen MM
2. Leung CY
3. Xu J
4. Chien KR
(2020)

Trajectory mapping of human embryonic stem cell cardiogenesis reveals lineage branch points and an ISL1 progenitor-derived cardiac fibroblast lineage

Stem Cells

38

:1267–1278.
- PubMed
- Google Scholar

1. Moon KR
2. van Dijk D
3. Wang Z
4. Gigante S
5. Burkhardt DB
6. Chen WS
7. Yim K
8. van den Elzen A
9. Hirn MJ
10. Coifman RR
11. Ivanova NB
12. Wolf G
13. Krishnaswamy S
(2019)

Visualizing structure and transitions in high-dimensional biological data

Nature Biotechnology

37

:1482–1492.
- PubMed
- Google Scholar

1. Mori K
2. Muto Y
3. Kokuzawa J
4. Yoshioka T
5. Yoshimura S
6. Iwama T
7. Okano Y
8. Sakai N
(2004)

Neuronal protein NP25 interacts with F-actin

Neuroscience Research

48

:439–446.
- PubMed
- Google Scholar

1. Muhl L
2. Mocci G
3. Pietilä R
4. Liu J
5. He L
6. Genové G
7. Leptidis S
8. Gustafsson S
9. Buyandelger B
10. Raschperger E
11. Hansson EM
12. Björkegren JLM
13. Vanlandewijck M
14. Lendahl U
15. Betsholtz C
(2022)

A single-cell transcriptomic inventory of murine smooth muscle cells

Developmental Cell

57

:2426–2443.
- PubMed
- Google Scholar

1. Murphy M
2. Bernard O
3. Reid K
4. Bartlett PF
(1991)

Cell lines derived from mouse neural crest are representative of cells at various stages of differentiation

Journal of Neurobiology

22

:522–535.
- PubMed
- Google Scholar

1. Närvä E
2. Rahkonen N
3. Emani MR
4. Lund R
5. Pursiheimo JP
6. Nästi J
7. Autio R
8. Rasool O
9. Denessiouk K
10. Lähdesmäki H
11. Rao A
12. Lahesmaa R
(2012)

RNA-binding protein L1TD1 interacts with LIN28 via RNA and is required for human embryonic stem cell self-renewal and cancer cell proliferation

STEM CELLS

30

:452–460.
- PubMed
- Google Scholar

1. Nguyen QH
2. Lukowski SW
3. Chiu HS
4. Senabouth A
5. Bruxner TJC
6. Christ AN
7. Palpant NJ
8. Powell JE
(2018)

Single-cell RNA-seq of human induced pluripotent stem cells reveals cellular heterogeneity and cell state transitions between subpopulations

Genome Research

28

:1053–1066.
- PubMed
- Google Scholar

1. Oikawa T
2. Otsuka Y
3. Onodera Y
4. Horikawa M
5. Handa H
6. Hashimoto S
7. Suzuki Y
8. Sabe H
(2018)

Necessity of p53-binding to the CDH1 locus for its expression defines two epithelial cell types differing in their integrity

Scientific Reports

8

:1595.
- PubMed
- Google Scholar

1. Pascal LE
2. True LD
3. Campbell DS
4. Deutsch EW
5. Risk M
6. Coleman IM
7. Eichner LJ
8. Nelson PS
9. Liu AY
(2008)

Correlation of mRNA and protein levels: cell type-specific gene expression of cluster designation antigens in the prostate

BMC Genomics

9

:246.
- PubMed
- Google Scholar

1. Peng WC
2. Logan CY
3. Fish M
4. Anbarchian T
5. Aguisanda F
6. Álvarez-Varela A
7. Wu P
8. Jin Y
9. Zhu J
10. Li B
11. Grompe M
12. Wang B
13. Nusse R
(2018)

Inflammatory cytokine TNFα promotes the long-term expansion of primary hepatocytes in 3D culture

Cell

175

:1607–1619.
- PubMed
- Google Scholar

1. Regev A
2. Teichmann SA
3. Lander ES
4. Amit I
5. Benoist C
6. Birney E
7. Bodenmiller B
8. Campbell P
9. Carninci P
10. Clatworthy M
11. Clevers H
12. Deplancke B
13. Dunham I
14. Eberwine J
15. Eils R
16. Enard W
17. Farmer A
18. Fugger L
19. Göttgens B
20. Hacohen N
21. Haniffa M
22. Hemberg M
23. Kim S
24. Klenerman P
25. Kriegstein A
26. Lein E
27. Linnarsson S
28. Lundberg E
29. Lundeberg J
30. Majumder P
31. Marioni JC
32. Merad M
33. Mhlanga M
34. Nawijn M
35. Netea M
36. Nolan G
37. Pe’er D
38. Phillipakis A
39. Ponting CP
40. Quake S
41. Reik W
42. Rozenblatt-Rosen O
43. Sanes J
44. Satija R
45. Schumacher TN
46. Shalek A
47. Shapiro E
48. Sharma P
49. Shin JW
50. Stegle O
51. Stratton M
52. Stubbington MJT
53. Theis FJ
54. Uhlen M
55. van Oudenaarden A
56. Wagner A
57. Watt F
58. Weissman J
59. Wold B
60. Xavier R
61. Yosef N
62. Human Cell Atlas Meeting Participants
(2017)

The Human Cell Atlas

eLife

6

:27041.
- Google Scholar

1. Rhodes K
2. Barr KA
3. Popp JM
4. Strober BJ
5. Battle A
6. Gilad Y
(2022)

Human embryoid bodies as a novel system for genomic studies of functionally diverse cell types

eLife

11

:e71361.
- Google Scholar

1. Rojas MG
2. Pereira-Simon S
3. Zigmond ZM
4. Varona Santos J
5. Perla M
6. Santos Falcon N
7. Stoyell-Conti FF
8. Salama A
9. Yang X
10. Long X
11. Duque JC
12. Salman LH
13. Tabbara M
14. Martinez L
15. Vazquez-Padron RI
(2024)

Single-cell analyses offer insights into the different remodeling programs of arteries and veins

Cells

13

:793.
- PubMed
- Google Scholar

1. Schüller U
2. Kho AT
3. Zhao Q
4. Ma Q
5. Rowitch DH
(2006)

Cerebellar “transcriptome” reveals cell-type and stage-specific expression during postnatal development and tumorigenesis

Molecular and Cellular Neurosciences

33

:247–259.
- PubMed
- Google Scholar

1. Shumate A
2. Salzberg SL
(2021)

Liftoff: accurate mapping of gene annotations

Bioinformatics

37

:1639–1643.
- PubMed
- Google Scholar

1. Song Y
2. Miao Z
3. Brazma A
4. Papatheodorou I
(2023)

Benchmarking strategies for cross-species integration of single-cell RNA sequencing data

Nature Communications

14

:6495.
- Google Scholar

1. Street K
2. Risso D
3. Fletcher RB
4. Das D
5. Ngai J
6. Yosef N
7. Purdom E
8. Dudoit S
(2018)

Slingshot: cell lineage and pseudotime inference for single-cell transcriptomics

BMC Genomics

19

:477.
- PubMed
- Google Scholar

1. Sullivan PF
2. Meadows JRS
3. Gazal S
4. Phan BN
5. Li X
6. Genereux DP
7. Dong MX
8. Bianchi M
9. Andrews G
10. Sakthikumar S
11. Nordin J
12. Roy A
13. Christmas MJ
14. Marinescu VD
15. Wang C
16. Wallerman O
17. Xue J
18. Yao S
19. Sun Q
20. Szatkiewicz J
21. Wen J
22. Huckins LM
23. Lawler A
24. Keough KC
25. Zheng Z
26. Zeng J
27. Wray NR
28. Li Y
29. Johnson J
30. Chen J
31. Zoonomia Consortium§
32. Paten B
33. Reilly SK
34. Hughes GM
35. Weng Z
36. Pollard KS
37. Pfenning AR
38. Forsberg-Nilsson K
39. Karlsson EK
40. Lindblad-Toh K
(2023)

Leveraging base-pair mammalian constraint to understand genetic variation and human disease

Science

380

:eabn2937.
- PubMed
- Google Scholar

1. Suresh H
2. Crow M
3. Jorstad N
4. Hodge R
5. Lein E
6. Dobin A
7. Bakken T
8. Gillis J
(2023)

Comparative single-cell transcriptomic analysis of primate brains highlights human-specific regulatory evolution

Nature Ecology & Evolution

7

:1930–1943.
- PubMed
- Google Scholar

1. Tachampa K
2. Wongtawan T
(2020)

Unique patterns of cardiogenic and fibrotic gene expression in rat cardiac fibroblasts

Veterinary World

13

:1697–1708.
- PubMed
- Google Scholar

Preprint
1. Tan L
2. Shi J
3. Moghadami S
4. Wright CP
5. Parasar B
6. Seo Y
7. Vallejo K
8. Cobos I
9. Duncan L
10. Chen R
11. Deisseroth K
(2023)

Cerebellar granule cells develop non-neuronal 3D genome architecture over the lifespan

bioRxiv.
- Google Scholar

1. The Tabula Muris Consortium
2. Overall coordination
3. Logistical coordination
4. Organ collection and processing
5. Library preparation and sequencing
6. Computational data analysis
7. Cell type annotation
8. Writing group
9. Supplemental text writing group
10. Principal investigators
(2018)

Single-cell transcriptomics of 20 mouse organs creates a Tabula Muris

Nature

562

:367–372.
- Google Scholar

1. Tseng TC
2. Hsieh FY
3. Dai NT
4. Hsu SH
(2016)

Substrate-mediated reprogramming of human fibroblasts into neural crest stem-like cells and their applications in neural repair

Biomaterials

102

:148–161.
- PubMed
- Google Scholar

1. Wang J
2. Sun H
3. Jiang M
4. Li J
5. Zhang P
6. Chen H
7. Mei Y
8. Fei L
9. Lai S
10. Han X
11. Song X
12. Xu S
13. Chen M
14. Ouyang H
15. Zhang D
16. Yuan G-C
17. Guo G
(2021)

Tracing cell-type evolution by cross-species comparison of cell atlases

Cell Reports

34

:108803.
- Google Scholar

1. Ware M
2. Hamdi-Rozé H
3. Le Friec J
4. David V
5. Dupé V
(2016)

Regulation of downstream neuronal genes by proneural transcription factors during initial neurogenesis in the vertebrate brain

Neural Development

11

:22.
- PubMed
- Google Scholar

1. Webber W
2. Moffat A
3. Zobel J
(2010)

A similarity measure for indefinite rankings

ACM Transactions on Information Systems

28

:1–38.
- Google Scholar

1. Xing T
2. Benderman LJ
3. Sabu S
4. Parker J
5. Yang J
6. Lu Q
7. Ding L
8. Chen YH
(2020)

Tight junction protein claudin-7 is essential for intestinal epithelial stem cell self-renewal and differentiation

Cellular and Molecular Gastroenterology and Hepatology

9

:641–659.
- PubMed
- Google Scholar

1. Zhang X
2. Lan Y
3. Xu J
4. Quan F
5. Zhao E
6. Deng C
7. Luo T
8. Xu L
9. Liao G
10. Yan M
11. Ping Y
12. Li F
13. Shi A
14. Bai J
15. Zhao T
16. Li X
17. Xiao Y
(2019a)

CellMarker: a manually curated resource of cell markers in human and mouse

Nucleic Acids Research

47

:D721–D728.
- PubMed
- Google Scholar

1. Zhang Z
2. Luo D
3. Zhong X
4. Choi JH
5. Ma Y
6. Wang S
7. Mahrt E
8. Guo W
9. Stawiski EW
10. Modrusan Z
11. Seshagiri S
12. Kapur P
13. Hon GC
14. Brugarolas J
15. Wang T
(2019b)

SCINA: a semi-supervised subtyping algorithm of single cells and bulk samples

Genes

10

:531.
- Google Scholar

1. Ziller MJ
2. Edri R
3. Yaffe Y
4. Donaghey J
5. Pop R
6. Mallard W
7. Issner R
8. Gifford CA
9. Goren A
10. Xing J
11. Gu H
12. Cacchiarelli D
13. Tsankov AM
14. Epstein C
15. Rinn JL
16. Mikkelsen TS
17. Kohlbacher O
18. Gnirke A
19. Bernstein BE
20. Elkabetz Y
21. Meissner A
(2015)

Dissecting neural differentiation regulatory networks through epigenetic footprinting

Nature

518

:355–359.
- Google Scholar

Article and author information

Author details

Jessica Jocher

Anthropology and Human Genomics, Faculty of Biology, Ludwig-Maximilians-Universität München, Munich, Germany

Contribution

Conceptualization, Data curation, Formal analysis, Validation, Investigation, Visualization, Methodology, Writing – original draft, Project administration

Contributed equally with

Philipp Janssen

Competing interests

No competing interests declared

Philipp Janssen

Anthropology and Human Genomics, Faculty of Biology, Ludwig-Maximilians-Universität München, Munich, Germany

Contribution

Conceptualization, Data curation, Software, Formal analysis, Validation, Investigation, Visualization, Methodology, Writing – original draft, Writing – review and editing

Contributed equally with

Jessica Jocher

Competing interests

No competing interests declared

"This ORCID iD identifies the author of this article:"

0000-0002-3167-7503

Beate Vieth

Anthropology and Human Genomics, Faculty of Biology, Ludwig-Maximilians-Universität München, Munich, Germany

Contribution

Conceptualization, Software, Formal analysis, Supervision, Investigation, Writing – original draft

Competing interests

No competing interests declared

Fiona C Edenhofer

Anthropology and Human Genomics, Faculty of Biology, Ludwig-Maximilians-Universität München, Munich, Germany

Contribution

Data curation, Methodology

Competing interests

No competing interests declared

"This ORCID iD identifies the author of this article:"

0000-0001-6983-2938

Tamina Dietl

Helmholtz Zentrum München Deutsches Forschungszentrum für Gesundheit und Umwelt: Munich, Munich, Germany

Contribution

Formal analysis, Methodology

Competing interests

No competing interests declared

"This ORCID iD identifies the author of this article:"

0009-0000-4126-2603

Anita Térmeg

Anthropology and Human Genomics, Faculty of Biology, Ludwig-Maximilians-Universität München, Munich, Germany

Contribution

Methodology

Competing interests

No competing interests declared

"This ORCID iD identifies the author of this article:"

0009-0005-8872-9086

Paulina Spurk

Anthropology and Human Genomics, Faculty of Biology, Ludwig-Maximilians-Universität München, Munich, Germany

Contribution

Methodology

Competing interests

No competing interests declared

"This ORCID iD identifies the author of this article:"

0000-0001-8682-370X

Johanna Geuder

Anthropology and Human Genomics, Faculty of Biology, Ludwig-Maximilians-Universität München, Munich, Germany

Contribution

Methodology

Competing interests

No competing interests declared

Wolfgang Enard

Anthropology and Human Genomics, Faculty of Biology, Ludwig-Maximilians-Universität München, Munich, Germany

Contribution

Conceptualization, Supervision, Funding acquisition, Project administration

Contributed equally with

Ines Hellmann

For correspondence

enard@bio.lmu.de

Competing interests

No competing interests declared

"This ORCID iD identifies the author of this article:"

0000-0002-4056-0550

Ines Hellmann

Anthropology and Human Genomics, Faculty of Biology, Ludwig-Maximilians-Universität München, Munich, Germany

Contribution

Conceptualization, Supervision, Funding acquisition, Investigation, Methodology, Writing – original draft, Writing – review and editing

Contributed equally with

Wolfgang Enard

For correspondence

hellmann@bio.lmu.de

Competing interests

No competing interests declared

"This ORCID iD identifies the author of this article:"

0000-0003-0588-1313

Funding

Deutsche Forschungsgemeinschaft (458247426)

Wolfgang Enard

Ines Hellmann

Deutsche Forschungsgemeinschaft (458888224)

Wolfgang Enard

Ines Hellmann

Deutsche Forschungsgemeinschaft (407541155)

Ines Hellmann

The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.

Acknowledgements

We thank all members of the Enard/Hellmann group for valuable input and discussions. We are grateful to Stefanie Färberböck for her expert technical assistance and help in cell culture. We acknowledge the Core Facility Flow Cytometry at the Biomedical Center, Ludwig-Maximilians-Universität München, for providing equipment and services. We thank Dr. Stefan Krebs and the staff of LAFUGA and the NGS Competence Center Tübingen (NCCT) for sequencing services. This work was supported by the Deutsche Forschungsgemeinschaft (DFG): PJ and JJ, as well as the majority of the project costs, were funded by a grant to IH and WE (458247426). BV was funded by the grant to IH (407541155) and FE by a grant to WE (458888224).

Version history

Sent for peer review:

December 13, 2024

Preprint posted

:

March 18, 2025

Reviewed Preprint version 1

:

March 24, 2025

Reviewed Preprint version 2

:

March 6, 2026

Version of Record published

:

April 8, 2026

Cite all versions

You can cite all versions using the DOI

. This DOI represents all versions, and will always resolve to the latest one.

Copyright

This article is distributed under the terms of the

Creative Commons Attribution License

, which permits unrestricted use and redistribution provided that the original author and source are credited.