Skip to main content
  • Research article
  • Open access
  • Published:

Detection and characterization of constitutive replication origins defined by DNA polymerase epsilon

Abstract

Background

Despite the process of DNA replication being mechanistically highly conserved, the location of origins of replication (ORI) may vary from one tissue to the next, or between rounds of replication in eukaryotes, suggesting flexibility in the choice of locations to initiate replication. Lists of human ORI therefore vary widely in number and location, and there are currently no methods available to compare them. Here, we propose a method of detection of ORI based on somatic mutation patterns generated by the mutator phenotype of damaged DNA polymerase epsilon (POLE).

Results

We report the genome-wide localization of constitutive ORI in POLE-mutated human tumors using whole genome sequencing data. Mutations accumulated after many rounds of replication of unsynchronized dividing cell populations in tumors allow to identify constitutive origins, which we show are shared with high fidelity between individuals and tumor types. Using a Smith–Waterman-like dynamic programming approach, we compared replication origin positions obtained from multiple different methods. The comparison allowed us to define a consensus set of replication origins, identified consistently by multiple ORI detection methods. Many DNA features co-localized with the consensus set of ORI, including chromatin loop anchors, G-quadruplexes, S/MARs, and CpGs. Among all features, the H2A.Z histone exhibited the most significant association.

Conclusions

Our results show that mutation-based detection of replication origins is a viable approach to determining their location and associated sequence features.

Background

DNA replication origins are crucial for initiation of the DNA synthesis, guiding the recruitment of proteins that form the pre-replication complex (pre-RC), including Mcm2-7 helicase. Helicase leads to the creation of a replication bubble, making the DNA accessible to polymerases, which replicate DNA in a bidirectional manner [1]. The first step in pre-RC formation is the recruitment of the origin recognition complex (ORC) that binds to specific regions in the DNA. These regions, referred to as the DNA replication origins (ORI), are selected based on sequence specificity in yeast [2, 3]; however, in humans, the recognition mechanism utilizes various DNA characteristics [4], and only a limited number of origins are active at each cell cycle [5]. The efficiency of activation of the origins is used to classify them into three categories: constitutive, flexible, and dormant [4]. Constitutive origins are used by all cells, independent of the cell type in each cell cycle, whereas activation of the flexible origins may vary in position or from one cell cycle to the next, or one cell type to the next [4]. Dormant origins become active in stress conditions that affect the S phase, including serum starvation and DNA damage [6]. Variation in origin activation by flexible or dormant origins may be one reason for large differences in the number of ORI between different cell types [7,8,9] and could also help explain differences in results obtained using various methods. Large variation in numbers and positioning of ORI in higher organisms constitutes one of the major confounding factors in the study of human ORI.

The target recognition mechanism of ORC requires DNA characteristics similar to those required by transcription factors at transcription start sites (TSS). They include nucleotide composition, chromatin state, DNA methylation, and secondary structure of DNA [4, 10]. Since ORI must be accessible to protein binding, their locations were shown to coincide with the nucleosome-free regions, histone acetylation, and DNAse sensitive sites [11]. Additionally, a low DNA methylation level is an important factor, making some promoter regions suitable targets for ORC binding [12]. As a result, several of the best studied ORI are in the vicinity of TSS of known genes such as MYC, TOP1, and LMNB2 [13,14,15].

Despite the similarities to TSS, there exists no definite evidence of the existence of specific sequence motifs required to initiate ORC assembly in humans. However, in some unicellular eukaryotic genomes, Cis-acting sequences determine the location of replication origins [2], and in yeast, two sequence elements are necessary: a 17-bp autonomously replicating consensus sequence (ACS) that binds origin recognition complex (ORC): WWWWTTTAYRTTTWGTT [16] and a broader sequence context encompassing 200 to 300 bp that appears to be important for depleting nucleosomes from the origin [17,18,19,20]. In human cells, nucleotide composition [21,22,23] and G-quadruplexes [24] have been shown to increase the replication origin activity also affecting its location. DNA structure is also believed to play an important role in replication origin location and activity. Chromatin loops were shown to be associated with origins [25], and also, it is believed that DNA replication is initiated in regions attached to the nuclear matrix (MARs) [26]. In a more recent work, DNA replication origins were also shown to be overrepresented at the borders of topologically associating domains (TADs) [23].

Previously, it was proposed [27] that mutational patterns emanated by the replicative DNA polymerases might effectively map the origins of replication, using mutations identified in whole genome sequencing experiments. Using this information, we developed a method for the detection of constitutive origins of replication, mORI (mutationally defined ORI), based solely on mutation data from mutator phenotype tumor genomes. We also developed a novel method for the comparison of genomic positions which we used to compare multiple replication origin detection methods. Finally, we used the identified replication origins to characterize DNA structure in the vicinity of constitutive replication origins, determining the factors that are associated with their location.

Results

Detection of replication origins based on mutation patterns

DNA polymerase ε is assumed to be responsible for the leading strand synthesis [28], a result first discovered in yeast, and subsequently reinforced by strong strand biases of context-specific mutations observed in cancers with mutation in the proof-reading domain of the enzyme (Fig. 1A) [27, 29]. To capture this feature and to identify potential replication origins, we analyzed the distribution of mutations in POLE exonuclease damage tumors by developing a POLE-exo-associated Mutation Asymmetry score (PMA). We obtained whole genome sequencing data (WGS) of tumors harboring POLE-exo mutation (at multiple different amino-acid positions), generated by the TCGA and ICGC projects (Additional file 1: Table S1) and focused on 20 cases out of 43 in which at least 20% of all mutations are either TCT→TAT or TCG→TTG—the two mutations most commonly found in these patients ( [30] see mutation signatures 10a and 10b). The main goal of selecting POLE exonuclease mutations is to remove samples which harbor missense mutations that do not lead to a strand-specific mutation pattern and therefore have no bearing on ORI detection. We additionally removed samples in which POLE-exo variant might be a false positive, or might have occurred late during the tumor development (in which case the pattern could be also not visible). Although TCT→TAT and TCG→TTG variants were reported to make up the majority of all mutations in POLE-exo mutant tumors, we noticed that other mutation types might be of relevance since they also occurred in similar spatial patterns around putative ORIs. Since our replication origin detection method (mORI) benefited from higher mutation numbers, we conducted it in two stages. In the first stage, we identified replication origins based on the PMA score calculated using only the reported TCT→TAT and TCG→TTG variants (Fig. 1B).

Fig. 1
figure 1

A DNA replication in human cells. Polymerase ε synthesizes the leading strand while polymerase δ is responsible for the synthesis of the lagging strand [28]. Both polymerases start the synthesis from the replication origin (ORI) after replication is initiated by the origin recognition complex (ORC). Damage in the exonuclease domain of polymerase ε leads to an increased mutation rate with a preference for C→A mutation in the TxT context, observed as TCT->TAT, in the newly replicated strand. This creates an asymmetric pattern of TCT>TAT and AGA>ATA in the vicinity of replication origins. B Diagram of the two-stage ORI detection method (mORI), based on the PMA score

We then used 1000 identified ORI positions with the highest PMA score to study the distribution around these ORI of all 96 possible triple context-dependent variants. Those that showed similar patterns, belonging to clusters A and B shown in Fig. 2, were used in the second stage of the detection, which was based this time on 58 mutation types. Mutations in cluster A show a positive PMA score while those in cluster B a negative one, in order to account for this difference; in the second stage of the detection, we combined mutations with C/A reference allele from cluster A with G/T from cluster B and G/T from cluster A with C/A from cluster B. This increased the total number of mutations used from on average 422 to 606k per sample, also increasing the number of identified replication origins from 5132 to 5409.

Fig. 2
figure 2

Heatmap of the PMA score calculated at ORI positions identified using known POLE-exo-specific variants (marked on the plot). Samples (rows) and mutation types (columns), both clustered into 3 groups. Columns were clustered using only samples in which the percentage of POLE-exo-specific variants exceeded 20% (bar plot located on the right side). Mutations from clusters A and B were used to calculate the PMA score in the final detection. Bar plots on the right show the percentage of POLE-exo-specific variants (marked in the lower left corner) and the total number of mutations per Mb in each of the samples divided into SigProfiler signatures listed in the upper right corner of the plot. Color-based annotation next to the sample IDs determines the cancer type (COAD, colon adenocarcinoma; LUAD, lung adenocarcinoma; READ, rectum adenocarcinoma; UCEC, uterine corpus endometrial carcinoma; PBCA, pediatric brain cancer)

Figure 2 shows the PMA score obtained for the 1000 replication origins identified using TCT→TAT and TCG→TTG (referred to as POLE-exo-specific variants) for individual mutation types and samples. The majority of 20 selected samples, which show at least 20% POLE-exo-specific variants, also show a similar pattern in other mutation types, including AAA→ACA/TTT→TGT, which make up around 9% of all mutations in all of the studied samples. Clusters A and B marked in Fig. 2 contain a total of 58 mutation types which show different frequencies in relation to the replication origin location. Cluster A contains, among others, all C→T mutations no matter the context while cluster B all A→C mutations. The context however seems to affect the frequency of the mutation occurrence with the TxT context being associated with the highest number of variants.

The samples differ significantly in terms of mutation profiles. Samples from cluster 1 (Fig. 2) show a high number of mutations per Mb, from 52 in TCGA-D1-A17Q up to 952 in TCGA-CA-6717. This cluster represents colon, lung, and rectum adenocarcinomas. SigProfiler decomposition of those variants shows a combination of SBS10a, SBS10b, SBS5, and SBS28 signatures, the first two of which are characteristic of POLE-exo mutants and the second two are often observed in cancers, especially the SBS5 clocklike signature. Cluster 3 shows a similar SigProfiler decomposition; however, the variant frequency is much lower, from an average 2 per Mb in TCGA−AG−3892 up to 14 in TCGA−FU−A3HZ. Cluster 3 contains similar cancer types as cluster 1, with an addition of one cervical squamous cell carcinoma and endocervical adenocarcinoma (TCGA-FU-A3HZ) and one pediatric brain cancer (DO227933). Sample TCGA−FI−A2D0 is an outlier in cluster 3 in terms of mutational signature, showing a majority of variants associated with SBS5 (unknown clock-like signature) and some from SBS15 (defective DNA mismatch repair) unseen in other samples from this group. This sample was not selected for the ORI detection as it does not pass the 20% cutoff of POLE-exo-specific variants. Cluster 2 contains the remaining samples which also have less than 20% of POLE-exo-specific variants, representing 12 different cancer types. While some of the cases exceed 100 mutations/Mb, they show a significantly different pattern associated with defective DNA mismatch repair (signatures SBS15 and SBS44) but also signature associated with polymerase ε mutation (SBS14).

The ORI detection algorithm described is expected to identify only the positions which are conserved among individual cell division cycles and among individuals, except for cell type-specific sites and sites that can change very often depending on the number of cell divisions. We believe these sites are highly dependent on the specificity of the DNA structure in their vicinity, guiding the ORC to those locations.

Distribution of identified replication origins

Figure 3 shows ORI sites identified in a 3-Mb segment on the p-arm of chromosome 1, demonstrating the extreme strand bias of selected mutations in their vicinity (for other examples, that show commonly studied ORI, see Additional file 4: Fig. S1, the entire list of identified replication origins is available in Additional file 2: Table S2). The coefficient also shows local minima, which we assume represent sites where polymerase ε meets polymerase δ, i.e., the endpoints of the replication process. The minima are not as evident as maxima, showing a different distribution, and in many cases are not placed halfway between the two neighboring replication origins. Possibly, this could be caused by variation in the firing time of individual replication origins [33], or the polymerase efficiency which, among other things, is known to be GC content-dependent [34].

Fig. 3
figure 3

Replication origins identified in a fragment of chromosome 1. A Individual context-dependent somatic mutations used in the ORI detection algorithm (C/A reference allele from cluster A and G/T from cluster B, as shown in Fig. 2), each row represents one individual patient; B mutations of reverse complementary type to those shown on panel A (G/T reference allele from cluster A and C/A from cluster B as shown on Fig. 2); C the consensus PMA score calculated for the combined samples (blue line) and individual sample PMA score (gray), red lines mark the peak positions which represent eight replication origins identified by mORI numbered 1–8; D replication origins identified by other NGS-based methods (see Supp Table 1), in other samples from various tissues; each row represents one sample; except the Akerman SNS-seq track where the top one corresponds to core and bottom to stochastic origins. E RFD profile (blue dots) for two OK-seq samples from [31], along with identified ORI positions (red rectangles); F exons of known genes; G GC content calculated in 1-kb windows; H nucleotide compositional skew profile (gray dots) and replication origin positions from [21]; I replication time obtained for individual ENCODE samples (gray line) and an average for all samples (purple line); J sequence conservation score (phastCons100way); K ENCODE histone marks; L CpG islands (UCSC data); M isochore positions from [32], lowest GC–L1,L2,H1,H2,H3–highest GC; N DNAse hypersensitivity sites obtained for K562 cell line (ENCODE data)

Locations of the replication origins are reported to be associated with DNA features that are non-randomly distributed over the entire genome [4]. For this reason, the identified replication origins might also be non-randomly distributed. To test this, we computed the median distance between identified ORI and compared it to the median distance between random positions in the genome, using a two-sided Wilcoxon rank sum test. The median distance between identified ORI positions is 397kb, compared to the 355kb of the random positions (p-value < 2.2 × 10−16). While the median distances differ by around 10%, a more significant difference can be observed for the variance, which is around 45% higher in a random set of positions (see Additional file 4: Fig. S2). This indicates that the identified ORI are more regularly spaced, than expected by chance. The average number of mORI per Mb is also fairly consistent between autosomes, with an average of 1.9/Mb (for details, see Additional file 4: Fig. S3).

Comparison of ORI detection methods

Replication origin detection can be carried out using various experimental approaches, summarized in Table 1, which provide from 1 thousand to over 0.5 million positions in the human genome. The detection is expected to vary significantly in terms of sensitivity and specificity also targeting various classes of replication origins (constitutive, flexible, and dormant), resulting in significantly different numbers of detected sites. Our replication origin detection method is based on an asymmetric mutational pattern between the leading and lagging strand. In that respect, it is similar to replication fork directionality (RFD) profiles, shown in Fig. 3, which originate from the Okazaki fragment purification and sequencing (OK-seq) [31, 42]. We compared the mutational pattern in the present work to the experimental RFD profiles available for GM06990, a “normal” cell line, and for HeLa, a cell line of cancerous origin; we also calculated the RFD profiles and carried out origin detection for 10 additional cell lines based on data provided by Wu [42]. The number of ORIs detected using our method (5409) is similar to the number described by Petryk et al. for GM06990 (5684) and samples from Wu et al. (on average 5374).

Table 1 Summary of large-scale replication origin detection methods in human cells

Comparison of two sets of genomic coordinates is usually carried out by calculating the distance between elements located on a specific chromosome or finding the overlaps after converting the coordinates into ranges [51, 52]. This however requires a definition of maximum allowed distance, used to find the overlaps, or pairing the coordinates between both sets for which the distance will be calculated. The latter can be very difficult especially if the overall number of features in both compared sets is different. The problem of comparing two sets of ORI is very similar to the one solved by Needleman–Wunsch global sequence alignment algorithm, where it is possible to introduce gaps into one of the compared sequences in order to account for insertions and deletions. We adopted this dynamic programming approach to work with ORI positions represented as numeric vectors on each chromosome instead of nucleotide sequences and used it to optimally pair to sets of origin positions, which led to an estimation of the pairwise distance between the two sets. Instead of rewards for match and mismatch, we used the absolute distance between locations and used a gap penalty to control the number of gaps introduced to optimize the comparison—gaps represent unmatched positions, from one method or the other (see the “Methods” section). We extended the algorithm to allow the comparison of multiple ORI sets, which works similarly to ClustalW multiple sequence alignment [53].

We used both algorithms, termed numeric vector alignment (NVA) and multiple numeric vector alignment (MNVA), to compare the positions of replication origins between methods that provide a similar number of replication origins (up to 10,000 positions in the genome). Comparison between methods that differ more significantly in the number of identified ORI sites (especially SNS-seq approaches with over 100,000 positions) was not carried out since they likely have significantly different detection resolutions. This would make the comparison misleading and also technically more challenging, since our comparison methodology, which is similar to global sequence alignment, requires that both compared sets of genomic positions have a similar size. Comparison of methods with higher resolution (higher number of genomic positions) would require changes in the gap penalty parameter of the NVA method, due to a significant decrease in the average distance between positions. However, we cannot alter this parameter in order to maintain identical comparison conditions between all pairs of methods.

Panels A and B in Fig. 4 show the results of MNVA conducted on replication origins we identified, based on mutation patterns in individual samples, with the highest number of mutations, as well as positions detected using the union of all samples. We also used experimentally identified ORI sites that originate from the works of Mesner, Picard, Langley, Petryk, and Wu and from the computational method developed by Huvet (see Table 1). Each row represents one origin and the color scale corresponds to the fraction of methods in which it was detected. By applying the MNVA algorithm, we were able to collapse nearly 150 thousand genomic coordinates, across all chromosomes, from 31 distinct sets, into 10,161 locations. Out of those, 4666 were identified in at least 50% and 518 by more than 90% of the samples, from all ORI detection methods (for details, see Additional file 4: Fig. S4). Positions of all compared origins combined using the MNVA algorithm are available as the Additional file 3: Table S3. Figure 4C shows the results of NVA pairwise comparisons between all origins identified. Figure 4 not only shows which positions are identified by the majority or minority of methods (panels A and B) but also the global similarities between them (panel C). Origins detected by mORI were most similar to those detected by the method of Huvet et al. and to the OK-seq. As expected, the three sets of randomly generated genomic positions were poor matches to all other sets.

Fig. 4
figure 4

Similarity between various replication origin detection methods. A Results obtained using the multiple vector alignment algorithm for chromosome 1, each row represents one origin and the color scale corresponds to the fraction of methods in which it was detected; the blue line represents the replication time data. B Enlarged fragment of panel A for 10–20-Mb region of chr1. C Heatmap showing the overall correlation between ORI detection methods, including a set of 5000 random genomic positions for comparison. The correlations are obtained using a dynamic programming-based vector alignment algorithm (see the “Methods” section)

DNA features associated with ORI

The mORI detection method is based on the PMA score, which additionally can be used to rank the origins. We assume higher PMA scores correspond not only to the detection confidence but also to the frequency at which the origin is utilized in a group of multiple analyzed samples. This provides the possibility to identify a set of consistently utilized replication origin positions that can be used to study the DNA features associated with their location. We conducted such analysis using all 5409 ORI positions, identified based on mutation patterns, and only using a subset of 1000 origins, with the highest values of the PMA score. In Fig. 5A, we compared the number of repeat sequences, G-quadruplexes, histone marks, CpG islands, transcription start sites, S/MARs, and chromatin loops at various distances from the replication origins, measured by 20-kb windows from the putative ORI. The occurrences were summed over all studied origins and divided by the minimum occurrence levels, showing only the fold change for each interval.

Fig. 5
figure 5

Correlation of mORI sites with epigenetic features of the genome. A Number of specific features at a given distance from the replication origins, expressed as the fold change with respect to the minimum value from each row. B Number of identified ORI at a specific distance from the chromatin loop anchor. The dashed vertical line shows the loop anchor points, and the red line marks the loop center. Since loops have different sizes but the plot shows a fixed interval of 1MB, the number of ORI was divided by the number of loops which reach this specific length. C Average methylation level of the TCGA samples used for the detection of replication origins at various distances from the top 1000 origin positions. D Number of H2A.Z histone modifications at a specific distance from the replication origins, expressed as the fold change with respect to the minimum value from each row. Individual rows were obtained using various replication origin detection methods and results of the method concordance obtained using the MNVA algorithm

As expected, the location of replication origins is associated with DNA features that affect the accessibility of DNA to protein binding. Many replication origins are located within the promoter and gene-rich regions [4], especially those which are active in a given cell [54]. Figure 5A shows an increase in the number of transcription start sites (TSS) and DNase hypersensitive sites in the vicinity of replication origins (±50kb). A stronger association can be observed for the CpG island locations which are also overrepresented in the vicinity or replication origins.

Simple tandem repeats and repeated sequences derived from the Repbase repository [55] show only a minor increase in the vicinity or ORI. The highest increase among repeated sequences is observable for the Alu SINEs, which were reported to be associated with ORI location [7, 56].

G-quadruplexes (G4) were previously reported as being overrepresented in the vicinity of replication origins [57, 58]. To test this association, we used G4 positions, identified in the [59] study based on G4-seq, which form under physiological K+ conditions and which are stabilized by pyridostatin (PDS). Results provided by both approaches are shown separately for each strand in Fig. 5A. Compared to other features, the increase in the G4 number in the vicinity of ORI is small, but clearly observable and offset from the center in opposite direction on plus and minus strands. The association of G-quadruplexes is stronger for ORI detected using other methods, especially SNS-seq out of which Core replication origins from Akerman et al. [23] show the strongest association (for details, see Additional file 4: Fig. S5). The association is also stronger for ORI detected using a computational method developed by Huvet et al. [21]. This indicates that those two detection methods might favor ORI of different properties, associated with G-quadruplexes. Our method aims to detect only a subset of all ORI (constitutive), for which G-quadruplexes might not be the most relevant factor. The higher association is however unlikely to be a result of more precise association estimates, resulting from a higher number of detected ORI sites. We have shown in Additional file 4: Fig. S6 that a selection of a random subset of 5000 ORI sites from the core SNS-seq ORI set (with a total of 65,329 positions), reported in Akerman et al. [23], has a negligible impact on the results. Promoters of transcriptionally active genes are located inside nucleosome-free regions, which are reported to be associated with replication origins [17, 18]. Chromatin structure limits DNA availability, and therefore, it is one of the major factors that affect replication origin activity. Among all of the features that we compared, H2A.Z histone exhibited the highest occurrence at the replication origins detected by mORI. Long has recently shown H2A.Z histone to epigenetically regulate the licensing and activation of early replication origins [60]. Interestingly, our data show that H3k36me3 and H3k79me2 exhibit depletion at the mORI sites, while showing an increased occurrence frequency at the ± 50–100-kb distance.

The topological organization of the DNA inside the nucleus is another important feature that affects DNA replication. Origins located near the anchor points of chromatin loops were shown to have a higher activity [25]. To validate this feature, we compared the locations of chromatin loops [61], with the locations of the highest scoring 1000 mORI replication origins (Fig. 5B, see the “Methods” section for details). The number of replication origins shows a clear enrichment in the vicinity of loop anchors marked with a vertical dashed line; however, the number of origins drops significantly, approaching the loop center. We also found a weak association between scaffold/nuclear matrix attached regions (S/MARs), gathered in the MARome database [62], and replication origins which were previously reported to have a possible association [26]. We also found mORI to be overrepresented at topologically associating domain (TAD) borders and underrepresented in TAD middle, based on ENCODE data [63]. This is consistent with the findings of Akerman et al. [23] (for details, see Additional file 4: Fig. S7); however, the same pattern for the TAD dataset used in the Akerman paper (referenced as RenLab [64] on the plot) is not as evident. This again may result from the fact that our method aims to detect only constitutive ORI, which make up a small subset of all ORI positions.

The number of mutations associated with the POLE-exo damage is assumed to be correlated with the number of tumor cell divisions (approximately 600 mutations per cell cycle [65]), implying tumor age contributes to the inter-patient differences that were observed (Fig. 3). However, epigenetic modifications to the DNA may also have a significant impact on the location of the replication origins (e.g., DNA methylation [66]) affecting the patient specificity of mutation patterns. We compared the methylation levels, determined in the TCGA project, in the vicinity of mORI replication origins with those for the distant parts of the genome (Fig. 5C). We observed a small drop in the methylation levels in the vicinity of replication origins (~10%) suggesting that the relation between both is not direct and might result from correlation with other structural features of the DNA.

The PMA score allowed us to select ORI conserved across multiple samples. Using the sample comparison based on MNVA, we also selected a subset of origin positions detected using other approaches. Figure 5C shows the frequency of H2A.Z histone in the vicinity of replication origins identified using various approaches. The first panel was created using a subset of positions obtained using the MNVA algorithm, based on a various cutoff level that defines the minimal agreement between all 31 analyzed samples shown below. We excluded the results of our mORI method obtained for individual samples (shown on the “mORI: individual” panel) since the data used to derive them were also used in the combined sample study (mORI: union). The MNVA panel shows that the higher the agreement between the methods, the stronger the association between the replication origin position and frequency of H2A.Z occurrence. A similar trend can be observed in the second panel obtained for the results of mORI selection based on various cutoff levels associated with the PMA score. The higher the PMA score of the identified replication origins, the more frequently H2A.Z is associated with an ORI. While both approaches, based on method concordance and PMA score, can be successfully used to identify a set of conserved origins, the H2A.Z frequency is increased in a narrower interval in the vicinity of ORI selected based on the PMA score and the same situation can be observed for other studied DNA features. This suggests that the origin position is more precisely estimated using the mutation-based approach, compared to the agreement between multiple methods, which is why we selected it for the study shown in Fig. 5A–C. Subsequent panels of Fig. 5D show the H2A.Z frequency in the vicinity of origins identified using the mORI method applied to individual samples with high mutation counts and other replication origin detection approaches also shown in Fig. 4. Similar to the ORI identified based on mutation patterns, the positions obtained in the OK-seq methods also show high association with H2A.Z; however, out of all approaches, the most clear association was obtained for the computational method developed by Huvet et al.

Discussion

We employed on a genomic scale the mutator phenotype associated with damaged polymerase ε, found in cancer cases (mainly colon adenocarcinoma and endometrial carcinoma), to identify genomic positions of ORI and developed an approach to compare positions obtained by other means, such as Repli-Seq or OK-seq. Mutations in the exonuclease domain of polymerase ε affect the proofreading mechanism, leading to characteristic context-specific mutations grouped asymmetrically on the leading strand near ORI. Although exonuclease mutations in POLD1 are much less frequent than in POLE, several cases have been reported [67] and it would be interesting to determine whether POLD1 can also effectively mark ORI.

There are important biological differences in ORI determined by POLE mutational processes compared to approaches depending on isolation and sequencing of newly synthesized DNA, which may lead to deeper insights into the function of mammalian ORI. Mutationally defined ORI (mORI) are inscribed directly onto the DNA through the action of the replicative enzyme over many rounds of cell division in vivo; Repli/OK-Seq obtain data from a single round of replication in cell culture. One outcome of our mutational approach is that only the most consistently firing origins will be clearly observed, which may reduce the overall ORI count relative to other methods. Repli/OK-Seq are dependent on mapping small fragments of DNA, which may prove limiting in understanding how highly repetitive regions of the genome are replicated. mORI in principle could use long-read technologies to overcome mapping difficulties.

We formalized the detection of mutationally derived ORI by constructing, the POLE-exo mutation asymmetry (PMA) score, which assumes the maximum value at the inferred position of ORI. ORI predicted by PMA scores are consistent overall with those obtained by other sequence-based methods. Using a subset of methods that provide a comparable number of replication origins, we conducted a two-stage comparison of replication origin positions. In the first stage, we conducted a pairwise comparison between each of the two methods using a numerical vector alignment (NVA) algorithm based on dynamic programming similar to Needleman–Wunsch for global nucleotide sequence alignment. mORI locations were most similar to the OK-seq-derived ORI among sequencing-based methods. Subsequently, we compared multiple methods by extending the NVA algorithm to a multiple vector version (MNVA), using an algorithm similar to ClustalW multiple sequence alignment [53]. Based on the results, we assessed the overlap among multiple ORI sets and selected origins which are identified in a specific fraction of samples by various methods. We found that 4666 ORI were identified in at least 50% and 518 by more than 90% of the samples, across all compared methods, that show similar detection resolution.

Given the large concordance observed between origins detected using cell types from different patients, we assume that despite cell type and tissue of origin, the majority of ORI positions that we identified can be also observed in normal human cells. We believe that the replication origin positions identified in our study, both based on the mutation patterns and multi-method overlap provide a good basis to study the DNA features that characterize replication origin positions.

Therefore, we sought associations of mORI locations with various DNA properties previously shown to co-occur with ORI location. Repeated sequences and G-quadruplexes were weakly associated with mORI origins, while DNA methylation levels and DNase hypersensitive sites were moderately associated. Replication origins from mORI were demonstrated to occur in the vicinity of chromatin loop anchors as was suggested previously [25]. A weak association with S/MARs [26] as a potential factor influencing ORI location was also observed. Replication origins identified using mORI were also associated with epigenetic modifications, including methylation level and specific chromatin histones, most importantly H2A.Z. We further used H2A.Z as a benchmark to compare other ORI detection methods as well as various criteria used to select a subset of mORI and ORI identified in multiple samples/methods. Association between ORI and H2A.Z is stronger the higher is our prediction score (PMA) and also stronger as more methods identify a particular origin. Additionally, among the sequencing-based ORI detection methods, we showed that Ok-seq, Ini-seq, and ORC2-based ChIP-seq exhibit the highest association with H2A.Z. However, an even stronger association with H2A.Z was observed for the computational method based on nucleotide composition skew [21]. The association with H2A.Z is much weaker for methods that provide a high number of origins, especially those based on SNS-seq [23, 35, 38], Ini-seq [39], and Bubble-seq [40]. This suggests that they may represent a more variable class of origins—as mentioned, mORI is biased toward constitutive ORI. A significantly higher number of ORI detected using those methods may additionally result from higher detection resolution that splits multiple, closely located sites, which in other methods could be reported as a single position. Furthermore, Ini-seq is expected to represent a different subset of positions since this method is highly biased towards the detection of ORI that are firing early in the S-phase [39].

The main weakness of the genomic feature association analysis, in which results are shown in Fig. 5 and Additional file 4: Fig. S5-7, is that it can be affected by the precision of the ORI site location. This is expected to be manifested by the width of the peaks shown on the heatmaps if the precision of the ORI detection algorithm is lower than the plot resolution (20kbp). For more precise ORI locations, the peaks are expected to be more narrow which would also increase their amplitude, affecting the interpretation of the results. For this reason, it is important to consider not only the peak amplitude but also its width, which can indicate that the motif can be located further apart from the ORI, to affect its location, but it is also potentially indicative of either imprecise position mapping, shifts in the ORI location between cells/divisions, or low resolution of the method, which may identify closely located ORI as a single position.

The mORI detection method is applicable to cells characterized by the POLE mutator phenotype, which currently is only associated with cancer. This study used those POLE-exo-mutated tumors found in ca. 1% of TCGA patients, primarily colorectal and endometrial cancers, but larger studies have been reported with representative tumors from many organ systems including the brain, ovaries, prostate pancreas, and lung [67]. Moreover, it should be possible to introduce exonuclease-mutated POLE into the nuclear genome of any cell type using gene editing technologies.

Conclusions

We developed a novel method of replication origin detection, MutORI, based on the mutation patterns of POLE-exo tumors. MutORI identifies replication origins using whole genome sequencing data without any modifications and then classifies origin utilization based on the value of the PMA score, revealing a set of constitutive replication origins in a single step. We applied this methodology to create the first ORI dataset generated from living tissues, rather than cell culture, and used it to characterize DNA structural features in the vicinity of identified ORI positions. The highest association of ORI was with histone H2A.Z, and the association increases with the PMA score. We additionally proposed a new replication origin comparison methodologies, for pairwise ORI comparison, based on Needleman–Wunsch global nucleotide alignment, and multiple ORI sets by adapting a ClustalW multiple sequence alignment approach. These methods enabled a comparative analysis of replication origin positions from different cell types and methods. We report the relevant DNA characteristics associated with the locations of the commonly observed ORI. Selection of ORI positions based on our comparison method allowed us to identify 518 highly conserved ORI, which are detected in over 90% of samples from multiple cell lines and using various detection methods of similar detection resolution.

Methods

Detection of replication origins based on somatic mutations (mORI)

For the purpose of the mutation pattern detection, we define a POLE-exo mutation asymmetry score PMA which is calculated for each n-th position of the genome:

$${\textrm{PMA}}_n=\frac{4}{W_n^2}\left(\sum_{i=n-d}^{n-1}{k}_i^{\textrm{lead}}\sum_{i=n}^{n+d}{k}_i^{\textrm{lag}}-\sum_{i=n-d}^{n-1}{k}_i^{\textrm{lag}}\sum_{i=n}^{n+d}{k}_i^{\textrm{lead}}\right)$$
(1)
$${W}_n=\sum_{i=-d}^d{k}_i^N$$
(2)

where:

dWindow size

\({k}_i^{\textrm{lead}}\)Number of mutations on the leading strand specific to the upstream of replication origin region, e.g., TCT → TAT, identified at position i of the genome, summed over all patients

\({k}_i^{\textrm{lag}}\)Number of mutations on the lagging strand, reverse complementary to those from \({k}_i^{\textrm{lead}}\), e.g., AGA → ATA, identified at position i of the genome, summed over all patients

\({k}_i^N\)Total number of all mutation types, at position i, summed over all patients

$${k}_i^{\textrm{lag}}+{k}_i^{\textrm{lead}}\le {k}_i^N\in \left\{0,1,\dots m\right\}$$

mNumber of patients

To clarify the definition of the coefficient, let us note that \(\sum_{i=n-d}^{n-1}{k}_i^{\textrm{lead}}\) and \(\sum_{i=n}^{n+d}{k}_i^{\textrm{lag}}\) are, correspondingly, the total numbers (in a window of length d) of mutations on the leading strand specific to the upstream region and the total numbers of mutations on the lagging strand specific to the downstream region, counting from a particular genome position n at which the coefficient is calculated. The interpretation of \(\sum_{i=n-d}^{n-1}{k}_i^{\textrm{lag}}\) and \(\sum_{i=n}^{n+d}{k}_i^{\textrm{lead}}\) is analogous.

Suppose now that coordinate n is perfectly “on target,” i.e., it is the position of the replication origin and there is no noise in the data. In this case, \(\sum_{i=n-d}^{n-1}{k}_i^{\textrm{lag}}=\sum_{i=n}^{n+d}{k}_i^{\textrm{lead}}=0\), while by symmetry \(\sum_{i=n-d}^{n-1}{k}_i^{\textrm{lead}}=\sum_{i=n}^{n+d}{k}_i^{\textrm{lag}}=0.5\times \sum_{i=-d}^d{k}_i^N\). Therefore, \(\textrm{PM}{\textrm{A}}_n=\left[{\left(0.5\times \sum_{i=-d}^d{k}_i^N\right)}^2-{0}^2\right]/\left({W}_n^2/4\right)=1\), which is the maximum possible value of the PMAn. The maximum value of the coefficient corresponds to the highest possible difference in mutation occurrence between regions upstream and downstream from a given position. Similarly, the minimum value is −1, which corresponds to a reverse pattern that can be observed in the vicinity of regions where two polymerases meet from opposite directions. These “ideal” values are corrupted by noise and not achievable, but the reasoning illustrates why the coefficient reaches maxima in the proximity of the origins.

The coefficient was calculated every 1kb over the entire genome, using a 200-kb window (100kb upstream and downstream from the selected position). We then smoothed the coefficient (moving average n = 100) and identified local maxima using the peakPick R package (neighlim = 100, peak. npos = 50). Maxima’s with PMA score lower than 0.1 (median value) were omitted (see Additional file 4: Fig. S8 for the PMA score distribution of all peaks). The peaks were then filtered using Fisher’s exact test with Benjamini–Hochberg multiple testing correction, assuming a 0.01 significance level. This removed regions identified using a small number of mutations which are likely false positives. We tested other approaches to peak filtering including methods using strict criteria based on the PMA score, number of mutations in the sliding window, or other tests and multiple testing correction methods. All methods used showed a high concordance affecting only the stringency of the tests which is dependent on the parameters and significance levels used.

Replication origin position obtained using other methods

The positions of identified replication origins were compared to positions obtained using other approaches, both experimental and one additional computational. Table 1 summarizes the results showing the number of available samples and identified replication origins. It also highlights which methods were used in the comparison.

Genomic positions of experimentally identified ORI positions were downloaded from 11 various sources, and we only used data that originate from NGS-based, genome-wide studies, listed in Table 1. In some cases, the data required some pre-processing, as described in Table 2.

Table 2 Sources of alternative ORI positions used in this study

Detection of origins based on OK-seq data

The ORI detection algorithm which we developed for OK-seq data is similar to the approach used in [31]. OK-seq is based on replicative incorporation of the EdU followed by size fractionation of EdU-labeled fragments and Illumina sequencing. The origins are identified using RFD profiles, computed for a 1-kb window using the following formula:

$$\boldsymbol{RFD}=\left(C-W\right)/\left(C+W\right)$$

where C and W correspond to the number of reads mapped on Crick and Watson strands respectively. The RFD values range from −1 to 1 which correspond to the highest proportion of leftward and rightward moving forks respectively. In the vicinity of the replication origin, the RFD profile should show a significant change in the RFD value from −1 to 1.

To calculate the RFD profile, we first used samtools view [69], ver. 1.11, to separate reads from both strands (-b -h -F 16 parameters to get the forward strand reads and -b -h -f 16 for the reverse strand). For this purpose, we used aligned reads stored in a BAM file. We then used bedtools coverage [70] ver. 2.25.0 to obtain the number of reads from both strands in 1-kb windows defined in a BED file, generated using bedtools makewindows, for the hg19 reference genome. Based on those results, we calculated the RFD profile using the formula defined above; however, we rescaled the number of Crick reads (C), so that the total number of W and C reads is identical for each BAM file. To detect the positions of replication origins based on the RFD profile, we first fitted a linear model to the RFD values from a 200-kb windows, moving by 1kb across the entire genome. The slope parameter of the linear model has the highest value at positions where the RFD profile shows a shift from −1 to 1, which is assumed to represent replication origins [31]. We smoothed (moving average n=200) the slope values of the linear models, calculated for the entire genome and later identified the local maxima, using the peakPick R package (neighlim=100, peak.npos= 100, deriv.lim = 1).

We applied this algorithm to the entire [42] dataset (11 samples) for which positions of replication origins were not provided by the authors. The analysis was based on BAM files downloaded from SRA (PRJEB25180), which were aligned to the hg19 reference genome.

Comparison of replication origin positions using NVA and MNVA

Genomic coordinates of DNA replication origins were compared using a modified version of the Needleman–Wunch global sequence alignment algorithm, named numeric vector alignment (NVA). The modification allows to use it for the comparison of two sets of numeric vectors, instead of nucleotide or amino acid sequences (character vectors). The alignment is performed for each chromosome individually. The modified algorithm does not require a substitution matrix, which defines scores given for matches and miss-matches between specific nucleotides. Instead, it uses the absolute distance between both locations and the only parameter required is the gap penalty that controls the number of gaps introduced to the compared vectors.

Individual elements of the scoring matrix S are defined as:

$${S}_{i,j}=\min \left({S}_{i-1,j-1}+{D}_{i,j},{S}_{i,j-1}+d,{S}_{i-1,j}+d\right)$$

where D is the absolute distance between locations i and j from the A and B vectors:

$${D}_{i,j}=\left|{A}_i-{B}_j\left|\right.\right.$$

and d is the gap penalty parameter. The scoring system is reversed compared to the classical Needleman–Wunch algorithm since the higher is the distance between the elements the less similar are the vectors. The alignment score (value associated with the overall quality of the alignment) is defined as the sum of absolute distances between all paired vector elements. For each unpaired element, we add the gap penalty. In our study, we used d=1,000,000. The method also returns a consensus vector, which is the average of paired elements from both individual vectors.

To compare a set of multiple replication origin positions, used to create Fig. 4A and B, we created an algorithm similar to the ClustalW multiple sequence alignment [53], named multiple numeric vector alignment (MNVA). The algorithm comprises the following steps:

  1. 1)

    Calculate all possible pairwise alignments using NVA, record the alignment score for each pair of vectors

  2. 2)

    Create a guide tree based on the pairwise alignment scores matrix obtained in the previous step, using the neighbor joining algorithm

  3. 3)

    Align the sequences with NVA by a progressive method based on the tree obtained in the previous step. In each following iteration, the selected vector is aligned to a consensus obtained in the previous iteration.

  4. 4)

    Create the final alignment matrix by aligning each individual vector to the vector obtained in the final iteration of the previous step (without introducing new gaps)

The function returns a n by m matrix where n is the number of provided sequences and m is the length of the vector with introduced gaps, obtained in the final iteration of step 3. The matrix elements include the original values of each vector separated by added gaps, marked with the dash character.

Implementations of both algorithms, NVA and MNVA, can be downloaded from our GitHub repository.

Generation of random genomic positions

Random genomic positions used in the ORI comparisons were selected using the createRandomRegions function from the regioneR package [71], which allows to exclude regions with gaps in the reference genome. For the study of distances between ORI identified using the PMA score which was compared to random genomic positions, we additionally selected the positions only from regions where mutations were identified. This approach allows omitting the regions with unknown sequence and low complexity, which due to lack of mutation data would bias the results. We compared the results obtained using the createRandomRegions function with our custom implementation based on the selection of random numbers from a uniform distribution, both of which provided very similar results.

Based on a known property of the Poisson process, if the number of events of the process is fixed and equal to (technically, conditional on this event), then the coordinates of the successive events are distributed identically to the order statistics from a sample of uniformly distributed random variables (Theorem 4.5.2 in [72]). Accordingly, if genomic coordinates are independent and sampled from the uniform distribution over the genome length, then effectively they constitute successive events of the Poisson process. For this reason, the random genomic positions obtained using uniform distribution are expected to be identically distributed as those obtained based on the Poisson distribution.

Study of methylation levels at ORI locations

Processed methylation data (beta values) for hg19 were downloaded from the GDC Data Portal (TCGA project) for 19 out of 20 samples, which were used for replication origin detection based on mutation patterns. The only omitted sample was DO227933, for which methylation data was not available in the ICGC database at the time we wrote the manuscript. The downloaded data originated from both Illumina Infinium HumanMethylation450 and HumanMethylation27 BeadChips. While both platforms differ significantly in the number of tested CpGs, we were only interested in the variability of methylation levels at a certain distance of identified replication origins without a direct comparison between the samples. For this reason, all samples were combined into one set which was used to calculate average methylation levels at a particular genomic location. The differences between both platforms affected the weight of individual sample in the average methylation levels; however, we find that to be of small relevance to the conclusions we have drawn based on those statistics.

Comparison of ORI locations and chromatin loops

Chromatin loop data were obtained from the Gene Expression Omnibus database (ID: GSE63525) for K562 cells [61]. The locations were used to estimate the distribution of replication origins in the vicinity of the loop anchor by calculating the number of replication origins at a specific distance from the closest loop. We considered the loop itself and the ± 500-kb surrounding region; however, since the loops have different lengths, we divided the number of origins by the total number of loops that reach a specific length, similarly as we proposed in our previous work [73] for transcription factor binding sites. 

Analysis of genomic features in the vicinity or replication origins

Table 3 lists all genomic features visualized in Fig. 4 and used to determine the number of features at a specific distance from replication origins used to create Fig. 5A, D.

Table 3 Sources of coordinate locations of various DNA features used in this study

Availability of data and materials

All data generated or analyzed during this study are included in this published article, its supplementary information files, and publicly available repositories.

Details concerning POLE-exo mutants, positions of identified replication origins, and comparison of replication origin positions, based on the MNVA algorithm, identified using multiple approaches, are available as supplementary tables S1, S2 and S3.

Implementations of the main R functions used in this study are available in the replicationOrigins R package, which is available through GitHub: https://github.com/rjaksik/replicationOrigins, and Zenodo [78]. The package includes implementations of the following algorithms:

• Replication origin detection based on the PMA score

• OK-seq origin detection based on RFD profiles

• Numeric vector alignment (NVA)

• Multiple numeric vector alignment (MNVA)

The remaining code used for additional analysis and visualizations is available per reasonable request.

Abbreviations

G4:

G-quadruplexes

ICGC:

International Cancer Genome Consortium

MNVA:

Multiple numeric vector alignment

mORI:

Mutationally defined ORI

NVA:

Numeric vector alignment

ORI:

Origins of replication

PMA:

POLE-exo-associated Mutation Asymmetry score

RFD:

Replication fork directionality

S/MARs:

Scaffold/nuclear matrix attached regions

TADs:

Topologically associating domains

TCGA:

The Cancer Genome Atlas

TSS:

Transcription start sites

WGS:

Whole genome sequencing data

References

  1. Cooper G, Hausman R. The cell: a molecular approach. 6th ed. Sunderland: Sinauer Associates; 2013.

    Google Scholar 

  2. Masai H, Matsumoto S, You Z, Yoshizawa-Sugata N, Oda M. Eukaryotic chromosome DNA replication: where, when, and how? Annu Rev Biochem. 2010;79:89–130.

    Article  CAS  PubMed  Google Scholar 

  3. Marahrens Y, Stillman B. A yeast chromosomal origin of DNA replication defined by multiple functional elements. Science. 1992;255(5046):817–23.

    Article  CAS  PubMed  Google Scholar 

  4. Fragkos M, Ganier O, Coulombe P, Mechali M. DNA replication origin activation in space and time. Nat Rev Mol Cell Biol. 2015;16(6):360–74.

    Article  CAS  PubMed  Google Scholar 

  5. DePamphilis ML. Origins of DNA replication in metazoan chromosomes. J Biol Chem. 1993;268(1):1–4.

    Article  CAS  PubMed  Google Scholar 

  6. Gilbert DM. Replication origin plasticity, Taylor-made: inhibition vs recruitment of origins under conditions of replication stress. Chromosoma. 2007;116(4):341–7.

    Article  PubMed  Google Scholar 

  7. Hansen RS, Thomas S, Sandstrom R, Canfield TK, Thurman RE, Weaver M, et al. Sequencing newly replicated DNA reveals widespread plasticity in human replication timing. Proc Natl Acad Sci U S A. 2010;107(1):139–44.

    Article  CAS  PubMed  Google Scholar 

  8. Gao F, Luo H, Zhang CT. DeOri: a database of eukaryotic DNA replication origins. Bioinformatics. 2012;28(11):1551–2.

    Article  CAS  PubMed  Google Scholar 

  9. Hyrien O. Peaks cloaked in the mist: the landscape of mammalian replication origins. J Cell Biol. 2015;208(2):147–60.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  10. Cvetic C, Walter JC. Eukaryotic origins of DNA replication: could you please be more specific? Semin Cell Dev Biol. 2005;16(3):343–53.

    Article  CAS  PubMed  Google Scholar 

  11. Mechali M. Eukaryotic DNA replication origins: many choices for appropriate answers. Nat Rev Mol Cell Biol. 2010;11(10):728–38.

    Article  CAS  PubMed  Google Scholar 

  12. Delgado S, Gomez M, Bird A, Antequera F. Initiation of DNA replication at CpG islands in mammalian chromosomes. EMBO J. 1998;17(8):2426–35.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  13. Vassilev L, Johnson EM. An initiation zone of chromosomal DNA replication located upstream of the c-myc gene in proliferating HeLa cells. Mol Cell Biol. 1990;10(9):4899–904.

    CAS  PubMed  PubMed Central  Google Scholar 

  14. Keller C, Ladenburger EM, Kremer M, Knippers R. The origin recognition complex marks a replication origin in the human TOP1 gene promoter. J Biol Chem. 2002;277(35):31430–40.

    Article  CAS  PubMed  Google Scholar 

  15. Abdurashidova G, Deganuto M, Klima R, Riva S, Biamonti G, Giacca M, et al. Start sites of bidirectional DNA synthesis at the human lamin B2 origin. Science. 2000;287(5460):2023–6.

    Article  CAS  PubMed  Google Scholar 

  16. Theis JF, Newlon CS. The ARS309 chromosomal replicator of Saccharomyces cerevisiae depends on an exceptional ARS consensus sequence. Proc Natl Acad Sci U S A. 1997;94(20):10786–91.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  17. Berbenetz NM, Nislow C, Brown GW. Diversity of eukaryotic DNA replication origins revealed by genome-wide analysis of chromatin structure. PLoS Genet. 2010;6(9):e1001092.

    Article  PubMed  PubMed Central  Google Scholar 

  18. Eaton ML, Galani K, Kang S, Bell SP, MacAlpine DM. Conserved nucleosome positioning defines replication origins. Genes Dev. 2010;24(8):748–53.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  19. Breier AM, Chatterji S, Cozzarelli NR. Prediction of Saccharomyces cerevisiae replication origins. Genome Biol. 2004;5(4):R22.

    Article  PubMed  PubMed Central  Google Scholar 

  20. Nieduszynski CA, Knox Y, Donaldson AD. Genome-wide identification of replication origins in yeast by comparative genomics. Genes Dev. 2006;20(14):1874–9.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  21. Huvet M, Nicolay S, Touchon M, Audit B, d'Aubenton-Carafa Y, Arneodo A, et al. Human gene organization driven by the coordination of replication and transcription. Genome Res. 2007;17(9):1278–85.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  22. Wang L, Lin CM, Lopreiato JO, Aladjem MI. Cooperative sequence modules determine replication initiation sites at the human beta-globin locus. Hum Mol Genet. 2006;15(17):2613–22.

    Article  CAS  PubMed  Google Scholar 

  23. Akerman I, Kasaai B, Bazarova A, Sang PB, Peiffer I, Artufel M, et al. A predictable conserved DNA base composition signature defines human core DNA replication origins. Nat Commun. 2020;11(1):4826.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  24. Besnard E, Babled A, Lapasset L, Milhavet O, Parrinello H, Dantec C, et al. Unraveling cell type-specific and reprogrammable human replication origin signatures associated with G-quadruplex consensus motifs. Nat Struct Mol Biol. 2012;19(8):837–44.

    Article  CAS  PubMed  Google Scholar 

  25. Courbet S, Gay S, Arnoult N, Wronka G, Anglana M, Brison O, et al. Replication fork movement sets chromatin loop size and origin choice in mammalian cells. Nature. 2008;455(7212):557–60.

    Article  CAS  PubMed  Google Scholar 

  26. Wilson RHC, Coverley D. Relationship between DNA replication and the nuclear matrix. Genes Cells. 2013;18(1):17–31.

    Article  CAS  PubMed  Google Scholar 

  27. Shinbrot E, Henninger EE, Weinhold N, Covington KR, Goksenin AY, Schultz N, et al. Exonuclease mutations in DNA polymerase epsilon reveal replication strand specific mutation patterns and human origins of replication. Genome Res. 2014;24(11):1740–50.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  28. Lujan SA, Williams JS, Kunkel TA. DNA polymerases divide the labor of genome replication. Trends Cell Biol. 2016;26(9):640–54.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  29. Haradhvala NJ, Polak P, Stojanov P, Covington KR, Shinbrot E, Hess JM, et al. Mutational strand asymmetries in cancer genomes reveal mechanisms of DNA damage and repair. Cell. 2016;164(3):538–49.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  30. Alexandrov LB, Nik-Zainal S, Wedge DC, Aparicio SA, Behjati S, Biankin AV, et al. Signatures of mutational processes in human cancer. Nature. 2013;500(7463):415–21.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  31. Petryk N, Kahli M, d'Aubenton-Carafa Y, Jaszczyszyn Y, Shen Y, Silvain M, et al. Replication landscape of the human genome. Nat Commun. 2016;7:10208.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  32. Oliver JL, Carpena P, Hackenberg M, Bernaola-Galvan P. IsoFinder: computational prediction of isochores in genome sequences. Nucleic Acids Res. 2004;32(Web Server issue):W287–92.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  33. Koren A, Handsaker RE, Kamitaki N, Karlic R, Ghosh S, Polak P, et al. Genetic variation in human DNA replication timing. Cell. 2014;159(5):1015–26.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  34. Arezi B, Xing W, Sorge JA, Hogrefe HH. Amplification efficiency of thermostable DNA polymerases. Anal Biochem. 2003;321(2):226–35.

    Article  CAS  PubMed  Google Scholar 

  35. Picard F, Cadoret JC, Audit B, Arneodo A, Alberti A, Battail C, et al. The spatiotemporal program of DNA replication is associated with specific combinations of chromatin marks in human cells. PLoS Genet. 2014;10(5):e1004282.

    Article  PubMed  PubMed Central  Google Scholar 

  36. Mukhopadhyay R, Lajugie J, Fourel N, Selzer A, Schizas M, Bartholdy B, et al. Allele-specific genome-wide profiling in human primary erythroblasts reveal replication program organization. PLoS Genet. 2014;10(5):e1004319.

    Article  PubMed  PubMed Central  Google Scholar 

  37. Fu H, Besnard E, Desprat R, Ryan M, Kahli M, Lemaitre JM, et al. Mapping replication origin sequences in eukaryotic chromosomes. Curr Protoc Cell Biol. 2014;65:22 0 1–17.

    Article  Google Scholar 

  38. Martin MM, Ryan M, Kim R, Zakas AL, Fu H, Lin CM, et al. Genome-wide depletion of replication initiation events in highly transcribed regions. Genome Res. 2011;21(11):1822–32.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  39. Langley AR, Graf S, Smith JC, Krude T. Genome-wide identification and characterisation of human DNA replication origins by initiation site sequencing (ini-seq). Nucleic Acids Res. 2016;44(21):10230–47.

    CAS  PubMed  PubMed Central  Google Scholar 

  40. Mesner LD, Valsakumar V, Cieslik M, Pickin R, Hamlin JL, Bekiranov S. Bubble-seq analysis of the human genome reveals distinct chromatin-mediated mechanisms for regulating early- and late-firing origins. Genome Res. 2013;23(11):1774–88.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  41. Chen YH, Keegan S, Kahli M, Tonzi P, Fenyo D, Huang TT, et al. Transcription shapes DNA replication initiation and termination in human cells. Nat Struct Mol Biol. 2019;26(1):67–77.

    Article  CAS  PubMed  Google Scholar 

  42. Wu X, Kabalane H, Kahli M, Petryk N, Laperrousaz B, Jaszczyszyn Y, et al. Developmental and cancer-associated plasticity of DNA replication preferentially targets GC-poor, lowly expressed and late-replicating regions. Nucleic Acids Res. 2018;46(19):10532.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  43. Dellino GI, Cittaro D, Piccioni R, Luzi L, Banfi S, Segalla S, et al. Genome-wide mapping of human DNA-replication origins: levels of transcription at ORC1 sites regulate origin selection and replication timing. Genome Res. 2013;23(1):1–11.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  44. Miotto B, Ji Z, Struhl K. Selectivity of ORC binding sites and the relation to replication timing, fragile sites, and deletions in cancers. Proc Natl Acad Sci U S A. 2016;113(33):E4810–9.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  45. Sugimoto N, Maehara K, Yoshida K, Ohkawa Y, Fujita M. Genome-wide analysis of the spatiotemporal regulation of firing and dormant replication origins in human cells. Nucleic Acids Res. 2018;46(13):6683–96.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  46. Ryba T, Hiratani I, Sasaki T, Battaglia D, Kulik M, Zhang J, et al. Replication timing: a fingerprint for cell identity and pluripotency. PLoS Comput Biol. 2011;7(10):e1002225.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  47. Mesner LD, Valsakumar V, Karnani N, Dutta A, Hamlin JL, Bekiranov S. Bubble-chip analysis of human origin distributions demonstrates on a genomic scale significant clustering into zones and significant association with transcription. Genome Res. 2011;21(3):377–89.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  48. Karnani N, Taylor CM, Malhotra A, Dutta A. Genomic study of replication initiation in human chromosomes reveals the influence of transcription regulation and chromatin structure on origin selection. Mol Biol Cell. 2010;21(3):393–404.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  49. Valenzuela MS, Chen Y, Davis S, Yang F, Walker RL, Bilke S, et al. Preferential localization of human origins of DNA replication at the 5'-ends of expressed genes and at evolutionarily conserved DNA sequences. PLoS One. 2011;6(5):e17308.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  50. Cadoret JC, Meisch F, Hassan-Zadeh V, Luyten I, Guillet C, Duret L, et al. Genome-wide studies highlight indirect links between human replication origins and gene regulation. Proc Natl Acad Sci U S A. 2008;105(41):15837–42.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  51. Standage DS, Brendel VP. ParsEval: parallel comparison and analysis of gene structure annotations. BMC Bioinformatics. 2012;13:187.

    Article  PubMed  PubMed Central  Google Scholar 

  52. Zytnicki M, Luo Y, Quesneville H. Efficient comparison of sets of intervals with NC-lists. Bioinformatics. 2013;29(7):933–9.

    Article  CAS  PubMed  Google Scholar 

  53. Thompson JD, Higgins DG, Gibson TJ. CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res. 1994;22(22):4673–80.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  54. Zehnbauer B, Vogelstein B. Supercoiled loops and the organization of replication and transcription in eukaryotes. BioEssays. 1985;2(2):52–4.

    Article  Google Scholar 

  55. Kojima KK. Human transposable elements in Repbase: genomic footprints from fish to humans. Mob DNA. 2018;9:2.

    Article  PubMed  PubMed Central  Google Scholar 

  56. Saegusa Y, Sato M, Galli I, Nakagawa T, Ono N, Iguchi-Ariga SM, et al. Stimulation of SV40 DNA replication and transcription by Alu family sequence. Biochim Biophys Acta. 1993;1172(3):274–82.

    Article  CAS  PubMed  Google Scholar 

  57. Cayrou C, Coulombe P, Vigneron A, Stanojcic S, Ganier O, Peiffer I, et al. Genome-scale analysis of metazoan replication origins reveals their organization in specific but flexible sites defined by conserved features. Genome Res. 2011;21(9):1438–49.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  58. Cayrou C, Coulombe P, Puy A, Rialle S, Kaplan N, Segal E, et al. New insights into replication origin characteristics in metazoans. Cell Cycle. 2012;11(4):658–67.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  59. Marsico G, Chambers VS, Sahakyan AB, McCauley P, Boutell JM, Antonio MD, et al. Whole genome experimental maps of DNA G-quadruplexes in multiple species. Nucleic Acids Res. 2019;47(8):3862–74.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  60. Long H, Zhang L, Lv M, Wen Z, Zhang W, Chen X, et al. H2A.Z facilitates licensing and activation of early replication origins. Nature. 2020;577(7791):576–81.

    Article  CAS  PubMed  Google Scholar 

  61. Rao SS, Huntley MH, Durand NC, Stamenova EK, Bochkov ID, Robinson JT, et al. A 3D map of the human genome at kilobase resolution reveals principles of chromatin looping. Cell. 2014;159(7):1665–80.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  62. Narwade N, Patel S, Alam A, Chattopadhyay S, Mittal S, Kulkarni A. Mapping of scaffold/matrix attachment regions in human genome: a data mining exercise. Nucleic Acids Res. 2019;47(14):7247–61.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  63. Guelen L, Pagie L, Brasset E, Meuleman W, Faza MB, Talhout W, et al. Domain organization of human chromosomes revealed by mapping of nuclear lamina interactions. Nature. 2008;453(7197):948–51.

    Article  CAS  PubMed  Google Scholar 

  64. Dixon JR, Selvaraj S, Yue F, Kim A, Li Y, Shen Y, et al. Topological domains in mammalian genomes identified by analysis of chromatin interactions. Nature. 2012;485(7398):376–80.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  65. Shlien A, Campbell BB, de Borja R, Alexandrov LB, Merico D, Wedge D, et al. Combined hereditary and somatic mutations of replication error repair genes result in rapid onset of ultra-hypermutated cancers. Nat Genet. 2015;47(3):257–62.

    Article  CAS  PubMed  Google Scholar 

  66. Rein T, Kobayashi T, Malott M, Leffak M, DePamphilis ML. DNA methylation at mammalian replication origins. J Biol Chem. 1999;274(36):25792–800.

    Article  CAS  PubMed  Google Scholar 

  67. Campbell BB, Light N, Fabrizio D, Zatzman M, Fuligni F, de Borja R, et al. Comprehensive analysis of hypermutation in human cancer. Cell. 2017;171(5):1042–56 e10.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  68. Lawrence M, Gentleman R, Carey V. rtracklayer: an R package for interfacing with genome browsers. Bioinformatics. 2009;25(14):1841–2.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  69. Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics. 2009;25(16):2078–9.

    Article  PubMed  PubMed Central  Google Scholar 

  70. Quinlan AR, Hall IM. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics. 2010;26(6):841–2.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  71. Gel B, Diez-Villanueva A, Serra E, Buschbeck M, Peinado MA, Malinverni R. regioneR: an R/Bioconductor package for the association analysis of genomic regions based on permutation tests. Bioinformatics. 2016;32(2):289–91.

    Article  CAS  PubMed  Google Scholar 

  72. Resnick SI. Adventures in stochastic processes: Springer Science & Business Media; 1992.

    Google Scholar 

  73. Jaksik R, Rzeszowska-Wolny J. The distribution of GC nucleotides and regulatory sequence motifs in genes and their adjacent sequences. Gene. 2012;492(2):375–81.

    Article  CAS  PubMed  Google Scholar 

  74. Charif D, Lobry JR. SeqinR 1.0-2: a contributed package to the R project for statistical computing devoted to biological sequences retrieval and analysis. In: Bastolla U, Porto M, Roman HE, Vendruscolo M, editors. Structural approaches to sequence evolution: molecules, networks, populations. New York: Springer Verlag; 2007.

    Google Scholar 

  75. Siepel A, Bejerano G, Pedersen JS, Hinrichs AS, Hou M, Rosenbloom K, et al. Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes. Genome Res. 2005;15(8):1034–50.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  76. Morgan M, Shepherd L. AnnotationHub: client to access AnnotationHub resources. R package version 2220; 2020.

    Google Scholar 

  77. Ho L, Crabtree GR. Chromatin remodelling during development. Nature. 2010;463(7280):474–84.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  78. rjaksik/replicationOrigins: functions used for replication origin detection and evaluation. 2023. Available from: https://0-doi-org.brum.beds.ac.uk/10.5281/zenodo.7513927.

Download references

Acknowledgements

The results shown here are in part based upon data generated by the TCGA Research Network: https://www.cancer.gov/about-nci/organization/ccg/research/structural-genomics/tcga. Calculations were carried out using the infrastructure of the Ziemowit computer cluster (www.ziemowit.hpc.polsl.pl) in the Laboratory of Bioinformatics and Computational Biology, The Biotechnology, Bioengineering and Bioinformatics Centre Silesian BIO-FARMA, created in the POIG.02.01.00-00-166/08 and expanded in the POIG.02.03.01-00-040/13 projects.

Funding

This work was supported by the Polish National Science Centre grant no. 2018/29/B/ST7/02550.

Author information

Authors and Affiliations

Authors

Contributions

RJ implemented the data processing methods, conducted the analysis, and drafted the paper. DW contributed to the study design, interpretation of the results, and writing of the manuscript. MK supervised the project and contributed to the study design, aided the data analysis, and contributed to the writing of the manuscript. The authors read and approved the final manuscript.

Corresponding author

Correspondence to Roman Jaksik.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Additional file 1: Table S1.

TCGA samples used in the study, including POLE-exo variant types and mutation statistics.

Additional file 2: Table S2.

Genomic coordinates of replication origins identified using mutation patterns – mORI.

Additional file 3: Table SS.

Comparison of replication origin positions based on MNVA algorithm.

Additional file 4: Fig. S1.

Examples of mORI detection for locations described in the Shinbrot et al. 2014 article. Fig. S2. Distribution of distances between mORI positions compared to random genomic locations. Fig. S3. Distribution of mORI across chromosomes from the hg19 reference genome and number of mORI, with and without chromosome length standardization. Fig. S4. Overlap between various ORI detection methods. Fig. S5. Number of G-quadruplexes at specific distance from the replication origins, expressed as the fold change with respect to the minimum value from each row. Individual rows were obtained using various replication origin detection methods and results of the method concordance obtained using the MNVA algorithm. Fig. S6. Number of G- quadruplexes at specific distance from the replication origins, expressed as the fold change with respect to the minimum value from each row. The first row was obtained for all 65 329 positions from the core SNS-seq ORI set (Akerman et al. 2020), the following rows were obtained by randomly selecting a subset of 5 000 positions, repeated 10 times. Fig. S7. Number of topologically associating domain (TADs) regions (middle or border) at a given distance from the mORI replication origins, expressed as the fold change with respect to the minimum value from each row. Fig. S8.: Histogram of the PMA score obtained for peaks identified by the mORI detection algorithm for all samples combined.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Jaksik, R., Wheeler, D.A. & Kimmel, M. Detection and characterization of constitutive replication origins defined by DNA polymerase epsilon. BMC Biol 21, 41 (2023). https://0-doi-org.brum.beds.ac.uk/10.1186/s12915-023-01527-z

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://0-doi-org.brum.beds.ac.uk/10.1186/s12915-023-01527-z

Keywords