The structure, function and evolution of a complete human chromosome 8


Data reporting

No statistical methods were used to predetermine sample size. The experiments were not randomized, and investigators were not blinded to allocation during experiments and outcome assessment.

Cell line sources

CHM13hTERT (CHM13) cells were originally isolated from a hydatidiform mole at Magee-Womens Hospital as part of a research study (IRB MWH-20-054). Cryogenically frozen cells from this culture were grown and transformed with the human telomerase reverse transcriptase (TERT) gene to immortalize the cell line. This cell line has been authenticated by STR analysis, tested negative for mycoplasma contamination, and karyotyped to show a 46,XX karyotype13. Human HG00733 lymphoblastoid cells were originally obtained from a female Puerto Rican child, immortalized with the Epstein-Barr virus (EBV), and stored at the Coriell Institute for Medical Research. Chimpanzee (Pan troglodytes; Clint; S006007) fibroblast cells were originally obtained from a male western chimpanzee named Clint (now deceased) at the Yerkes National Primate Research Center and immortalized with EBV. Orangutan (Pongo abelii; Susie; PR01109) fibroblast cells were originally obtained from a female Sumatran orangutan named Susie (now deceased) at the Gladys Porter Zoo, immortalized with EBV, and stored at the Coriell Institute for Medical Research. Macaque (Macaca mulatta; AG07107) fibroblast cells were originally obtained from a female rhesus macaque of Indian origin and stored at the Coriell Institute for Medical Research. The HG00733, chimpanzee, orangutan and macaque cell lines have not yet been authenticated or assessed for mycoplasma contamination, to our knowledge.

Cell culture

CHM13 cells were cultured in complete AmnioMax C-100 Basal Medium (Thermo Fisher Scientific, 17001082) supplemented with 15% AmnioMax C-100 Supplement (Thermo Fisher Scientific, 12556015) and 1% penicillin-streptomycin (Thermo Fisher Scientific, 15140122). HG00733 cells were cultured in RPMI 1640 with l-glutamine (Thermo Fisher Scientific, 11875093) supplemented with 15% FBS (Thermo Fisher Scientific, 16000-044) and 1% penicillin-streptomycin (Thermo Fisher Scientific, 15140122). Chimpanzee (P. troglodytes; S006007) and macaque (M. mulatta; AG07107) cells were cultured in MEMα containing ribonucleosides, deoxyribonucleosides and l-glutamine (Thermo Fisher Scientific, 12571063) supplemented with 12% FBS (Thermo Fisher Scientific, 16000-044) and 1% penicillin-streptomycin (Thermo Fisher Scientific, 15140122). Orangutan (P. abelii; PR01109) cells were cultured in MEMα containing ribonucleosides, deoxyribonucleosides and l-glutamine (Thermo Fisher Scientific, 12571063) supplemented with 15% FBS (Thermo Fisher Scientific, 16000-044) and 1% penicillin-streptomycin (Thermo Fisher Scientific, 15140122). All cells were cultured in a humidity-controlled environment at 37 °C with 5% CO2.

DNA extraction, library preparation and sequencing

PacBio HiFi data were generated from the HG00733, chimpanzee, orangutan and macaque genomes as previously described36 with modifications. In brief, high-molecular-weight (HMW) DNA was extracted from cells using a modified Qiagen Gentra Puregene Cell Kit protocol37. HMW DNA was used to generate HiFi libraries via the SMRTbell Express Template Prep Kit v2 and SMRTbell Enzyme Clean Up kits (PacBio). Size selection was performed with SageELF (Sage Science), and fractions sized 11, 14, 18, 22, or 25 kb (as determined by FEMTO Pulse (Agilent)) were chosen for sequencing. Libraries were sequenced on the Sequel II platform (Instrument Control SW v7.1 or v8.0) with three to seven SMRT Cells 8M (PacBio) using either Sequel II Sequencing Chemistry 1.0 and 12-h pre-extension or Sequel II Sequencing Chemistry 2.0 and 3- or 4-h pre-extension, both with 30-h movies, aiming for a minimum estimated coverage of 25× in HiFi reads (assuming a genome size of 3.2 Gb). Raw data were processed using the CCS algorithm (v.3.4.1 or v.4.0.0) with the following parameters: –minPasses 3 –minPredictedAccuracy 0.99 –maxLength 21000 or 50000.

Ultra-long ONT data were generated from the CHM13, HG00733, chimpanzee and orangutan genomes according to a previously published protocol41. In brief, 5 × 107 cells were lysed in a buffer containing 10 mM Tris-Cl (pH 8.0), 0.1 M EDTA (pH 8.0), 0.5% (w/v) SDS, and 20 μg ml−1 RNase A for 1 h at 37 °C. Proteinase K (200 μg ml−1) was added, and the solution was incubated at 50 °C for 2 h. DNA was purified via two rounds of 25:24:1 phenol-chloroform-isoamyl alcohol extraction followed by ethanol precipitation. Precipitated DNA was solubilized in 10 mM Tris (pH 8) containing 0.02% Triton X-100 at 4 °C for two days. Libraries were constructed using the Rapid Sequencing Kit (SQK-RAD004) from ONT with modifications to the manufacturer’s protocol. Specifically, 2–3 μg of DNA was resuspended in a total volume of 18 μl with 16.6% FRA buffer. FRA enzyme was diluted 2- to 12-fold into FRA buffer, and 1.5 μl of diluted FRA was added to the DNA solution. The DNA solution was incubated at 30 °C for 1.5 min, followed by 8 °C for 1 min to inactivate the enzyme. RAP enzyme was diluted 2- to 12-fold into RAP buffer, and 0.5 μl of diluted RAP was added to the DNA solution. The DNA solution was incubated at room temperature for 2 h before loading onto a primed FLO-MIN106 R9.4.1 flow cell for sequencing on a GridION using MinKNOW (v.2.0 – v1.9.12).

Additional ONT data were generated from the CHM13, HG00733, chimpanzee, orangutan, and macaque genomes. In brief, HMW DNA was extracted from cells using a modified Qiagen Gentra Puregene Cell Kit protocol37. HMW DNA was prepared into libraries with the Ligation Sequencing kit (SQK-LSK109) from ONT and loaded onto primed FLO-MIN106 or FLO-PRO002 R9.4.1 flow cells for sequencing on a GridION or PromethION, respectively, using MinKNOW (v.2.0–v.19.12). All ONT data were base called with Guppy 3.6.0 or 4.0.11 with the HAC model.

PacBio HiFi whole-genome assembly

The CHM13 genome was assembled from PacBio HiFi data using HiCanu5 as previously described5. The HG00733 genome was assembled from PacBio HiFi data (Supplementary Table 3) using hifiasm6 (v.0.7). The chimpanzee, orangutan and macaque genomes were assembled from PacBio HiFi data (Supplementary Table 5) using HiCanu5 (v.2.0). Contigs from each assembly were used to replace the ONT-based sequence scaffolds in targeted regions (described below).

Targeted sequence assembly

Gapped regions within human chromosome 8 were targeted for assembly via a SUNK-based method that combines both PacBio HiFi and ONT data. Specifically, CHM13 PacBio HiFi data were used to generate a library of SUNKs (k = 20; total = 2,062,629,432) via Jellyfish (v.2.2.4) on the basis of the sequencing coverage of the HiFi dataset. In total, 99.88% (2,060,229,331) of the CHM13 PacBio HiFi SUNKs were validated with CHM13 Illumina data (SRR3189741). A subset of CHM13 ultra-long ONT reads aligning to the CHM1 β-defensin patch (GenBank: KZ208915.1) or select regions within the GRCh38 chromosome 8 reference sequence (chr8:42,881,543–47,029,467 for the centromere and chr8:85,562,829–85,848,463 for the 8q21.2 locus) were barcoded with Illumina-validated SUNKs. Reads sharing at least 50 SUNKs were selected for inspection to determine whether their SUNK barcodes overlapped. SUNK barcodes can be composed of ‘valid’ and ‘invalid’ SUNKs. Valid SUNKs are those that occur once in the genome and are located at the exact position on the read. By contrast, invalid SUNKs are those that occur once in the genome but are falsely located at the position on the read, and this may be due to a sequencing or base-calling error, for example. Valid SUNKs were identified within the barcode as those that share pairwise distances with at least ten other SUNKs on the same read. Reads that shared a SUNK barcode containing at least three valid SUNKs and their corresponding pairwise distances (±1% of the read length) were assembled into a tile. The process was repeated using the tile and subsetted ultra-long ONT reads several times until a sequence scaffold spanning the gapped region was generated. Validation of the scaffold organization was carried out via three independent methods. First, the sequence scaffold and underlying ONT reads were subjected to RepeatMasker (v.3.3.0) to ensure that read overlaps were concordant in repeat structure. Second, the centromeric scaffold and underlying ONT reads were subjected to StringDecomposer42 to validate the HOR organization in overlapping reads. Finally, the sequence scaffold for each target region was incorporated into the CHM13 chromosome 8 assembly previously generated5, thereby filling the gaps in the chromosome 8 assembly. CHM13 PacBio HiFi and ONT data were aligned to the entire chromosome 8 assembly via pbmm2 (v.1.1.0) (for PacBio data; https://github.com/PacificBiosciences/pbmm2) or Winnowmap43 (v.1.0) (for ONT data) to identify large collapses or misassemblies. Although the ONT-based scaffolds are structurally accurate, they are only 87–98% accurate at the base level owing to base-calling errors in the raw ONT reads7. Therefore, we sought to improve the base accuracy of the sequence scaffolds by replacing the ONT sequences with PacBio HiFi contigs assembled from the CHM13 genome5, which have a consensus accuracy greater than 99.99%5. Therefore, we aligned CHM13 PacBio HiFi contigs generated via HiCanu5 to the chromosome 8 assembly via minimap244 (v2.17-r941; parameters: minimap2 -t 8 -I 8G -a –eqx -x asm20 -s 5000) to identify contigs that share high sequence identity with the ONT-based sequence scaffolds. A typical scaffold had multiple PacBio HiFi contigs that aligned to regions within it. Therefore, the scaffold was used to order and orient the PacBio HiFi contigs and bridge gaps between them when necessary. PacBio HiFi contigs with high sequence identity replaced almost all regions of the ONT-based scaffolds: ultimately, the chromosome 8 assembly consists of 146,254,195 bp of PacBio HiFi contigs and only 5,490 bp of ONT sequence scaffolds (99.9963% PacBio HiFi contigs and 0.0037% ONT scaffold). The chromosome 8 assembly was incorporated into a whole-genome assembly of CHM13 previously generated5 for validation via orthogonal methods (detailed below). The HG00733, chimpanzee, orangutan and macaque chromosome 8 centromeres were assembled via the same SUNK-based method.

Accuracy estimation

The accuracy of the CHM13 chromosome 8 assembly was estimated from mapped k-mers using Merqury17. In brief, Merqury (v.1.1) was run on the chromosome 8 assembly with the following command: eval/qv.sh CHM13.k21.meryl chr8.fasta chr8_v9.

CHM13 Illumina data (SRR1997411, SRR3189741, SRR3189742 and SRR3189743) were used to identify k-mers with k = 21. In Merqury, every k-mer in the assembly is evaluated for its presence in the Illumina k-mer database, with any k-mer missing in the Illumina set counted as base-level ‘error’. We detected 1,474 k-mers found only in the assembly out of 146,259,650, resulting in a quality value score of 63.19, estimated as follows: −10 × log(1 − (1 − 1,474/146,259,650)(1/21)) = 63.19.

The accuracy percentage for chromosome 8 was estimated from this quality value score as: 100 − (10(63.19/−10)) × 100 = 99.999952.

The accuracy of the CHM13 chromosome 8 assembly and β-defensin locus were also estimated from sequenced BACs. In brief, 66 BACs from the CHM13 chromosome 8 (BAC library VMRC59) were aligned to the chromosome 8 assembly via minimap244 (v2.17-r941) with the following parameters: -I 8G -2K 1500m –secondary = no -a –eqx -Y -x asm20 -s 200000 -z 10000,1000 -r 50000 -O 5,56 -E 4,1 -B 5. The quality value was then estimated using the CIGAR string in the resulting BAM, counting alignment differences as errors according to the following formula:

$$begin{array}{c},{rm{Quality}},{rm{value}}=-10times {log }_{10}[1-({rm{matches}}/,\ ({rm{mismatches}}+{rm{matches}}+{rm{insertions}}+{rm{deletions}}))]end{array}$$

The median quality value was 40.6988 for the entire chromosome 8 assembly and 40.4769 for the β-defensin locus (chr8:6300000–13300000; estimated from 47 individual BACs) (see Extended Data Fig. 5 for more details), which falls within the 95% confidence interval for the whole chromosome. This quality value score was used to estimate the base accuracy36 as follows:

$$100-({10}^{(40.6988/-10)})times 100=99.9915$$

$$100-({10}^{(40.4769/-10)})times 100=99.9910$$

The BAC quality value estimation should be considered a lower bound, because differences between the BACs and the assembly may originate from errors in the BAC sequences themselves. BACs were previously shown to occasionally contain sequencing errors that are not supported by the underlying PacBio HiFi reads36. In addition, the upper bound for the estimated BAC quality value is limited to approximately 53, because BACs are typically 200 kb and, as a result, the maximum calculable quality value is 1 error in 200 kb (quality value 53). We also note that the quality value of the centromeric region could not be estimated from BACs owing to biases in BAC library preparation, which preclude centromeric sequences in BAC clones.

The accuracy of the HG00733, chimpanzee, orangutan and macaque chromosome 8 centromere assemblies was estimated with Merqury17. In brief, Merqury (v.1.1) was run on the centromere assemblies as described above for the CHM13 chromosome 8 assembly. Ultimately, we detected 248 k-mers found only in the HG00733 maternal assembly out of 3,877,376 bp (estimated quality value score of 55.16; base accuracy of 99.9997%); 10,562 k-mers found only in the HG00733 paternal assembly out of 3,597,645 bp (estimated quality value score of 38.54; base accuracy of 99.986%); 0 k-mers found only in the chimpanzee H1 assembly out of 2,803,083 bp (estimated quality value score of infinity; base accuracy of 100%); 20 k-mers found only in the chimpanzee H2 assembly out of 3,603,864 bp (estimated quality value score of 65.7796; base accuracy of 99.9999%); 1,302 k-mers found only in the orangutan assembly out of 5,372,621 bp (estimated quality value score of 49.3774; accuracy of 99.9988%); and 104 k-mers found only in the macaque assembly out of 14,999,980 bp (estimated quality value score of 64.8128; accuracy of 99.9999%). We note that Merqury detects the presence of erroneous k-mers in the assembly that have no support within the raw reads, but it cannot detect the absence of true k-mers (variants) within the assembled repeat copies. Thus, within these highly repetitive arrays, Merqury is useful for comparative analyses but may overestimate the overall accuracy of the consensus.

Strand-seq analysis

We evaluated the directional and structural contiguity of CHM13 chromosome 8 assembly, including the centromere, using Strand-seq data. First, all Strand-seq libraries produced from the CHM13 genome36 were aligned to the CHM13 assembly, including chromosome 8 using BWA-MEM45 (v.0.7.17-r1188) with default parameters for paired-end mapping. Next, duplicate reads were marked by sambamba46 (v.0.6.8) and removed before subsequent analyses. We used SAMtools47 (v.1.9) to sort and index the final BAM file for each Strand-seq library. To detect putative misassembly breakpoints in the chromosome 8 assembly, we ran breakpointR48 on all BAM files to detect strand-state breakpoints. Misassemblies are visible as recurrent changes in strand state across multiple Strand-seq libraries39. To increase our sensitivity of misassembly detection, we created a ‘composite file’ that groups directional reads across all available Strand-seq libraries49,50. Next, we ran breakpointR on the ‘composite reads file’ using the function ‘runBreakpointr’ to detect regions that are homozygous (‘ww’; ‘HOM’ – all reads mapped in minus orientation) or heterozygous inverted (‘wc’, ‘HET’ – approximately equal number of reads mapped in minus and plus orientation). To further detect any putative chimaerism in the chromosome 8 assembly, we applied Strand-seq to assign 200-kb long chunks of the chromosome 8 assembly to unique groups corresponding to individual chromosomal homologues using SaaRclust39,51. For this, we used the SaaRclust function ‘scaffoldDenovoAssembly’ on all BAM files.

Bionano analysis

Bionano Genomics data were generated from the CHM13 genome13. Long DNA molecules labelled with Bionano’s Direct Labelling Enzyme were collected on a Bionano Saphyr Instrument to a coverage of 130×. The molecules were assembled with the Bionano assembly pipeline Solve (v.3.4), using the nonhaplotype-aware parameters and GRCh38 as the reference. The resulting data produced 261 genome maps with a total length of 2.921.6 Mb and a genome map N50 of 69.02 Mb.

The molecule set and the nonhaplotype-aware map were aligned to the CHM13 draft assembly and the GRCh38 assembly, and discrepancies were identified between the Bionano maps and the sequence references using scripts in the Bionano Solve software package—runCharacterize.py, runSV.py, and align_bnx_to_cmap.py.

A second version of the map was assembled using the haplotype-aware parameters. This map was also aligned to GRCh38 and the final CHM13 assembly to verify heterozygous locations. These regions were then examined further.

Analysis of Bionano alignments revealed three heterozygous sites within chromosome 8 located at approximately chr8:21,025,201, chr8:80,044,843 and chr8:121,388,618 (Supplementary Table 7). The structure with the greatest ONT read support was selected for inclusion in the chromosome 8 assembly (Supplementary Table 7).

TandemMapper and TandemQUAST analysis of the centromeric HOR array

We assessed the structure of the CHM13 and NHP centromeric HOR arrays by applying TandemMapper and TandemQUAST52 (https://github.com/ablab/TandemTools; version from 20 March 2020), which can detect large structural assembly errors in repeat arrays. For the CHM13 centromere, we first aligned ONT reads longer than 50 kb to the CHM13 assembly containing the contiguous chromosome 8 with Winnowmap43 (v.1.0) and extracted reads aligning to the centromeric HOR array (chr8:44243868–46323885). We then inputted these reads in the following TandemQUAST command: tandemquast.py -t 24 –nano {ont_reads.fa} -o {out_dir} chr8.fa. For the NHP centromeres, we aligned ONT reads to the whole-genome assemblies containing the contiguous chromosome 8 centromeres with Winnowmap43 (v.1.0) and extracted reads aligning to the centromeric HOR arrays. We then inputted these reads in the following TandemQUAST command: tandemquast.py-t 24 –nano {ont_reads.fa} -o {out_dir} chr8.fa.

Methylation analysis

Nanopolish18 (v.0.12.5) was used to measure the frequency of CpG methylation from raw ONT reads (>50 kb in length for CHM13) aligned to whole-genome assemblies via Winnowmap43 (v.1.0). Nanopolish distinguishes 5-methylcytosine from unmethylated cytosine via a Hidden Markov,model (HMM) on the raw nanopore current signal. The methylation caller generates a log-likelihood value for the ratio of probability of methylated to unmethylated CpGs at a specific k-mer. We filtered methylation calls using the nanopore_methylation_utilities tool (https://github.com/isaclee/nanopore-methylation-utilities)53, which uses a log-likelihood ratio of 2.5 as a threshold for calling methylation. CpG sites with log-likelihood ratios greater than 2.5 (methylated) or less than −2.5 (unmethylated) are considered high quality and included in the analysis. Reads that do not have any high-quality CpG sites are filtered from the BAM for subsequent methylation analysis. Nanopore_methylation_utilities integrates methylation information into the BAM file for viewing in IGV54 bisulfite mode, which was used to visualize CpG methylation.

Iso-Seq data generation and sequence analyses

RNA was purified from approximately 1 × 107 CHM13 cells using an RNeasy kit (Qiagen; 74104) and prepared into Iso-Seq libraries following a standard protocol55. Libraries were loaded on two SMRT Cells 8M and sequenced on the Sequel II. The data were processed via isoseq3 (v.8.0), ultimately generating 3,576,198 full-length non-chimeric reads. Poly-A trimmed transcripts were aligned to this CHM13 chr8 assembly and to GRCh38 with minimap244 (v.2.17-r941) with the following parameters: -ax splice -f 1000 –sam-hit-only –secondary = no –eqx. Transcripts were assigned to genes using featureCounts56 with GENCODE57 (v.34) annotations, supplemented with CHESS v.2.258 for any transcripts unannotated in GENCODE. Each transcript was scored for the percentage identity of its alignment to each assembly, requiring 90% of the length of each transcript to align to the assembly for it to count as aligned. For each gene, the percentage identity of non-CHM13 transcripts to GRCh38 was compared to the CHM13 chromosome 8 assembly. Genes with an improved representation in the CHM13 assembly were identified with a cut-off value of 20 improved reads per gene, with at least 0.2% average improvement in percentage identity. GENCODE (v.34) transcripts were lifted over to the CHM13 chr8 assembly using Liftoff59 to compare the GRCh38 annotations to this assembly and Iso-Seq transcripts.

We combined the 3.6 million full-length transcript data (above) with 20,937,742 full-length non-chimeric publicly available human Iso-Seq data (Supplementary Table 8). In total, we compared the alignment of 24,513,940 full-length non-chimeric reads from 13 tissue and cell line sources to both the completed CHM13 chromosome 8 assemblies and the current human reference genome, GRCh38. Of the 848,048 non-CHM13 cell line transcripts that align to chromosome 8, 93,495 (11.02%) align with at least 0.1% greater percentage identity to the CHM13 assembly, and 52,821 (6.23%) to GRCh38. This metric suggests that the chromosome 8 reference improves human gene annotation by approximately 4.79% even though most of those changes are subtle in nature. Overall, 61 protein-coding and 33 noncoding loci have improved alignments to the CHM13 assembly compared to GRCh38, with >0.2% average percentage identity improvement, and at least 20 supporting transcripts (Extended Data Fig. 3a–c, Supplementary Table 1). As an example, WDYHV1 (also known as NTAQ1) has four amino acid replacements, with 13 transcripts sharing the identical open reading frame to CHM13 (Extended Data Fig. 3d).

Pairwise sequence identity heat maps

To generate pairwise sequence identity heat maps, we fragmented the centromere assemblies into 5-kb fragments (for example, 1–5,000, 5,001–10,000, and so on) and made all possible pairwise alignments between the fragments using the following minimap244 (v.2.17-r941) command: minimap2 -f 0.0001 -t 32 -X –eqx -ax ava-ont. The sequence identity was determined from the CIGAR string of the alignments and then visualized using ggplot2 (geom_raster) in R (v.1.1.383)60. The colour of each segment was determined by sorting the data by identity and then creating 10 equally sized bins, each of which received a distinct colour from the spectral pallet. The choice of a 5-kb window came after testing a variety of window sizes. Ultimately, we found 5 kb to be a good balance between resolution of the figure (because each 5 kb fragment is plotted as a pixel) and sensitivity of minimap2 (fragments less than 5 kb often missed alignments with the ava-ont preset). A schematic illustrating this process is shown in Supplementary Fig. 3.

Miropeats analysis

To compare the organization and orientation of the CHM13 and GRCh38 β-defensin loci, we aligned the two β-defensin regions (CHM13 chr8:6300000–13300000; GRCh38 chr8:6545299–13033398) to each other using the following minimap244 parameters: minimap2 -x asm20 -s 200000 -p 0.01 -N 1000 –cs {GRCh38_defensin.fasta} {CHM13_defensin.fasta}. Then, we applied a version of Miropeats61 that is modified to use minimap244 alignments (https://github.com/mrvollger/minimiro) to produce the figure showing homology between the two sequences.

Analysis of α-satellite organization

To determine the organization of the CHM13 chromosome 8 centromeric region, we used two independent approaches. First, we subjected the CHM13 centromere assembly to an in silico restriction enzyme digestion in which a set of restriction enzyme recognition sites were identified within the assembly. In agreement with previous findings that XbaI digestion can generate a pattern of HORs within the chromosome 8 HOR array9, we found that each α-satellite HOR could be extracted via XbaI digestion. The in silico digestion analysis indicates that the chromosome 8 centromeric HOR array consists of 1,462 HOR units: 283 4-monomer HORs, 4 5-monomer HORs, 13 6-monomer HORs, 356 7- monomer HORs, 295 8-monomer HORs, and 511 11-monomer HORs. As an alternative approach, we subjected the centromere assembly to StringDecomposer42 (https://github.com/ablab/stringdecomposer; version from 28 February 2020) using a set of 11 α-satellite monomers derived from a chromosome 8 11-mer HOR unit. The sequence of the α-satellite monomers used are as follows: A: AGCATTCTCAGAAACACCTTCGTGATGTTTGCAATCAAGTCACAGAGTTGAACCTTCCGTTTCATAGAGCAGGTTGGAAACA CTCTTATTGTAGTATCTGGAAGTGGACATTTGGAGCGCTTTCAGGCCTATGGTGAAAAAGGAAATATCTTCCCATAAAAACGACATAGA; B: AGCTATCTCAGGAACTTGTTTATGATGCATCTAATCAACTAACAGTGTTGAACCTTTGTACTGACAG AGCACTTTGAAACACTCTTTTTTGGAATCTGCAAGTGGATATTTGGATCGCTTTGAGGATTTCGTTGGAAACGGGATGCAATATAAAACGTACACAGC; C: AGCATACTCAGAAAATACTTTGCCATATTTCCATTCAAGTCACAGAGTGGAACATTCCCATTCATAGAGCAGGTTGGAAACACTCTTTTTGGAGTATCTGGAAGTGGACATTTGGAGCGCTTTCTGAACTATGGTGAAAAAGGAAATATCTTCCAATGAAAACAAGACAGA; D: AGCATTCTGAGAAACTTATTTGTGATGTGTGTCCTCAACAAACGGACTTGAACCTTTCGTTTCATGCAGTACTTCTGGAACACTCTTTTT GAAGATTCTGCATGCGGATATTTGGATAGCTTTGAGGATTTCGTTGGAAACGGGCTTACATGTAAAAATTAGACAGC; E: AGCATTCTCAGAAACTTCTTTGTGGTG TCTGCATTCAAGTCACAGAATTGAACTTCTCCTCACATAGAGCAGTTGTGCAGCACTCTATTTGTAGTATCTGGAAGTGGACATTTGGAGGGCTTTGTAGCCTATCTGGAAAAAGGAAATATCTTCCCATGAATGCGAGATAGA; F: AGTAATCTCAGAAACATGTTTATGCTGTATCTACTCAACTAACTGTGCTGAACATTTCTATTGATAGAGCAGTTTTGAGACCCTCTTCTTTTGGAATCTGCAAGTGGATATTTGGATAGATTTGAGGATTTCGTTGGAAACGGGATTATATATAAAAAGTAGACAGC; G: AGCATTCTCAGAAACTTCTTTGTGATGTTTGCATCCAGCTCTCAGAGTTGAACATTCCCTTTCATAGAGTAGGTTTGAAACCCTCTTTTTATAGTGTCTGGAAGCGGGCATTTGGAGCGCTTTCAGGCCTATGCTGAAAAAGGAAATATCTACATATAGAAACTAGACAGA; H: AGCATTCTGAGAATCAAGTTTGTGATGTGGGTACTCAACTAACAGTGTTGATCCATTCTTTTGATACAGCAGTTTTGAACCACACTTTTTGTAGAATCTGCAAGTGGATATTTGGATAGCTGTGAGGATTTCGTTGGAAACGGGAATGTCTTCATAGAAAATTTAGACAGA; I: AGCATTCTCAGAACCTTGATTGTGATGTGTGTTCTCCACTAACAGAGTTGAACCTTTCTTTTGACAGAACTGTTCTGAAACATTCTTTTTATAGAATCTGGAAGTGGATATTTGGAAAGCTTTGAGGATTTCGTTGGAAACGGGAATATCTTCAAATAAAATCTAGCCAGA; J: AGCATTCTAAGAAACATCTTAGGGATGTTTACATTCAAGTCACAGAGTTGAACATTCC CTTTCACAGAGCAGGTTTGAAACAATCTTCTCGTACTATCTGGCAGTGGACATTTTGAGCTCTTTGGGGCCTATGCTGAAAAAGGAAATATCTTCCGACAAAAACTAGTCAGA; K: AGCATTCGCAGAATCCCGTTTGTGATGTGTGCACTCAACTGTCAGAATTGAACCTTGGTTTGGAGAGAGCACTTTTGAAACACACT TTTTGTAGAATCTGCAGGTGGATATTTGGCT AGCTTTGAGGATTTCGTTGGAAACGGTAATGTCTTCAAAGAAAATCTAGACAGA.

This analysis indicated that the CHM13 chromosome 8 centromeric HOR array consists of 1,515 HOR units: 286 4-monomer HORs, 12 6-monomer HORs, 366 7-monomer HORs, 303 8-monomer HORs, 3 10-monomer HORs, 539 11-monomer HORs, 2 12-monomer HORs, 2 13-monomer HORs, 1 17-monomer HOR, and 1 18-monomer HOR, which is concordant with the in silico restriction enzyme digestion results. The predominant HOR types from StringDecomposer42 are presented in Extended Data Fig. 8.

Copy number estimation

To estimate the copy number for the 8q21.2 VNTR and DEFB loci in human lineages, we applied a read-depth based copy number genotyper14 to a collection of 1,105 published high-coverage genomes62,63,64,65,66,67. In brief, sequencing reads were divided into multiples of 36-monomer HORs, which were then mapped to a repeat-masked human reference genome (GRCh38) using mrsFAST68 (v.3.4.1). To increase the mapping sensitivity, we allowed up to two mismatches per 36-monomer HOR. The read depth of mappable sequences across the genome was corrected for underlying GC content, and copy number estimate for the locus of interest was computed by summarizing over all mappable bases for each sample.

Entropy calculation

To define regions of increased admixture within the centromeric HOR array, we calculated the entropy using the frequencies of the different HOR units in 10-unit windows (1 unit slide) over the entire array. The following formula was used to determine entropy:

$${rm{Entropy}}=-Sigma ({{rm{frequency}}}_{{rm{i}}}times {log }_{2}({{rm{frequency}}}_{{rm{i}}}))$$

in which frequency is: (no. of HORs)/(total no. of HORs) in a 10-unit window. The analysis is analogous to that previously performed69.

Droplet digital PCR

Droplet digital PCR was performed on CHM13 genomic DNA to estimate the number of D8Z2 α-satellite HORs, as was previously done for the DXZ1 α-satellite HORs13. In brief, genomic DNA was isolated from CHM13 cells using the DNeasy Blood & Tissue Kit (Qiagen). DNA was quantified using a Qubit Fluorometer and the Qubit dsDNA HS Assay (Invitrogen). Reactions (20 μl) were prepared with 0.1 ng of gDNA for the D8Z2 assay or 1 ng of gDNA for the MTUS1 single-copy gene (as a control). EvaGreen droplet digital PCR (Bio-Rad) master mixes were simultaneously prepared for the D8Z2 and MTUS1 reactions, which were then incubated for 15 min to allow for restriction digest, according to the manufacturer’s protocol.

Pulsed-field gel electrophoresis and Southern blot

CHM13 genomic DNA was prepared in agarose plugs and digested with either BamHI or MfeI (to characterize the chromosome 8 centromeric region) or BmgBI (to characterize the chromosome 8q21.2 region) in the buffer recommended by the manufacturer. The digested DNA was separated with the CHEF Mapper system (Bio-Rad; autoprogram, 5–850-kb range, 16 h run), transferred to a membrane (Amersham Hybond-N+) and blot-hybridized with a 156 bp probe specific to the chromosome 8 centromeric α-satellite or 8q21.2 region. The probe was labelled with 32P by PCR-amplifying a synthetic DNA template 233: 5′-TTTGTGGAAGTGGACATTTCGCTTTGTAGCCTATCTGGAAAAAGGAAATATCTTCCCATGAATGCGAGATAGAAGTAATCTCAGAA ACATGTTTATGCTGTATCTACTCAACTAACTGTGCTGAACATTTCTATTGTAAAAATAGACAGAAGCATT-3′ (for the centromere of chromosome 8); 264: 5′-TTTGTGGAAGTGGACATTTCG CCCGAGGGGCCGCGGCAGGGATTCCGGGGGACCGGGAGTGGGGGGTTGGGGTTACTCTTGGCTTTTTGCCCTCTCCTGCCGCCGGCTGCTCCAGTTTCTTTCGCTTTGCGGCGAGGTGGTAAAAATAGACAGAAGCATT-3′ (for the organization of the chromosome 8q21.2 locus) with PCR primers 129: 5′-TTTGTGGAAGTGGACATTTC-3′ and 130: 5′-AATGCTTCTGTCTATTTTTA-3′. The blot was incubated for 2 h at 65 °C for pre-hybridization in Church’s buffer (0.5 M Na-phosphate buffer containing 7% SDS and 100 μg ml−1 of unlabelled salmon sperm carrier DNA). The labelled probe was heat denatured in a boiling water bath for 5 min and snap-cooled on ice. The probe was added to the hybridization Church’s buffer and allowed to hybridize for 48 h at 65 °C. The blot was washed twice in 2× SSC (300 mM NaCl, 30 mM sodium citrate, pH 7.0), 0.05% SDS for 10 min at room temperature, twice in 2× SSC, 0.05% SDS for 5 min at 60 °C, twice in 0.5× SSC, 0.05% SDS for 5 min at 60 °C, and twice in 0.25× SSC, 0.05% SDS for 5 min at 60 °C. The blot was exposed to X-ray film for 16 h at −80 °C. Uncropped, unprocessed images of all gels and blots are shown in Supplementary Fig. 9.

FISH and immunofluorescence

To validate the organization of the chromosome 8 centromere, we performed FISH on metaphase chromosome spreads as previously described70 with slight modifications. In brief, CHM13 cells were treated with colcemid and resuspended in HCM buffer (10 mM HEPES pH7.3, 30 mM glycerol, 1 mM CaCl2, 0.8 mM MgCl2). After 10 min, cells were fixed with methanol:acetic acid (3:1), dropped onto previously clean slides, and soaked in 1× PBS. Slides were incubated overnight in cold methanol, hybridized with labelled FISH probes at 68 °C for 2 min, and incubated overnight at 37 °C. Slides were washed three times in 0.1× SSC at 65 °C for 5 min each before mounting in Vectashield containing 5 μg ml−1 DAPI. Slides were imaged on a fluorescence microscope (Leica DM RXA2) equipped with a charge-coupled device camera (CoolSNAP HQ2) and a 100× 1.6–0.6 NA objective lens. Images were collected using Leica Application Suite X (v.3.7).

The probes used to validate the organization of the chromosome 8 centromere were picked from the human large-insert clone fosmid library ABC10. ABC10 end sequences were mapped using MEGABLAST (similarity = 0.99, parameters: -D 2 -v 7 -b 7 -e 1e-40 -p 80 -s 90 -W 12 -t 21 -F F) to a repeat-masked CHM13 genome assembly containing the complete chromosome 8 (parameters: -e wublast -xsmall -no_is -s -species Homo sapiens). Expected insert size for fosmids was set to (min) 32 kb and (max) 48 kb. Resulting clone alignments were grouped into the following categories based on uniqueness of the alignment for a given pair of clones, alignment orientation and the inferred insert size from the assembly. (1) Concordant best: unique alignment for clone pair, insert size within expected fosmid range, expected orientation. (2) Concordant tied: non-unique alignment for clone pair, insert size within expected fosmid range, expected orientation. (3) Discordant best: unique alignment of clone pair, insert size too small, too large or in opposite expected orientation of expected fosmid clone. (4) Discordant tied: non unique alignment for clone pair, insert size too small, too large or in opposite expected orientation of expected fosmid clone. (5) Discordant trans: clone pair has ends mapping to different contigs.

Clones aligning to regions within the chromosome 8 centromeric region were selected for FISH validation. The fosmid clones used for validation of the chromosome 8 centromeric region are: 174552_ABC10_2_1_000046302400_C7 for the p-arm monomeric α-satellite region (Cy5; blue), 174222_ABC10_2_1_000044375100_H13 for the p-arm portion of the D8Z2 HOR array (FluorX; green), 171417_ABC10_2_1_000045531400_M19 for the central portion of the D8Z2 HOR array (Cy3; red), 173650_ABC10_2_1_000044508400_J14 for the q-arm portion of the D8Z2 HOR array (FluorX; green), and 173650_ABC10_2_1_000044091500_K11 for the q-arm monomeric α-satellite region (Cy5; blue).

To determine the location of CENP-A relative to methylated DNA (specifically, 5-methylcytosines), we performed immunofluorescence on stretched CHM13 chromatin fibres as previously described71,72 with modifications. In brief, CHM13 cells were swollen in a hypotonic buffer consisting of a 1:1:1 ratio of 75 mM KCl, 0.8% sodium citrate, and dH2O for 5 min. Then, 3.5 × 104 cells were cytospun onto an ethanol-washed glass slide with a Shandon Cytospin 4 at 55g for 4 min with high acceleration and allowed to adhere for 1 min before immersing in a salt-detergent-urea lysis buffer (25 mM Tris pH 7.5, 0.5 M NaCl, 1% Triton X-100 and 0.3 M urea) for 15 min at room temperature. The slide was slowly removed from the lysis buffer over a time period of 38 s and subsequently washed in PBS, incubated in 4% formaldehyde in PBS for 10 min, and washed with PBS and 0.1% Triton X-100. The slide was rinsed in PBS and 0.05% Tween-20 (PBST) for 3 min, blocked for 30 min with immunofluorescence block (2% FBS, 2% BSA, 0.1% Tween-20 and 0.02% NaN2), and then incubated with a mouse monoclonal anti-CENP-A antibody (1:200, Enzo, ADI-KAM-CC006-E) and rabbit monoclonal anti-5-methylcytosine antibody (1:200, RevMAb, RM231) for 3 h at room temperature. Cells were washed three times for 5 min each in PBST and then incubated with Alexa Fluor 488 goat anti-rabbit (1:200, Thermo Fisher Scientific, A-11034) and Alexa Fluor 594 conjugated to goat anti-mouse (1:200, Thermo Fisher Scientific, A-11005) for 1.5 h. Cells were washed three times for 5 min each in PBST, fixed for 10 min in 4% formaldehyde, and washed three times for 1 min each in dH2O before mounting in Vectashield containing 5 μg ml−1 DAPI. Slides were imaged on an inverted fluorescence microscope (Leica DMI6000) equipped with a charge-coupled device camera (Leica DFC365 FX) and a 40× 1.4 NA objective lens.

To assess the repeat organization of the 8q21 neocentromere, we performed FISH73 on CHM13 chromatin fibres. DNA fibres were obtained following Henry H. Q. Heng’s protocol with minor modifications74. In brief, chromosomes were fixed with methanol:acetic acid (3:1), dropped onto previously clean slides, and soaked in 1× PBS. Manual elongation was performed by coverslip in NaOH:ethanol (5:2) solution. Slides were mounted in Vectashield containing 5 μg ml−1 DAPI and imaged on a fluorescence microscope (Leica DM RXA2) equipped with a charge-coupled device camera (CoolSNAP HQ2) and a 100× 1.6–0.6 NA objective lens. The probes used for validation of the 8q21.2 locus were picked from the same ABC10 fosmid library described above and include 174552_ABC10_2_1_000044787700_O7 for Probe 1 (Cy3; red) and 173650_ABC10_2_1_000044086000_F24 for Probe 2 (FluorX; green). Several CHM13 8q21.2 chromatin fibres were imaged. We quantified the number and intensity of the probe signals on a set of CHM13 chromatin fibres using ImageJ’s Gel Analysis tool (v.1.51) and found that there were 63 ± 7.55 green signals and 67 ± 5.20 red signals (n = 3 independent experiments), consistent with the 67 full and 7 partial repeats in the CHM13 8q21.2 VNTR.

Native CENP-A ChIP–seq and analysis

We performed two independent replicates of native CENP-A ChIP–seq on CHM13 cells as previously described25,72 with some modifications. In brief, 3 × 107–4 × 107 cells were collected and resuspended in 2 ml of ice-cold buffer I (0.32 M sucrose, 15 mM Tris, pH 7.5, 15 mM NaCl, 5 mM MgCl2, 0.1 mM EGTA, and 2× Halt Protease Inhibitor Cocktail (Thermo Fisher 78429)). Ice-cold buffer II (2 ml; 0.32 M sucrose, 15 mM Tris, pH 7.5, 15 mM NaCl, 5 mM MgCl2, 0.1 mM EGTA, 0.1% IGEPAL, and 2× Halt Protease Inhibitor Cocktail) was added, and samples were placed on ice for 10 min. The resulting 4 ml of nuclei were gently layered on top of 8 ml of ice-cold buffer III (1.2 M sucrose, 60 mM KCl, 15 mM, Tris pH 7.5, 15 mM NaCl, 5 mM MgCl2, 0.1 mM EGTA, and 2× Halt Protease Inhibitor Cocktail (Thermo Fisher 78429)) and centrifuged at 10,000g for 20 min at 4 °C. Pelleted nuclei were resuspended in buffer A (0.34 M sucrose, 15 mM HEPES, pH 7.4, 15 mM NaCl, 60 mM KCl, 4 mM MgCl2 and 2× Halt Protease Inhibitor Cocktail) to 400 ng ml−1. Nuclei were frozen on dry ice and stored at 80 °C. MNase digestion reactions were carried out on 200–300 μg chromatin, using 0.2–0.3 U μg−1 MNase (Thermo Fisher 88216) in buffer A supplemented with 3 mM CaCl2 for 10 min at 37 °C. The reaction was quenched with 10 mM EGTA on ice and centrifuged at 500g for 7 min at 4 °C. The chromatin was resuspended in 10 mM EDTA and rotated at 4 °C for 2 h. The mixture was adjusted to 500 mM NaCl, rotated for another 45 min at 4 °C and then centrifuged at maximum speed (21,100g) for 5 min at 4 °C, yielding digested chromatin in the supernatant. Chromatin was diluted to 100 ng ml−1 with buffer B (20 mM Tris, pH 8.0, 5 mM EDTA, 500 mM NaCl and 0.2% Tween 20) and precleared with 100 μl 50% protein G Sepharose bead (GE Healthcare) slurry for 20 min at 4 °C, rotating. Precleared supernatant (10–20 μg bulk nucleosomes) was saved for further processing. To the remaining supernatant, 20 μg mouse monoclonal anti-CENP-A antibody (Enzo ADI-KAM-CC006-E) was added and rotated overnight at 4 °C. Immunocomplexes were recovered by the addition of 200 ml 50% protein G Sepharose bead slurry followed by rotation at 4 °C for 3 h. The beads were washed three times with buffer B and once with buffer B without Tween. For the input fraction, an equal volume of input recovery buffer (0.6 M NaCl, 20 mM EDTA, 20 mM Tris, pH 7.5 and 1% SDS) and 1 ml of RNase A (10 mg ml−1) was added, followed by incubation for 1 h at 37 °C. Proteinase K (100 mg ml−1, Roche) was then added, and samples were incubated for another 3 h at 37 °C. For the ChIP fraction, 300 μl of ChIP recovery buffer (20 mM Tris, pH 7.5, 20 mM EDTA, 0.5% SDS and 500 mg ml−1 proteinase K) was added directly to the beads and incubated for 3–4 h at 56 °C. The resulting proteinase K-treated samples were subjected to a phenol–chloroform extraction followed by purification with a QIAGEN MinElute PCR purification column. Unamplified bulk nucleosomal and ChIP DNA were analysed using an Agilent Bioanalyzer instrument and a 2100 High Sensitivity Kit.

Sequencing libraries were generated using the TruSeq ChIP Library Preparation Kit Set A (Illumina IP-202-1012) according to the manufacturer’s instructions, with some modifications. In brief, 5–10 ng bulk nucleosomal or ChIP DNA was end-repaired and A-tailed. Illumina TruSeq adaptors were ligated, libraries were size-selected to exclude polynucleosomes using an E-Gel SizeSelect II agarose gel, and the libraries were PCR-amplified using the PCR polymerase and primer cocktail provided in the kit. The resulting libraries were submitted for 150 bp, paired-end Illumina sequencing using a NextSeq 500/550 High Output Kit v2.5 (300 cycles). The resulting reads were assessed for quality using FastQC (https://github.com/s-andrews/FastQC), trimmed with Sickle (https://github.com/najoshi/sickle; v1.33) to remove low-quality 5′ and 3′ end bases, and trimmed with Cutadapt75 (v.1.18) to remove adapters.

Processed CENP-A ChIP and bulk nucleosomal reads were aligned to the CHM13 whole-genome assembly5 using two different approaches: (1) BWA-MEM76 (v.0.7.17) and (2) a k-mer-based mapping approach we developed (described below).

For BWA-MEM mapping, data were aligned with the following parameters: bwa mem -k 50 -c 1000000 {index} {read1.fastq.gz}for single-end data, and bwa mem -k 50 -c 1000000 {index} {read1.fastq.gz} {read2.fastq.gz} for paired-end data. The resulting SAM files were filtered using SAMtools47 with FLAG score 2308 to prevent multi-mapping of reads. With this filter, reads mapping to more than one location are randomly assigned a single mapping location, thereby preventing mapping biases in highly identical regions. Alignments to the chromosome 8 centromere were downsampled to the same coverage and normalized with deepTools77 (v.3.4.3) bamCompare with the following parameters: bamCompare -b1 {ChIP.bam} -b2 {Bulk_nucleosomal.bam} –operation ratio –binSize 1000 -o {out.bw}. The resulting bigWig file was visualized on the UCSC Genome Browser using the CHM13 chromosome 8 assembly as an assembly hub.

For the k-mer-based mapping, the initial BWA-MEM alignment was used to identify reads specific to the chromosome 8 centromeric region (chr8:43600000–47200000). The k-mers (k = 50) were identified from each chromosome 8 centromere-specific data set using Jellyfish (v.2.3.0) and mapped back onto reads and chromosome 8 centromere assembly allowing for no mismatches. Approximately 93–98% of all k-mers identified in the reads were also found within the D8Z2 HOR array. Each k-mer from the read data was then placed once at random between all sites in the HOR array that had a perfect match to that k-mer. These data were then visualized using a histogram with 1-kb bins in R (R core team, 2020).

Mappability of short reads within the chromosome 8 centromeric region

To determine the mappability of short reads within the chromosome 8 centromeric HOR array, we performed a simulation where we generated 300,000 random 150-bp fragments from five equally sized (416 kb) regions across the CHM13 D8Z2 HOR array. We mapped these fragments back to the CHM13 chromosome 8 centromeric region using BWA-MEM (v0.7.17) or the k-mer-based approach, as described above. For BWA-MEM mapping, the 150-bp fragments were aligned with the following parameters: bwa mem -k 50 -c 1000000 {index} {fragments.fasta}. The resulting SAM files were filtered using SAMtools47 with FLAG score 2308 to prevent multi-mapping of reads and then converted to a BAM file. BAM files were visualized in IGV54. For the k-mer-based mapping, k-mers (k = 50) were identified from each set of 150-bp fragments using Jellyfish (v.2.3.0) and mapped back onto the fragments and the chromosome 8 centromere assembly allowing for no mismatches. k-mers with perfect matches to multiple sites within the centromeric region were assigned to one of the sites at random. These data were visualized using a histogram with 1-kb bins in R (R core team, 2020).

Phylogenetic analysis

To assess the phylogenetic relationship between α-satellite repeats, we first masked every non-α-satellite repeat in the human and NHP centromere assemblies using RepeatMasker78 (v.4.1.0). Then, we subjected the masked assemblies to StringDecomposer42 (version available 28 February 2020) using a set of 11 α-satellite monomers derived from a chromosome 8 11-monomer HOR unit (described in the ‘Analysis of α-satellite organization’ section above). This tool identifies the location of α-satellite monomers in the assemblies, and we used this to extract the α-satellite monomers from the HOR/dimeric array and monomeric regions into multi-FASTA files. We ultimately extracted 12,989, 8,132, 12,224, 25,334 and 63,527 α-satellite monomers from the HOR/dimeric array in human, chimpanzee (H1), chimpanzee (H2), orangutan and macaque, respectively, and 2,879, 3,781, 3,351, 1,573 and 8,127 monomers from the monomeric regions in human, chimpanzee (H1), chimpanzee (H2), orangutan and macaque, respectively. We randomly selected 100 and 50 α-satellite monomers from the HOR/dimeric array and monomeric regions and aligned them with MAFFT79,80 (v.7.453). We used IQ-TREE81 to reconstruct the maximum-likelihood phylogeny with model selection and 1000 bootstraps. The resulting tree file was visualized in iTOL82.

To estimate sequence divergence along the pericentromeric regions, we first mapped each NHP centromere assembly to the CHM13 centromere assembly using minimap244 (v.2.17-r941) with the following parameters: -ax asm20 –eqx -Y -t 8 -r 500000. Then, we generated a BED file of 10 kb windows located within the CHM13 centromere assembly. We used the BED file to subset the BAM file, which was subsequently converted into a set of FASTA files. FASTA files contained at least 5 kb of orthologous sequences from one or more NHP centromere assemblies. Pairs of human and NHP orthologous sequences were realigned using MAFFT (v.7.453) and the following command: mafft –maxiterate 1000 –localpair. Sequence divergence was estimated using the Tamura-Nei substitution model83, which accounts for recurrent mutations and differences between transversions and transitions as well as within transitions. Mutation rate per segment was estimated using Kimura’s model of neutral evolution84. In brief, we modelled the estimated divergence (D) is a result of between-species substitutions and within-species polymorphisms; that is, D = 2μt + 4Neμ, in which Ne is the ancestral human effective population size, t is the divergence time for a given human–NHP pair, and μ is the mutation rate. We assumed a generation time of [20, 29] years and the following divergence times: human–macaque = [23 × 106, 25 × 106] years, human–orangutan = [12 × 106, 14 × 106] years, human–chimpanzee = [4 × 106, 6 × 106] years. To convert the genetic unit to a physical unit, our computation also assumes Ne = 10,000 and uniformly drawn values for the generation and divergence times.

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this paper.



Source link

Leave a Reply

Your email address will not be published. Required fields are marked *