Experimental design
The teeth used for DNA extraction were obtained with relevant institutional permissions from the Institute of Ethnology and Anthropology of Russian Academy of Sciences (Russia), Cherepovets Museum Association (Russia), and Archaeological Research Collection of Tallinn University (Estonia). DNA was extracted from the teeth of 48 individuals: 3 from Stone Age HGs from western Russia (WeRuHG; 10,800 to 4250 cal BCE), 44 from Bronze Age Fatyanovo Culture individuals from western Russia (Fatyanovo; 2900 to 2050 cal BCE), and 1 from a Corded Ware Culture individual from Estonia (EstCWC; 2850 to 2500 cal BCE) (Fig. 1, data S1, table S1, and text S1). Petrous bones of 13 of the Fatyanovo Culture individuals have been sampled for another project. More detailed information about the archaeological periods and the specific sites and burials of this study is given below.
All of the laboratory work was performed in dedicated aDNA laboratories of the Institute of Genomics, University of Tartu. The library quantification and sequencing were performed at the Institute of Genomics Core Facility, University of Tartu. The main steps of the laboratory work are detailed below.
DNA extraction. The teeth of 48 individuals were used to extract DNA. One individual was sampled twice from different teeth. Apical tooth roots were cut off with a drill and used for extraction because root cementum has been shown to contain more endogenous DNA than crown dentine (62). The root pieces were used whole to avoid heat damage during powdering with a drill and to reduce the risk of cross-contamination between samples. Contaminants were removed from the surface of tooth roots by soaking in 6% bleach for 5 min, then rinsing three times with Milli-Q water (Millipore), and lastly soaking in 70% ethanol for 2 min, shaking the tubes during each round to dislodge particles. Last, the samples were left to dry under an ultraviolet light for 2 hours.
Next, the samples were weighed, [20 * sample mass (mg)] l of EDTA and [sample mass (mg) / 2] l of proteinase K were added, and the samples were left to digest for 72 hours on a rotating mixer at 20C to compensate for the smaller surface area of the whole root compared to powder. Undigested material was stored for a second DNA extraction if need be.
The DNA solution was concentrated to 250 l (Vivaspin Turbo 15, 30,000 MWCO PES, Sartorius) and purified in large-volume columns (High Pure Viral Nucleic Acid Large Volume Kit, Roche) using 2.5 ml of PB buffer, 1 ml of PE buffer, and 100 l of EB buffer (MinElute PCR Purification Kit, QIAGEN).
Library preparation. Sequencing libraries were built using NEBNext DNA Library Prep Master Mix Set for 454 (E6070, New England Biolabs) and Illumina-specific adaptors (63) following established protocols (6365). The end repair module was implemented using 30 l of DNA extract, 12.5 l of water, 5 l of buffer, and 2.5 l of enzyme mix, incubating at 20C for 30 min. The samples were purified using 500 l of PB and 650 l of PE buffer and eluted in 30 l of EB buffer (MinElute PCR Purification Kit, QIAGEN). The adaptor ligation module was implemented using 10 l of buffer, 5 l of T4 ligase, and 5 l of adaptor mix (63), incubating at 20C for 15 min. The samples were purified as in the previous step and eluted in 30 l of EB buffer (MinElute PCR Purification Kit, QIAGEN). The adaptor fill-in module was implemented using 13 l of water, 5 l of buffer, and 2 l of Bst DNA polymerase, incubating at 37C for 30 min and at 80C for 20 min. The libraries were amplified, and both the indexed and universal primers (NEBNext Multiplex Oligos for Illumina, New England Biolabs) were added by polymerase chain reaction (PCR) using HGS Diamond Taq DNA polymerase (Eurogentec). The samples were purified and eluted in 35 l of EB buffer (MinElute PCR Purification Kit, QIAGEN). Three verification steps were implemented to make sure library preparation was successful and to measure the concentration of double-stranded DNA/sequencing librariesfluorometric quantitation (Qubit, Thermo Fisher Scientific), parallel capillary electrophoresis (Fragment Analyzer, Agilent Technologies), and quantitative PCR. One sample (TIM004) had a DNA concentration lower than our threshold for sequencing and was hence excluded, leaving 48 samples from 47 individuals to be sequenced.
DNA sequencing. DNA was sequenced using the Illumina NextSeq 500 platform with the 75base pair (bp) single-end method. First, 15 samples were sequenced together on one flow cell. Later, additional data were generated for some samples to increase coverage.
Mapping. Before mapping, the sequences of adaptors and indexes and poly-G tails occurring due to the specifics of the NextSeq 500 technology were cut from the ends of DNA sequences using cutadapt 1.11 (66). Sequences shorter than 30 bp were also removed with the same program to avoid random mapping of sequences from other species. The sequences were mapped to reference sequence GRCh37 (hs37d5) using Burrows-Wheeler Aligner (BWA 0.7.12) (67) and command mem with reseeding disabled.
After mapping, the sequences were converted to BAM format, and only sequences that mapped to the human genome were kept with samtools 1.3 (68). Next, data from different flow cell lanes were merged and duplicates were removed with picard 2.12 (http://broadinstitute.github.io/picard/index.html). Indels were realigned with GATK 3.5 (69), and lastly, reads with mapping quality under 10 were filtered out with samtools 1.3 (68).
The average endogenous DNA content (proportion of reads mapping to the human genome) for the 48 samples is 29% (table S1). The endogenous DNA content is variable as is common in aDNA studies, ranging from under 1 to around 78% (table S1).
aDNA authentication. As a result of degrading over time, aDNA can be distinguished from modern DNA by certain characteristics: short fragments and a high frequency of CT substitutions at the 5 ends of sequences due to cytosine deamination. The program mapDamage2.0 (70) was used to estimate the frequency of 5 CT transitions.
mtDNA contamination was estimated using the method from (71).This included calling an mtDNA consensus sequence based on reads with mapping quality of at least 30 and positions with at least 5 coverage, aligning the consensus with 311 other human mtDNA sequences from (71), mapping the original mtDNA reads to the consensus sequence, and running contamMix 1.0-10 with the reads mapping to the consensus and the 312 aligned mtDNA sequences while trimming seven bases from the ends of reads with the option trimBases. For the male individuals, contamination was also estimated on the basis of chrX using the two contamination estimation methods first described in (72) and incorporated in the ANGSD software (73) in the script contamination.R.
The samples show 10% CT substitutions at the 5 ends on average, ranging from 6 to 17% (table S1). The mtDNA contamination point estimate for samples with >5 mtDNA coverage ranges from 0.03 to 2.02% with an average of 0.4% (table S1). The average of the two chrX contamination methods of male individuals with average chrX coverage of >0.1 is between 0.4 and 0.87% with an average of 0.7% (table S1).
Kinship analysis. A total of 4,375,438 biallelic single-nucleotide variant sites, with minor allele frequency (MAF) > 0.1 in a set of more than 2000 high-coverage genomes of Estonian Genome Center (EGC) (74), were identified and called with ANGSD (73) command --doHaploCall from the 25 BAM files of 24 Fatyanovo individuals with coverage of >0.03. The ANGSD output files were converted to .tped format as an input for the analyses with READ script to infer pairs with first- and second-degree relatedness (41).
The results are reported for the 100 most similar pairs of individuals of the 300 tested, and the analysis confirmed that the two samples from one individual (NIK008A and NIK008B) were indeed genetically identical (fig. S6). The data from the two samples from one individual were merged (NIK008AB) with samtools 1.3 option merge (68).
Calculating general statistics and determining genetic sex. Samtools 1.3 (68) option stats was used to determine the number of final reads, average read length, average coverage, etc. Genetic sex was calculated using the script sexing.py from (75), estimating the fraction of reads mapping to chrY out of all reads mapping to either X or Y chromosome.
The average coverage of the whole genome for the samples is between 0.00004 and 5.03 (table S1). Of these, 2 samples have an average coverage of >0.01, 18 samples have >0.1, 9 samples have >1, 1 sample have around 5, and the rest are lower than 0.01 (table S1). Genetic sexing confirms morphological sex estimates or provides additional information about the sex of the individuals involved in the study. Genetic sex was estimated for samples with an average genomic coverage of >0.005. The study involves 16 females and 20 males (Table 1 and table S1).
Determining mtDNA hgs. The program bcftools (76) was used to produce VCF files for mitochondrial positions; genotype likelihoods were calculated using the option mpileup, and genotype calls were made using the option call. mtDNA hgs were determined by submitting the mtDNA VCF files to HaploGrep2 (77, 78). Subsequently, the results were checked by looking at all the identified polymorphisms and confirming the hg assignments in PhyloTree (78). Hgs for 41 of the 47 individuals were successfully determined (Table 1, fig. S1, and table S1).
No female samples have reads on the chrY consistent with a hg, indicating that levels of male contamination are negligible. Hgs for 17 (with coverage of >0.005) of the 20 males were successfully determined (Table 1 and tables S1 and S2).
chrY variant calling and hg determination. In total, 113,217 haplogroup informative chrY variants from regions that uniquely map to chrY (36, 7982) were called as haploid from the BAM files of the samples using the --doHaploCall function in ANGSD (73). Derived and ancestral allele and hg annotations for each of the called variants were added using BEDTools 2.19.0 intersect option (83). Hg assignments of each individual sample were made manually by determining the hg with the highest proportion of informative positions called in the derived state in the given sample. chrY haplogrouping was blindly performed on all samples regardless of their sex assignment.
Preparing the datasets for autosomal analyses. The HO array dataset (https://reich.hms.harvard.edu/downloadable-genotypes-present-day-and-ancient-dna-data-compiled-published-papers) was used as the modern DNA background. Individuals from the 1240K dataset (https://reich.hms.harvard.edu/downloadable-genotypes-present-day-and-ancient-dna-data-compiled-published-papers) were used as the aDNA background.
The data of the comparison datasets and of the individuals of this study were converted to BED format using PLINK 1.90 (http://pngu.mgh.harvard.edu/purcell/plink/) (84), and the datasets were merged. Two datasets were prepared for analyses: one with HO and 1240K individuals and the individuals of this study, where 584,901 autosomal SNPs of the HO dataset were kept; the other with 1240K individuals and the individuals of this study, where 1,136,395 autosomal and 48,284 chrX SNPs of the 1240K dataset were kept.
Individuals with <10,000 SNPs overlapping with the HO autosomal dataset were removed from further autosomal analyses, leaving 30 individuals of this study to be used in autosomal analyses. These included 3 from WeRuHG, 26 from Fatyanovo, and 1 from EstCWC (table S1).
Principal components analysis. To prepare for PCA, a reduced comparison sample set composed of 813 modern individuals from 53 populations of Europe, Caucasus, and Near East and 737 ancient individuals from 107 populations was assembled (tables S3 and S4). The data were converted to EIGENSTRAT format using the program convertf from the EIGENSOFT 7.2.0 package (85). PCA was performed with the program smartpca from the same package, projecting ancient individuals onto the components constructed based on the modern genotypes using the option lsqproject and trying to account for the shrinkage problem introduced by projecting by using the option autoshrink.
Admixture analysis. For Admixture analysis (86), the same ancient sample set was used as for PCA, and the modern sample set was increased to 1861 individuals from 144 populations from all over the world (tables S3 and S4). The analysis was carried out using ADMIXTURE 1.3 (86) with the P option, projecting ancient individuals into the genetic structure calculated on the modern dataset due to missing data in the ancient samples. The HO dataset of modern individuals was pruned to decrease linkage disequilibrium using the option indep-pairwise with parameters 1000 250 0.4 in PLINK 1.90 (http://pngu.mgh.harvard.edu/purcell/plink/) (84). This resulted in a set of 269,966 SNPs. Admixture was run on this set using K = 3 to K = 18 in 100 replicates. This enabled us to assess convergence of the different models. K = 10 and K = 9 were the models with the largest number of inferred genetic clusters for which >10% of the runs that reached the highest log likelihood values yielded very similar results. This was used as a proxy to assume that the global likelihood maximum for this particular model was indeed reached. Then, the inferred genetic cluster proportions and allele frequencies of the best run at K = 9 were used to run Admixture to project the aDNA individuals, for which the intersection with the LD pruned modern dataset yielded data for more than 10,000 SNPs, on the inferred clusters. The same projecting approach was taken for all models for which there is good indication that the global likelihood maximum was reached (K3 to 18). We present all ancient individuals in fig. S2 but only population averages in Fig. 2B. The resulting membership proportions to K genetic clusters are sometimes called ancestry components, which can lead to overinterpretation of the results. The clustering itself is, however, an objective description of genetic structure and hence a valuable tool in population comparisons.
Outgroup f3 statistics. For calculating autosomal outgroup f3 statistics, the same ancient sample set as for previous analyses was used, and the modern sample set included 1177 individuals from 80 populations from Europe, Caucasus, Near East, Siberia and Central Asia, and Yoruba as outgroup (tables S3 and S4). The data were converted to EIGENSTRAT format using the program convertf from the EIGENSOFT 5.0.2 package (85). Outgroup f3 statistics of the form f3(Yoruba; West_Siberia_N/EHG/CentralRussiaHG/Fatyanovo/ Yamnaya_Samara/Poland_CWC/Baltic_CWC/Central_CWC, modern/ancient) were computed using the ADMIXTOOLS 6.0 program qp3Pop (87).
To allow chrX versus autosome comparison for ancient populations, outgroup f3 statistics using chrX SNPs were computed. To allow the use of the bigger number of positions in the 1240K over the HO dataset, Mbuti from the Simons Genome Diversity Project (88) was used as the outgroup. The outgroup f3 analyses of the form f3(Mbuti; West_Siberia_N/EHG/CentralRussiaHG/Fatyanovo/ Yamnaya_Samara/Poland_CWC/Baltic_CWC/Central_CWC, ancient) were run both using not only 1,136,395 autosomal SNPs but also 48,284 chrX positions available in the 1240K dataset. Because all children inherit half of their autosomal material from their father but only female children inherit their chrX from their father, then in this comparison chrX data give more information about the female and autosomal data about the male ancestors of a population.
The autosomal outgroup f3 results of the two different SNP sets were compared to each other and to the results based on the chrX positions of the 1240K dataset to see whether the SNPs used affect the trends seen. Outgroup f3 analyses were also run with the form f3(Mbuti; PES001/I0061/Sidelkino, Paleolithic/Mesolithic HG) and admixture f3 analyses with the form f3(Fatyanovo; Yamnaya, EF) using the autosomal positions of the 1240K dataset.
D statistics. D statistics of the form D(Yoruba, West_Siberia_N/EHG/CentralRussiaHG/Fatyanovo/ Yamnaya_Samara/Poland_CWC/Baltic_CWC/Central_CWC; Russian, modern/ancient) were calculated on the same dataset as outgroup f3 statistics (tables S3 and S4) using the autosomal positions of the HO dataset. The ADMIXTOOLS 6.0 package program qpDstat was used (87).
In addition, D statistics of the form D(Mbuti, ancient; Yamnaya_Samara, Fatyanovo/Baltic_CWC/ Central_CWC) and D(Mbuti, ancient; Poland_CWC/Baltic_CWC/ Central_CWC, Fatyanovo) were calculated using the autosomal positions of the 1240K dataset. However, comparing very similar populations directly using D statistics seems to be affected by batch biasesCentral_CWC comes out as significantly closer to almost all populations than Fatyanovo, while this is not the case when comparing less similar Fatyanovo and Yamnaya_Samara. Because of this, the results of D(Mbuti, ancient; Poland_CWC/Baltic_CWC/Central_CWC, Fatyanovo) are not discussed in the main text, but the data are included in table S19.
FST. Weir and Cockerham pairwise average FST (89) was calculated for the dataset used for outgroup f3 and D statistics using the autosomal positions of the HO dataset using a custom script.
qpAdm. The ADMIXTOOLS 6.0 (87) package programs qpWave and qpAdm were used to estimate which populations and in which proportions are suitable proxies of admixture to form the populations or individuals of this study. The autosomal positions of the 1240K dataset were used. Only samples with more than 100,000 SNPs were used in the analyses. Mota, Ust-Ishim, Kostenki14, GoyetQ116, Vestonice16, MA1, AfontovaGora3, ElMiron, Villabruna, WHG, EHG, CHG, Iran_N, Natufian, Levant_N, and Anatolia_N (and Volosovo in some cases indicated in table S15) were used as right populations. Yamnaya_Samara or Yamnaya_Kalmykia was used as the left population representing Steppe ancestry. Levant_N, Anatolia_N, LBK_EN, Central_MN, Globular_Amphora, Trypillia, Ukraine_Eneolithic, or Ukraine_Neolithic was used as the left population representing EF ancestry. In some cases, WHG, EHG, WesternRussiaHG, or Volosovo was used as the left population representing HG ancestry. Alternatively, one-way models between Fatyanovo, Baltic_CWC, and Central_CWC were tested. Also, PES001 was modeled as a mixture of WHG and AfontovaGora3, MA1, or CHG.
To look at sex bias, four models that were not rejected using autosomal data were also tested using the 48,284 chrX positions of the 1240K dataset. The same samples were used as in the autosomal modeling.
ChromoPainter/NNLS. To infer the admixture proportions of ancient individuals, the ChromoPainter/NNLS pipeline was applied (28). Because of the low coverage of the ancient data, it is not possible to infer haplotypes, and the analysis was performed in unlinked mode (option -u). The autosomal positions of the HO dataset were used. Only samples with more than 20,000 SNPs were used in the analyses. Because ChromoPainter (90) does not tolerate missing data, every ancient target individual was iteratively painted together with one representative individual from potential source populations as recipients. All the remaining modern individuals from the sample set used for Admixture analysis were used as donors (tables S3 and S4). Subsequently, we reconstructed the profile of each target individual as a combination of two or more ancient individuals, using the non-negative least square approach. Let Xg and Yp be vectors summarizing the proportion of DNA that source and target individuals copy from each of the modern donor groups as inferred by ChromoPainter. Yp = 1X1 + 2X2 + + zXz was reconstructed using a slight modification of the nnls function in R (91) and implemented in GlobeTrotter (92) under the conditions g 0 and g = 1. To evaluate the fitness of the NNLS estimation, we inferred the sum of the squared residual for every tested model (93). Models identified as plausible with qpAdm with Yamnaya_Samara and Globular_Amphora/Trypillia as sources were used. The resulting painting profiles, which summarize the fraction of the individuals DNA inherited by each donor individual, were summed over individuals from the same population.
DATES. The time of admixture between Yamnaya and EF populations forming the Fatyanovo Culture population was estimated using the program DATES (37). The autosomal positions of the 1240K dataset were used.
Phenotyping. To predict eye, hair, and skin color in the ancient individuals (tables S20 to S22), the HIrisPlex-S variants from 19 genes in nine autosomes were selected (9496), and the region to be analyzed was selected adding 2 Mb around each SNP, collapsing in the same region the variants separated by less than 5 Mb. A total of 10 regions (2 for chromosome 15 and 1 for each of the remaining autosomes) were obtained, ranging from about 6 to about 1.5 Mb. Similarly, to analyze the other phenotype-informative markers (diet, immunity, and diseases), 2 Mb around each variant was selected, and the overlapping regions were merged, for a total of 47 regions (45 regions in 17 autosomes and 2 regions on chrX). For the local imputation, we used a two-step pipeline (97) as follows: (i) variant calling, (ii) first imputation step using a reference panel as much similar as possible to the target samples, (iii) variant filtering, (iv) second imputation step using a larger worldwide reference panel, and (v) final variant filtering. This pipeline has been validated by randomly downsampling a high-coverage Neolithic sample (NE1) (98) to 0.05 and comparing the imputed variants in the low-coverage version with the called variants from the original genome. For a local imputation approach on 2 Mb, we obtained a concordance rate higher than 90% for all the variants, a figure that increased to 99% for frequent variants (MAF 0.3). The variants were called using ATLAS v0.9.0 (99) (task = call and method = MLE commands) (step 1) at biallelic SNPs with a MAF 0.1% in a reference panel composed of more than 2000 high-coverage Estonian genomes (EGC) (74). The variants were called separately for each sample and merged in one VCF file per chromosomal region. The merged VCFs were used as input for the first step of our two-step imputation pipeline [genotype likelihood update; -gl command on Beagle 4.1 (100)], using the EGC panel as reference (step 2). Then, the variants with a genotype probability (GP) less than 0.99 were discarded (step 3), and the missing genotype was imputed with the -gt command of Beagle 5.0 (101) using the large HRC as reference panel (102), with the exception of variants rs333 and rs2430561 [not present in the HRC (Haplotype Reference Consortium)], imputed using the 1000 Genomes as reference panel (step 4) (103). Last, a second GP filter was applied to keep variants with GP 0.85 (step 5). Then, the 113 phenotype-informative SNPs were extracted, recoded, and organized in tables, using VCFtools (104), PLINK 1.9 (http://pngu.mgh.harvard.edu/purcell/plink/) (84), and R (91) (tables S21 and S22). The HIrisPlex-S variants were uploaded on the HIrisPlex webtool (https://hirisplex.erasmusmc.nl/) to perform the pigmentation prediction, after tabulating them according to the manual of the tool. Out of 41 variants of the HIrisPlex-s system, two markers were not analyzed, namely, the rs312262906 indel and the rare (MAF = 0 in the HRC) rs201326893 SNP, because of the difficulties in the imputation of such variants.
The 28 samples analyzed here were compared with 34 ancient samples from surrounding geographical regions from literature, gathering them in seven groups according to their region and/or culture: (i) 3 Western Russian Stone Age HGs (present study); (ii) 5 Latvian Mesolithic HGs (34); (iii) 7 Estonian and Latvian Corded Ware Culture farmers [present study and (27, 34)]; (iv) 24 Fatyanovo Culture individuals (present study); (v) 10 Estonian Bronze Age individuals (28); (vi) 9 Estonian and Ingrian Iron Age individuals (28); (vii) 4 Estonian Middle Age individuals (28). For each variant, an analysis of variance (ANOVA) test was performed between the seven groups, applying Bonferronis correction by the number of tested variants to set the significance threshold (table S20). For the significant variant, a Tukey test was performed to identify the significant pairs of groups.
See the rest here:
Genetic ancestry changes in Stone to Bronze Age transition in the East European plain - Science Advances