- Thu 10 October 2019
- How-to
- J. Lucas Boatwright
- #bioinformatics, #education, #gwas, #gp
Among the many tools used for GWAS and Genomic Prediction, there seems to equally as many file formats to navigate. Running multiple different algorithms then requires switching between file formats, which can be a tedious if not time-absorbing task. For both my future sanity as well as yours, I've provided a brief overview of some common formats and recommended routes for file conversion.
To start, I highly recommend that all genotype files be originally generated in VCF format. If you can do so following the GATK best practices for variant calling, all the better. The reason I recommend this is because VCF files are generally the most descriptive of all the genotype formats and conversion from VCF format to others results in a loss of informaton.
Once you have a VCF file, genotype file conversion is relatively simple. You can use Tassel which provides options to convert among several genotype file formats:
VCF | HapMap | HapMap Diploid | Plink | HDF5 | Phylip (interleaved or sequential) | Table
This may be done using the Tassel GUI (graphical user interface) or on command line.
Alternatively, you can use vcftools to convert a VCF or BCF (binary VCF) to the following formats:
BEAGLE | IMPUTE | LDhat | Plink | Plink TPED (transposed) | Numeric Matrix
Conversion among genotype files is typically relatively simple but can be more complicated when converting between phased or unphased genotypes or when files with multi-allelic sites need to be converted to contain only bi-allelic sites.
As for phenotype formats, the main issue is typically determining what format you need for your program of choice and then figuring out how best to convert it.
Plink format for phenotypes has the following tab-delimited structure:
- FID = Family ID
- IID = Individual ID
In this example, I don't have FIDs so the IID is just repeated for both.
FID | IID | Phenotype1 | ... | PhenotypeN |
---|---|---|---|---|
Indiv1 | Indiv1 | 0 | ... | 0 |
Indiv2 | Indiv2 | 3 | ... | 2 |
This is slightly different from the Tassel tab-delimited format for phenotypes:
<Trait> | Phenotype1 | ... | PhenotypeN |
---|---|---|---|
Indiv1 | 0 | ... | 0 |
Indiv2 | 3 | ... | 2 |
Tassel will allow you to save a phenotype file into either Tassel or Plink format.
If you're running GAPIT, then the phenotype file is read into an R dataframe. That means that the delimiter used is less important since files can be read with "read.csv" or "read.table". However, you will generally want the file to looks like this:
Taxa | Phenotype1 | ... | PhenotypeN |
---|---|---|---|
Indiv1 | 0 | ... | 0 |
Indiv2 | 3 | ... | 2 |
And, when you read the file, you want to designate that there are headers for the phenotype file ("head=TRUE"). Don't do this for the genotype file, though (in HapMap format). Even though there is a header, it should be read as "head=FALSE" so that the header occurs on the first row.
It's also worth noting that while the first 11 HapMap columns are required for GAPIT, only three of them are used ("rs" a.k.a. SNP name, "chrom" and "pos"). So, the other eight columns may be filled with "NA".
Now, let's assume instead that you want to run Plink or GEMMA. GEMMA can use Plink file formats, so let's use that common format. Conversion from VCF to Plink files is easily acheived using Tassel or vcftools.
To begin, Plink requires both a PED (pedigree) and MAP (genetic map) file. Plink PED file format requires all markers be biallelic and the file look like so (header included here for clarity -- not in actual PED file):
FamilyID | IndividualID | PaternalID | MaternalID | Sex | Phenotype | Snp1 | ... | SnpN |
---|---|---|---|---|---|---|---|---|
Indiv1 | Indiv1 | 0 | 0 | 0 | 0 | A | ... | G |
Indiv2 | Indiv2 | 0 | 0 | 0 | 0 | A | ... | G |
And the Plink MAP file looks like (again, header included for clarity only):
chromosome | SnpID | GeneticDistance | BasePairPosition |
---|---|---|---|
1 | S01_3918 | 0 | 3918 |
1 | S01_12215 | 0 | 12215 |
1 | S01_23664 | 0 | 23664 |
# This will generate the PED and MAP files -- note: Only biallelic loci will be output
# Tassel can also generate the PED and MAP files, but Tassel retains multiallelic loci
# which is a problem for GEMMA
vcftools --vcf genotypes.vcf --plink --out genotypes
# This will generate BED (binary PED), NoSex (if --allow-no-sex is included),
# FAM (PLINK sample information file), BIM (PLINK extended MAP file) and log files
plink --file genotypes --pheno phenotypes.txt --allow-no-sex --make-bed --out genotypes
The BED, BIM and FAM files are referred to as the Plink binary fileset. These are what you want to run GEMMA and Plink.
The last style of genotype matrix typically seen is the numeric genotype matrix. Depending on the software, you may want this matrix in 0, 1, 2 format or in -1, 0, 1. This matrix format is useful in genomic prediction as a design matrix or during the calculation of a genomic relationship matrix.
Conversion to 012 is easily accomplished using vcftools.
vcftools --vcf my.vcf --012 --out geno_matrix
Then, conversion to a -101 format simply requires that the matrix be read into either R or python and one be subtracted from the matrix. For example:
# Commented code for a small-scale example
# myMatrix <- matrix(c(c(0,1,2),c(2,1,0),c(1,2,0)), nrow=3)
myMatrix <- read.csv("geno_matrix.012", head=FALSE)
myMatrix <- myMatrix - 1
Do note, however, that different programs require missing data in particular formats. Here, the 012 matrix has missing as -1 and the -101 has -2. So, change for NAs as necessary.
That should cover the majority of file conversions needed for performing GWAS or Genomic Prediction. Good luck!