How-to: Edit GWAS and Genomic Prediction File Formats


Among the many tools used for GWAS and Genomic Prediction, there seems to equally as many file formats to navigate. Running multiple different algorithms then requires switching between file formats, which can be a tedious if not time-absorbing task. For both my future sanity as well as yours, I've provided a brief overview of some common formats and recommended routes for file conversion.

To start, I highly recommend that all genotype files be originally generated in VCF format. If you can do so following the GATK best practices for variant calling, all the better. The reason I recommend this is because VCF files are generally the most descriptive of all the genotype formats and conversion from VCF format to others results in a loss of informaton.

Once you have a VCF file, genotype file conversion is relatively simple. You can use Tassel which provides options to convert among several genotype file formats:

VCF | HapMap | HapMap Diploid | Plink | HDF5 | Phylip (interleaved or sequential) | Table

This may be done using the Tassel GUI (graphical user interface) or on command line.

Alternatively, you can use vcftools to convert a VCF or BCF (binary VCF) to the following formats:

BEAGLE | IMPUTE | LDhat | Plink | Plink TPED (transposed) | Numeric Matrix

Conversion among genotype files is typically relatively simple but can be more complicated when converting between phased or unphased genotypes or when files with multi-allelic sites need to be converted to contain only bi-allelic sites.

As for phenotype formats, the main issue is typically determining what format you need for your program of choice and then figuring out how best to convert it.

Plink format for phenotypes has the following tab-delimited structure:

  • FID = Family ID
  • IID = Individual ID

In this example, I don't have FIDs so the IID is just repeated for both.

FID IID Phenotype1 ... PhenotypeN
Indiv1 Indiv1 0 ... 0
Indiv2 Indiv2 3 ... 2

This is slightly different from the Tassel tab-delimited format for phenotypes:

<Trait> Phenotype1 ... PhenotypeN
Indiv1 0 ... 0
Indiv2 3 ... 2

Tassel will allow you to save a phenotype file into either Tassel or Plink format.

If you're running GAPIT, then the phenotype file is read into an R dataframe. That means that the delimiter used is less important since files can be read with "read.csv" or "read.table". However, you will generally want the file to looks like this:

Taxa Phenotype1 ... PhenotypeN
Indiv1 0 ... 0
Indiv2 3 ... 2

And, when you read the file, you want to designate that there are headers for the phenotype file ("head=TRUE"). Don't do this for the genotype file, though (in HapMap format). Even though there is a header, it should be read as "head=FALSE" so that the header occurs on the first row.

It's also worth noting that while the first 11 HapMap columns are required for GAPIT, only three of them are used ("rs" a.k.a. SNP name, "chrom" and "pos"). So, the other eight columns may be filled with "NA".

Now, let's assume instead that you want to run Plink or GEMMA. GEMMA can use Plink file formats, so let's use that common format. Conversion from VCF to Plink files is easily acheived using Tassel or vcftools.

To begin, Plink requires both a PED (pedigree) and MAP (genetic map) file. Plink PED file format requires all markers be biallelic and the file look like so (header included here for clarity -- not in actual PED file):

FamilyID IndividualID PaternalID MaternalID Sex Phenotype Snp1 ... SnpN
Indiv1 Indiv1 0 0 0 0 A ... G
Indiv2 Indiv2 0 0 0 0 A ... G

And the Plink MAP file looks like (again, header included for clarity only):

chromosome SnpID GeneticDistance BasePairPosition
1 S01_3918 0 3918
1 S01_12215 0 12215
1 S01_23664 0 23664
# This will generate the PED and MAP files -- note: Only biallelic loci will be output
# Tassel can also generate the PED and MAP files, but Tassel retains multiallelic loci
#     which is a problem for GEMMA
vcftools --vcf genotypes.vcf --plink --out genotypes 

# This will generate BED (binary PED), NoSex (if --allow-no-sex is included), 
# FAM (PLINK sample information file), BIM (PLINK extended MAP file)  and log files
plink --file genotypes --pheno phenotypes.txt --allow-no-sex --make-bed --out genotypes

The BED, BIM and FAM files are referred to as the Plink binary fileset. These are what you want to run GEMMA and Plink.

The last style of genotype matrix typically seen is the numeric genotype matrix. Depending on the software, you may want this matrix in 0, 1, 2 format or in -1, 0, 1. This matrix format is useful in genomic prediction as a design matrix or during the calculation of a genomic relationship matrix.

Conversion to 012 is easily accomplished using vcftools.

vcftools --vcf my.vcf --012 --out geno_matrix

Then, conversion to a -101 format simply requires that the matrix be read into either R or python and one be subtracted from the matrix. For example:

# Commented code for a small-scale example
# myMatrix <- matrix(c(c(0,1,2),c(2,1,0),c(1,2,0)), nrow=3)

myMatrix <- read.csv("geno_matrix.012", head=FALSE)
myMatrix <- myMatrix - 1

Do note, however, that different programs require missing data in particular formats. Here, the 012 matrix has missing as -1 and the -101 has -2. So, change for NAs as necessary.

That should cover the majority of file conversions needed for performing GWAS or Genomic Prediction. Good luck!