################################### # PopSeq2Geno # # Population Sequence to Genotype # ################################### Contact: biojoiner@gmail.com Citation: Fu L, Cai C, Cui Y, Wu J, Liang J, Cheng F and Wang X (2016). Pooled Mapping: An Efficient Method of Calling Variations for Population Samples with Low-depth Resequencing Data. -----INTRODUCTION------------ PopSeq2Geno is a tool to rapidly and accurately call variations for population samples with low depth re-sequencing data. Currently, PopSeq2Geno has been successfully trained for calling variations in Brassica species. -----INSTALLATION------------ BWA and SAMtools should be installed first (http://github.com/lh3/bwa) (http://samtools.sourceforge.net). PopSeq2Geno is a Perl script and does't require installation. -----RUNNING PopSeq2Geno----- USAGE: perl PopSeq2Geno.pl [options] [options] Type 'perl PopSeq2Geno.pl' in linux shell for a list of options. INPUT: -inDir [required] The directory where the population re-sequencing fastq data are stored. The names of these paired-end data should be formatted as '*_1.fq.gz' and '*_2.fq.gz'. -outDir [required] The directory where the filtered reads and other temp file are stored. -qual [required] File to provide the Illumina reads quality version. 1.8 (all >=1.8), or 1.3 (all <1.8). File name and the quality version are split by 'Tab'. -ref [required] Fasta format of the reference genome. -Gsize [optional] The size of the genome unit of Gb. Automatic calculated from the reference genome as default. -folds [optional] How many folds for each sample used to do the calling. The program will randomly select a small part of data to determine polymorphic loci even if we produced high-depth resequencing data -build [optional] Build index for the reference [1], or don't build it [0] in case it has been built previously. Default [1]. -pop [optional] The file of polymorphic loci for variations calling if exist. If file not exist, the program will perform pooled mapping as default [0]. Format of the file should be like this: chrid position reference-allele reference-allele variation-allele reference-reads-depth variation-reads-depth A01 679 G G T 20 47 Total seven columns in every line and columns are split by 'Tab'. OUTPUT: -outStat [optional] File to store the statistics during filtering raw reads. File 'readsFilter.statistics' will be generated if default. -sam [optional] File to store the samples used in the variations calling. File 'samples' will be generated if default. -pos [optional] File to store the genomic position of variations called. File 'positions.loci' will be generated if default. Every line represents a SNP locus. The file has the formation like this: chrid position reference-allele variation-allele A01 679 G T Total four columns in every line and columns are split by 'Tab'. -geno [optional] File to store the genotypes of variations called. File 'genotypes.loci' will be generated if default. All the SNP genotypes of the sample population are listed in 'genotypes.loci'. Every two columns in this file represent an individual corresponding to the ordered sample IDs in the file 'samples'; while every line represents a SNP locus whose physical position is stored in the file 'positions.loci'. Additionally, those loci whose genotypes in some samples are identical to the reference (but have different genotypes in other samples) are also listed with the genotype of the reference. NOTES: 1.We provide examples in 'example' directory. Users can download these files to test the program. And all the files mentioned above have an example in the 'examples_for_file_format' directory. 2.We generate two types of genotype file as: 1)genotypes.loci; 2)genotypes.loci.imputated. The first file stores genotypes called out purely from the resequencing data, it contains ungenotyped loci recorded as 'NN'. The second file stores genotype data of both those in the first file and these imputated ones of ungenotyped loci by imputation. Therefore, users can separate and choose to use genotype data either is called from resequencing or is produced by imputation easily. 3.We used the algorithm k-NN to perform the imputation. Users are free to choose other imputation methods such as IMPUTE (https://mathgen.stats.ox.ac.uk/impute/impute.html).