###################################
        #           PopSeq2Geno           #
        # Population Sequence to Genotype #
        ###################################

Contact: biojoiner@gmail.com

Citation: Fu L, Cai C, Cui Y, Wu J, Liang J, Cheng F and Wang X (2016).
Pooled Mapping: An Efficient Method of Calling Variations for Population 
Samples with Low-depth Resequencing Data.

-----INTRODUCTION------------
PopSeq2Geno is a tool to rapidly and accurately call variations for population
samples with low depth re-sequencing data. Currently, PopSeq2Geno has been 
successfully trained for calling variations in Brassica species.

-----INSTALLATION------------
BWA and SAMtools should be installed first (http://github.com/lh3/bwa)
(http://samtools.sourceforge.net). PopSeq2Geno is a Perl script and does't
require installation.

-----RUNNING PopSeq2Geno-----
USAGE:
perl PopSeq2Geno.pl [options]
[options] Type 'perl PopSeq2Geno.pl' in linux shell for a list of options.

INPUT:
-inDir    [required]
The directory where the population re-sequencing fastq data are stored. The names
of these paired-end data should be formatted as '*_1.fq.gz' and '*_2.fq.gz'.

-outDir   [required]
The directory where the filtered reads and other temp file are stored.

-qual     [required]
File to provide the Illumina reads quality version. 1.8 (all >=1.8), or 1.3 (all <1.8).
File name and the quality version are split by 'Tab'.

-ref      [required]
Fasta format of the reference genome.

-Gsize    [optional]
The size of the genome unit of Gb. Automatic calculated from the reference genome as
default.

-folds    [optional]
How many folds for each sample used to do the calling. The program will randomly select
a small part of data to determine polymorphic loci even if we produced high-depth
resequencing data

-build    [optional]
Build index for the reference [1], or don't build it [0] in case it has been built previously.
Default [1].

-pop      [optional]
The file of polymorphic loci for variations calling if exist. If file not exist, the program
will perform pooled mapping as default [0]. Format of the file should be like this:
chrid  position reference-allele reference-allele variation-allele  reference-reads-depth variation-reads-depth
A01	679	G	G	T	20	47
Total seven columns in every line and columns are split by 'Tab'.

OUTPUT:
-outStat  [optional]
File to store the statistics during filtering raw reads. File 'readsFilter.statistics' will
be generated if default.

-sam      [optional]
File to store the samples used in the variations calling. File 'samples' will be generated
if default.

-pos      [optional]
File to store the genomic position of variations called. File 'positions.loci' will be
generated if default. Every line represents a SNP locus. The file has the formation like
this:
chrid	position	reference-allele	variation-allele
A01	679	G	T
Total four columns in every line and columns are split by 'Tab'.

-geno     [optional]
File to store the genotypes of variations called. File 'genotypes.loci' will be generated
if default. All the SNP genotypes of the sample population are listed in 'genotypes.loci'.
Every two columns in this file represent an individual corresponding to the ordered sample
IDs in the file 'samples'; while every line represents a SNP locus whose physical position
is stored in the file 'positions.loci'. Additionally, those loci whose genotypes in some 
samples are identical to the reference (but have different genotypes in other samples) are
also listed with the genotype of the reference.

NOTES:
1.We provide examples in 'example' directory. Users can download these files to test the program.
And all the files mentioned above have an example in the 'examples_for_file_format' directory.

2.We generate two types of genotype file as: 1)genotypes.loci; 2)genotypes.loci.imputated. The
first file stores genotypes called out purely from the resequencing data, it contains ungenotyped
loci recorded as 'NN'. The second file stores genotype data of both those in the first file and
these imputated ones of ungenotyped loci by imputation. Therefore, users can separate and choose
to use genotype data either is called from resequencing or is produced by imputation easily.

3.We used the algorithm k-NN to perform the imputation. Users are free to choose other imputation
methods such as IMPUTE (https://mathgen.stats.ox.ac.uk/impute/impute.html).