GALF-P Logo  

GALF-P for TFBS Identification (Motif Discovery)
Genetic Algorithm with Local Filtering and Post-processing


Overview

  • GALF-P is a novel framework for TFBS identification (motif discovery) in DNA sequences. It consists of Genetic Algorithm with Local Filtering (GALF) and the post-processing procedure based on adaptive adding and removing. GALF-P achieves both effectiveness and efficiency, and provides reliable performance over the other state-of-art GA based approaches. The post-processing procedure is designed for zero or more TFBSs in each sequence.

Download


Usage

  • Win32 Version: In cmd: GALF_P.exe -w width -i input_file -o output_file [-l MAXLEN\-m MAXINS\-b BETA\- r MAXRUN] [-g Gen\-p Pop\-u Muta\-c Cross] [-f Flag\-A PGALF]
  • Linux Version: ./GALF_P.o -w width -i input_file -o output_file [-l MAXLEN\-m MAXINS\-b BETA\- r MAXRUN] [-g Gen\-p Pop\-u Muta\-c Cross] [-f Flag\-A PGALF]
  • Options:
    Compulsary arguments:
     
    -w
    width is the desired motif width (input by the user)
     
    -i
    input_file is the fasta input file to read in
     
    -o
    output_file is the output file to store the results in detail (refer to the examples for more details)
    Optional configuration arguments:
     
    -l
    MAXLEN--maximal sequence length (>=500(default: 1000))
     
    -m
    MAXINS--maximal number of possible instances in each sequence (default: 20; >=1)
     
    -b
    BETA--the fitness gain threshold in post processing(default: 0.001; >=0)
     
    -r
    MAXRUN--number of runs of GALF (default: 20; >=5) before post processing
     
    -g
    Gen--maximal generation (>=300(default))
     
    -p
    Pop--population size (>=500(default))
     
    -u
    Muta--mutation rate (>=0 and <= 1 (default: 0.9))
     
    -c
    Cross--crossover rate (>=0 and <= 1 (default: 0.3)
    Optional control arguments:
     
    -f
    Flag--to control whether post-processing (adding and/or removing) is performed: 0 (default): both adding and removing; 1: adding only; 2: none
     
    -a
    PGALF--whether to print the results of GALF to the output file: 0 (default): No; 1: Yes
    Note: The convergence generation number is fixed at 50 in GALF. Both upper and lower cases for the arguments are acceptable. If the input parameters exceed the listed constraints, the corresponding parameters will be set automatically to be the maximal/minimal constrained values. The actual computation time will be recorded in the output file.


Examples

  • Note: The input file can be any txt file but must be in the FASTA format. GALF-P currently does not support IUPAC codes, and does not distinguish between the upper and lower cases for A, T, C and G. The background frequencies of A, T, C and G are calculated from the input sequences. The version for double-strand search is available here. The following examples are shown using the Win32 version. Please change the command accordingly in Linux.
  • Sample datasets: There are two real sample input files for the transcription factor MEF2. In each file there is a set of 17 200 bp sequences, and 17 TFBSs of width 7 are embedded. The two files are the same except that the TFBSs in the first dataset are specifically capitalized (for easy check of the results in the output file) while in the second set not. Additionally, there is a third dataset for MyOD including 17 sequences in 200 bp, and 21 TFBSs of width 6 are embedded.
  • Example 1: Suppose there is a set of DNA sequences of the co-expressed genes in the first dataset (MEF2200bp.fa). The TFBSs regulated by the corresponding transcription factor are to be found. The desired motif width is 7. The results are to be stored in Results.txt. Then in cmd (command line) mode, GALF-P can be run as follows: GALF_P.exe -w 7 -i MEF2200bp.fa -o R5.txt. In this case the optional arguments are chosen by default. The information concerning the configurations will be listed in the R5.txt. The results include the starting position(s) of the instance(s) and the extracted instance(s) in each sequence. See R5.txt for more details.
  • Example 2: For the above example, GALF-P can also be run as follows: GALF_P.exe -w 7 -i MEF2200bp.fa -o R5_l.txt -c 0.5 -u 0.9 -r 10 -f 2. In this case, GALF-P is executed with GALF being run 10 times before the post-processing is applied. The crossover rate is 0.5 and the mutation rate is 0.9. No post-processing is applied. See the output file R5_l.txt for more details.
  • Example 3: For MyOD dataset, GALF_P can be run as follows: GALF_P.exe -w 6 -i MyOD200bp.fa -o R6_2.txt -r 5 -f 1 -a 1 -g 3000. Here a maximum of 3000 generations is set while GALF is run for 5 times. The best result from GALF with fitness score (information content without pseudo-counts) will be printed to the output file. In the post-processing, only adding stage is performed. See the sample output R6_2.txt for more details.

Supplementary Material


Contact

Email: tmchan at cse dot cuhk dot edu dot hk

Last update: 23/Nov/2007