CLU: An implementation of the CLU's analysis algorithm, originally developed by Andrey Ptitsyn and Winston Hide. More...
#include <CLU.h>
Classes | |
class | CLUESTData |
Container for CLU's hash maps for ESTs. More... | |
Public Member Functions | |
virtual | ~CLU () |
The destructor. | |
virtual void | showArguments (std::ostream &os) |
Display valid command line arguments for this analyzer. | |
virtual bool | parseArguments (int &argc, char **argv) |
Process command line arguments. | |
virtual int | initialize () |
Method to begin EST analysis. | |
virtual std::string | getName () const |
Method to obtain human-readable name for this EST analyzer. | |
virtual float | getValidMetric () const |
Obtain a valid (or the best) metric generated by this analyzer. | |
virtual int | setReferenceEST (const int estIdx) |
Set the reference EST id for analysis. | |
Protected Member Functions | |
virtual float | getMetric (const int otherEST) |
Analyze and obtain a similarity metric. | |
virtual void | dumpEST (ResultLog &log, const EST *est, const bool isReference=false) |
Dumps a given EST in 3-column format using a ResultLog. | |
void | buildHashMaps (EST *est) |
Helper method to build reference and complement hash maps. | |
void | createCLUHashMap (int *&table, const char *sequence) |
Helper method to create the CLU hash/look up table. | |
void | filterHashMap (int *table, const int sequenceLength) |
Helper method to filter out certain entries from the reference hash map. | |
int | getSimilarity (const int *const hashTable, const char *const sequence) const |
Obtain similarity metric between reference sequence and a given sequence. | |
Static Protected Attributes | |
static int | abundanceFraction = 10 |
Parameter to define fraction value to compute abundance metric. | |
Private Member Functions | |
CLU (const int refESTidx, const std::string &outputFile) | |
The default constructor. | |
Static Private Attributes | |
static arg_parser::arg_record | argsList [] |
The set of arguments specific to the CLU program. | |
static char | CharToInt [] |
A simple array to map characters A, T, C, and G to 0, 1, 2, and 3 respectively. | |
Friends | |
class | ESTAnalyzerFactory |
CLU: An implementation of the CLU's analysis algorithm, originally developed by Andrey Ptitsyn and Winston Hide.
This analyzer implements the similarity metric generation part of CLU, a clustering algorithm developed by Andrey Ptitsyn and Winston Hide. The Reference to clu is:
"CLU: A new algorithm for EST clustering", A. Ptitsyn and W. Hide, BMC Bioinformatics, 6(2), 2005. doi: 10.1186/1471-2105-6-S2-S3.
The implementation in this class has been developed by suitably adapting parts of the source code for CLU available from http://lamar.colostate.edu/~ptitsyn/
This class has been implemented by extending the FWAnalyzer base class. The FWAnalyzer base class provides most of the standard functionality involved in reading FASTA files and generating formatted output and processing some parameters. This class adds functionality to compare EST's using CLU's similarity comparison metrics.
Definition at line 71 of file CLU.h.
CLU::~CLU | ( | ) | [virtual] |
The destructor.
The destructor frees up all any dynamic memory allocated by this object for its operations.
Definition at line 69 of file CLU.cpp.
References EST::deleteAllESTs().
CLU::CLU | ( | const int | refESTidx, | |
const std::string & | outputFile | |||
) | [private] |
The default constructor.
The default constructor for this class. The constructor is made private so that this class cannot be directly instantiated. However, since the ESTAnalyzerFactory is a friend of this class, an object can be instantiated via the ESTAnalyzerFactory::create() method.
[in] | refESTidx | The reference EST index value to be used when performing EST analysis. This parameter should be >= 0. This value is simply passed onto the base class. This parameter is not really used for this analyzer is used for clustering. |
[in] | outputFile | The name of the output file to which the EST analysis data is to be written. This parameter is ignored if this analyzer is used for clustering. If this parameter is the empty string then output is written to standard output. This value is simply passed onto the base class. |
Definition at line 57 of file CLU.cpp.
References CharToInt.
void CLU::buildHashMaps | ( | EST * | est | ) | [protected] |
Helper method to build reference and complement hash maps.
This is a helper method that is used to build the reference and complement hash maps for a given EST. If the reference (and complement) hash maps already exist for the given EST then this metod exits immediately without rebuilding hash maps. Otherwise it builds the reference and complement hash maps using the createCLUHashMap() method and pouplates the custom ESTCLUData object associated with the EST.
[in,out] | est | Pointer to the EST whose reference and complement hash maps are to be built. |
Definition at line 150 of file CLU.cpp.
References ASSERT, CharToInt, createCLUHashMap(), EST::getCustomData(), and EST::getSequence().
Referenced by getMetric().
void CLU::createCLUHashMap | ( | int *& | table, | |
const char * | sequence | |||
) | [protected] |
Helper method to create the CLU hash/look up table.
This method is invoked from the setReferenceEST() method to create the referenceHashTable and complementHashTable required for comparing and processing other ESTs with the reference EST. This method operates as follows:
It initializes the table (if it is NULL) to hold 4^wordSize hash entries.
It resets all the entries to 0 (zero) in the table.
For each wordSize base pairs in the sequence, it computes the hash value for the sequence and increments the corresponding entry (using hash value as the index) in the table.
It finally calls the filterHashMap() method to filter out low-complexity and abundant oligos.
[in,out] | table | The hash table to be populated by this method. |
[in] | sequence | The EST sequence to be used for populating the hash table. |
Definition at line 180 of file CLU.cpp.
References ASSERT, CharToInt, filterHashMap(), and FWAnalyzer::wordSize.
Referenced by buildHashMaps().
void CLU::dumpEST | ( | ResultLog & | log, | |
const EST * | est, | |||
const bool | isReference = false | |||
) | [protected, virtual] |
Dumps a given EST in 3-column format using a ResultLog.
This method is a helper method that dumps a given EST out to the log. This method overrides the default implementation in the base class to perform its own custom operation.
[out] | log | The log to which the EST is to be dumped. |
[in] | est | The EST to be dumped. This parameter is never NULL. |
[in] | isReference | If this flag is true, then this EST is the reference EST to be dumped out. |
Reimplemented from FWAnalyzer.
Definition at line 119 of file CLU.cpp.
References EST::getInfo(), EST::getSequence(), EST::getSimilarity(), ESTAnalyzer::htmlLog, and ResultLog::report().
void CLU::filterHashMap | ( | int * | table, | |
const int | sequenceLength | |||
) | [protected] |
Helper method to filter out certain entries from the reference hash map.
This helper method is invoked from the setReferenceEST() method to filter out certain entries from the hash map as they are not significant when comparing ESTs. Specifically, this method filters out non-informative and low-complexity sequences in the following manner:
First, zero out all simple oligos from consideration. The simple oligos are sequences of the form: "AAAAAACCCCCCGGGGGGTTTTTT".
Next it removes abundant sequences (sequence that occur too frequently) from the table. Such abundant sequences are not informative in EST comparisons.
Finally, all non-zero entries are normalized to 1.
[in,out] | table | The table of hash values to be filtered and normalized by this method. This table must have been populated by a call to createCLUHashTable() method. |
[in] | sequenceLength | The length of the EST sequence from which the referenceHashMap has been generated. |
Definition at line 220 of file CLU.cpp.
References abundanceFraction, and FWAnalyzer::wordSize.
Referenced by createCLUHashMap().
float CLU::getMetric | ( | const int | otherEST | ) | [protected, virtual] |
Analyze and obtain a similarity metric.
This method can be used to compare a given EST with the reference EST (set via the call to the setReferenceEST()) method.
[in] | otherEST | The index (zero based) of the EST with which the reference EST is to be compared. |
Reimplemented from FWAnalyzer.
Definition at line 349 of file CLU.cpp.
References ASSERT, buildHashMaps(), CLU::CLUESTData::complementHashMap, EST::getCustomData(), EST::getEST(), EST::getSequence(), getSimilarity(), CLU::CLUESTData::referenceHashMap, and ESTAnalyzer::refESTidx.
virtual std::string CLU::getName | ( | ) | const [inline, virtual] |
Method to obtain human-readable name for this EST analyzer.
This method provides a human-readable string identifying the EST analyzer. This string is typically used for display/debugging purposes (particularly via the PEACE Interactive Console).
Implements ESTAnalyzer.
int CLU::getSimilarity | ( | const int *const | hashTable, | |
const char *const | sequence | |||
) | const [protected] |
Obtain similarity metric between reference sequence and a given sequence.
This method performs the core task of comparing a given EST sequence with the reference sequence given a CLU hash table. This method operates as follows:
It computes the hash for each word the first frame of the given sequence and obtains a categorized distribution (cd) index by add number of matching words found in the reference sequence.
For sequent frames in the sequence, it uses the values computed for the first sequence to compute the categorized distribution (cd) index value.
It adds one to the corresponding cd entry as indicated by the resulting cdIndex value.
Finally it determines the sum of multiplying the cd values with the thresholds (constant values obtained earlier by the original authors via Monte-Carlo type simulations) to obtain the similarity metric.
[in] | hashTable | The hash table from the reference sequence (either the referenceHashTable or complementHashTable) to be used for searching/comparison. |
[in] | sequence | The other sequence with which the reference sequence is to be compared. |
Definition at line 267 of file CLU.cpp.
References ASSERT, CharToInt, FWAnalyzer::frameSize, and FWAnalyzer::wordSize.
Referenced by getMetric().
virtual float CLU::getValidMetric | ( | ) | const [inline, virtual] |
Obtain a valid (or the best) metric generated by this analyzer.
This method can be used to obtain a valid metric value for this analyzer. This value can be used to initialize metric values. By default this method returns 0, which should be ideal for distance-based metrics.
Reimplemented from ESTAnalyzer.
int CLU::initialize | ( | ) | [virtual] |
Method to begin EST analysis.
This method is invoked just before commencement of EST analysis. This method first invokes the base class method that loads the list of ESTs from a given input multi-FASTA file and pouplates the list of ESTs. If the ESTs were successfully loaded, then this method initializes the custom data for each EST (with empty hash maps).
Reimplemented from FWAnalyzer.
Definition at line 101 of file CLU.cpp.
References EST::getESTList().
bool CLU::parseArguments | ( | int & | argc, | |
char ** | argv | |||
) | [virtual] |
Process command line arguments.
This method is used to process command line arguments specific to this EST analyzer. This method is typically used from the main method just after the EST analyzer has been instantiated. This method consumes all valid command line arguments. If the command line arguments were valid and successfully processed, then this method returns true
.
[in,out] | argc | The number of command line arguments to be processed. |
[in,out] | argv | The array of command line arguments. |
true
if the command line arguments were successfully processed. Otherwise this method returns false
. This method checks to ensure that a valid frame size and a valid word size have been specified. Reimplemented from FWAnalyzer.
Definition at line 86 of file CLU.cpp.
References abundanceFraction, ESTAnalyzer::analyzerName, and arg_parser::check_args().
int CLU::setReferenceEST | ( | const int | estIdx | ) | [virtual] |
Set the reference EST id for analysis.
This method is invoked just before a batch of ESTs are analyzed via a call to the analyze(EST *) method. This method builds the hash table (used by CLU for searching/comparison) for the reference analysis. The reference EST is also called Sq in CLU literature.
Reimplemented from FWAnalyzer.
Definition at line 134 of file CLU.cpp.
References ASSERT, EST::getESTList(), and ESTAnalyzer::refESTidx.
void CLU::showArguments | ( | std::ostream & | os | ) | [virtual] |
Display valid command line arguments for this analyzer.
This method must be used to display all valid command line options that are supported by this analyzer. Note that derived classes may override this method to display additional command line options that are applicable to it. This method is typically used in the main() method when displaying usage information.
[out] | os | The output stream to which the valid command line arguments must be written. |
Reimplemented from FWAnalyzer.
friend class ESTAnalyzerFactory [friend] |
int CLU::abundanceFraction = 10 [static, protected] |
Parameter to define fraction value to compute abundance metric.
This instance variable is used to track the fraction of values (with respect to sequence length) after which a specific oligonucleotide (word) must be considered to be abundant. The default value is 10. This value can be changed by the user via a command line parameter.
Definition at line 409 of file CLU.h.
Referenced by filterHashMap(), and parseArguments().
arg_parser::arg_record CLU::argsList [static, private] |
{ {"--abdFrac", "Abundance Fraction (default=10)", &CLU::abundanceFraction, arg_parser::INTEGER}, {NULL, NULL, NULL, arg_parser::BOOLEAN} }
The set of arguments specific to the CLU program.
This instance variable contains a static list of arguments that are specific only to the CLU analyzer class. This argument list is statically defined and shared by all instances of this class.
char CLU::CharToInt [static, private] |
A simple array to map characters A, T, C, and G to 0, 1, 2, and 3 respectively.
This is a simple array of 255 entries that are used to convert the base pair encoding characters A, T, C, and G to 0, 1, 2, and 3 respectively to compute the hash as defined by CLU. This array is initialized in the constructor and is never changed during the life time of this class.
Definition at line 456 of file CLU.h.
Referenced by buildHashMaps(), CLU(), createCLUHashMap(), and getSimilarity().