CLU Class Reference

CLU: An implementation of the CLU's analysis algorithm, originally developed by Andrey Ptitsyn and Winston Hide. More...

#include <CLU.h>

Inheritance diagram for CLU:
Inheritance graph
[legend]
Collaboration diagram for CLU:
Collaboration graph
[legend]

List of all members.

Classes

class  CLUESTData
 Container for CLU's hash maps for ESTs. More...

Public Member Functions

virtual ~CLU ()
 The destructor.
virtual void showArguments (std::ostream &os)
 Display valid command line arguments for this analyzer.
virtual bool parseArguments (int &argc, char **argv)
 Process command line arguments.
virtual int initialize ()
 Method to begin EST analysis.
virtual std::string getName () const
 Method to obtain human-readable name for this EST analyzer.
virtual float getValidMetric () const
 Obtain a valid (or the best) metric generated by this analyzer.
virtual int setReferenceEST (const int estIdx)
 Set the reference EST id for analysis.

Protected Member Functions

virtual float getMetric (const int otherEST)
 Analyze and obtain a similarity metric.
virtual void dumpEST (ResultLog &log, const EST *est, const bool isReference=false)
 Dumps a given EST in 3-column format using a ResultLog.
void buildHashMaps (EST *est)
 Helper method to build reference and complement hash maps.
void createCLUHashMap (int *&table, const char *sequence)
 Helper method to create the CLU hash/look up table.
void filterHashMap (int *table, const int sequenceLength)
 Helper method to filter out certain entries from the reference hash map.
int getSimilarity (const int *const hashTable, const char *const sequence) const
 Obtain similarity metric between reference sequence and a given sequence.

Static Protected Attributes

static int abundanceFraction = 10
 Parameter to define fraction value to compute abundance metric.

Private Member Functions

 CLU (const int refESTidx, const std::string &outputFile)
 The default constructor.

Static Private Attributes

static arg_parser::arg_record argsList []
 The set of arguments specific to the CLU program.
static char CharToInt []
 A simple array to map characters A, T, C, and G to 0, 1, 2, and 3 respectively.

Friends

class ESTAnalyzerFactory

Detailed Description

CLU: An implementation of the CLU's analysis algorithm, originally developed by Andrey Ptitsyn and Winston Hide.

This analyzer implements the similarity metric generation part of CLU, a clustering algorithm developed by Andrey Ptitsyn and Winston Hide. The Reference to clu is:

"CLU: A new algorithm for EST clustering", A. Ptitsyn and W. Hide, BMC Bioinformatics, 6(2), 2005. doi: 10.1186/1471-2105-6-S2-S3.

The implementation in this class has been developed by suitably adapting parts of the source code for CLU available from http://lamar.colostate.edu/~ptitsyn/

This class has been implemented by extending the FWAnalyzer base class. The FWAnalyzer base class provides most of the standard functionality involved in reading FASTA files and generating formatted output and processing some parameters. This class adds functionality to compare EST's using CLU's similarity comparison metrics.

Note:
This class instantiated via the ESTAnalyzerFactory::create method.

Definition at line 71 of file CLU.h.


Constructor & Destructor Documentation

CLU::~CLU (  )  [virtual]

The destructor.

The destructor frees up all any dynamic memory allocated by this object for its operations.

Definition at line 69 of file CLU.cpp.

References EST::deleteAllESTs().

CLU::CLU ( const int  refESTidx,
const std::string &  outputFile 
) [private]

The default constructor.

The default constructor for this class. The constructor is made private so that this class cannot be directly instantiated. However, since the ESTAnalyzerFactory is a friend of this class, an object can be instantiated via the ESTAnalyzerFactory::create() method.

Parameters:
[in] refESTidx The reference EST index value to be used when performing EST analysis. This parameter should be >= 0. This value is simply passed onto the base class. This parameter is not really used for this analyzer is used for clustering.
[in] outputFile The name of the output file to which the EST analysis data is to be written. This parameter is ignored if this analyzer is used for clustering. If this parameter is the empty string then output is written to standard output. This value is simply passed onto the base class.

Definition at line 57 of file CLU.cpp.

References CharToInt.


Member Function Documentation

void CLU::buildHashMaps ( EST est  )  [protected]

Helper method to build reference and complement hash maps.

This is a helper method that is used to build the reference and complement hash maps for a given EST. If the reference (and complement) hash maps already exist for the given EST then this metod exits immediately without rebuilding hash maps. Otherwise it builds the reference and complement hash maps using the createCLUHashMap() method and pouplates the custom ESTCLUData object associated with the EST.

Parameters:
[in,out] est Pointer to the EST whose reference and complement hash maps are to be built.

Definition at line 150 of file CLU.cpp.

References ASSERT, CharToInt, createCLUHashMap(), EST::getCustomData(), and EST::getSequence().

Referenced by getMetric().

void CLU::createCLUHashMap ( int *&  table,
const char *  sequence 
) [protected]

Helper method to create the CLU hash/look up table.

This method is invoked from the setReferenceEST() method to create the referenceHashTable and complementHashTable required for comparing and processing other ESTs with the reference EST. This method operates as follows:

  1. It initializes the table (if it is NULL) to hold 4^wordSize hash entries.

  2. It resets all the entries to 0 (zero) in the table.

  3. For each wordSize base pairs in the sequence, it computes the hash value for the sequence and increments the corresponding entry (using hash value as the index) in the table.

  4. It finally calls the filterHashMap() method to filter out low-complexity and abundant oligos.

Parameters:
[in,out] table The hash table to be populated by this method.
[in] sequence The EST sequence to be used for populating the hash table.

Definition at line 180 of file CLU.cpp.

References ASSERT, CharToInt, filterHashMap(), and FWAnalyzer::wordSize.

Referenced by buildHashMaps().

void CLU::dumpEST ( ResultLog log,
const EST est,
const bool  isReference = false 
) [protected, virtual]

Dumps a given EST in 3-column format using a ResultLog.

This method is a helper method that dumps a given EST out to the log. This method overrides the default implementation in the base class to perform its own custom operation.

Parameters:
[out] log The log to which the EST is to be dumped.
[in] est The EST to be dumped. This parameter is never NULL.
[in] isReference If this flag is true, then this EST is the reference EST to be dumped out.

Reimplemented from FWAnalyzer.

Definition at line 119 of file CLU.cpp.

References EST::getInfo(), EST::getSequence(), EST::getSimilarity(), ESTAnalyzer::htmlLog, and ResultLog::report().

void CLU::filterHashMap ( int *  table,
const int  sequenceLength 
) [protected]

Helper method to filter out certain entries from the reference hash map.

This helper method is invoked from the setReferenceEST() method to filter out certain entries from the hash map as they are not significant when comparing ESTs. Specifically, this method filters out non-informative and low-complexity sequences in the following manner:

  • First, zero out all simple oligos from consideration. The simple oligos are sequences of the form: "AAAAAACCCCCCGGGGGGTTTTTT".

  • Next it removes abundant sequences (sequence that occur too frequently) from the table. Such abundant sequences are not informative in EST comparisons.

  • Finally, all non-zero entries are normalized to 1.

Parameters:
[in,out] table The table of hash values to be filtered and normalized by this method. This table must have been populated by a call to createCLUHashTable() method.
[in] sequenceLength The length of the EST sequence from which the referenceHashMap has been generated.

Definition at line 220 of file CLU.cpp.

References abundanceFraction, and FWAnalyzer::wordSize.

Referenced by createCLUHashMap().

float CLU::getMetric ( const int  otherEST  )  [protected, virtual]

Analyze and obtain a similarity metric.

This method can be used to compare a given EST with the reference EST (set via the call to the setReferenceEST()) method.

Parameters:
[in] otherEST The index (zero based) of the EST with which the reference EST is to be compared.
Returns:
This method must returns a similarity metric by comparing the ESTs by calling the analyze() method.

Reimplemented from FWAnalyzer.

Definition at line 349 of file CLU.cpp.

References ASSERT, buildHashMaps(), CLU::CLUESTData::complementHashMap, EST::getCustomData(), EST::getEST(), EST::getSequence(), getSimilarity(), CLU::CLUESTData::referenceHashMap, and ESTAnalyzer::refESTidx.

virtual std::string CLU::getName (  )  const [inline, virtual]

Method to obtain human-readable name for this EST analyzer.

This method provides a human-readable string identifying the EST analyzer. This string is typically used for display/debugging purposes (particularly via the PEACE Interactive Console).

Returns:
This method returns the string "CLU" identifiying this analyzer.

Implements ESTAnalyzer.

Definition at line 153 of file CLU.h.

int CLU::getSimilarity ( const int *const   hashTable,
const char *const   sequence 
) const [protected]

Obtain similarity metric between reference sequence and a given sequence.

This method performs the core task of comparing a given EST sequence with the reference sequence given a CLU hash table. This method operates as follows:

  1. It computes the hash for each word the first frame of the given sequence and obtains a categorized distribution (cd) index by add number of matching words found in the reference sequence.

  2. For sequent frames in the sequence, it uses the values computed for the first sequence to compute the categorized distribution (cd) index value.

  3. It adds one to the corresponding cd entry as indicated by the resulting cdIndex value.

  4. Finally it determines the sum of multiplying the cd values with the thresholds (constant values obtained earlier by the original authors via Monte-Carlo type simulations) to obtain the similarity metric.

Parameters:
[in] hashTable The hash table from the reference sequence (either the referenceHashTable or complementHashTable) to be used for searching/comparison.
[in] sequence The other sequence with which the reference sequence is to be compared.
Returns:
A similarity score for the given sequence against the reference sequence.

Definition at line 267 of file CLU.cpp.

References ASSERT, CharToInt, FWAnalyzer::frameSize, and FWAnalyzer::wordSize.

Referenced by getMetric().

virtual float CLU::getValidMetric (  )  const [inline, virtual]

Obtain a valid (or the best) metric generated by this analyzer.

This method can be used to obtain a valid metric value for this analyzer. This value can be used to initialize metric values. By default this method returns 0, which should be ideal for distance-based metrics.

Note:
Dervied similarity-based metric classes must override this method to provide a suitable value.
Returns:
This method returns a valid (or the best) metric for this EST analyzer.

Reimplemented from ESTAnalyzer.

Definition at line 155 of file CLU.h.

int CLU::initialize (  )  [virtual]

Method to begin EST analysis.

This method is invoked just before commencement of EST analysis. This method first invokes the base class method that loads the list of ESTs from a given input multi-FASTA file and pouplates the list of ESTs. If the ESTs were successfully loaded, then this method initializes the custom data for each EST (with empty hash maps).

Returns:
If the ESTs were successfully loaded from the FATA file then this method returns 0. Otherwise this method returns with a non-zero error code.

Reimplemented from FWAnalyzer.

Definition at line 101 of file CLU.cpp.

References EST::getESTList().

bool CLU::parseArguments ( int &  argc,
char **  argv 
) [virtual]

Process command line arguments.

This method is used to process command line arguments specific to this EST analyzer. This method is typically used from the main method just after the EST analyzer has been instantiated. This method consumes all valid command line arguments. If the command line arguments were valid and successfully processed, then this method returns true.

Note:
Derived EST analyzer classes must override this method to process any command line arguments that are custom to their operation. When this method is overridden don't forget to call the corresponding base class implementation to display common options.
Parameters:
[in,out] argc The number of command line arguments to be processed.
[in,out] argv The array of command line arguments.
Returns:
This method returns true if the command line arguments were successfully processed. Otherwise this method returns false. This method checks to ensure that a valid frame size and a valid word size have been specified.

Reimplemented from FWAnalyzer.

Definition at line 86 of file CLU.cpp.

References abundanceFraction, ESTAnalyzer::analyzerName, and arg_parser::check_args().

int CLU::setReferenceEST ( const int  estIdx  )  [virtual]

Set the reference EST id for analysis.

This method is invoked just before a batch of ESTs are analyzed via a call to the analyze(EST *) method. This method builds the hash table (used by CLU for searching/comparison) for the reference analysis. The reference EST is also called Sq in CLU literature.

Note:
This method must be called only after the initialize() method is called. This method overrides the implementation in the base class to perform its own custom operation.
Returns:
If the reference estIdx is invalid then this method returns with 1. Otherwise it pouplates the referenceTable array with the necessary information and returns 0.

Reimplemented from FWAnalyzer.

Definition at line 134 of file CLU.cpp.

References ASSERT, EST::getESTList(), and ESTAnalyzer::refESTidx.

void CLU::showArguments ( std::ostream &  os  )  [virtual]

Display valid command line arguments for this analyzer.

This method must be used to display all valid command line options that are supported by this analyzer. Note that derived classes may override this method to display additional command line options that are applicable to it. This method is typically used in the main() method when displaying usage information.

Note:
Derived EST analyzer classes must override this method to display help for their custom command line arguments. When this method is overridden don't forget to call the corresponding base class implementation to display common options.
Parameters:
[out] os The output stream to which the valid command line arguments must be written.

Reimplemented from FWAnalyzer.

Definition at line 78 of file CLU.cpp.


Friends And Related Function Documentation

friend class ESTAnalyzerFactory [friend]

Definition at line 72 of file CLU.h.


Member Data Documentation

int CLU::abundanceFraction = 10 [static, protected]

Parameter to define fraction value to compute abundance metric.

This instance variable is used to track the fraction of values (with respect to sequence length) after which a specific oligonucleotide (word) must be considered to be abundant. The default value is 10. This value can be changed by the user via a command line parameter.

Definition at line 409 of file CLU.h.

Referenced by filterHashMap(), and parseArguments().

Initial value:
 {
    {"--abdFrac", "Abundance Fraction (default=10)",
     &CLU::abundanceFraction, arg_parser::INTEGER},
    {NULL, NULL, NULL, arg_parser::BOOLEAN}
}

The set of arguments specific to the CLU program.

This instance variable contains a static list of arguments that are specific only to the CLU analyzer class. This argument list is statically defined and shared by all instances of this class.

Note:
Use of static arguments and parameters makes CLU class hierarchy not MT-safe.

Definition at line 422 of file CLU.h.

char CLU::CharToInt [static, private]

A simple array to map characters A, T, C, and G to 0, 1, 2, and 3 respectively.

This is a simple array of 255 entries that are used to convert the base pair encoding characters A, T, C, and G to 0, 1, 2, and 3 respectively to compute the hash as defined by CLU. This array is initialized in the constructor and is never changed during the life time of this class.

Definition at line 456 of file CLU.h.

Referenced by buildHashMaps(), CLU(), createCLUHashMap(), and getSimilarity().


The documentation for this class was generated from the following files:

Generated on 19 Mar 2010 for PEACE by  doxygen 1.6.1