D2 Class Reference

EST Analyzer that uses the D2 algorithm to compute distances between two ESTs. More...

#include <D2.h>

Inheritance diagram for D2:
Inheritance graph
[legend]
Collaboration diagram for D2:
Collaboration graph
[legend]

List of all members.

Public Member Functions

virtual ~D2 ()
 The destructor.
virtual void showArguments (std::ostream &os)
 Display valid command line arguments for this analyzer.
virtual bool parseArguments (int &argc, char **argv)
 Process command line arguments.
int initialize ()
 Method to begin EST analysis.
virtual std::string getName () const
 Method to obtain human-readable name for this EST analyzer.
virtual int setReferenceEST (const int estIdx)
 Set the reference EST id for analysis.
virtual int analyze ()
 Method to perform exhaustive EST analysis.

Protected Member Functions

virtual float getMetric (const int otherEST)
 Analyze and obtain a distance metric.
bool compareMetrics (const float metric1, const float metric2) const
 Method to compare two metrics generated by this class.
float getInvalidMetric () const
 Obtain an invalid (or the worst) metric generated by this analyzer.
virtual bool getAlignmentData (int &alignmentData)
 Get alignment data for the previous call to analyze method.
bool isDistanceMetric () const
 Determine if this EST analyzer provides distance metrics or similarity metrics.
template<typename Encoder >
void buildWordTable (std::vector< int > &wordTable, const char *estSeq, Encoder encoder)
 Helper method to create a word table.

Protected Attributes

std::vector< int > s1WordTable
 Instance variable that maps an index in the reference sequence (sequence s1) to the hash of a word.
std::vector< int > s2WordTable
 Instance variable that maps an index in the comparison sequence (sequence s2) to the hash of a word.
int * delta
 Array to track the delta values in the core D2 algorithm.

Static Protected Attributes

static int frameShift = 1
 Parameter to define number of characters to shift the frame on the reference sequence, when computing D2.
static int bitShift = 0
 Instance variable to store the number of bits to be shifted to create hash values.
static int threshold = 0
 The threshold score below which two ESTs are considered sufficiently similar to be clustered.

Private Member Functions

 D2 (const int refESTidx, const std::string &outputFileName)
void updateWindow (const int wordIn, const int wordOut, int &score, int &minScore)
 Helper method to update the scores based on a sliding window.
float runD2 (const int otherEST)
 The core method that run's the D2 algorithm.

Private Attributes

int alignmentMetric
 Instance variable to track alignment metric computed by the analyze() method.
int numWordsInWindow
 Instance variable to track the number of words (of wordSize) that can fit into a window (of frameSize).
const std::string hintKey
 The hint key that is used to add hint for normal or reverse-complement D2 computation.

Static Private Attributes

static arg_parser::arg_record argsList []
 The set of arguments specific to the D2 algorithm.
static int BitMask = 0
 The bitmask to be used when build hashing values.

Friends

class ESTAnalyzerFactory

Detailed Description

EST Analyzer that uses the D2 algorithm to compute distances between two ESTs.

This analyzer provides the mechanism to use D2 algorithm to compute the distance values between a pair of ESTs. The D2 implementation has been adapted purely from the implementations of WCD, Zimmerman, and CLU.

Note:
This D2 analyzer uses a word size of 8 base pairs. This is hard coded into the algorithm in the form of various assumptions (data types of variables etc.). A similar approach is used by other D2 implementations as well. This is a compromise between performance and flexibility of the implementation.

Definition at line 61 of file D2.h.


Constructor & Destructor Documentation

D2::~D2 (  )  [virtual]

The destructor.

The destructor frees up all any dynamic memory allocated by this object for its operations.

Definition at line 70 of file D2.cpp.

References delta.

D2::D2 ( const int  refESTidx,
const std::string &  outputFileName 
) [private]

Definition at line 64 of file D2.cpp.

References alignmentMetric, and delta.


Member Function Documentation

virtual int D2::analyze (  )  [inline, virtual]

Method to perform exhaustive EST analysis.

This method is used to perform the core tasks of comparing all ESTs to one another for full analysis of ESTs. This is an additional feature of PEACE that is not used for clustering but just doing an offline analysis. Currently, this method merely calls the corresponding base class implementation that performs all the necessary operations.

Note:
This method was introduced to avoid the unnecessary warning about partial overloading of virtual methods (warning #654) in ICC.
Returns:
This method returns zero if all the processing proceeded successfully. On errors this method returns a non-zero value.

Reimplemented from FWAnalyzer.

Definition at line 185 of file D2.h.

template<typename Encoder >
void D2::buildWordTable ( std::vector< int > &  wordTable,
const char *  estSeq,
Encoder  encoder 
) [inline, protected]

Helper method to create a word table.

Creates a "word table" mapping integer indices to integer hashes of words, in effect translating the sequence from a sequence of characters to a sequence of n-words (where n = wordSize).

Parameters:
encoder The encoder class to be used for building the word table.
[out] wordTable The word table to be populated with with hash values.
[in] estSeq The sequence of base pairs associated with this EST that must be converted to hash values.

Definition at line 289 of file D2.h.

References FWAnalyzer::frameSize, EST::getMaxESTLen(), and FWAnalyzer::wordSize.

Referenced by runD2(), and setReferenceEST().

bool D2::compareMetrics ( const float  metric1,
const float  metric2 
) const [inline, protected, virtual]

Method to compare two metrics generated by this class.

This method provides the interface for comparing metrics generated by this ESTAnalyzer when comparing two different ESTs. This method returns true if metric1 is comparatively better than or equal to metric2.

Note:
As per the ESTAnalyzer API requirements, EST analyzers that are based on distance measures (such as this D2 analyzer) must override this method.
Parameters:
[in] metric1 The first metric to be compared against.
[in] metric2 The second metric to be compared against.
Returns:
This method returns true if metric1 is comparatively better than metric2.

Reimplemented from ESTAnalyzer.

Definition at line 220 of file D2.h.

bool D2::getAlignmentData ( int &  alignmentData  )  [protected, virtual]

Get alignment data for the previous call to analyze method.

This method can be used to obtain alignment data that was obtained typically as an byproduct of the previous call to the analyze() method. This method essentially returns the difference between the two windows that provided the minimum d2 distance value.

Parameters:
[out] alignmentData The parameter is updated to the alignment information generated as a part of the the immediately preceding analyze(const int) method call is returned in the parameter.
Returns:
This method always returns true to indicate that alignment data is computed by this ESTAnalyzer.

Reimplemented from ESTAnalyzer.

Definition at line 272 of file D2.cpp.

References alignmentMetric.

float D2::getInvalidMetric (  )  const [inline, protected, virtual]

Obtain an invalid (or the worst) metric generated by this analyzer.

This method can be used to obtain an invalid metric value for this analyzer. This value can be used to initialize metric values.

Note:
Derived distance-based metric classes (such as this D2 analyzer) must override this method to provide a suitable value.
Returns:
This method returns an invalid (or the worst) metric of 1e7 for this EST analyzer.

Reimplemented from ESTAnalyzer.

Definition at line 237 of file D2.h.

References FWAnalyzer::frameSize.

float D2::getMetric ( const int  otherEST  )  [protected, virtual]

Analyze and obtain a distance metric.

This method can be used to compare a given EST with the reference EST (set via the call to the setReferenceEST()) method.

Parameters:
[in] otherEST The index (zero based) of the EST with which the reference EST is to be compared.
Returns:
This method returns the distance value reported by the D2 algorithm.

Reimplemented from FWAnalyzer.

Definition at line 144 of file D2.cpp.

References EST::getESTCount(), ESTAnalyzer::refESTidx, runD2(), and VALIDATE.

virtual std::string D2::getName (  )  const [inline, virtual]

Method to obtain human-readable name for this EST analyzer.

This method provides a human-readable string identifying the EST analyzer. This string is typically used for display/debugging purposes (particularly via the PEACE Interactive Console).

Returns:
This method returns the string "d2" identifiying this analyzer.

Implements ESTAnalyzer.

Definition at line 145 of file D2.h.

int D2::initialize (  )  [virtual]

Method to begin EST analysis.

This method is invoked just before commencement of EST analysis. This method first invokes the base class's initialize method that initializes the heuristic chain, if a chain has been set for this analyzer. It then initializes memory for the word tables used to store hash values for the pairs of ESTs to be analyzed.

Returns:
Currently, this method always returns 0 (zero) to indicate initialization was successfully completed.

Reimplemented from FWAnalyzer.

Definition at line 100 of file D2.cpp.

References BitMask, bitShift, delta, FWAnalyzer::frameSize, numWordsInWindow, and FWAnalyzer::wordSize.

bool D2::isDistanceMetric (  )  const [inline, protected, virtual]

Determine if this EST analyzer provides distance metrics or similarity metrics.

This method can be used to determine if this EST analyzer provides distance metrics or similarity metrics. If this method returns true, then this EST analyzer returns distance metrics (smaller is better). On the other hand, if this method returns false, then this EST analyzer returns similarity metrics (bigger is better).

Returns:
This method returns true to indicate that this EST analyzer operates using distance metrics.

Reimplemented from ESTAnalyzer.

Definition at line 270 of file D2.h.

bool D2::parseArguments ( int &  argc,
char **  argv 
) [virtual]

Process command line arguments.

This method is used to process command line arguments specific to this EST analyzer. This method is typically used from the main method just after the EST analyzer has been instantiated. This method consumes all valid command line arguments. If the command line arguments were valid and successfully processed, then this method returns true.

Currently, this EST analyzer accepts an optional frameShift value to control the stride used for D2 algorithm. However, the default value of 1 for frameShift is the preferred value for this parameter.

Note:
The ESTAnalyzer base class requires that derived EST analyzer classes must override this method to process any command line arguments that are custom to their operation. However, as per API requirement, this methodcalls the corresponding base class implementation to display additional options.
Parameters:
[in,out] argc The number of command line arguments to be processed.
[in,out] argv The array of command line arguments.
Returns:
This method returns true if the command line arguments were successfully processed. Otherwise this method returns false. This method checks to ensure that a valid frame size and a valid word size have been specified.

Reimplemented from FWAnalyzer.

Definition at line 85 of file D2.cpp.

References ESTAnalyzer::analyzerName, arg_parser::check_args(), and frameShift.

float D2::runD2 ( const int  otherEST  )  [private]

The core method that run's the D2 algorithm.

This method performs the core D2 analysis. This method is invoked from the getMetric() method to run the core D2 algorithm. This method operates as follows:

  1. If a heuristic chain has been set, then this method obtains a hint from the chain to decide if the normal or reverse complement analysis must be performed. The default is normal analysis.

  2. Next, a hash table of words is built for the otherEST sequence via a call to the buildWordTable method.

  3. Next, it computes the score using the first two windows.

  4. Finally, for each window in the reference EST it iterates over the windows in the other EST (sliding windows in each inner iteration) to compute the minimum d2 score by calling the updateWindow() method.

  5. It finally returns the minimum d2 score recorded.

Parameters:
[in] otherEST The index of the other EST to be analyzed by this method.
Returns:
The d2 score between the reference EST (set via call to setReferenceEST()) and the otherEST (parameter).

Definition at line 160 of file D2.cpp.

References buildWordTable(), ESTAnalyzer::chain, delta, FWAnalyzer::frameSize, EST::getEST(), ESTAnalyzer::getHeuristicChain(), HeuristicChain::getHint(), EST::getSequence(), numWordsInWindow, s1WordTable, s2WordTable, threshold, updateWindow(), and FWAnalyzer::wordSize.

Referenced by getMetric().

int D2::setReferenceEST ( const int  estIdx  )  [virtual]

Set the reference EST id for analysis.

This method is invoked just before a batch of ESTs are analyzed via a call to the analyze(EST *) method. This method currently saves the index in the refESTidx instance variable for further look up. Next it creates a "word table" (via a call to the buildWordTable method) mapping integer indices to integer hashes of words, in effect translating the sequence from a sequence of characters to a sequence of n-words (where n = wordSize). This word table is kept until the reference EST is changed, which reduces overhead.

Note:
This method must be called only after the initialize() method is called.
Returns:
This method returns true if the estIdx was within the given range of values. Otherwise this method returns a non-zero value as the error code.

Reimplemented from FWAnalyzer.

Definition at line 124 of file D2.cpp.

References buildWordTable(), ESTAnalyzer::chain, EST::getEST(), EST::getESTCount(), EST::getSequence(), ESTAnalyzer::refESTidx, s1WordTable, and HeuristicChain::setReferenceEST().

void D2::showArguments ( std::ostream &  os  )  [virtual]

Display valid command line arguments for this analyzer.

This method must be used to display all valid command line options that are supported by this analyzer. Currently, this analyzer does not require any special command line parameters.

Note:
The ESTAnalyzer base class requires that derived EST analyzer classes must override this method to display help for their custom command line arguments. When this method is overridden don't forget to call the corresponding base class implementation to display common options.
Parameters:
[out] os The output stream to which the valid command line arguments must be written.

Reimplemented from FWAnalyzer.

Definition at line 77 of file D2.cpp.

void D2::updateWindow ( const int  wordIn,
const int  wordOut,
int &  score,
int &  minScore 
) [inline, private]

Helper method to update the scores based on a sliding window.

This method is invoked from several different spots from the runD2() method to update the d2 scores as the window slides across the two sequences being analyzed. The hash values of the words moving into and out of the window are used to update the scores.

Parameters:
[in] wordIn The hash value of the word moving into the window.
[in] wordOut The hash value of the word moving out of the window.
[in,out] score The current running score for this window. This value is updated using the delta array.
[in,out] minScore The current minimum score. This value is updated after the score is updated to reflect the minimum of score and minScore.

Definition at line 483 of file D2.h.

References delta.

Referenced by runD2().


Friends And Related Function Documentation

friend class ESTAnalyzerFactory [friend]

Definition at line 62 of file D2.h.


Member Data Documentation

int D2::alignmentMetric [private]

Instance variable to track alignment metric computed by the analyze() method.

This instance variable is used to hold the alignment metric that was computed in the previous analyze method call. By default this value is set to zero. The alignment metric is computed as the difference in the window positions (on the two ESTs being analyzed) with the minimum d2 distance.

Definition at line 437 of file D2.h.

Referenced by D2(), and getAlignmentData().

Initial value:
 {
    {"--frameShift", "Frame Shift (default=1)",
     &D2::frameShift, arg_parser::INTEGER},
    {"--d2Threshold", "Threshold score to break out of D2 (default=0)",
     &D2::threshold, arg_parser::INTEGER},    
    {NULL, NULL, NULL, arg_parser::BOOLEAN}
}

The set of arguments specific to the D2 algorithm.

This instance variable contains a static list of arguments that are specific only to the D2 analyzer class. This argument list is statically defined and shared by all instances of this class.

Note:
Use of static arguments and parameters makes D2 class hierarchy not MT-safe.

Definition at line 405 of file D2.h.

int D2::BitMask = 0 [static, private]

The bitmask to be used when build hashing values.

This bitmask is used to drop higher order bits when building hash values for wordSize (defined in base class) base pairs in a given EST. This instance variable is actually passed on to the ESTCodec::NormalEncoder or ESTCodec::RevCompEncoder when computing hash values. Since this is value is passed in a template parameter, it is defined to be static (to ensure that it has external linkage as per the ISO/ANSI standards requirement).

Definition at line 450 of file D2.h.

Referenced by initialize().

int D2::bitShift = 0 [static, protected]

Instance variable to store the number of bits to be shifted to create hash values.

This instance variable is set to the value of 2 * (wordSize - 1) (in the initialize method) to reflect the number of bits that need to be shifted in order to build the hash values for common words (including the values stored in s1WordMap and s1RCWordMap).

This instance variable is actually passed on to the ESTCodec::NormalEncoder or ESTCodec::RevCompEncoder when computing hash values. Since this is value is passed in a template parameter, it is defined to be static (to ensure that it has external linkage as per the ISO/ANSI standards requirement).

Definition at line 380 of file D2.h.

Referenced by initialize().

int* D2::delta [protected]

Array to track the delta values in the core D2 algorithm.

This array is initialized to point to an array of size 4wordSize. This array is used to track the delta values generated in the core D2 algorithm.

Definition at line 347 of file D2.h.

Referenced by D2(), initialize(), runD2(), updateWindow(), and ~D2().

int D2::frameShift = 1 [static, protected]

Parameter to define number of characters to shift the frame on the reference sequence, when computing D2.

This parameter is used to enable D2-asymmetric behavior. The default value is 1, which means D2 symmetric: all frames in both sequences will be compared. Higher values mean that the algorithm will shift by more than one character when shifting the frame on the reference sequence, resulting in fewer computations but a possible loss of accuracy from not comparing every frame in both sequences.

Note:
Currently this value is not used.

Definition at line 362 of file D2.h.

Referenced by parseArguments().

const std::string D2::hintKey [private]

The hint key that is used to add hint for normal or reverse-complement D2 computation.

This hint key is used to set a hint in the hints hash map. This string is defined as a constant to save compute time in the core runHeuristics method.

Definition at line 537 of file D2.h.

int D2::numWordsInWindow [private]

Instance variable to track the number of words (of wordSize) that can fit into a window (of frameSize).

This instance variable is set in the initialize() method and its value is used by various methods in this class. Rather than computing this value repeatedly, it is computed once and used throughout.

Definition at line 460 of file D2.h.

Referenced by initialize(), and runD2().

std::vector<int> D2::s1WordTable [protected]

Instance variable that maps an index in the reference sequence (sequence s1) to the hash of a word.

This hash can then be used as an index in the delta array to get the frequency differential for that word.

The word table is created in the initialize() method and filled in using the buildWordTable() method. For the reference EST, buildWordTable() is called in the setReferenceEST() method, meaning it does not need to be rebuilt every time we analyze a new comparison sequence.

Definition at line 326 of file D2.h.

Referenced by runD2(), and setReferenceEST().

std::vector<int> D2::s2WordTable [protected]

Instance variable that maps an index in the comparison sequence (sequence s2) to the hash of a word.

This hash is used as an index in the delta array to get the frequency differential for that word.

The word table is created in the initialize() method and filled in using the buildWordTable() method. For the comparison EST, buildWordTable() must be called in the analyze() method because a new comparison sequence is given every time analyze() is called.

Definition at line 339 of file D2.h.

Referenced by runD2().

int D2::threshold = 0 [static, protected]

The threshold score below which two ESTs are considered sufficiently similar to be clustered.

This instance variable tracks the threshold value to be used to break out of the core D2 loop. This value essentially represents the D2 score below which two ESTs are considered sufficiently similar. Currently, the default threshold value is set to 40. This value can be overridden by the user with the "--threshold" command line argument.

Definition at line 392 of file D2.h.

Referenced by runD2().


The documentation for this class was generated from the following files:

Generated on 19 Mar 2010 for PEACE by  doxygen 1.6.1