EST Analyzer that uses the D2 algorithm in both its symmetric and asymmetric variations to compute distance between two ESTs. More...
#include <TwoPassD2.h>
Public Member Functions | |
virtual | ~TwoPassD2 () |
The destructor. | |
virtual void | showArguments (std::ostream &os) |
Display valid command line arguments for this analyzer. | |
virtual bool | parseArguments (int &argc, char **argv) |
Process command line arguments. | |
int | initialize () |
Method to begin EST analysis. | |
virtual std::string | getName () const |
Method to obtain human-readable name for this EST analyzer. | |
virtual int | setReferenceEST (const int estIdx) |
Set the reference EST id for analysis. | |
virtual int | analyze () |
Method to perform exhaustive EST analysis. | |
Protected Member Functions | |
virtual float | getMetric (const int otherEST) |
Analyze and obtain a distance metric. | |
bool | compareMetrics (const float metric1, const float metric2) const |
Method to compare two metrics generated by this class. | |
float | getInvalidMetric () const |
Obtain an invalid (or the worst) metric generated by this analyzer. | |
virtual bool | getAlignmentData (int &alignmentData) |
Get alignment data for the previous call to analyze method. | |
bool | isDistanceMetric () const |
Determine if this EST analyzer provides distance metrics or similarity metrics. | |
template<typename Encoder > | |
void | buildWordTable (int *wordTable, const char *estSeq, Encoder encoder) |
Helper method to create a word table. | |
Protected Attributes | |
int * | s1WordTable |
Instance variable that maps an index in the reference sequence (sequence s1) to the hash of a word. | |
int * | s2WordTable |
Instance variable that maps an index in the comparison sequence (sequence s2) to the hash of a word. | |
int * | delta |
Array to track the delta values in the core D2 algorithm. | |
int | frameShift |
Parameter to define number of characters to shift the frame on the reference sequence, when computing D2. | |
int | maxThreshold |
The threshold score above which two ESTs are considered sufficiently dissimilar that a more accurate symmetric D2 pass would still not cluster the ESTs. | |
int | threshold |
The threshold score to be used for clustering. | |
Static Protected Attributes | |
static int | bitShift = 0 |
Instance variable to store the number of bits to be shifted to create hash values. | |
static int | minThreshold = 0 |
The threshold score below which two ESTs are considered sufficiently similar to be clustered. | |
Private Member Functions | |
TwoPassD2 (const int refESTidx, const std::string &outputFileName) | |
void | updateWindowAsym (const int wordIn, const int wordOut, int &score, int &minScore, const int s1CurWindowIdx, const int s2CurWindowIdx, int *s1MinScoreIdx, int *s2MinScoreIdx) |
Helper method to update the delta table as well as the minimum score (if it has changed) when the window is shifted. | |
void | updateWindow (const int wordIn, const int wordOut, int &score, int &minScore, const int windowDistance) |
Helper method to update the delta table as well as the minimum score (if it has changed) when the window is shifted. | |
void | updateParameters () |
Method to dynamically compute parameters such as frame size and threshold. | |
float | runD2Asymmetric (int *s1MinScoreIdx, int *s2MinScoreIdx) |
Method to run the D2 algorithm, asymmetric version. | |
float | runD2Bounded (int sq1Start, int sq1End, int sq2Start, int sq2End) |
Method to run the D2 algorithm, symmetric version. | |
Private Attributes | |
int | alignmentMetric |
Instance variable to track alignment metric computed by the analyze() method. | |
int | numWordsInWindow |
Instance variable to track the number of words (of wordSize ) that can fit into a window (of frameSize ). | |
int | sq1Len |
Length of sequence 1, or the reference sequence. | |
int | sq2Len |
Length of sequence 2, or the comparison sequence. | |
const std::string | hintKey |
The hint key that is used to add hint for normal or reverse-complement D2 computation. | |
Static Private Attributes | |
static arg_parser::arg_record | argsList [] |
The set of arguments specific to the D2 algorithm. | |
static int | BitMask = 0 |
The bitmask to be used when building hash values. | |
static int | NHash = 0 |
The hash value to be used for a word containing 'N'. | |
static bool | noNormalize = false |
Boolean value indicating whether or not d2 scores should be normalized (converted to a 1.0-based value where 1.0 is equal to the d2 threshold). | |
Friends | |
class | ESTAnalyzerFactory |
EST Analyzer that uses the D2 algorithm in both its symmetric and asymmetric variations to compute distance between two ESTs.
This analyzer provides the mechanism to use D2 algorithm to compute the distance values between a pair of ESTs. For improved performance, the analyzer uses D2 asymmetric and symmetric together. It runs a fast "first pass" using D2 asymmetric and finds the minimum D2 score, as well as the two windows between which the score was computed. The analyzer then computes bounds and runs a bounded version of D2 symmetric. Ideally this analyzer will provide faster runtime because of the asymmetric pass, and not sacrifice any precision because of the bounded symmetric pass.
Definition at line 65 of file TwoPassD2.h.
TwoPassD2::~TwoPassD2 | ( | ) | [virtual] |
The destructor.
The destructor frees up all any dynamic memory allocated by this object for its operations.
Definition at line 85 of file TwoPassD2.cpp.
References delta, s1WordTable, and s2WordTable.
TwoPassD2::TwoPassD2 | ( | const int | refESTidx, | |
const std::string & | outputFileName | |||
) | [private] |
Definition at line 70 of file TwoPassD2.cpp.
References alignmentMetric, delta, frameShift, maxThreshold, s1WordTable, s2WordTable, and threshold.
virtual int TwoPassD2::analyze | ( | ) | [inline, virtual] |
Method to perform exhaustive EST analysis.
This method is used to perform the core tasks of comparing all ESTs to one another for full analysis of ESTs. This is an additional feature of PEACE that is not used for clustering but just doing an offline analysis. Currently, this method merely calls the corresponding base class implementation that performs all the necessary operations.
Reimplemented from FWAnalyzer.
Definition at line 189 of file TwoPassD2.h.
void TwoPassD2::buildWordTable | ( | int * | wordTable, | |
const char * | estSeq, | |||
Encoder | encoder | |||
) | [inline, protected] |
Helper method to create a word table.
Creates a "word table" mapping integer indices to integer hashes of words, in effect translating the sequence from a sequence of characters to a sequence of n-words (where n = wordSize).
[out] | wordTable | The word table to be populated with with hash values. |
[in] | estSeq | The sequence of base pairs associated with this EST that must be converted to hash values. |
encoder | The encoder object to be used to generate encoding for the words added to the generated word table. |
Definition at line 293 of file TwoPassD2.h.
References NHash, and FWAnalyzer::wordSize.
Referenced by getMetric(), and setReferenceEST().
bool TwoPassD2::compareMetrics | ( | const float | metric1, | |
const float | metric2 | |||
) | const [inline, protected, virtual] |
Method to compare two metrics generated by this class.
This method provides the interface for comparing metrics generated by this ESTAnalyzer when comparing two different ESTs. This method returns true
if metric1
is comparatively better than or equal to metric2
.
[in] | metric1 | The first metric to be compared against. |
[in] | metric2 | The second metric to be compared against. |
true
if metric1
is comparatively better than metric2
. Reimplemented from ESTAnalyzer.
Definition at line 224 of file TwoPassD2.h.
bool TwoPassD2::getAlignmentData | ( | int & | alignmentData | ) | [protected, virtual] |
Get alignment data for the previous call to analyze method.
This method can be used to obtain alignment data that was obtained typically as an byproduct of the previous call to the analyze() method. This method essentially returns the difference between the two windows that provided the minimum d2 distance value.
[out] | alignmentData | The parameter is updated to the alignment information generated as a part of the the immediately preceding analyze(const int) method call is returned in the parameter. |
true
to indicate that alignment data is computed by this ESTAnalyzer. Reimplemented from ESTAnalyzer.
Definition at line 437 of file TwoPassD2.cpp.
References alignmentMetric.
float TwoPassD2::getInvalidMetric | ( | ) | const [inline, protected, virtual] |
Obtain an invalid (or the worst) metric generated by this analyzer.
This method can be used to obtain an invalid metric value for this analyzer. This value can be used to initialize metric values.
Reimplemented from ESTAnalyzer.
Definition at line 241 of file TwoPassD2.h.
float TwoPassD2::getMetric | ( | const int | otherEST | ) | [protected, virtual] |
Analyze and obtain a distance metric.
This method can be used to compare a given EST with the reference EST (set via the call to the setReferenceEST()) method.
[in] | otherEST | The index (zero based) of the EST with which the reference EST is to be compared. |
Reimplemented from FWAnalyzer.
Definition at line 173 of file TwoPassD2.cpp.
References buildWordTable(), ESTAnalyzer::chain, frameShift, FWAnalyzer::frameSize, EST::getEST(), EST::getESTCount(), ESTAnalyzer::getHeuristicChain(), HeuristicChain::getHint(), maxThreshold, noNormalize, ESTAnalyzer::refESTidx, runD2Asymmetric(), runD2Bounded(), s2WordTable, sq1Len, sq2Len, threshold, updateParameters(), and VALIDATE.
virtual std::string TwoPassD2::getName | ( | ) | const [inline, virtual] |
Method to obtain human-readable name for this EST analyzer.
This method provides a human-readable string identifying the EST analyzer. This string is typically used for display/debugging purposes (particularly via the PEACE Interactive Console).
Implements ESTAnalyzer.
Definition at line 149 of file TwoPassD2.h.
int TwoPassD2::initialize | ( | ) | [virtual] |
Method to begin EST analysis.
This method is invoked just before commencement of EST analysis. This method first invokes the base class's initialize
method that initializes the heuristic chain, if a chain has been set for this analyzer. It then initializes memory for the word tables used to store hash values for the pairs of ESTs to be analyzed.
Reimplemented from FWAnalyzer.
Definition at line 121 of file TwoPassD2.cpp.
References BitMask, bitShift, delta, FWAnalyzer::frameSize, EST::getMaxESTLen(), ParameterSetManager::getMaxFrameSize(), ParameterSetManager::getParameterSetManager(), NHash, numWordsInWindow, s1WordTable, s2WordTable, and FWAnalyzer::wordSize.
bool TwoPassD2::isDistanceMetric | ( | ) | const [inline, protected, virtual] |
Determine if this EST analyzer provides distance metrics or similarity metrics.
This method can be used to determine if this EST analyzer provides distance metrics or similarity metrics. If this method returns true
, then this EST analyzer returns distance metrics (smaller is better). On the other hand, if this method returns false
, then this EST analyzer returns similarity metrics (bigger is better).
true
to indicate that this EST analyzer operates using distance metrics. Reimplemented from ESTAnalyzer.
Definition at line 274 of file TwoPassD2.h.
bool TwoPassD2::parseArguments | ( | int & | argc, | |
char ** | argv | |||
) | [virtual] |
Process command line arguments.
This method is used to process command line arguments specific to this EST analyzer. This method is typically used from the main
method just after the EST analyzer has been instantiated. This method consumes all valid command line arguments. If the command line arguments were valid and successfully processed, then this method returns true
.
Currently, this EST analyzer accepts an optional frameShift
value to control the stride used for D2 algorithm. However, the default value of 1 for frameShift
is the preferred value for this parameter.
ESTAnalyzer
base class requires that derived EST analyzer classes must override this method to process any command line arguments that are custom to their operation. However, as per API requirement, this methodcalls the corresponding base class implementation to display additional options.[in,out] | argc | The number of command line arguments to be processed. |
[in,out] | argv | The array of command line arguments. |
true
if the command line arguments were successfully processed. Otherwise this method returns false
. This method checks to ensure that a valid frame size and a valid word size have been specified. Reimplemented from FWAnalyzer.
Definition at line 106 of file TwoPassD2.cpp.
References ESTAnalyzer::analyzerName, arg_parser::check_args(), and frameShift.
float TwoPassD2::runD2Asymmetric | ( | int * | s1MinScoreIdx, | |
int * | s2MinScoreIdx | |||
) | [private] |
Method to run the D2 algorithm, asymmetric version.
This method runs D2 asymmetric on the two EST sequences and finds the minimum D2 score. Since this is D2 asymmetric, the score is not guaranteed to be the true minimum. Thus, this method also finds the indices of the windows on each sequence where the minimum D2 score was found. The analyzer will use these to compute appropriate bounds for the bounded symmetric D2 function.
Definition at line 244 of file TwoPassD2.cpp.
References delta, frameShift, minThreshold, NHash, numWordsInWindow, s1WordTable, s2WordTable, sq1Len, sq2Len, updateWindowAsym(), and FWAnalyzer::wordSize.
Referenced by getMetric().
float TwoPassD2::runD2Bounded | ( | int | sq1Start, | |
int | sq1End, | |||
int | sq2Start, | |||
int | sq2End | |||
) | [private] |
Method to run the D2 algorithm, symmetric version.
This method is essentially identical to the runD2 method in the D2.cpp class (Peace's base D2 implementation) with the exception that this method places bounds on each sequence. It will not search beyond these bounds. This saves a great deal of computation time (though the asymptotic runtime is unchanged) by avoiding unnecessary computations and only searching in the area where the best match might be found.
Definition at line 344 of file TwoPassD2.cpp.
References alignmentMetric, delta, minThreshold, NHash, numWordsInWindow, s1WordTable, s2WordTable, sq1Len, sq2Len, updateWindow(), and FWAnalyzer::wordSize.
Referenced by getMetric().
int TwoPassD2::setReferenceEST | ( | const int | estIdx | ) | [virtual] |
Set the reference EST id for analysis.
This method is invoked just before a batch of ESTs are analyzed via a call to the analyze(EST *) method. This method currently saves the index in the refESTidx
instance variable for further look up. Next it creates a "word table" (via a call to the buildWordTable
method) mapping integer indices to integer hashes of words, in effect translating the sequence from a sequence of characters to a sequence of n-words (where n = wordSize). This word table is kept until the reference EST is changed, which reduces overhead.
initialize()
method is called.true
if the estIdx was within the given range of values. Otherwise this method returns a non-zero value as the error code. Reimplemented from FWAnalyzer.
Definition at line 152 of file TwoPassD2.cpp.
References buildWordTable(), ESTAnalyzer::chain, EST::getEST(), EST::getESTCount(), EST::getSequence(), ESTAnalyzer::refESTidx, s1WordTable, HeuristicChain::setReferenceEST(), and sq1Len.
void TwoPassD2::showArguments | ( | std::ostream & | os | ) | [virtual] |
Display valid command line arguments for this analyzer.
This method must be used to display all valid command line options that are supported by this analyzer. Currently, this analyzer does not require any special command line parameters.
[out] | os | The output stream to which the valid command line arguments must be written. |
Reimplemented from FWAnalyzer.
Definition at line 98 of file TwoPassD2.cpp.
void TwoPassD2::updateParameters | ( | ) | [private] |
Method to dynamically compute parameters such as frame size and threshold.
Sets these parameters to their (hopefully) optimum values for better clustering of data sets with wide variation in length, such as data generated by 454 pyrosequencing.
Definition at line 230 of file TwoPassD2.cpp.
References ASSERT, ParameterSet::frameShift, frameShift, ParameterSet::frameSize, FWAnalyzer::frameSize, ParameterSetManager::getParameterSet(), ParameterSetManager::getParameterSetManager(), ParameterSet::maxThreshold, maxThreshold, numWordsInWindow, sq1Len, sq2Len, ParameterSet::threshold, threshold, and FWAnalyzer::wordSize.
Referenced by getMetric().
void TwoPassD2::updateWindow | ( | const int | wordIn, | |
const int | wordOut, | |||
int & | score, | |||
int & | minScore, | |||
const int | windowDistance | |||
) | [inline, private] |
Helper method to update the delta table as well as the minimum score (if it has changed) when the window is shifted.
This method is identical to the updateWindow method found in D2.h (Peace's base D2 implementation).
Definition at line 545 of file TwoPassD2.h.
References alignmentMetric, delta, and NHash.
Referenced by runD2Bounded().
void TwoPassD2::updateWindowAsym | ( | const int | wordIn, | |
const int | wordOut, | |||
int & | score, | |||
int & | minScore, | |||
const int | s1CurWindowIdx, | |||
const int | s2CurWindowIdx, | |||
int * | s1MinScoreIdx, | |||
int * | s2MinScoreIdx | |||
) | [inline, private] |
Helper method to update the delta table as well as the minimum score (if it has changed) when the window is shifted.
For d2 asymmetric, the updateWindow method also has to keep track of the location on each sequence where the minimum score was found, which is the function of the variables s1MinScoreIdx and s2MinScoreIdx. The variables s1CurWindowIdx and s2CurWindowIdx mark the start of the current window on each sequence. All of these variables are numbers, corresponding to indices of the word tables.
Definition at line 516 of file TwoPassD2.h.
Referenced by runD2Asymmetric().
friend class ESTAnalyzerFactory [friend] |
Definition at line 66 of file TwoPassD2.h.
int TwoPassD2::alignmentMetric [private] |
Instance variable to track alignment metric computed by the analyze() method.
This instance variable is used to hold the alignment metric that was computed in the previous analyze
method call. By default this value is set to zero. The alignment metric is computed as the difference in the window positions (on the two ESTs being analyzed) with the minimum d2 distance.
Definition at line 456 of file TwoPassD2.h.
Referenced by getAlignmentData(), runD2Bounded(), TwoPassD2(), and updateWindow().
arg_parser::arg_record TwoPassD2::argsList [static, private] |
{ {"--d2Threshold", "Threshold score to break out of D2 (default=0)", &TwoPassD2::minThreshold, arg_parser::INTEGER}, {"--noNormalize", "Signals that threshold scores should not be normalized", &TwoPassD2::noNormalize, arg_parser::BOOLEAN}, {NULL, NULL, NULL, arg_parser::BOOLEAN} }
The set of arguments specific to the D2 algorithm.
This instance variable contains a static list of arguments that are specific only to the D2 analyzer class. This argument list is statically defined and shared by all instances of this class.
Definition at line 424 of file TwoPassD2.h.
int TwoPassD2::BitMask = 0 [static, private] |
The bitmask to be used when building hash values.
This bitmask is used to drop higher order bits when building hash values for wordSize
(defined in base class) base pairs in a given EST. This instance variable is actually passed on to the ESTCodec::NormalEncoder or ESTCodec::RevCompEncoder when computing hash values. Since this is value is passed in a template parameter, it is defined to be static (to ensure that it has external linkage as per the ISO/ANSI standards requirement).
Definition at line 469 of file TwoPassD2.h.
Referenced by initialize().
int TwoPassD2::bitShift = 0 [static, protected] |
Instance variable to store the number of bits to be shifted to create hash values.
This instance variable is set to the value of 2 * (wordSize - 1) (in the initialize
method) to reflect the number of bits that need to be shifted in order to build the hash values for common words (including the values stored in s1WordMap
and s1RCWordMap
).
This instance variable is actually passed on to the ESTCodec::NormalEncoder or ESTCodec::RevCompEncoder when computing hash values. Since this is value is passed in a template parameter, it is defined to be static (to ensure that it has external linkage as per the ISO/ANSI standards requirement).
Definition at line 379 of file TwoPassD2.h.
Referenced by initialize().
int* TwoPassD2::delta [protected] |
Array to track the delta values in the core D2 algorithm.
This array is initialized to point to an array of size 4wordSize. This array is used to track the delta values generated in the core D2 algorithm.
Definition at line 348 of file TwoPassD2.h.
Referenced by initialize(), runD2Asymmetric(), runD2Bounded(), TwoPassD2(), updateWindow(), updateWindowAsym(), and ~TwoPassD2().
int TwoPassD2::frameShift [protected] |
Parameter to define number of characters to shift the frame on the reference sequence, when computing D2.
This parameter is used to enable D2-asymmetric behavior. The default value is 1, which means D2 symmetric: all frames in both sequences will be compared. Higher values mean that the algorithm will shift by more than one character when shifting the frame on the reference sequence, resulting in fewer computations but a possible loss of accuracy from not comparing every frame in both sequences.
Definition at line 361 of file TwoPassD2.h.
Referenced by getMetric(), parseArguments(), runD2Asymmetric(), TwoPassD2(), and updateParameters().
const std::string TwoPassD2::hintKey [private] |
The hint key that is used to add hint for normal or reverse-complement D2 computation.
This hint key is used to set a hint in the hints
hash map. This string is defined as a constant to save compute time in the core runHeuristics
method.
Definition at line 601 of file TwoPassD2.h.
int TwoPassD2::maxThreshold [protected] |
The threshold score above which two ESTs are considered sufficiently dissimilar that a more accurate symmetric D2 pass would still not cluster the ESTs.
This instance variable tracks the threshold value to be used to call the runD2Bounded method. If the calculated D2 score from the runD2Asymmetric method is above this threshold, bounded D2 will not be run. Currently, the default value is 130. This value can be overridden by the user with the "--maxThreshold" command line argument.
Definition at line 404 of file TwoPassD2.h.
Referenced by getMetric(), TwoPassD2(), and updateParameters().
int TwoPassD2::minThreshold = 0 [static, protected] |
The threshold score below which two ESTs are considered sufficiently similar to be clustered.
This instance variable tracks the threshold value to be used to break out of the core D2 loop. This value essentially represents the D2 score below which two ESTs are considered sufficiently similar. Currently, the default threshold value is set to 0. This value can be overridden by the user with the "--threshold" command line argument.
Definition at line 391 of file TwoPassD2.h.
Referenced by runD2Asymmetric(), and runD2Bounded().
int TwoPassD2::NHash = 0 [static, private] |
The hash value to be used for a word containing 'N'.
Initialized to MapSize (the size of the delta table).
Definition at line 475 of file TwoPassD2.h.
Referenced by buildWordTable(), initialize(), runD2Asymmetric(), runD2Bounded(), updateWindow(), and updateWindowAsym().
bool TwoPassD2::noNormalize = false [static, private] |
Boolean value indicating whether or not d2 scores should be normalized (converted to a 1.0-based value where 1.0 is equal to the d2 threshold).
Defaults to false (i.e. scores should be normalized).
Definition at line 483 of file TwoPassD2.h.
Referenced by getMetric().
int TwoPassD2::numWordsInWindow [private] |
Instance variable to track the number of words (of wordSize
) that can fit into a window (of frameSize
).
This instance variable is set in the initialize() method and its value is used by various methods in this class. Rather than computing this value repeatedly, it is computed once and used throughout.
Definition at line 493 of file TwoPassD2.h.
Referenced by initialize(), runD2Asymmetric(), runD2Bounded(), and updateParameters().
int* TwoPassD2::s1WordTable [protected] |
Instance variable that maps an index in the reference sequence (sequence s1) to the hash of a word.
This hash can then be used as an index in the fdHashMap to get the frequency differential for that word.
The word table is created in the initialize() method and filled in using the buildWordTable() method. For the reference EST, buildWordTable() is called in the setReferenceEST() method, meaning it does not need to be rebuilt every time we analyze a new comparison sequence.
Definition at line 327 of file TwoPassD2.h.
Referenced by initialize(), runD2Asymmetric(), runD2Bounded(), setReferenceEST(), TwoPassD2(), and ~TwoPassD2().
int* TwoPassD2::s2WordTable [protected] |
Instance variable that maps an index in the comparison sequence (sequence s2) to the hash of a word.
This hash is used as an index in the fdHashMap to get the frequency differential for that word.
The word table is created in the initialize() method and filled in using the buildWordTable() method. For the comparison EST, buildWordTable() must be called in the analyze() method because a new comparison sequence is given every time analyze() is called.
Definition at line 340 of file TwoPassD2.h.
Referenced by getMetric(), initialize(), runD2Asymmetric(), runD2Bounded(), TwoPassD2(), and ~TwoPassD2().
int TwoPassD2::sq1Len [private] |
Length of sequence 1, or the reference sequence.
Stored to avoid making multiple unnecessary strlen() calls.
Definition at line 498 of file TwoPassD2.h.
Referenced by getMetric(), runD2Asymmetric(), runD2Bounded(), setReferenceEST(), and updateParameters().
int TwoPassD2::sq2Len [private] |
Length of sequence 2, or the comparison sequence.
Stored to avoid making multiple unnecessary strlen() calls.
Definition at line 503 of file TwoPassD2.h.
Referenced by getMetric(), runD2Asymmetric(), runD2Bounded(), and updateParameters().
int TwoPassD2::threshold [protected] |
The threshold score to be used for clustering.
note: need to disambiguate from the two scores above; and possibly minThreshold should be made non-static for consistency
Definition at line 411 of file TwoPassD2.h.
Referenced by getMetric(), TwoPassD2(), and updateParameters().