EST Analyzer that uses the D2 algorithm to compute distances between two ESTs. More...
#include <D2.h>
Public Member Functions | |
virtual | ~D2 () |
The destructor. | |
virtual void | showArguments (std::ostream &os) |
Display valid command line arguments for this analyzer. | |
virtual bool | parseArguments (int &argc, char **argv) |
Process command line arguments. | |
int | initialize () |
Method to begin EST analysis. | |
virtual std::string | getName () const |
Method to obtain human-readable name for this EST analyzer. | |
virtual int | setReferenceEST (const int estIdx) |
Set the reference EST id for analysis. | |
virtual int | analyze () |
Method to perform exhaustive EST analysis. | |
Protected Member Functions | |
virtual float | getMetric (const int otherEST) |
Analyze and obtain a distance metric. | |
bool | compareMetrics (const float metric1, const float metric2) const |
Method to compare two metrics generated by this class. | |
float | getInvalidMetric () const |
Obtain an invalid (or the worst) metric generated by this analyzer. | |
virtual bool | getAlignmentData (int &alignmentData) |
Get alignment data for the previous call to analyze method. | |
bool | isDistanceMetric () const |
Determine if this EST analyzer provides distance metrics or similarity metrics. | |
template<typename Encoder > | |
void | buildWordTable (std::vector< int > &wordTable, const char *estSeq, Encoder encoder) |
Helper method to create a word table. | |
Protected Attributes | |
std::vector< int > | s1WordTable |
Instance variable that maps an index in the reference sequence (sequence s1) to the hash of a word. | |
std::vector< int > | s2WordTable |
Instance variable that maps an index in the comparison sequence (sequence s2) to the hash of a word. | |
int * | delta |
Array to track the delta values in the core D2 algorithm. | |
Static Protected Attributes | |
static int | frameShift = 1 |
Parameter to define number of characters to shift the frame on the reference sequence, when computing D2. | |
static int | bitShift = 0 |
Instance variable to store the number of bits to be shifted to create hash values. | |
static int | threshold = 0 |
The threshold score below which two ESTs are considered sufficiently similar to be clustered. | |
Private Member Functions | |
D2 (const int refESTidx, const std::string &outputFileName) | |
void | updateWindow (const int wordIn, const int wordOut, int &score, int &minScore) |
Helper method to update the scores based on a sliding window. | |
float | runD2 (const int otherEST) |
The core method that run's the D2 algorithm. | |
Private Attributes | |
int | alignmentMetric |
Instance variable to track alignment metric computed by the analyze() method. | |
int | numWordsInWindow |
Instance variable to track the number of words (of wordSize ) that can fit into a window (of frameSize ). | |
const std::string | hintKey |
The hint key that is used to add hint for normal or reverse-complement D2 computation. | |
Static Private Attributes | |
static arg_parser::arg_record | argsList [] |
The set of arguments specific to the D2 algorithm. | |
static int | BitMask = 0 |
The bitmask to be used when build hashing values. | |
Friends | |
class | ESTAnalyzerFactory |
EST Analyzer that uses the D2 algorithm to compute distances between two ESTs.
This analyzer provides the mechanism to use D2 algorithm to compute the distance values between a pair of ESTs. The D2 implementation has been adapted purely from the implementations of WCD, Zimmerman, and CLU.
Definition at line 61 of file D2.h.
D2::~D2 | ( | ) | [virtual] |
D2::D2 | ( | const int | refESTidx, | |
const std::string & | outputFileName | |||
) | [private] |
Definition at line 64 of file D2.cpp.
References alignmentMetric, and delta.
virtual int D2::analyze | ( | ) | [inline, virtual] |
Method to perform exhaustive EST analysis.
This method is used to perform the core tasks of comparing all ESTs to one another for full analysis of ESTs. This is an additional feature of PEACE that is not used for clustering but just doing an offline analysis. Currently, this method merely calls the corresponding base class implementation that performs all the necessary operations.
Reimplemented from FWAnalyzer.
void D2::buildWordTable | ( | std::vector< int > & | wordTable, | |
const char * | estSeq, | |||
Encoder | encoder | |||
) | [inline, protected] |
Helper method to create a word table.
Creates a "word table" mapping integer indices to integer hashes of words, in effect translating the sequence from a sequence of characters to a sequence of n-words (where n = wordSize).
encoder | The encoder class to be used for building the word table. | |
[out] | wordTable | The word table to be populated with with hash values. |
[in] | estSeq | The sequence of base pairs associated with this EST that must be converted to hash values. |
Definition at line 289 of file D2.h.
References FWAnalyzer::frameSize, EST::getMaxESTLen(), and FWAnalyzer::wordSize.
Referenced by runD2(), and setReferenceEST().
bool D2::compareMetrics | ( | const float | metric1, | |
const float | metric2 | |||
) | const [inline, protected, virtual] |
Method to compare two metrics generated by this class.
This method provides the interface for comparing metrics generated by this ESTAnalyzer when comparing two different ESTs. This method returns true
if metric1
is comparatively better than or equal to metric2
.
[in] | metric1 | The first metric to be compared against. |
[in] | metric2 | The second metric to be compared against. |
true
if metric1
is comparatively better than metric2
. Reimplemented from ESTAnalyzer.
bool D2::getAlignmentData | ( | int & | alignmentData | ) | [protected, virtual] |
Get alignment data for the previous call to analyze method.
This method can be used to obtain alignment data that was obtained typically as an byproduct of the previous call to the analyze() method. This method essentially returns the difference between the two windows that provided the minimum d2 distance value.
[out] | alignmentData | The parameter is updated to the alignment information generated as a part of the the immediately preceding analyze(const int) method call is returned in the parameter. |
true
to indicate that alignment data is computed by this ESTAnalyzer. Reimplemented from ESTAnalyzer.
Definition at line 272 of file D2.cpp.
References alignmentMetric.
float D2::getInvalidMetric | ( | ) | const [inline, protected, virtual] |
Obtain an invalid (or the worst) metric generated by this analyzer.
This method can be used to obtain an invalid metric value for this analyzer. This value can be used to initialize metric values.
Reimplemented from ESTAnalyzer.
Definition at line 237 of file D2.h.
References FWAnalyzer::frameSize.
float D2::getMetric | ( | const int | otherEST | ) | [protected, virtual] |
Analyze and obtain a distance metric.
This method can be used to compare a given EST with the reference EST (set via the call to the setReferenceEST()) method.
[in] | otherEST | The index (zero based) of the EST with which the reference EST is to be compared. |
Reimplemented from FWAnalyzer.
Definition at line 144 of file D2.cpp.
References EST::getESTCount(), ESTAnalyzer::refESTidx, runD2(), and VALIDATE.
virtual std::string D2::getName | ( | ) | const [inline, virtual] |
Method to obtain human-readable name for this EST analyzer.
This method provides a human-readable string identifying the EST analyzer. This string is typically used for display/debugging purposes (particularly via the PEACE Interactive Console).
Implements ESTAnalyzer.
int D2::initialize | ( | ) | [virtual] |
Method to begin EST analysis.
This method is invoked just before commencement of EST analysis. This method first invokes the base class's initialize
method that initializes the heuristic chain, if a chain has been set for this analyzer. It then initializes memory for the word tables used to store hash values for the pairs of ESTs to be analyzed.
Reimplemented from FWAnalyzer.
Definition at line 100 of file D2.cpp.
References BitMask, bitShift, delta, FWAnalyzer::frameSize, numWordsInWindow, and FWAnalyzer::wordSize.
bool D2::isDistanceMetric | ( | ) | const [inline, protected, virtual] |
Determine if this EST analyzer provides distance metrics or similarity metrics.
This method can be used to determine if this EST analyzer provides distance metrics or similarity metrics. If this method returns true
, then this EST analyzer returns distance metrics (smaller is better). On the other hand, if this method returns false
, then this EST analyzer returns similarity metrics (bigger is better).
true
to indicate that this EST analyzer operates using distance metrics. Reimplemented from ESTAnalyzer.
bool D2::parseArguments | ( | int & | argc, | |
char ** | argv | |||
) | [virtual] |
Process command line arguments.
This method is used to process command line arguments specific to this EST analyzer. This method is typically used from the main
method just after the EST analyzer has been instantiated. This method consumes all valid command line arguments. If the command line arguments were valid and successfully processed, then this method returns true
.
Currently, this EST analyzer accepts an optional frameShift
value to control the stride used for D2 algorithm. However, the default value of 1 for frameShift
is the preferred value for this parameter.
ESTAnalyzer
base class requires that derived EST analyzer classes must override this method to process any command line arguments that are custom to their operation. However, as per API requirement, this methodcalls the corresponding base class implementation to display additional options.[in,out] | argc | The number of command line arguments to be processed. |
[in,out] | argv | The array of command line arguments. |
true
if the command line arguments were successfully processed. Otherwise this method returns false
. This method checks to ensure that a valid frame size and a valid word size have been specified. Reimplemented from FWAnalyzer.
Definition at line 85 of file D2.cpp.
References ESTAnalyzer::analyzerName, arg_parser::check_args(), and frameShift.
float D2::runD2 | ( | const int | otherEST | ) | [private] |
The core method that run's the D2 algorithm.
This method performs the core D2 analysis. This method is invoked from the getMetric() method to run the core D2 algorithm. This method operates as follows:
If a heuristic chain has been set, then this method obtains a hint from the chain to decide if the normal or reverse complement analysis must be performed. The default is normal analysis.
Next, a hash table of words is built for the otherEST sequence via a call to the buildWordTable method.
Next, it computes the score using the first two windows.
Finally, for each window in the reference EST it iterates over the windows in the other EST (sliding windows in each inner iteration) to compute the minimum d2 score by calling the updateWindow() method.
It finally returns the minimum d2 score recorded.
[in] | otherEST | The index of the other EST to be analyzed by this method. |
Definition at line 160 of file D2.cpp.
References buildWordTable(), ESTAnalyzer::chain, delta, FWAnalyzer::frameSize, EST::getEST(), ESTAnalyzer::getHeuristicChain(), HeuristicChain::getHint(), EST::getSequence(), numWordsInWindow, s1WordTable, s2WordTable, threshold, updateWindow(), and FWAnalyzer::wordSize.
Referenced by getMetric().
int D2::setReferenceEST | ( | const int | estIdx | ) | [virtual] |
Set the reference EST id for analysis.
This method is invoked just before a batch of ESTs are analyzed via a call to the analyze(EST *) method. This method currently saves the index in the refESTidx
instance variable for further look up. Next it creates a "word table" (via a call to the buildWordTable
method) mapping integer indices to integer hashes of words, in effect translating the sequence from a sequence of characters to a sequence of n-words (where n = wordSize). This word table is kept until the reference EST is changed, which reduces overhead.
initialize()
method is called.true
if the estIdx was within the given range of values. Otherwise this method returns a non-zero value as the error code. Reimplemented from FWAnalyzer.
Definition at line 124 of file D2.cpp.
References buildWordTable(), ESTAnalyzer::chain, EST::getEST(), EST::getESTCount(), EST::getSequence(), ESTAnalyzer::refESTidx, s1WordTable, and HeuristicChain::setReferenceEST().
void D2::showArguments | ( | std::ostream & | os | ) | [virtual] |
Display valid command line arguments for this analyzer.
This method must be used to display all valid command line options that are supported by this analyzer. Currently, this analyzer does not require any special command line parameters.
[out] | os | The output stream to which the valid command line arguments must be written. |
Reimplemented from FWAnalyzer.
void D2::updateWindow | ( | const int | wordIn, | |
const int | wordOut, | |||
int & | score, | |||
int & | minScore | |||
) | [inline, private] |
Helper method to update the scores based on a sliding window.
This method is invoked from several different spots from the runD2() method to update the d2 scores as the window slides across the two sequences being analyzed. The hash values of the words moving into and out of the window are used to update the scores.
[in] | wordIn | The hash value of the word moving into the window. |
[in] | wordOut | The hash value of the word moving out of the window. |
[in,out] | score | The current running score for this window. This value is updated using the delta array. |
[in,out] | minScore | The current minimum score. This value is updated after the score is updated to reflect the minimum of score and minScore. |
Definition at line 483 of file D2.h.
References delta.
Referenced by runD2().
friend class ESTAnalyzerFactory [friend] |
int D2::alignmentMetric [private] |
Instance variable to track alignment metric computed by the analyze() method.
This instance variable is used to hold the alignment metric that was computed in the previous analyze
method call. By default this value is set to zero. The alignment metric is computed as the difference in the window positions (on the two ESTs being analyzed) with the minimum d2 distance.
Definition at line 437 of file D2.h.
Referenced by D2(), and getAlignmentData().
arg_parser::arg_record D2::argsList [static, private] |
{ {"--frameShift", "Frame Shift (default=1)", &D2::frameShift, arg_parser::INTEGER}, {"--d2Threshold", "Threshold score to break out of D2 (default=0)", &D2::threshold, arg_parser::INTEGER}, {NULL, NULL, NULL, arg_parser::BOOLEAN} }
The set of arguments specific to the D2 algorithm.
This instance variable contains a static list of arguments that are specific only to the D2 analyzer class. This argument list is statically defined and shared by all instances of this class.
int D2::BitMask = 0 [static, private] |
The bitmask to be used when build hashing values.
This bitmask is used to drop higher order bits when building hash values for wordSize
(defined in base class) base pairs in a given EST. This instance variable is actually passed on to the ESTCodec::NormalEncoder or ESTCodec::RevCompEncoder when computing hash values. Since this is value is passed in a template parameter, it is defined to be static (to ensure that it has external linkage as per the ISO/ANSI standards requirement).
Definition at line 450 of file D2.h.
Referenced by initialize().
int D2::bitShift = 0 [static, protected] |
Instance variable to store the number of bits to be shifted to create hash values.
This instance variable is set to the value of 2 * (wordSize - 1) (in the initialize
method) to reflect the number of bits that need to be shifted in order to build the hash values for common words (including the values stored in s1WordMap
and s1RCWordMap
).
This instance variable is actually passed on to the ESTCodec::NormalEncoder or ESTCodec::RevCompEncoder when computing hash values. Since this is value is passed in a template parameter, it is defined to be static (to ensure that it has external linkage as per the ISO/ANSI standards requirement).
Definition at line 380 of file D2.h.
Referenced by initialize().
int* D2::delta [protected] |
Array to track the delta values in the core D2 algorithm.
This array is initialized to point to an array of size 4wordSize. This array is used to track the delta values generated in the core D2 algorithm.
Definition at line 347 of file D2.h.
Referenced by D2(), initialize(), runD2(), updateWindow(), and ~D2().
int D2::frameShift = 1 [static, protected] |
Parameter to define number of characters to shift the frame on the reference sequence, when computing D2.
This parameter is used to enable D2-asymmetric behavior. The default value is 1, which means D2 symmetric: all frames in both sequences will be compared. Higher values mean that the algorithm will shift by more than one character when shifting the frame on the reference sequence, resulting in fewer computations but a possible loss of accuracy from not comparing every frame in both sequences.
Definition at line 362 of file D2.h.
Referenced by parseArguments().
const std::string D2::hintKey [private] |
int D2::numWordsInWindow [private] |
Instance variable to track the number of words (of wordSize
) that can fit into a window (of frameSize
).
This instance variable is set in the initialize() method and its value is used by various methods in this class. Rather than computing this value repeatedly, it is computed once and used throughout.
Definition at line 460 of file D2.h.
Referenced by initialize(), and runD2().
std::vector<int> D2::s1WordTable [protected] |
Instance variable that maps an index in the reference sequence (sequence s1) to the hash of a word.
This hash can then be used as an index in the delta array to get the frequency differential for that word.
The word table is created in the initialize() method and filled in using the buildWordTable() method. For the reference EST, buildWordTable() is called in the setReferenceEST() method, meaning it does not need to be rebuilt every time we analyze a new comparison sequence.
Definition at line 326 of file D2.h.
Referenced by runD2(), and setReferenceEST().
std::vector<int> D2::s2WordTable [protected] |
Instance variable that maps an index in the comparison sequence (sequence s2) to the hash of a word.
This hash is used as an index in the delta array to get the frequency differential for that word.
The word table is created in the initialize() method and filled in using the buildWordTable() method. For the comparison EST, buildWordTable() must be called in the analyze() method because a new comparison sequence is given every time analyze() is called.
Definition at line 339 of file D2.h.
Referenced by runD2().
int D2::threshold = 0 [static, protected] |
The threshold score below which two ESTs are considered sufficiently similar to be clustered.
This instance variable tracks the threshold value to be used to break out of the core D2 loop. This value essentially represents the D2 score below which two ESTs are considered sufficiently similar. Currently, the default threshold value is set to 40. This value can be overridden by the user with the "--threshold" command line argument.
Definition at line 392 of file D2.h.
Referenced by runD2().