A Minimum Spanning Tree (MST) based parallel cluster maker that uses conditional-transitivity relations to accelerate clustering. More...
#include <TransMSTClusterMaker.h>
Public Member Functions | |
virtual void | displayStats (std::ostream &os) |
Method to display performance statistics. | |
virtual | ~TransMSTClusterMaker () |
The destructor. | |
Protected Member Functions | |
virtual int | initialize () |
A method to handle initialization tasks for the TransMSTClusterMaker. | |
virtual float | analyze (const int otherEST) |
Helper method to call the actual heavy-weight analysis method(s). | |
virtual void | populateCache (const int estIdx, SMList *metricList=NULL) |
Computes sends/receives similarity list for a given EST. | |
TransMSTClusterMaker (ESTAnalyzer *analyzer, const int refESTidx, const std::string &outputFile) | |
The default constructor. | |
void | pruneMetricEntries (const SMList &list, SMList &entries) |
Helper method to remove invalid metrics from a given list of values. | |
void | processMetricList (SMList &metricList) |
Helper method to process a list of similarity metrics for caching. | |
Private Attributes | |
MetricCacheMap | metricCache |
An hash map that acts as the cache to hold known metrics. | |
int | analyzeCount |
Instance variable to track total number of calls for analysis. | |
int | successCount |
Instance variable to track number of times transitivity was successfully applied. | |
int | currRefESTidx |
The current reference EST for which analysis is underway. | |
Static Private Attributes | |
static float | badMetric = 0 |
Friends | |
class | ClusterMakerFactory |
A Minimum Spanning Tree (MST) based parallel cluster maker that uses conditional-transitivity relations to accelerate clustering.
This class aims to enhance the performance (without signficiantly impacting quality of clusters) of the standard MSTClusterMaker by minimizing the number of the heavy weight analysis (such as: d2 or clu) performed for building the MST. Reduction in the number of required heavy weight analyses is achieved by applying the concept of transitivity in a conditional manner as follows:
Given three ESTs, say e1, e2, and e2, assume heavy weight metrics m12 (between e1 & e2 and m23 (between e2 & e3) have been computed.
When the heavy weight metric m13 (between e1 & e3) is required, this class obtains this information in the following manner:
First a heuristic (such as: the t/v heuiristic) is applied to determine if the pair e1 & e3 meet the stipulated relationship condition. If not the ESTs are unrelated and conditional-transitivity is not applicable. Therefore, an invalid large distance value is assumed.
However, if the pair of ESTs pass the heuristic test then they are related and conditional-transitivity is applicable. In this case the heavy weight metric is deduced using a binary function f(m12, m23). Currently, the default implementation for f is std::max()
method that selects the maximum of m12 and m23
In order to rapidly retrieve pertinent metrics (or identify the lack of existing metrics), this class maintains a distributed set of hash maps in the transCache
instance variable. Each entry in the transCache
hash map is a TransCacheEntry object and corresponds to a given EST (identified by its index). The TransCacheEntry in-turn has a hash map of CachedESTInfo objects that provide all the necessary information to determine metrics via transitivity.
Definition at line 107 of file TransMSTClusterMaker.h.
TransMSTClusterMaker::~TransMSTClusterMaker | ( | ) | [virtual] |
The destructor.
The destructor frees up all any dynamic memory allocated by this object for its operations.
Definition at line 57 of file TransMSTClusterMaker.cpp.
References MSTClusterMaker::getOwnedESTidx(), and metricCache.
TransMSTClusterMaker::TransMSTClusterMaker | ( | ESTAnalyzer * | analyzer, | |
const int | refESTidx, | |||
const std::string & | outputFile | |||
) | [protected] |
The default constructor.
The default constructor for this class. The constructor is made protected so that this class cannot be directly instantiated. However, since the ClusterMakerFactory is a friend of this class, an object can be instantiated via the ClusterMakerFactory::create() method.
[in,out] | analyzer | The EST analyzer to be used for obtaining similarity metrics between two ESTs. This parameter is simply passed onto the base class. |
[in] | refESTidx | The reference EST index value to be used to root the spanning tree created by this method. This parameter should be >= 0. This value is simply passed onto the base class. |
[in] | outputFile | The name of the output file to which the raw MST cluster information is to be written. If this parameter is the empty string then output is written to standard output. This value is simply passed onto the base class. |
Definition at line 44 of file TransMSTClusterMaker.cpp.
References analyzeCount, badMetric, currRefESTidx, ESTAnalyzer::getInvalidMetric(), and successCount.
float TransMSTClusterMaker::analyze | ( | const int | otherEST | ) | [protected, virtual] |
Helper method to call the actual heavy-weight analysis method(s).
This method overrides the default implementation in the base-class before the actual call to the heavy weight analyzer is made. This method attempts to use conditional-transitivity to try and minimize the number of calls to the heavy weight analyzer. This method operates in the following manner:
[in] | otherEST | The index of the other EST to which the metric is required. |
Reimplemented from MSTClusterMaker.
Definition at line 96 of file TransMSTClusterMaker.cpp.
References ESTAnalyzer::analyze(), analyzeCount, ClusterMaker::analyzer, ASSERT, badMetric, ESTAnalyzer::compareMetrics(), currRefESTidx, TransCacheEntry::getMetric(), metricCache, and successCount.
void TransMSTClusterMaker::displayStats | ( | std::ostream & | os | ) | [virtual] |
Method to display performance statistics.
This method overrides the empty implementation in the base class to display additional statistics on conditional-transitivity application.
[out] | os | The output stream to which the statistics must be written. |
Reimplemented from MSTClusterMaker.
Definition at line 68 of file TransMSTClusterMaker.cpp.
References analyzeCount, and successCount.
int TransMSTClusterMaker::initialize | ( | ) | [protected, virtual] |
A method to handle initialization tasks for the TransMSTClusterMaker.
This method is called after the ESTs have been loaded into the ESTAnalyzer and is used to initialize the vector of TransCacheEntrys, as well as to check for the existence of a heuristic chain on the analyzer. Without a suitable heuristic, the TransMSTClusterMaker cannot function properly and will exit with an error code.
Reimplemented from MSTClusterMaker.
Definition at line 77 of file TransMSTClusterMaker.cpp.
References ClusterMaker::analyzer, ERROR_NO_HEURISTIC, EST::getESTList(), ESTAnalyzer::getHeuristicChain(), metricCache, and NO_ERROR.
void TransMSTClusterMaker::populateCache | ( | const int | estIdx, | |
SMList * | metricList = NULL | |||
) | [protected, virtual] |
Computes sends/receives similarity list for a given EST.
This method overrides the default implementation in the base class that performs the core tasks of: (i) computing similarity metrics, (ii) gathering metrics from various processes to obtain the highest set of similarity metrics, and (iii) indicating to the manager that similarity metric calculations have been completed.
This method overrides the default implementation and performs the following tasks:
It calls the base class method to perform all the tasks in the usual manner.
Next the process that owns estIdx
broadcasts the cached metrics to all the other processes. Each process in-turn waits to receive the cached metrics from the owner process.
Then each process extracts and populates metrics for conditional-transitivity calculation by adding/updating appropriate entries to the transCache.
[in] | estIdx | The index of the EST that was just added to the MST and for which the adjacent neighbors need to be determined. |
[out] | metricList | If this pointer is not NULL, then this vector is populated with the set of metrics that were computed for estIdx only on the owner process. This list contains the metrics collated from all the processes participating in the distributed computing process. Currently, this feature is used by TransMSTClusterMaker to obtain the list of metrics computed. |
Reimplemented from MSTClusterMaker.
Definition at line 143 of file TransMSTClusterMaker.cpp.
References currRefESTidx, MSTClusterMaker::getOwnerProcess(), metricCache, MPI_GET_RANK, MPI_GET_SIZE, MPI_PROBE, MPI_RECV, MPI_SEND, MPI_STATUS, MPI_TYPE_CHAR, processMetricList(), pruneMetricEntries(), TRACK_IDLE_TIME, and MSTClusterMaker::TRANSITIVITY_LIST.
void TransMSTClusterMaker::processMetricList | ( | SMList & | metricList | ) | [protected] |
Helper method to process a list of similarity metrics for caching.
This method is called in the populateCache method and will take care of a list of similarity metrics, either computed locally or received via remote communication. This method goes through the list of similarity metrics and updates TransCacheEntrys as appropriate.
[in] | metricList | The list of metrics received or computed. |
Definition at line 204 of file TransMSTClusterMaker.cpp.
References TransCacheEntry::addEntries(), MSTClusterMaker::getOwnedESTidx(), and metricCache.
Referenced by populateCache().
Helper method to remove invalid metrics from a given list of values.
This is a helper method that is invoked from populateCache() method to remove invalid/unwanted entries from a given list of metrics. This method iterates over the entries in list
and removes all entires that have their metric set to a value that is worse than badMetric value.
[in] | list | The list of metrics that need to be pruned. |
[out] | entries | The list of entries that are better than badMetric. |
Definition at line 122 of file TransMSTClusterMaker.cpp.
References ClusterMaker::analyzer, badMetric, and ESTAnalyzer::compareMetrics().
Referenced by populateCache().
friend class ClusterMakerFactory [friend] |
Reimplemented from MSTClusterMaker.
Definition at line 108 of file TransMSTClusterMaker.h.
int TransMSTClusterMaker::analyzeCount [private] |
Instance variable to track total number of calls for analysis.
This instance variable tracks the number of times the analyze() method was invoked. This instance variable essentially tracks the number of times the heavy-weight analysis method would have been invoked if conditional-transitivity was not applied. This instance variable is set to zero in the constructor, incremented in the analyze() method, and displayed by the displayStats() method.
Definition at line 292 of file TransMSTClusterMaker.h.
Referenced by analyze(), displayStats(), and TransMSTClusterMaker().
float TransMSTClusterMaker::badMetric = 0 [static, private] |
Definition at line 318 of file TransMSTClusterMaker.h.
Referenced by analyze(), pruneMetricEntries(), and TransMSTClusterMaker().
int TransMSTClusterMaker::currRefESTidx [private] |
The current reference EST for which analysis is underway.
This instance variable tracks the index of the reference EST for which we are trying to figure out the closest neighbor in the MST. This instance variable is set in the populateCache() method and is used in the analyze method. Note that the analyze method is actually called from within MSTClusterMaker::populateCache() (the base class implementation).
Definition at line 316 of file TransMSTClusterMaker.h.
Referenced by analyze(), populateCache(), and TransMSTClusterMaker().
An hash map that acts as the cache to hold known metrics.
This instance variable holds a cache of the previously computed metrics in a from that enables rapid application of conditional-transitivity to deduce metrics for related ESTs. The hash key for this hash map is the index of the reference EST for which metrics are to be computed. That is, given two ESTs ei and ej, metricCache[i].getMetric(j, metric) provides an estimate of metric assuming ei and ej pass t/v heuristic.
Entries in this cache are primarily added by the populateCache() method and entries are used by the analyze() method.
Definition at line 280 of file TransMSTClusterMaker.h.
Referenced by analyze(), initialize(), populateCache(), processMetricList(), and ~TransMSTClusterMaker().
int TransMSTClusterMaker::successCount [private] |
Instance variable to track number of times transitivity was successfully applied.
This instance variable tracks the number of times transitivity was successfully applied. This number indices how effective transitivity approach was for a given set of ESTs. This instance variable is set to zero in the constructor, incremented in the analyze() method, and displayed by the displayStats() method.
Definition at line 304 of file TransMSTClusterMaker.h.
Referenced by analyze(), displayStats(), and TransMSTClusterMaker().