Abstract
This
paper proposes an algorithm, named HWK-Sets, based on K-Means, suited for
clustering data which are variable-sized sets of elementary items. An example
of such data occurs in the analysis of medical diagnosis, where the goal is to
detect human subjects who share common diseases so as to predict future illnesses
from previous medical history possibly. Clustering sets is difficult because
data objects do not have numerical attributes and therefore it is not possible
to use the classical Euclidean distance upon which K-Means is normally based.
An adaptation of the Jaccard distance between sets is used, which exploits
application-sensitive information. More in particular, the Hartigan and Wong
variation of K-Means is adopted, which can favor the fast attainment of a
careful solution. The HWK-Sets algorithm can flexibly use various stochastic
seeding techniques. Since the difficulty of calculating a mean among the sets
of a cluster, the concept of a medoid is employed as a cluster representative
(centroid), which always remains a data object of the application. The paper
describes the HWK-Sets clustering algorithm and outlines its current
implementation in Java based on parallel streams. After that, the efficiency
and accuracy of the proposed algorithm are demonstrated by applying it to 15
benchmark datasets.
Read
More About This Article: https://crimsonpublishers.com/oabb/fulltext/OABB.000564.php
Read
More About Crimson Publishers: https://crimsonpublishers.com/
No comments:
Post a Comment