Knowledge Discovery & Web Mining Lab, University of Louisville

Stream Clustering Algorithms in Mixed Domains with Soft Two-way Semi-Supervision


  Goals, Objectives and Targeted Activities

  Merit and Impacts


Area Background

Data sets
 Outreach Activities

This project is supported by National Science Foundation under Data Intensive Computation grant NSF IIS-0916489.

 Area Background

Clustering Data Streams:

   Clustering is an essential task of data mining that aims to discover the underlying structure of a set of data points, such as by partitioning the data into groups of similar objects. The explosion of data collections in the last decade has placed high demands on clustering algorithms, which must now handle very large data sets, leading to some scalable clustering techniques. More recently, an explosion of applications generating and analyzing data streams has added new unprecedented challenges for clustering algorithms if they are to be able to track changing clusters in noisy data streams using only the new data points because storing past data is not even an option. Data streams are massive data sets that arrive with a throughput that is so high that the data can only be analyzed sequentially and in a single pass. The patterns that could be discovered from most streams follow dynamic trends, and hence they are different from traditional static data sets that are very large. Such data streams are referred to as evolving data streams. For these reasons, even techniques that are scalable for huge data sets may not be the answer for mining evolving data streams, because these techniques always strive to work on the entire data set without making any distinction between new data and old data, and hence cannot be expected to handle the notion of emerging and obsolete patterns. Like their non-stream counterparts, data streams are not immune from noise and outliers, which are data points that deviate from the trend set by the majority of the remaining data points. However, being able to handle outliers while tracking evolving patterns can be a tricky requirement that adds an additional burden to the stream mining task, because, at least the first time that an outlier is detected, it is not easy to distinguish it from the beginning of a new pattern.

Semi-Supervised Clustering

   Learning with both labeled and unlabeled data, is called semi-supervised learning or transductive learning, and is used mainly to exploit information in unlabeled data to enhance the performance of a classification model (traditionally trained using only labeled data). Many semi-supervised algorithms have been proposed including co-training , transductive support vector machine , entropy minimization, semi-supervised EM , graph-based approaches, and clustering-based approaches. In semi-supervised clustering, labeled data can be used as (1) initial seeds , (2) constraints , or (3) feedback . All these existing approaches are based on model-based clustering where each cluster is represented by its centroid. Seed-based approaches use labeled data only to help initialize cluster centroids, while constrained approaches keep the grouping of labeled data unchanged throughout the clustering process, and feedback-based approaches start by running a regular clustering process and finally adjust the resulting clusters based on labeled data.

    Go to back to Knowledge Discovery & Web Mining Lab