Clustering Data Streams:
Clustering is an essential task of data mining that aims
to discover the underlying structure of a set of data points,
such as by partitioning the data into groups of similar objects.
The explosion of data collections in the last decade has placed
high demands on clustering algorithms, which must now handle
very large data sets, leading to some scalable clustering techniques.
More recently, an explosion of applications generating and analyzing
data streams has added new unprecedented challenges for clustering algorithms
if they are to be able to track changing clusters in noisy data streams using
only the new data points because storing past data is not even an option.
Data streams are massive data sets that arrive with a throughput that is
so high that the data can only be analyzed sequentially and in a single pass.
The patterns that could be discovered from most streams follow dynamic trends,
and hence they are different from traditional static data sets that are very large.
Such data streams are referred to as evolving data streams. For these reasons,
even techniques that are scalable for huge data sets may not be the answer for mining evolving data streams,
because these techniques always strive to work on the entire data set without making any distinction between new data and old data,
and hence cannot be expected to handle the notion of emerging and obsolete patterns. Like their non-stream counterparts,
data streams are not immune from noise and outliers, which are data points that deviate from
the trend set by the majority of the remaining data points. However, being able to handle
outliers while tracking evolving patterns can be a tricky requirement that adds an additional
burden to the stream mining task, because, at least the first time that an outlier is detected,
it is not easy to distinguish it from the beginning of a new pattern.
Learning with both labeled and unlabeled data,
is called semi-supervised learning or transductive learning,
and is used mainly to exploit information in unlabeled data to enhance the performance
of a classification model (traditionally trained using only labeled data).
Many semi-supervised algorithms have been proposed
including co-training , transductive support vector machine ,
semi-supervised EM , graph-based approaches,
and clustering-based approaches.
In semi-supervised clustering, labeled data can be used as (1) initial seeds ,
(2) constraints , or
(3) feedback .
All these existing approaches are based on model-based clustering
where each cluster is represented by its centroid. Seed-based approaches use labeled data only to help initialize cluster centroids,
while constrained approaches keep the grouping of labeled data unchanged throughout the clustering process,
and feedback-based approaches start by running a regular clustering process and finally adjust the resulting clusters
based on labeled data.
Go to back to
Knowledge Discovery & Web Mining Lab