Knowledge Discovery & Web Mining Lab, University of Louisville

Stream Clustering Algorithms in Mixed Domains with Soft Two-way Semi-Supervision

Home


  Goals, Objectives and Targeted Activities

  Merit and Impact

 

Area Background

People
Publications
Data sets
Links
Outreach Activities

This project is supported by National Science Foundation under Data Intensive Computation grant NSF IIS-0916489.

Goals, Objectives and Targeted Activities

Motivations and Background

   One way to form a model of massive data sets is to use clustering techniques that summarize the data by several cluster representatives. However, clustering huge data sets is a very challenging problem whose difficulty increases further when the data is dynamic. We propose to develop scalable and robust stream summarization methods that provide a concise summary of huge multi-dimensional data streams that keep track of each discovered cluster or component of the summary through time, and that stores only milestones corresponding to the occurrence of significant changes in these cluster representatives. Moreover to handle possibly diverse data formats and different sources of data, we propose a semi-supervised framework for (i) combining diverse representations of the data, in particular where data comes from different sources, some of which may be unreliable or uncertain, and (ii) exploiting optional external concept set labels to guide the clustering of the main data set in its original domain.   

Goals

  This project aims at developing a new framework for learning synopses in evolving data streams. The synopses are based on analytical learning strategies that derive their strength from the robustness and speed of statistical and mathematical analysis. Furthermore, the projects aims at developing a new semi-supervised framework for clustering data with mixed data types or diverse representations. Our goals are detailed below:

   1. Mining Evolving Data Streams: As data is presented in a stream, it is processed sequentially in a single pass over the stream. A stream synopsis is learned in a continuous fashion, and consists of a set of synopsis nodes or clusters that offer a concise summary of the data stream. The stream synopsis is constrained so that its size does not exceed a maximal limit that is predefined depending on the application, and preference will be given to newer parts of the data stream in occupying synopsis nodes that represent them.

   2. Higher-level Exploratory analysis: We propose a method to keep track of each discovered node or cluster of the stream's summary/synopsis through time, and that stores only milestones corresponding to the occurrence of significant changes. Instead of storing an infinite number of summaries of the stream (at each instant), only temporally salient synopsis snapshots of the stream will be stored to disk when significant changes have occurred, together with a model of this change in between consecutive salient snapshots.

   3. Semi-supervised framework for combining diverse domains: To handle diverse data formats and different sources of data, we propose a semi-supervised framework for (i) combining diverse representations of the data (x), e.g. when each dara record x is composed of two parts: one part transactional: (x^{1}) and one part numerical: (x^{2}), (ii) exploiting optional external concept set labels (x^{c}) to guide the clustering of the main data in its original domain (x).

Activities

1) We have designed, implemented and validated the first component in the proposal:

- RINO-Streams: Robust clustering of data streams using Incremental Optimization.

We have conducted experiments to evaluate the performance of RINO-Streams against other density based clustering algorithms (TRAC-Streams and IncDBSCAN).

2) We have designed, implemented and validated the second components from the proposal:

- Stream-Dashboard: A Framework for Mining, Tracking and Validating Data Stream Clusters.

3) We have conducted extensive experiments to evaluate the performance of Stream-Dashboard. For the purpose of our experiments, we have created different online streaming scenarios to generate noisy data streams with clusters varying in distribution, numbers, order of arrival of the data points and clusters, volatility and speed of changes in the data stream. We have experimented with large synthetic and real real data sets. Real data sets include text, network activity and twitter social media text data streams.

4) We have proposed an Inter-Domain Supervision (IDS) clustering framework to discover clusters within diverse data formats, mixed-type attributes and different sources of data.

5) We have performed extensive experiments on several publicly available data sets with mixed types of attributes. The data sets include synthetic as well as real data sets.

6) We have proposed a real life application of our IDS approach to the cluster-based automated image annotation problem and present evaluation results on a benchmark data set, consisting of images described with their visual content along with noisy text descriptions, generated by users on the social media sharing website, Flickr.

7) We have proposed, together with collbaorators from National University of Colombia, novel techniques based on Non-negative Matrix Factorization (NMF) for handling data with mixed domains and validated the proposed techniques on the tasks of multimodal data indexing, retrieval and annotation. The two main techniques for mixed data mining are called Mixed NMF and Asymmetric NMF.

Go to back to Knowledge Discovery & Web Mining Lab