Knowledge Discovery & Web Mining Lab, University of Louisville

Stream Clustering Algorithms in Mixed Domains with Soft Two-way Semi-Supervision


  Goals, Objectives and Targeted Activities

  Merit and Impact


Area Background

Data sets
Outreach Activities

This project is supported by National Science Foundation under Data Intensive Computation grant NSF IIS-0916489.

Intellectual Merit and Impact within the Discipline

We propose a new framework for learning synopses in evolving data streams. The synopses are based on analytical learning strategies that derive their strength from the robustness and speed of statistical and mathematical analysis. While the theoretical foundation of the proposed methods are generic enough to be applied in a variety of dynamic learning contexts (including monitoring and summarizing sensor streams), we focus on adaptive stream mining systems that can handle massive amounts of data while being able to model and adapt to rapid changes. In particular, in our project, we propose:.

1. Mining evolving data streams: Data is presented in a stream, and is processed sequentially as it arrives, in a single pass over the data stream. A stream synopsis is learned in a continuous fashion. The stream synopsis consists of a set of synopsis nodes or clusters that offer a concise summary of the data stream. Because the data stream is a dynamic source of data, the stream synopsis itself is dynamic, and will change to reflect the status of the current data stream. The stream synopsis is constrained so that its size does not exceed a maximal limit that is predefined depending on the application, and preference will be given to newer parts of the data stream in occupying synopsis nodes that represent them. Obsolete parts of the summary that correspond to older parts of the data stream are purged from the synopsis, and delegated to secondary storage.

2. Higher-level exploratory analysis: Develop mechanisms to keep track of each discovered cluster or component of the data stream's summary through time, and that stores only milestones corresponding to the occurrence of significant changes in these cluster representatives. Instead of storing an infinite number of summaries of the stream (one at each instant), only salient snapshots of the stream synopsis will be stored, and these snapshots will only be stored when significant changes have occurred, together with a model of the change that has occurred in between consecutive salient snapshots. A linear regression model will be built for the amount of drift (or velocity of the centroids) with time for each synopsis node, as well as for the amount of expansion/shrinking of the scale. Each time that a significant change is detected (i.e., a failed goodness of fit test of new velocity data on previous models), the last snapshot will be stored, and the regression model building will restart to build a new model.

3. Semi-supervised framework for combining diverse domains: To handle possibly diverse data formats and different sources of data, we propose a semi-supervised framework for (i) combining diverse representations of the data (x), e.g. when each data record x is composed of two parts: one part transactional: (x^{1}) and one part numerical: (x^{2}), (ii) exploiting optional external concept set labels (x^{c}) to guide the clustering of the main data set in its original domain (x). Examples of case (i) abound in data with mixed attributes such as network activity data (e.g. the KDD cup data), most existing census or demographic data, environmental data, and other scientific data. In the latter case, an example is solar images captured by various instruments onboard different satellites, some of which have much higher quality and cadence, e.g. the soon to be launched Solar Dynamic Observatory (SDO), while others are either subject to failure, or are simply have less quality data: either lower resolution or slower cadence, e.g. the EIT solar data). Other examples include when data has a time stamp, but is collected from different sources, such as distributed sensors, and where part of the data may be missing or be less certain than the other (e.g. due to failure of one source). In this case, it may be preferable to cluster the most reliable or most complete part of the data, and yet still use the other part when it is available (to guide the clustering of the first part). Examples of case (ii) arise when some (not necessarily all) external annotation x^{c} (Human or machine) is available for some of the data records (x), e.g. tags (x^{c}) for web pages (x), tags (x^{c}) for images (x), search queries (x^{c}) from the REFERRER field in Web server access logs for the Web sessions/clickstreams (x). In the last example, the search queries must have been submitted by the user on an external search engine just prior to visiting the website where the clickstreams are being logged and analyzed. The queries indicate the intention between the user's clickstream in words, and are thus valuable to guide the Web usage mining process when mining clusters or user profiles. Note that because part of the data may be unavailable (e.g. not all user sessions follow a search query), the two sources of information need to be combined in a flexible framework. This is what led us to consider semi-supervised clustering.

4. Potential ultimate uses of proposed algorithm: Because our proposed stream mining algorithm is scalable, runs in one pass, and produces rich (ongoing as well as snapshots/archive) summaries of evolving large data sets, it can be used for (i) data summarization and reduction, (ii) exploratory analysis, (iii) visualization, (iv) anomaly detection, (v) using the generated knowledge to construct new features that capture the high resolution dynamics at the cluster level from the input stream.

Impact on Other Disciplines:

We anticipate our research to affect public welfare through improvements in the ability to mine and especially monitor and visualize massive data streams, in particular, for streams of data concerning health, climate and Earth science data, financial data, etc, since these can have a direct impact on society at large.

Our proposed methods have tremendous impact on applications that deal with streaming data in general, and more specifically on monitoring data streams in real-life dynamic settings. For example, as more and more of the everyday activities moved online, network and Web data has been increasing at a rapid pace that precludes standard and classical data analysis methods, and call instead for real time analysis. The same can be said about the deluge of data that is being or about to be generated by new and future sensor networks, astronomical observatories, and missions in space. Thus, new research efforts and new paradigms are needed and will have a strong impact on our ability to digest and make sense of this data. The proposed techniques for mining evolving data streams and characterizing their dynamic trends can be applied to a variety of other settings that involve massive and dynamic data streams, such as sensor streams, because they can form concise summaries involving only temporally salient stream synopsis snapshots.


Go to back to Knowledge Discovery & Web Mining Lab