Knowledge Discovery & Web Mining Lab, University of Louisville

 
NSF CAREER: New Clustering Algorithms Based on Robust Estimation and Genetic Niches

with Applications to Web Usage Mining

 
Home


  Goals, Objectives and Targeted Activities

  Selected Developments

 

Area Background

People
Publications
Software
Data sets
Links
Outreach Activities
 

This project is supported by National Science Foundation under grant NSF IIS-0533317

   

Goals, Objectives and Targeted Activities

   Motivations and Background:

      Clustering is an important task in data mining that aims at organizing an otherwise indistinguishable mess of data into several internally homogeneous groups or clusters. However, finding clusters, in real data sets, is a challenging problem because the number of clusters is typically unknown and data can be severely contaminated by noise and outliers. Clustering has many applications.

    In particular, within the context of mining user access patterns on websites, clustering can be used to automatically discover a set of mass user profiles from a mess of anonymous user sessions or clickstreams. Relying on clickstreams as opposed to user ratings is referred to as implicit user modeling, and is generally easy to apply because it does not require any explicit ratings from the users. Instead, users’ interests are captured directly from the trails of activity that they leave behind in Web server logs, as they click through the pages of a website. Each user profile offers a concise model of the items of interest (Web pages) of a group of similar users. For this reason, user profiles can enable an efficient collaborative filtering strategy where a new user’s interests can be predicted from the interests that fall within the interests of a similar profile or group of similar users. This approach that predicts what is interesting to one user based on the interests of a group is very attractive because it allows an anonymous user profiling (hence respecting privacy) and because user profiles constitute a much smaller or summarized knowledge base of the user access patterns as compared based Web personalization strategy to be successful, the clustering process must be accurate and must handle the large proportions of outliers that can permeate real access logs. Also, the final recommendation process must be carefully designed to take advantage of the learned user profiles.

 

  This project include activities that aim at achieving the following goals:

  • Develop robust unsupervised scalable clustering techniques based on Evolutionary computation and Genetic Niching. This includes: 

    • (i) Developing robust statistical estimators to automatically estimate the location and boundary of clusters in the presence of unknown amounts of noise, 

    • (ii) Developing hybrid robust unsupervised clustering techniques based on Robust Statistical theory and Evolutionary Computation Theory, and 

    • (iii) Developing scalable robust unsupervised clustering technique based on Evolutionary Computation Theory.

  • Develop an Evolutionary Web Personalization system based on the following components:

    • (i) A discovery engine that discovers an unknown number of robust multiresolution profiles and context-sensitive associations based on the above clustering technique, 

    • (ii) An intelligent recommendation system based on the discovered knowledge, 

    • (iii) New techniques for preprocessing Web usage data, and
       

    • (iv) New techniques for formulating user profiles.

  • Develop mentoring and outreach activities that will support this project and encourage the exchange and spread of knowledge in society. This includes: 

    • (i) Training several graduate and undergraduate students in the emerging high demand areas of data and Web mining, 

    • (ii) Integrating the results of this research into the undergraduate and undergraduate educational curriculum at the University of Memphis, in a way that will involve partnerships with local businesses and campus laboratories,
       

    • (iii) Instigating multidisciplinary and international collaborations that will benefit research and education in Web and data mining, Evolutionary computation, and other disciplines.

 

   Research Contributions:

    This project has developed robust unsupervised scalable clustering techniques based on Evolutionary computation, as well as adaptive and evolutionary Web Personalization strategies that are based on the following components: (i) a discovery engine that discovers an unknown number of robust user profiles based on the above clustering technique, and (ii) an intelligent recommendation system based on the discovered knowledge. The clustering techniques are unsupervised, meaning that they do not require the user to specify the number of clusters in advance. And they are robust since they can handle data that is contaminated with noise. The developed techniques have a strong impact on organizing the web information space with minimum intervention from the user or administrator. This is because they can be used to automatically extract web user profiles that do not rely on personal identification. When coupled with intelligent recommendation methods, they can be used to adapt the web information space according to the user's interest in a model based collaborative filtering approach. Recommendations range from adding suggestions to adding links on a requested page, which in essence reorganizes the web site.

    Our most important contributions so far, are the development of unsupervised clustering algorithms that are based on evolutionary computation. Evolutionary methods are nature inspired strategies that search for solutions to difficult problems by following an approach that mimics the way that genetic code in natural organisms evolves through several generations. A set of candidate solutions evolves through competition and genetic-like operators towards a pool of improved, and hence better fit individuals that represent better solutions for the problem that is being solved. Some evolutionary techniques mimic the way evolution takes place between different organisms within a population across long time spans, typically several lifetimes, such as in the case of evolution of mammals. These are known as genetic algorithms. While other evolutionary techniques mimic the way evolution takes place inside the same organism, typically over much shorter time spans, such as in the evolution of immune cells throughout a single lifetime within the Human body, in order to protect the body from harmful bacteria and viruses. These are called Immune based algorithms.

    In this project, we have developed clustering techniques that can cope with high levels of noise and large data sets based on both Genetic and Immune based search algorithms. In particular, Immune based methods (known as TECNO-Streams which stands for Tracking Evolving Clusters in Noisy data Streams) can cope with a massive stream of Web usage data or any multidimensional data in a single pass, and is thus particularly suitable to real-time Web usage mining and personalization. To our knowledge, our proposed algorithms are the first Evolutionary clustering algorithms that are scalable to massive data sets, while being robust to outliers and noise.

    We have also developed methodologies and metrics for the validation of the discovered user profiles, particularly in a streaming framework, and several recommendation strategies that use the discovered user profiles. Some of these strategies are based on Fuzzy Approximate Reasoning to handle the large amounts of uncertainty hidden in real user clickstream data, while others use Neural Networks to build several highly accurate and specialized recommender systems, one for each user profile or group. Finally, we have proposed a novel approach to easily, freely and quickly deploy recommendation systems by tweaking open source Search Engine software to work just like a recommendation system, and in particular, a system that can recommend content from many websites and not just one. This is accomplished mainly by indexing the websites’ content, and by transforming user clickstreams while browsing a website, into queries that get submitted to an underlying search engine, and then transforming the results of the search into suitable recommendations.

  Human Resources:

    The research in this project has involved several graduate students, the majority of whom are women and minorities. The project has resulted in one PhD dissertation by a female student and several Masters Theses and projects. Two new graduate courses related to Web Mining have also been developed and benefited a large number of students at the graduate level.

  Impacts:

    Our contributions in Web personalization have a significant impact on e-commerce, e-learning, human computer interaction, and adaptive user interfaces, where Web usage mining and personalization can play an important role in understanding user activities and in guiding them and recommending content located deep within large websites, that the user might never find on his or her own.

    Our contributions in developing clustering algorithms have impact on the discipline of data mining, since incisive techniques are being studied theoretically and experimentally for robust unsupervised clustering. The techniques are unsupervised, meaning that they do not require the user to specify the number of clusters in advance. They are robust since they can handle data that is contaminated with noise.

    The developed techniques have a strong impact on organizing the web information space with minimum intervention from the user or administrator. This is because they can be used to automatically extract collaborative web user profiles that do not rely on personal identification. When coupled with intelligent recommendation methods, they can be used to adapt the web information space according to the user's interest. This can range from adding suggestions to adding links on a requested page, which in essence reorganizes the web site. The fact that these techniques detect multi-resolution profiles makes them able to work at different levels of granularities.

    The techniques developed are generic enough that they can learn from and act on both the 'usage' and 'content' aspects of the web. Our contributions in Web personalization have an impact on e-commerce, human computer interaction and adaptive user interfaces.

 

Selected Developments

    We developed a statistically robust and efficient algorithm for clustering and generalized it to multivariate data. Hierarchical clustering based on genetic niching showed that it is a feasible technique and promises more success compared to other techniques, for clustering web sessions. We have also developed new recommendation techniques based on Web usage mining and fuzzy approximate reasoning to take into account the inherent uncertainties in all the phases of Web personalization (see publications).

    We have presented a systematic approach to automatic web recommender systems based on Web usage mining in a first stage to learn user profiles, and a second data mining phase that is devoted to learning an accurate model for predicting user requests. Our approach differs from existing methods because it includes two separate learning phases: one to learn the user profiles, and another to learn a recommendation model. Most previous approaches do not include adaptive learning in a separate second phase, and instead base the recommendations on simplistic assumptions (e.g. nearest profile recommendations, or deployment of pre-discovered association rules). We have also recently developed a new approach to generate simultaneously accurate and complete recommendations, called Context Ultra-Sensitive Approach based on two-step Recommender systems (CUSA-2-step-Rec). Our approach relies on a committee of profile-specific neural networks. This approach provides recommendations that are accurate and fast to train because only the URLs relevant to a specific profile are used to define the architecture of each network. Similar to the task of completing the missing pieces of a puzzle, each neural network is trained to predict the missing URLs of several complete ground-truth sessions from a given profile, given as input several incomplete sub sessions. We compared the proposed approach with K-NN based collaborative filtering, as well as other classical methods (such as nearest profile recommendations ) showing that our approach achieves higher coverage and precision while being faster, and requiring lower main memory at recommendation time. While most recommenders are inherently context sensitive, our approach is context ultra-sensitive because a different recommendation model is designed for each profile separately.

    On a separate track, we have continued our investigation into evolutionary clustering using Unsupervised Niche Clustering (UNC), achieving significant strides, specifically in (i)  performing a comprehensive evaluation of UNC under different density, noise and cluster distribution conditions, (ii) successfully applying UNC to the problem of anomaly detection for network intrusion detection, by using UNC to discover normal profiles, (iii) developing scalable models of UNC to cluster large volumes of data under stringent time and memory constraints, (iv) developing a parameter-free model of UNC that eliminates the need to specify Genetic Algorithm (GA) parameters such as crossover and mutation probabilities, and finally developing a flexible and generalized framework for evolutionary clustering based on  exchangeable modules that affect crucial mechanisms such as the selection strategy, replacement strategy, niching strategy, hybrid scale estimation, automatic adaptation of the GA parameters, and hybrid local refinement post evolution.

 
    We have also investigated several approaches to scalable evolutionary clustering of noisy and evolving data sets based on more recently understood Evolutionary phenomena, such a new artificial immune system model, multiploid and structured chromosome representations, and gene expression mechanisms. Preliminary results have shown a promise of success. We believe that creating hybrids that tap on the strengths of these different facets of evolutionary computation, as well as the Unsupervised Niche Clustering algorithm will naturally evolve into beneficial outcomes of this project to the data mining community. Tangible products of this research will be a suite of scalable, autonomous, and reliable clustering algorithms for intensive applications such as web and stream data mining.

 
Area Background

    As a valuable unsupervised learning tool, clustering is crucial to many applications. Genetic Algorithms have been used with success as global searchers in difficult problems, particularly in the optimization of non-differentiable functions. Recently, several Genetic Niching (GN) techniques have emerged to tackle multimodal function optimization. Current genetic clustering techniques are not robust in the presence of noise; assume a known number of clusters; and suffer from a search space size that explodes exponentially with the number of clusters. Web Personalization tailors a user’s interaction with the Web information space based on information gathered about them. Declarative user information such as manually entered profiles continue to raise privacy concerns and are neither scalable nor flexible in the face of very active dynamic Web sites and changing user trends and interests. One way to deal with this problem is through an automated Web personalization system. Such a system can be based on Web usage mining to discover Web usage profiles, followed by a recommendation system that can respond to the users’ individual interests.

Go to back to Knowledge Discovery & Web Mining Lab