Knowledge Discovery & Web Mining Lab, University of Louisville

 
NSF CAREER: New Clustering Algorithms Based on Robust Estimation and Genetic Niches

with Applications to Web Usage Mining

 

 
Home


  Goals, Objectives and Targeted Activities

  Selected Developments

 

Area Background

People
Publications
Software
Data sets
Links
Outreach Activities

This project is supported by National Science Foundation under grant NSF IIS-0533317

   

Goals, Objectives and Targeted Activities

   Motivations and Background:

Clustering is an important task in data mining that aims at organizing an otherwise indistinguishable mess of data into several internally homogeneous groups or clusters. However, finding clusters, in real data sets, is a challenging problem because the number of clusters is typically unknown and data can be severely contaminated by noise and outliers. Clustering has many applications.

In particular, within the context of mining user access patterns on websites, clustering can be used to automatically discover a set of mass user profiles from a mess of anonymous user sessions or clickstreams. Relying on clickstreams as opposed to user ratings is referred to as implicit user modeling, and is generally easy to apply because it does not require any explicit ratings from the users. Instead, users’ interests are captured directly from the trails of activity that they leave behind in Web server logs, as they click through the pages of a website. Each user profile offers a concise model of the items of interest (Web pages) of a group of similar users. For this reason, user profiles can enable an efficient collaborative filtering strategy where a new user’s interests can be predicted from the interests that fall within the interests of a similar profile or group of similar users. This approach that predicts what is interesting to one user based on the interests of a group is very attractive because it allows an anonymous user profiling (hence respecting privacy) and because user profiles constitute a much smaller or summarized knowledge base of the user access patterns as compared to the entire history of user clickstreams. However, in order for a user-profiling based Web personalization strategy to be successful, the clustering process must be accurate and must handle the large proportions of outliers that can permeate real access logs. Also, the final recommendation process must be carefully designed to take advantage of the learned user profiles.

  This project include activities that aim at achieving the following goals:

  • Develop robust unsupervised scalable clustering techniques based on Evolutionary computation and Genetic Niching. This includes: 

    • (i) Developing robust statistical estimators to automatically estimate the location and boundary of clusters in the presence of unknown amounts of noise, 

    • (ii) Developing hybrid robust unsupervised clustering techniques based on Robust Statistical theory and Evolutionary Computation Theory, and 

    • (iii) Developing scalable robust unsupervised clustering technique based on Evolutionary Computation Theory.

  • Develop an Evolutionary Web Personalization system based on the following components:

    • (i) A discovery engine that discovers an unknown number of robust multi resolution profiles and context-sensitive associations based on the above clustering technique, 

    • (ii) An intelligent recommendation system based on the discovered knowledge, 

    • (iii) New techniques for preprocessing Web usage data, and
       

    • (iv) New techniques for formulating user profiles.

  • Develop mentoring and outreach activities that will support this project and encourage the exchange and spread of knowledge in society. This includes: 

    • (i) Training several graduate and undergraduate students in the emerging high demand areas of data and Web mining, 

    • (ii) Integrating the results of this research into the undergraduate and undergraduate educational curriculum at the University of Memphis, in a way that will involve partnerships with local businesses and campus laboratories,
       

    • (iii) Instigating multidisciplinary and international collaborations that will benefit research and education in Web and data mining, Evolutionary computation, and other disciplines.

 

   Research Contributions:

This project has developed robust unsupervised scalable clustering techniques based on Evolutionary computation, as well as adaptive and evolutionary Web Personalization strategies that are based on the following components: (i) a discovery engine that discovers an unknown number of robust user profiles based on the above clustering technique, and (ii) an intelligent recommendation system based on the discovered knowledge. The clustering techniques are unsupervised, meaning that they do not require the user to specify the number of clusters in advance. And they are robust since they can handle data that is contaminated with noise. The developed techniques have a strong impact on organizing the web information space with minimum intervention from the user or administrator. This is because they can be used to automatically extract web user profiles that do not rely on personal identification. When coupled with intelligent recommendation methods, they can be used to adapt the web information space according to the user's interest in a model based collaborative filtering approach. Recommendations range from adding suggestions to adding links on a requested page, which in essence reorganizes the web site.

Our most important contributions so far, are the development of unsupervised clustering algorithms that are based on evolutionary computation. Evolutionary methods are nature inspired strategies that search for solutions to difficult problems by following an approach that mimics the way that genetic code in natural organisms evolves through several generations. A set of candidate solutions evolves through competition and genetic-like operators towards a pool of improved, and hence better fit individuals that represent better solutions for the problem that is being solved. Some evolutionary techniques mimic the way evolution takes place between different organisms within a population across long time spans, typically several lifetimes, such as in the case of evolution of mammals. These are known as genetic algorithms. While other evolutionary techniques mimic the way evolution takes place inside the same organism, typically over much shorter time spans, such as in the evolution of immune cells throughout a single lifetime within the Human body, in order to protect the body from harmful bacteria and viruses. These are called Immune based algorithms.

In this project, we have developed clustering techniques that can cope with high levels of noise and large data sets based on both Genetic and Immune based search algorithms. In particular, Immune based methods (known as TECNO-Streams which stands for Tracking Evolving Clusters in Noisy data Streams) can cope with a massive stream of Web usage data or any multidimensional data in a single pass, and is thus particularly suitable to real-time Web usage mining and personalization. To our knowledge, our proposed algorithms are the first Evolutionary clustering algorithms that are scalable to massive data sets, while being robust to outliers and noise.

We have also developed methodologies and metrics for the validation of the discovered user profiles, particularly in a streaming framework, and several recommendation strategies that use the discovered user profiles. Some of these strategies are based on Fuzzy Approximate Reasoning to handle the large amounts of uncertainty hidden in real user clickstream data, while others use Neural Networks to build several highly accurate and specialized recommender systems, one for each user profile or group. Finally, we have proposed a novel approach to easily, freely and quickly deploy recommendation systems by tweaking open source Search Engine software to work just like a recommendation system, and in particular, a system that can recommend content from many websites and not just one. This is accomplished mainly by indexing the websites’ content, and by transforming user clickstreams while browsing a website, into queries that get submitted to an underlying search engine, and then transforming the results of the search into suitable recommendations.

  Human Resources:

The research in this project has involved several graduate students, the majority of whom are women and minorities. The project has resulted in one PhD dissertation by a female student and several Masters Theses and projects. Two new graduate courses related to Web Mining have also been developed and benefited a large number of students at the graduate level.

  Impacts:

Our contributions in Web personalization have a significant impact on e-commerce, e-learning, human computer interaction, and adaptive user interfaces, where Web usage mining and personalization can play an important role in understanding user activities and in guiding them and recommending content located deep within large websites, that the user might never find on his or her own.

Our contributions in developing clustetring algorithms have impact on the discipline of data mining, since incisive techniques are being studied theoretically and experimentally for robust unsupervised clustering. The techniques are unsupervised, meaning that they do not require the user to specify the number of clusters in advance. They are robust since they can handle data that is contaminated with noise.

The developed techniques have a strong impact on organizing the web information space with minimum intervention from the user or administrator. This is because they can be used to automatically extract collaborative web user profiles that do not rely on personal identification. When coupled with intelligent recommendation methods, they can be used to adapt the web information space according to the user's interest. This can range from adding suggestions to adding links on a requested page, which in essence reorganizes the web site. The fact that these techniques detect multi-resolution profiles makes them able to work at different levels of granularities.

The techniques developed are generic enough that they can learn from and act on both the 'usage' and 'content' aspects of the web. Our contributions in Web personalization have an impact on e-commerce, human computer interaction and adaptive user interfaces.

Go to back to Knowledge Discovery & Web Mining Lab