|
|
|
 
Goals, Objectives and Targeted Activities
Motivations and Background:
Clustering is an
important task in data mining that aims at organizing an
otherwise indistinguishable mess of data into several
internally homogeneous groups or clusters. However, finding
clusters, in real data sets, is a challenging problem
because the number of clusters is typically unknown and data
can be severely contaminated by noise and outliers.
Clustering has many applications.
In particular, within the context of mining user access
patterns
on websites, clustering can be used to
automatically discover a set of mass user profiles from a
mess of anonymous user sessions or clickstreams. Relying on
clickstreams as opposed to user ratings is referred to as
implicit user modeling, and is generally easy to apply
because it does not require any explicit ratings from the
users. Instead, users’ interests are captured directly from
the trails of activity that they leave behind in Web server
logs, as they click through the pages of a website. Each
user profile offers a concise model of the items of interest
(Web pages) of a group of similar users. For this reason,
user profiles can enable an efficient collaborative
filtering strategy where a new user’s interests can be
predicted from the interests that fall within the interests
of a similar profile or group of similar users. This
approach that predicts what is interesting to one user based
on the interests of a group is very attractive because it
allows an anonymous user profiling (hence respecting
privacy) and because user profiles constitute a much smaller
or summarized knowledge base of the user access patterns as
compared
based Web
personalization strategy to be successful, the clustering
process must be accurate and must handle the large
proportions of outliers that can permeate real access logs.
Also, the final recommendation process must be carefully
designed to take advantage of the learned user profiles.
This project include
activities that aim at achieving the following goals:
-
(i)
Developing robust statistical estimators to
automatically estimate the location and
boundary of clusters in the presence of
unknown amounts of noise,
-
(ii) Developing hybrid robust
unsupervised clustering techniques based on
Robust Statistical theory and Evolutionary
Computation Theory, and
-
(iii) Developing scalable robust
unsupervised clustering technique based
on Evolutionary Computation Theory.
-
(i)
A discovery engine that discovers an
unknown number of robust multiresolution
profiles and context-sensitive
associations based on the above clustering
technique,
-
(ii) An intelligent recommendation
system based on the discovered knowledge,
-
(iii) New techniques for preprocessing
Web usage data, and
-
(iv) New techniques for formulating user
profiles.
-
(i)
Training several graduate and undergraduate
students in the emerging high demand areas of
data and Web mining,
-
(ii) Integrating the results of this
research into the undergraduate and
undergraduate educational curriculum at the
University of Memphis, in a way that will
involve partnerships with local businesses and
campus laboratories,
-
(iii) Instigating multidisciplinary and
international collaborations that will benefit
research and education in Web and data mining,
Evolutionary computation, and other disciplines.
Research
Contributions:
This project has developed
robust unsupervised scalable clustering techniques based on
Evolutionary computation, as well as adaptive and
evolutionary Web Personalization strategies that are based
on the following components: (i) a discovery engine that
discovers an unknown number of robust user profiles based on
the above clustering technique, and (ii) an intelligent
recommendation system based on the discovered knowledge. The
clustering techniques are unsupervised, meaning that they do
not require the user to specify the number of clusters in
advance. And they are robust since they can handle data that
is contaminated with noise. The developed techniques have a
strong impact on organizing the web information space with
minimum intervention from the user or administrator. This is
because they can be used to automatically extract web user
profiles that do not rely on personal identification. When
coupled with intelligent recommendation methods, they can be
used to adapt the web information space according to the
user's interest in a model based collaborative filtering
approach. Recommendations range from adding suggestions to
adding links on a requested page, which in essence
reorganizes the web site.
Our most important contributions
so far, are the development of unsupervised clustering
algorithms that are based on evolutionary computation.
Evolutionary methods are nature inspired strategies that
search for solutions to difficult problems by following an
approach that mimics the way that genetic code in natural
organisms evolves through several generations. A set of
candidate solutions evolves through competition and
genetic-like operators towards a pool of improved, and hence
better fit individuals that represent better solutions for
the problem that is being solved. Some evolutionary
techniques mimic the way evolution takes place between
different organisms within a population across long time
spans, typically several lifetimes, such as in the case of
evolution of mammals. These are known as genetic algorithms.
While other evolutionary techniques mimic the way evolution
takes place inside the same organism, typically over much
shorter time spans, such as in the evolution of immune cells
throughout a single lifetime within the Human body, in order
to protect the body from harmful bacteria and viruses. These
are called Immune based algorithms.
In this project, we have
developed clustering techniques that can cope with high
levels of noise and large data sets based on both Genetic
and Immune based search algorithms. In particular, Immune
based methods (known as TECNO-Streams which stands for
Tracking Evolving Clusters in Noisy data Streams) can cope
with a massive stream of Web usage data or any
multidimensional data in a single pass, and is thus
particularly suitable to real-time Web usage mining and
personalization. To our knowledge, our proposed algorithms
are the first Evolutionary clustering algorithms that are
scalable to massive data sets, while being robust to
outliers and noise.
We have also developed
methodologies and metrics for the validation of the
discovered user profiles, particularly in a streaming
framework, and several recommendation strategies that use
the discovered user profiles. Some of these strategies are
based on Fuzzy Approximate Reasoning to handle the large
amounts of uncertainty hidden in real user clickstream data,
while others use Neural Networks to build several highly
accurate and specialized recommender systems, one for each
user profile or group. Finally, we have proposed a novel
approach to easily, freely and quickly deploy recommendation
systems by tweaking open source Search Engine software to
work just like a recommendation system, and in particular, a
system that can recommend content from many websites and not
just one. This is accomplished mainly by indexing the
websites’ content, and by transforming user clickstreams
while browsing a website, into queries that get submitted to
an underlying search engine, and then transforming the
results of the search into suitable recommendations.
Human Resources:
The research in this project has
involved several graduate students, the majority of whom are
women and minorities. The project has resulted in one PhD
dissertation by a female student and several Masters Theses
and projects. Two new graduate courses related to Web Mining
have also been developed and benefited a large number of
students at the graduate level.
Impacts:
Our contributions in Web
personalization have a significant impact on e-commerce,
e-learning, human computer interaction, and adaptive user
interfaces, where Web usage mining and personalization can
play an important role in understanding user activities and
in guiding them and recommending content located deep within
large websites, that the user might never find on his or her
own.
Our contributions in developing
clustering algorithms have impact on the discipline of data
mining, since incisive techniques are being studied
theoretically and experimentally for robust unsupervised
clustering. The techniques are unsupervised,
meaning that they do not require the user to
specify the number of clusters in advance. They are
robust since they can handle data that
is contaminated with noise.
The developed techniques have a
strong impact on organizing the web information space with
minimum intervention from the user or administrator. This is
because they can be used to automatically extract
collaborative web user profiles that do not rely on personal
identification. When coupled with intelligent recommendation
methods, they can be used to adapt the web information space
according to the user's interest. This can range from adding
suggestions to adding links on a requested page, which in
essence reorganizes the web site. The fact that these
techniques detect multi-resolution profiles makes them able
to work at different levels of granularities.
The techniques developed are
generic enough that they can learn from and act on both the
'usage' and 'content' aspects of the web. Our contributions
in Web personalization have an impact on e-commerce, human
computer interaction and adaptive user interfaces.
Selected Developments
We
developed a statistically robust and efficient algorithm for
clustering and generalized it to multivariate data.
Hierarchical clustering based on genetic niching showed that
it is a feasible technique and promises more success
compared to other techniques, for clustering web sessions.
We have also developed new recommendation techniques based
on Web usage mining and fuzzy approximate reasoning to take
into account the inherent uncertainties in all the phases of
Web personalization (see publications).
We
have presented a systematic approach to automatic
web recommender systems based on Web usage mining in a first
stage to learn user profiles, and a second data mining phase
that is devoted to learning an accurate model for predicting
user requests. Our approach differs from existing methods
because it includes two separate learning phases:
one to learn the user profiles, and another to
learn a recommendation model. Most previous approaches
do not include adaptive learning in a separate second phase,
and instead base the recommendations on simplistic
assumptions (e.g. nearest profile recommendations,  or
deployment of pre-discovered association rules). We have
also recently developed a new approach to generate
simultaneously accurate and complete recommendations,
called Context Ultra-Sensitive Approach based on
two-step Recommender systems (CUSA-2-step-Rec).
Our approach relies on a committee of profile-specific
neural networks. This approach provides recommendations that
are accurate and fast to train because only the URLs
relevant to a specific profile are used to define the
architecture of each network. Similar to the task of
completing the missing pieces of a puzzle, each neural
network is trained to predict the missing URLs of several
complete ground-truth sessions from a given profile, given
as input several incomplete sub sessions. We compared the
proposed approach with K-NN based collaborative filtering,
as well as other classical methods (such as nearest profile
recommendations ) showing that our approach achieves higher
coverage and precision while being faster, and requiring
lower main memory at recommendation time. While most
recommenders are inherently context sensitive, our approach
is context ultra-sensitive because a
different recommendation model is designed for
each profile separately.
On
a separate track, we have continued our investigation into
evolutionary clustering using Unsupervised Niche
Clustering (UNC), achieving significant
strides, specifically in (i)
performing
a comprehensive evaluation of UNC under different density,
noise and cluster distribution conditions, (ii)
successfully applying UNC to the problem of anomaly
detection for network intrusion detection, by using UNC
to discover normal profiles, (iii) developing
scalable models of UNC to cluster large volumes of data
under stringent time and memory constraints, (iv)
developing a parameter-free model of UNC that
eliminates the need to specify Genetic
Algorithm (GA) parameters such as crossover and mutation
probabilities, and finally developing a flexible
and generalized framework for evolutionary clustering
based on exchangeable modules that
affect crucial mechanisms such as the selection strategy,
replacement strategy, niching strategy, hybrid scale
estimation, automatic adaptation of the GA parameters, and
hybrid local refinement post evolution.
We have also investigated several approaches to scalable
evolutionary clustering of noisy and evolving data sets
based on more recently understood Evolutionary phenomena,
such a new artificial immune system model, multiploid and
structured chromosome representations, and gene expression
mechanisms. Preliminary results have shown a promise of
success. We believe that creating hybrids that tap on the
strengths of these different facets of evolutionary
computation, as well as the Unsupervised Niche Clustering
algorithm will naturally evolve into beneficial outcomes of
this project to the data mining community. Tangible products
of this research will be a suite of scalable, autonomous,
and reliable clustering algorithms for intensive
applications such as web and stream data mining.
Area Background
As a valuable unsupervised
learning tool, clustering is crucial to many applications.
Genetic Algorithms have been used with success as global
searchers in difficult problems, particularly in the
optimization of non-differentiable functions. Recently,
several Genetic Niching (GN) techniques have emerged to
tackle multimodal function optimization. Current genetic
clustering techniques are not robust in the
presence of noise; assume a known number of clusters;
and suffer from a search space size that explodes
exponentially with the number of clusters. Web
Personalization tailors a user’s interaction with the Web
information space based on information gathered about them.
Declarative user information such as manually entered
profiles continue to raise privacy concerns and are neither
scalable nor flexible in the face of very active dynamic Web
sites and changing user trends and interests. One way to
deal with this problem is through an automated Web
personalization system. Such a system can be based on Web
usage mining to discover Web usage profiles, followed by a
recommendation system that can respond to the users’
individual interests.
Go to back to
Knowledge Discovery & Web Mining Lab
|
|
|
 |