true pt
A Brief Overview of Robust Statistics
Olfa Nasraoui
Department of Computer Engineering
& Computer Science
University of Louisville,
olfa.nasraoui_AT_louisville.edu
Robust statistics have recently emerged as a family of theories and techniques for estimating the parameters of a parametric model while dealing with deviations from idealized assumptions [Goo83,Hub81,HRRS86,RL87]. Examples of deviations include the contamination of data by gross errors, rounding and grouping errors, and departure from an assumed sample distribution.
Gross errors or outliers are data severely deviating from the pattern set by the majority of the data. This type of error usually occurs due to mistakes in copying or computation. They can also be due to part of the data not fitting the same model, as in the case of data with multiple clusters. Gross errors are often the most dangerous type of errors. In fact, a single outlier can completely spoil the Least Squares estimate, causing it to break down.
Rounding and grouping errors result from the inherent inaccuracy in the collection and recording of data which is usually rounded, grouped, or even coarsely classified.
The departure from an assumed model means that real data can deviate from the often assumed normal distribution. The departure from the normal distribution can manifest itself in many ways such as in the form of skewed (asymmetric) or longer tailed distributions.
The ordinary Least Squares (LS) method to estimate parameters is not robust because its objective function,
, increases indefinitely with the residuals
between the
data point and the estimated fit, with
being the total number of data points in a data set. Hence, extreme outliers with arbitrarily large residuals can have an infinitely large influence on the resulting estimate. M-estimators [Hub81] attempt to limit the influence of outliers by replacing the square of the residuals with a less rapidly increasing loss function of the data value,
, and parameter estimate,
,
. The M-estimate,
for the function
and the sample
, is the value that minimizes the following objective
 |
(1) |
The optimal parameter,
, is determined by solving
 |
(2) |
where, except for a multiplicative constant,
 |
(3) |
When the M-estimator is equivariant, i. e.,
for any real constant
, we can write
and
in terms of the residuals
. Also, in general, an auxiliary scale estimate,
is used to obtain the scaled residuals
. Hence, we can write
and
The
functions for some familiar M-estimators are listed in Table 1. Note that LS can be considered an M-estimator, even though it is not a robust M-estimator.
As seen in this table, M-estimators rely on both an accurate estimate of scale and a fixed tuning constant,
. Most M-estimators use a multiple of the Median of Absolute Deviations (MAD) as a scale estimate which implicitely assumes that the noise contamination rate is
. MAD is defined as follows
The most common scale estimate used is
where the
factor adjusts the scale for maximum efficiency when the data samples come from a Gaussian distribution.
Table 1:
A few common M- and W-estimators
| Type |
 |
pt pt |
pt w(r) pt |
Range |
used |
| |
|
|
|
of r |
scale |
(mean) |
 |
 |
1 |
 |
none |
(median) |
 |
sgn( ) |
 |
 |
none |
| Huber |
 |
 |
1 |
 |
MAD |
| |
 |
k sgn( ) |
 |
 |
|
| Cauchy |
![${c^2 \over 2} log \Big[1+({r\over c})^2 \Big]$](img42.gif) |
 |
 |
 |
 |
| Tukey's |
![${1\over 6} \Big[1-\big(1-r^2\big)^3 \Big]$](img46.gif) |
 |
 |
 |
 MAD |
| biweight |
 |
0 |
0 |
 |
|
| Andrews |
 |
 |
 |
 |
 MAD |
| |
 |
0 |
0 |
 |
|
| Welsch |
![${c^2 \over 2} \Big[1-exp(-({r\over c})^2) \Big]$](img57.gif) |
 |
 |
 |
 |
W-estimators [Goo83] represent an alternative form of M-estimators. Each W-estimator has a characteristic weight function,
that represents the importance of each sample in its contribution to the estimation of
, which is related to the corresponding M-estimator as follows
 |
(4) |
The optimal parameter is determined by solving
 |
(5) |
which is similar to the equations for a ``weighted LS'' regression problem. W-estimators offer a convenient and simple iterative computational procedure for M-estimators, where the W-estimator equations in the current iteration are solved by fixing the weight values,
, to those of the previous iteration. The resulting procedure is referred to as the Iterative Reweighted Least Squares (IRLS or RLS). As in the case of M- and W-estimators, the IRLS relies on an accurate and prefixed scale estimate for the definition of its weights. The most common scale estimate used is
.
The
and
functions for some familiar M- and W-estimators are listed in Table 1.
Also known as trimmed means for the case of location estimation (
), L-estimators [KJ78] are based on a definition of quantiles as follows.
 |
(6) |
In (6),
is the signed residual from the
data sample,
, to the location estimate,
. The loss function is defined as
 |
(7) |
The
quantile of the sample
, is the value of
that solves
 |
(8) |
It is easy to check that for
, the half-quantile,
corresponds to the sample median. The major inconvenience in L-estimators is that they are not easy to optimize, and that they rely on a known value for the noise contamination rate,
. They are also among the least efficient (accurate) estimators because they completely ignore part of the data.
In this approach [Jae72] , each residual is weighted by a score based on its rank as in the following objective.
 |
(9) |
where
is the rank of the
residual in
and
is a nondecreasing score function satisfying
. For example, the Wilcoxon scores are given by
. Like L-estimators, the major inconvenience in R-estimators is that they are not easy to optimize, and that the definition of the score function implicitly necessitates prior information about the noise contamination rate.
Instead of minimizing the sum of squared residuals,
, as in LS to estimate the parameter vector
, Rousseuw [RL87] proposed minimizing their median as follows
 |
(10) |
This estimator effectively trims the
observations having the largest residuals, and uses the
maximal residual value in the remaining set as the criterion to be minimized. Hence it is equivalent to assuming that the noise proportion is 50%, and its breakdown point asymptotically
approaches
for
-dimensional data sets, and
for
-dimensional
data sets. It can also be seen that (10) is unwieldy from an optimization point of view, because of its non-differentiable form. This means that a quasi-exhaustive search on all possible parameter values needs to be done to find the global minimum. As a variation, random sampling/searching of some kind has been suggested to find the best fit [RL87], leading to a reduced complexity of
instead of the high
complexity of the exhaustive option. A major drawback of LMedS is its low efficiency, since it only uses the middle residual value and hence assumes that the data set contains a
fraction of noise. When the data set contains less than
noise, the LMedS estimates suffer in terms of accuracy, since not all the good points are used in the estimation; and when the data set contains more than
noise, the LMedS estimates can suffer considerably in terms of robustness, as will be illustrated in Chapter 3.
LTS [RL87] offers a more efficient way to find robust estimates by minimizing the objective function given by
where
is the
smallest residual or distance when the residuals are ordered in ascending order, i.e.,
Since
is the number of data points whose residuals are included in the sum, this estimator basically finds a robust estimate by identifying the
points having the largest residuals as outliers, and discarding (trimming) them from the data set. The resulting estimates are essentially LS estimates of the trimmed data set. It can be seen that
should be as close as possible to the number of good points in the data set, because the higher the number of good points used in the estimates, the more accurate the estimates are. In this case, LTS will yield the best possible estimate. One problem with LTS is that its objective function does not lend itself to mathematical optimization. Besides, the estimation of
itself is difficult in practice. As will be illustrated in Chapter 3, when faced with more noise than assumed, LTS will lack robustness. And when the amount of noise is less than the assumed level, it will lack efficiency, i.e., the parameter estimates suffer in terms of accuracy, since not all the good data points are taken into account. This reliance on a known or assumed amount of noise present in the data set (contamination rate) means that an exhaustive search over all possible contamination rates needs to be done; and that the optimal estimates have to be chosen based on some kind of validity test because the LTS criterion is monotonically nondecreasing with
when
is less than the actual number of noise points. In addition, the LTS objective function is based on hard rejection. That is, a given data point is either totally included in the estimation process or totally excluded from it. This is not a good strategy if there are points in the region of doubt. As in the case of L-estimators, LTS suffers from a low efficiency, because it completely ignores part of the data.
Instead of the noise proportion, some algorithms explicitly cast their objective functions in terms of a set of weights that distinguish between inliers and outliers. However, these weights usually depend on a scale measure which is also difficult to estimate. For example, the RLS estimator [HW77] tries to minimize
 |
(11) |
where
are robust residuals resulting from an approximate LMedS or LTS procedure. Here the weights
essentially trim outliers from the data used in LS minimization, and can be computed after a preliminary approximate phase of LMedS or LTS. The function
is usually continuous and has a maximum at 0 and is monotonicaly non-increasing with
. In addition,
depends on an error scale
which is usually heuristically estimated from the results of LMedS or LTS. RLS can be considered to be equivalent to W-estimators if there exists a function
satisfying (4). A major advantage of RLS is its ease of computation using the IRLS procedure as for the case of W-estimators. However, RLS was intended to refine the estimates resulting from other robust but less efficient algorithms. Hence it is extremely dependent on a good initialization.
Resistance Properties of M-estimators
An estimator is resistant if a small number of gross errors or any number of small rounding and grouping errors have only a limited effect on the resulting estimate [Goo83]. As seen below, most of the resistance properties of an M-estimator can be inferred from the shape of its Influence Curve.
The Influence Curve (IC) tells us how an infinitesimal proportion of contamination affects the estimate in large samples. Formally [Goo83], the IC gives a quantitative expression of the change in the estimate that results from perturbing the samples underlying distribution,
, by a point mass at sample location
. For an estimator given by the functional
and defined by the
function
, the IC at
is
where
is a point mass perturbation at
and
is the underlying or empirical distribution. The IC can be shown to reduce to
where
is the auxiliary scale estimate. The IC can be further shown to be of the form
 |
(12) |
Hence, the shape of IC depends only on the shape of the
function, not the data distribution.
The Breakdown (BD) bound or point [Hub81] is the largest possible fraction of observations for which there is a bound on the change of the estimate when that fraction of the sample is altered without restrictions. M-estimators of location with an odd
function have a BD bound close to
provided that the auxiliary scale estimator has equal or better BD bound [RL87].
The rejection point is defined as the point beyond which IC becomes zero [Goo83]. Except possibly through the auxiliary scale estimate, observations with residuals beyond the rejection point have zero influence. Hence they make no contribution to the final estimate. Estimators who have a finite rejection point are said to be redescending and are well protected against very large outliers. However, a finite rejection point usually results in the underestimation of scale. This is because when the samples near the tails of a distribution are ignored, too little of the samples may remain for the estimation process. This in turn adversely affects the efficiency of the estimator. An estimator is efficient if the variance of its estimate is as close as possible to the variance of the best estimator for a given distribution. For the Gaussian distribution the best estimator is the mean which also yields the minimum variance of the estimate. In general, it is best for a robust estimator to use as many of the good samples of a distribution as possible, in order to maintain a good efficiency. Another adverse effect of finite rejection is that if a large part of the sample is ignored, the objective function may have many local minima [Goo83].
The Gross Error Sensitivity (g.e.s.) expresses asymptotically the maximum effect a contaminated observation can have on the estimator. It is the maximum absolute value of the IC. The asymptotic bias of an estimator, defined as the maximum effect of the contamination of a given distribution with a proportion
from an outlying distribution, is given by
Unfortunately, it was reported in [Goo83] that in general, poor g.e.s. corresponds to higher Gaussian efficiency, and vice versa.
The local Shift Sensitivity (l.s.s.) measures the effect of the removal of a mass
at
and its reintroduction at
. Therefore, it measures the effect of rounding and grouping errors on an estimator. For highest resistance, it is required that the l.s.s. be bounded. For a continuous and differentiable IC, l.s.s. is given by the maximum absolute value of the slope of IC at any point. In [Goo83], it was reported that in general, a lower (hence better) l.s.s. corresponds to higher Gaussian efficiency.
Winsor's principle [Tuk60] states that all distibutions are normal in the middle. Hence, the
function of M-estimators should resemble the one that is optimal for Gaussian data in the middle. Since the Maximum Likelihood estimate for Gaussian data is the mean which has a linear
function, it is desired that
for small
, where
is a nonzero constant. In general, a
function that is linear in the middle results in better efficiency at the Gaussian distribution.
As shown in [Goo83], it is necessary for the
function to be odd in order to have an unbiased estimator, when the samples' underlying distribution is symmetric. If the underlying distribution is symmetric with center
, then an estimator,
is said to be unbiased if
This crucial property is satisfied by all known M-estimators.
A simultaneous M-estimator of location and scale [Goo83] for the sample
is the combination of a location estimator
and a scale estimator
that satisfy the pair of equations
 |
(13) |
and
 |
(14) |
where
is a tuning constant, and for a symmetric underlying sample distribution,
is an odd function, and
is an even function. The problem with this approach lies in how to choose an appropriate
-function that will yield a scale estimate that is as accurate as possible, that is meaningful, and that is somehow related to the
-function so that the simultaneous optimization process, which usually alternates solving the two equations, makes global sense.
For all these reasons, this approach has hardly been used in the past.
As in the case of location M-estimators,
has the same shape as the IC of the scale estimator.
The classical nonrobust multiple regression model relates a response vector
to the explanatory variables in the matrix
in the form
where the rows of
represent the individual observation vectors
which are augmented by
in the first dimension to allow for a constant term,
, in the linear regression model.
The elements of
are unknown parameters to be estimated, and
is an error vector. The Least Squares chooses the estimate
as the value of
that minimizes the sum of squared residuals,
which results in the optimal closed form solution
and the fitted values
The M-estimator for
, based on the loss function
and the data
is the value
which minimizes
is determined by solving the set of
simultaneous equations
where
W-estimators offer an alternative form of M-estimators by writing
Hence the
simultaneous equations become
where
The above equations can be combined into the following single matrix equation
where
is a diagonal matrix with
This results in the optimal closed form solution
and the fitted values
In a similar fashion to the simple location M-estimator, the IC for a regression M-estimator can be shown to take the form [HRRS86]
![\begin{displaymath}
IC \left( ({\bf x}, y); F_0, \hat{{\bf\beta}} \right) = {\rm...
...{\bf\beta}} \left( F_0 \right) \right] {\bf B}^{-1} {\bf x}^t,
\end{displaymath}](img151.gif) |
(15) |
where
Though
is bounded for robust M-estimators, the term
can grow infinitely large depending on the position of
. Hence M- and W-estimators of regression have an unbounded IC. This means that leverage points for which the independent variables,
, are outlying compared to the remainder of the data can have an unbounded influence on the parameter estimates.
MVE [RL87] tries to generalize LMedS to the case of multivariate location estimation. The approach is based on seeking the ellipsoid with smallest volume, including at least
points of the sample data set. A subsample,
, consisting of
observations is first drawn from the data set,
, with dimensionality
. Then, the subsample mean and covariance matrix are computed as per maximum likelihood,
and
It can easily be shown that the factor,
needed to inflate the ellipsod covering the subsample
to the ellipsoid
, which includes
of the points, is given by
The volume of the resulting ellipsoid is proportional to
. After repeating this sampling process many times, the sample resulting in the minimum volume generates the optimal MVE parameters,
and
given by
and
, where the denominator adjusts the final covariance estimate to include all the good data points for the case of Gaussian data. Instead of the otherwise exhaustive sampling needed, Rousseeuw suggests selecting only
subsamples to guarantee a probability,
, of selecting at least one good subsample, where
is given by the relation
, and
is the noise contamination rate. However, it is clear that an accurate lower bound on the number of subsamples cannot be computed if the noise contamination rate is not known in advance. MVE is also limited by its assumption that the noise contamination rate is
.
This approach [FB81] relies on random sampling selection to search for the best fit. The model parameters are computed for each randomly selected subset of points. Then the points within some error tolerance are called the consensus set of the model, and if the cardinality of this set exceeds a prespecified threshold, the model is accepted and its parameters are recomputed based on the whole consensus set. Otherwise, the random sampling and validation is repeated as in the above. Hence, RANSAC can be considered to seek the best model that maximizes the number of inliers. The problem with this pioneering approach is that it requires the prior specification of a tolerance threshold limit which is actually related to the inlier bound.
MINPRAN [Ste95] relies on the assumption that the noise comes from a well known distribution. As in RANSAC, this approach uses random sampling to search for the fit and the inliers to this fit that are least likely to come from the known noise distribution. Even with random sampling, MINPRAN's computational complexity is
, which is higher than that of LMedS, where
is the size of the data set and
is the number of samples.
All the above estimators are either obliged to perform an exhaustive search or assume a known value for the amount of noise present in the data set (contamination rate), or equivalently an estimated scale value or inlier bound. When faced with more noise than assumed, all these estimators will lack robustness. And when the amount of noise is less than the assumed level, they will lack efficiency, i.e., the parameter estimates suffer in terms of accuracy, since not all the good data points are taken into account. They are also limited to estimating a single component in a data set.
This work is supported by the National Science Foundation (CAREER Award IIS-0133948 to O. Nasraoui).
Partial support of earlier stages of this work by the Office of Naval Research grant
(N000014-96-1-0439) and by the National Science Foundation Grant
IIS 9800899 is also gratefully acknowledged.
- FB81
-
M. A. Fischler and R. C. Bolles.
Random sample consensus for model fitting with applications to image
analysis and automated cartography.
Comm. of the ACM., 24:381-395, 1981.
- Goo83
-
C. Goodall.
M-estimators of location: An outline of the theory.
In D. Hoaglin, F. Mosteller, and J. W. Tukey, editors, Understanding Robust and Exploratory Data Analysis, pages 339-403. New
York, 1983.
- HRRS86
-
F. R. Hampel, E. M. Ronchetti, P. J. Rousseeuw, and W. A. Stahel.
Robust Statistics the Approach Based on Influence Functions.
John Wiley & Sons, New York, 1986.
- Hub81
-
P. J. Huber.
Robust Statistics.
John Wiley & Sons, New York, 1981.
- HW77
-
P. W. Holland and R. E. Welsch.
Robust regression using iteratively reweighted least-squares.
Commun. Statist. Theory Meth, A6(9):813-827, 1977.
- Jae72
-
L. A. Jaeckel.
Estimating regression coefficients by minimizing the dispersion of
the residuals.
Annals of Mathematical Statistics, 43:1449-1458, 1972.
- KJ78
-
R. Koenker and G. Basset Jr.
Regression quantiles.
Econometrica, 36:33-50, 1978.
- RL87
-
P. J. Rousseeuw and A. M. Leroy.
Robust Regression and Outlier Detection.
John Wiley & Sons, New York, 1987.
- Ste95
-
C. V. Stewart.
Minpran: A new robust estimator for computer vision.
IEEE Transactions on Pattern Analysis and Machine Intelligence,
17(10):925-938, Oct. 1995.
- Tuk60
-
J. W. Tukey.
A survey of sampling from contaminated distributions.
In I. Olkin, S. G. Ghurye, W. Hoeffding, W. G. Madow, , and H. B.
Mann, editors, Contributions to Probability and Statistics, Essays in
Honor of Harold Hotelling, pages 448-485. Stanford, CA: Stanford University
Press, 1960.
A Brief Overview of Robust Statistics
This document was generated using the
LaTeX2HTML translator Version 2002 (1.62)
Copyright © 1993, 1994, 1995, 1996,
Nikos Drakos,
Computer Based Learning Unit, University of Leeds.
Copyright © 1997, 1998, 1999,
Ross Moore,
Mathematics Department, Macquarie University, Sydney.
The command line arguments were:
latex2html -split 0 -image_type gif RobustStatistics.tex