It’s no secret that ensemble methods are extremely powerful tools in statistical inference, data science, and machine learning. It’s long been known that many “imperfect” models combined together can often perform better than any single model.
For example, in the M5 forecasting competition almost all of the top performers used some element of model averaging or ensembling. Indeed, the very foundations of some of the most commonly used tools in machine learning, like random forests and boosting, work by averging across many highly biased models to create a single more powerful model. The success of this method is relies on the fact that averaging across many high-variance models generally results in a single, lower-variance model. This is most evident in the idea of the “wisdom of the crowd”, where large groups of individuals are often better at predicting something compared to a single expert. However, one area which hasn’t received much attention is outlier detection. When we say “outliers” we’re generally referring to observations that are exceptionally unusual compared to the rest of the sample. A rather consise definition by Hawkins (1980) states:
“An outlier is an observation which deviates so much from the other observations as to arouse suspicions that it was generated by a different mechanism.”
This rather broad definition fits well with the general application of outlier detection. It can be used for identifying fraud in insurance or healthcare datasets, intrusion detection for computer networks, or flagging anomalies in time-series data - among many others. The specific challenge I want to address in this mini-blog is an ensembling approach for unsupervised outlier detection. This is doubly interesting because unsupervised learning presents many more issues compared to supervised learning. Below, I’ll contrast some of these differences and then describe an interesting ensemble approach.
Code
# library and data importsimport randomimport pandas as pdimport numpy as npimport seaborn as snsimport matplotlib.pyplot as pltfrom ensamble_funcs import ALSOefrom sklearn.ensemble import RandomForestClassifierfrom sklearn.metrics import roc_auc_scorefrom sklearn.metrics import average_precision_scorefrom sklearn.model_selection import cross_val_scorefrom sklearn.model_selection import train_test_splitfrom sklearn.utils import resamplefrom sklearn.preprocessing import StandardScalerfrom pyod.models.knn import KNNfrom pyod.models.lof import LOF# set plotting themecparams = {"axes.spines.left": False,"axes.spines.right": False,"axes.spines.top": False,"axes.spines.bottom": False,"grid.linestyle": "--" }sns.set_style("whitegrid", rc = cparams)sns.set_palette(["#0077BB","#EE7733"])# define some helper functionsdef eval_preds(yobs, ypred):""" Print AUC and average precision """ auc = roc_auc_score(yobs, ypred) pre = average_precision_score(yobs, ypred)print(f'Roc:{np.round(auc,3)}') print(f'Prn:{np.round(pre,3)}')# set seedrandom.seed(46098)# load datadata = np.load("C:/Users/gioc4/Documents/blog/data/Classical/6_cardio.npz")X, y = data['X'], data['y']# Scale input features to mean 0, sd 1X = StandardScaler().fit_transform(X)# Train-test splitXtrain, Xtest, ytrain, ytest = train_test_split(X, y)
Outlier Detection with Supervision
To start, let’s look at an example using the cardio dataset sourced from the pyodbenchmark set. In this case we have 1831 observations with 21 variables, of which about 9.6% of them are considered anomalous. These are conviently labeled for us, where a value of 1 indicates an anomalous reading. If we fit a simple random forest classifier we see that it is trivial to get a very high AUC on the test data (let’s also not get ahead of ourselves here, as this is a toy dataset with a target that is quite easy to predict). Below we see an example of the fairly strong separation between the inliers and outliers. Our random forest works well in this case - giving us a test AUC of .99 and an average precision of .98. While this is an overly simple example, it does expose how easy some models can be (in many cases) when there is a definite target variable.
Code
# fit a random forest classifierrf = RandomForestClassifier()rf.fit(Xtrain, ytrain)# extract the predictions, calculate AUCrf_preds = rf.predict_proba(Xtest)[:,1]eval_preds(ytest, rf_preds)
Roc:0.997
Prn:0.977
Code
sns.scatterplot(x = X[:,7], y = X[:,18], hue = y)( plt.xlabel("Feature 7"), plt.ylabel("Feature 18"), plt.title("Cardio Scatterplot, Inliers and Outliers"))plt.show()
Unsupervised Outlier Detection
Let’s talk about unsupervised outlier detection. Unlike the situation above in an unsupervised setting we don’t have the convenience of a set of labels identifying whether a given observation is anomalous or not. And because we’re lacking this ground truth, it makes things a lot more complicated for choosing both our model(s) and the parameters for those model(s). Let’s talk about why.
How do we select a best model?
In classic supervised learning we can choose a metric to optimize (say, root-mean squared error or log-loss), then fit a model which attempts to minimize that metric. In the simplest case, think about ordinary least squares. In that case we have a simple target of minimizing the sum of squared errors. We can validate the fit of the model by looking at evalution metrics (RMSE, R-Squared, standardized residuals).However, when we lack a way to identify if a given observation is anomalous or not we don’t have any meaningful way to know if one model is doing better than another.
You might also be thinking about optimizing a specific parameter in a model (like the number of nearest neighbors \(K\) in a K-nearest neighbors model) using some criterion that doesn’t rely on a target variable (like the ‘elbow’ method). However, optimizing this parameter doesn’t guarantee that the model itself is useful. Simply put, we’re often left groping around in the dark trying to decide what the optimal model or set of parameters is.
Let’s consider this example: Say we’re looking at the same cardio dataset from above and trying to decide what unsupervised outlier detector we want to use. Maybe we’re deciding between a distance-based one like exact K-nearest neighbors (KNN) or a density-based one like local outlier factor (LOF). Let’s also say we’re agnostic to parameter choices, so we stick with the default ones provided by pyod.
# initalize and fit using default paramsknn = KNN()lof = LOF()knn.fit(X)lof.fit(X)# extract the predictions, calculate AUCknn_pred = knn.decision_function(X)eval_preds(y, knn_pred)
Roc:0.686
Prn:0.286
Code
# extract the predictions, calculate AUClof_pred = lof.decision_function(X)eval_preds(y, lof_pred)
Roc:0.546
Prn:0.154
Here we see that the KNN model performs better than the LOF model - however we didn’t adjust any of the \(K\) parameters for either model. Because, in practice, we can’t see this, we don’t know a-priori which model or set of parameters will work best in a given case. This is in stark contrast to our first attempt when we could simply focus on decreasing out-of-sample bias.
An Unsupervised Ensambling Approach
So, because we can’t easily decrease bias (due to the lack of ground truth values) what are our options? Well, as we saw above, there is a large source of variance implicit in these models. This variance can come from multiple sources:
Choice of model. There is considerable variance in how different models perform under different kinds of anomolies. For example, absent some evaluation metric how do you meaningfully choose between KNN, LOF, or one of many other methods. The pyod benchmark page shows that many models perform quite differently under different types of dimensionality and outlier proportion.
Choice of parameters. Almost all anomaly detection models have added uncertaintly based on the choice of parameters. For example, in K-nearest neighbors we need to specify the parameter \(K\). One-class support vector machines (SVM) are notoriously difficult to tune in part due to the number of parameters to choose (choice of kernel, polynomial degree, ect…).
Therefore, in an unsupervised model our best option is to try to reduce the variance implicit in both the sources above. Rather than staking our whole model on a single choice of model, or a single set of parameters, we can ensemble over a wide number of choices to avoid the risk of choosing a catastrophically bad combination. Since we don’t have access to ground truth, this ends up being the safest option (and as we will see, generally produces good results).
ALSO: A regression-based approach
For this specific post I’m going to focus on an unsupervised ensemble algothrim proposed by Paulheim & Meusel (2015) and further discussed in Aggarwal & Sathe (2017). The authors dub this method “attribute-wise learning for scoring outliers” or ALSO. The approach we are going to use extends the logic of the ALSO model to an ensemble-based approach (hence ALSO-E).
Let’s talk a bit about the logic here. The general idea of the algothrim is that we iteratively choose a target feature \(X_j\) from the full set of features \(X\). The chosen \(j\) value is used as the target and the remaining \(X - j\) features are used to predict \(j\). We repeat this for all features in \(X\), collecting the standardized model residuals at each step. We then average the residuals across all the models and use them to identify “anomalous” observations. In this case, more anomalous observations will likely be ones whose residuals are substantially larger than the rest of the sample. The beauty of this method is that for the modelling portion we can choose any base model for prediction (e.g. linear regression, random forests, etc…).
To avoid models that are very overfit, or have virtually no predictive ability, we define weights for each model using cross-validated RMSE. The goal here is to downweight models that have low predictive ability so they have a proportionally lower effect on the final outlier score. This is defined as
\[w_k = 1 - min(1, RMSE(M_k))\]
which simply means that models that perform worse than predicting the mean (which would give us an RMSE of 1) are weighted to 0, while a theoretically “perfect” model would be weighted 1. This gives us the added benefit of downweighting features that have little or no predictive value, which helps in cases when we might have one or more irrelevant variables.
Adding an E to ALSO
The base algothrim above fits \(X\) models using all \(n\) observations in the data. However, we can extend this model to an ensambling method by applying some useful statistical tools - namely variable subsambling. The idea proposed by Aggarwal & Sathe (2017) is to define a number of iterations (say, 100), and for each iteration randomly subsamble the data from between \(min(1, \frac{50}{n})\) and \(min(1, \frac{1000}{n})\). This means that each model is fit on a minimum of 50 observations and up to 1000 (or, \(n\) if \(n\) is less than 1000). In addition, we randomly choose a feature \(j\) to be predicted. Combined, this ensambling approach makes a more efficient use of the available data and results in a more diverse set of models.
To my knowledge, there is no “official” implementation of the ALSO-E algothrim, and it is not present in any large libraries (e.g. pyod or sklearn). However the method is generic enough that it is not difficult to code from scrach. Using the notes available I implemented the method myself using a random forest regressor as the base detector. The code below defines a class with a fit() and predict() function. The fit() function handles all the subsampling and fits each sample on a very shallow random forest regressor. The predict() function does the work of getting the residuals and re-weighting them according to their CV-error. While my implementation is certainly not “feature complete”, it’s good enough to try out:
Code
# code to fit an ALSOe anomaly detectorimport numpy as npfrom sklearn.ensemble import RandomForestRegressorfrom sklearn.preprocessing import StandardScalerfrom sklearn.model_selection import cross_val_scoreclass ALSOe():""" Initializes a regression-based outlier detector """def__init__(self, N =100) ->None:self.param_list = []self.model_list = []self.anom_list = []self.wt = []self.N = Nself.std_scaler = StandardScaler()def fit(self, data):""" Fit an ensemble detector """# standardize dataself.std_scaler =self.std_scaler.fit(X = data) data =self.std_scaler.transform(data)# fit N modelsfor i inrange(0, self.N):# define sample space n = data.shape[0] p = data.shape[1] s = [min([n, 50]), min(n,1000)]# draw s random samples from dataframe X s1 = np.random.randint(low = s[0], high = s[1]) p1 = np.random.randint(low =0, high = p) ind = np.random.choice(n, size = s1, replace =False)# define random y and X df = data[ind] y = df[:,p1] X = np.delete(df, p1, axis=1)# initalize RF regressor rf = RandomForestRegressor(n_estimators=10)# fit & predict rf.fit(X, y)# add fitted models & y param to listself.model_list.append(rf)self.param_list.append(p1)def predict(self, newdata):""" Get anomaly scores from fitted models """# standardize data newdata =self.std_scaler.transform(newdata) for i,j inzip(self.model_list, self.param_list):# define X, y y = newdata[:,j] X = np.delete(newdata, j, axis=1)# get predictions on model i, dropping feature j yhat = i.predict(X)# rmse resid = np.sqrt(np.square(y - yhat)) resid = (resid - np.mean(resid)) / np.std(resid) # compute and apply weights cve = cross_val_score(i, X, y, cv=3, scoring='neg_root_mean_squared_error') w =1-min(1, np.mean(cve)*-1) resid = resid*w# add weights and preds to listsself.wt.append(w)self.anom_list.append(resid)# export results as min-max scaled anom_score = np.array(self.anom_list).T anom_score = np.mean(anom_score, axis =1)# rescale and export anom_score = StandardScaler().fit_transform(anom_score.reshape(-1,1)) anom_score = anom_score.flatten()return anom_score
Fitting the model
With all this in mind, fitting the actual model is quite simple. As stated above, the ALSO-E approach is largely parameter free, which means there isn’t much for the user to worry about. Here we’ll just initialize a model with 100 iterations, fit all of the random forest regressors, then get the weighted standardized residuals. This whole process can be condensed into basically 3 lines of code:
# Fit an ALSOe regression ensemble# using 100 random forestsad = ALSOe(N =100)ad.fit(X)# extract predictions from the models and combine# the standardized outlier scoresad_preds = ad.predict(X)
Evaluating the predictions
Now that we have the predictions, we can look at the distribution of outlier scores. Below we see a histogram of the ensembled scores which, to recall, are rescaled to mean 0 and standard deviation 1. Therefore, the most anomalous observations will have large positive values. Consistent with what we would expect to see, there is a long tail of large residuals which correspond to the outliers, while the bulk of the data corresponds to a mostly “normal” set of values centered around zero.
We can evaluate the performance of our method by bootstrapping the original dataset 10 times, then running our model on each of the boostrap replicates. This is because there is bound to be some degree of randomness based on the chosen samples and variables for each iteration. Averaging over the bootstrap replicates helps give us some idea of how this model might perform “in the wild” so to speak. Below I define a little helper function to resample the dataset, fit the model, and then extract the relevant evaluation metrics. We then loop through the function and put the fit statistics in a set of lists. For evaluation we’ll look at the averages for each set.
# Bootstrap sample from base dataset and evaluate metricsdef boot_eval(df): X, y = resample(df['X'], df['y']) ad = ALSOe(N =100) ad.fit(X) ad_preds = ad.predict(X) auc = roc_auc_score(y, ad_preds) pre = average_precision_score(y, ad_preds)return [auc, pre]# run modelsroc_list = []prn_list = []for i inrange(10): roc, prn = boot_eval(data) roc_list.append(roc) prn_list.append(prn)
Shown below, we see we have a decent AUC and average precision score of about 0.71 and 0.26 respectively. While this is substantially lower than the supervised model, it is still better than the base KNN and LOF models above. The ensambling process also makes it easy because we don’t have to specify any parameters other than the number of iterations to run. In testing, the default 100 works well as a starting point, and there aren’t huge performance gains by increasing it substantially.
We can also evaluate its performance on a variety of other datasets. Here I randomly chose another four datasets from the pyod benchmarks page and compared its performance over 10 bootstrap resamplings to the other benchmarked methods in the pyod ecosystem. Looking at the results we see that we get median RocAUC scores of between .7 to .85 and average precision scores between .2 to .85. For an unsupervised model this isn’t too bad, and largely falls within the range of other detectors.
We should note that while its performance is never the best, it is also never the worst either. For example: the .707 we achieved on the cardio dataset is lower than some of the best methods (in this case, PCA and Cluster-Based LOF). However, we avoid extremely bad results like with Angle-Based Outlier Detection or LOF. This underscores our goals with the ensemble model: we prefer a more conservative model that tends to perform consistently across many types of anomalies. We also avoid issues related to choosing optimal parameters but simply ensambling over many detectors. In an unsupervised case this decrease in variance is especially desirable.
Code
# load additional datasetsd1 = np.load("C:/Users/gioc4/Documents/blog/data/Classical/14_glass.npz")d2 = np.load("C:/Users/gioc4/Documents/blog/data/Classical/18_Ionosphere.npz")d3 = np.load("C:/Users/gioc4/Documents/blog/data/Classical/20_letter.npz")d4 = np.load("C:/Users/gioc4/Documents/blog/data/Classical/21_Lymphography.npz")dlist = [d1,d2,d3,d4]mname = ['Cardio', 'Glass','Ionosphere','Letter','Lympho']roc_list_m = []prn_list_m = []# run modelsfor j in dlist:for i inrange(10): roc, prn = boot_eval(j) roc_list_m.append(roc) prn_list_m.append(prn)# Plot evaluation metrics across datasetsevaldf = pd.DataFrame({'RocAUC' : roc_list +roc_list_m, 'Precision' : prn_list + prn_list_m,'Dataset': sorted([x for x in mname*10])})\ .melt(id_vars ='Dataset', var_name ='Metric', value_name ='Score')# facet plot across datasetsg = sns.FacetGrid(evaldf, col ='Metric', sharey =False, col_wrap=1, aspect =2)( g.map(sns.boxplot, 'Dataset','Score', order = mname), g.fig.subplots_adjust(top=0.9), g.fig.suptitle('ALSO-E Model Evaluation', fontsize=16))plt.show()