The Power of Ensembles – A Blog for Data Stuff

It’s no secret that ensemble methods are extremely powerful tools in statistical inference, data science, and machine learning. It’s long been known that many “imperfect” models combined together can often perform better than any single model.

For example, in the M5 forecasting competition almost all of the top performers used some element of model averaging or ensembling. Indeed, the very foundations of some of the most commonly used tools in machine learning, like random forests and boosting, work by averging across many highly biased models to create a single more powerful model. The success of this method is relies on the fact that averaging across many high-variance models generally results in a single, lower-variance model. This is most evident in the idea of the “wisdom of the crowd”, where large groups of individuals are often better at predicting something compared to a single expert. However, one area which hasn’t received much attention is outlier detection. When we say “outliers” we’re generally referring to observations that are exceptionally unusual compared to the rest of the sample. A rather consise definition by Hawkins (1980) states:

“An outlier is an observation which deviates so much from the other observations as to arouse suspicions that it was generated by a different mechanism.”

This rather broad definition fits well with the general application of outlier detection. It can be used for identifying fraud in insurance or healthcare datasets, intrusion detection for computer networks, or flagging anomalies in time-series data - among many others. The specific challenge I want to address in this mini-blog is an ensembling approach for unsupervised outlier detection. This is doubly interesting because unsupervised learning presents many more issues compared to supervised learning. Below, I’ll contrast some of these differences and then describe an interesting ensemble approach.

Code

# library and data imports
import random
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

from ensamble_funcs import ALSOe

from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_auc_score
from sklearn.metrics import average_precision_score
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split
from sklearn.utils import resample
from sklearn.preprocessing import StandardScaler

from pyod.models.knn import KNN
from pyod.models.lof import LOF

# set plotting theme
cparams = {
            "axes.spines.left": False,
            "axes.spines.right": False,
            "axes.spines.top": False,
            "axes.spines.bottom": False,
            "grid.linestyle": "--"
            }

sns.set_style("whitegrid", rc = cparams)
sns.set_palette(["#0077BB","#EE7733"])

# define some helper functions
def eval_preds(yobs, ypred):
    """ Print AUC and average precision
    """

    auc = roc_auc_score(yobs, ypred)
    pre = average_precision_score(yobs, ypred)

    print(f'Roc:{np.round(auc,3)}') 
    print(f'Prn:{np.round(pre,3)}')

# set seed
random.seed(46098)

# load data
data = np.load("C:/Users/gioc4/Documents/blog/data/Classical/6_cardio.npz")
X, y = data['X'], data['y']

# Scale input features to mean 0, sd 1
X = StandardScaler().fit_transform(X)

# Train-test split
Xtrain, Xtest, ytrain, ytest = train_test_split(X, y)

Outlier Detection with Supervision

To start, let’s look at an example using the cardio dataset sourced from the pyod benchmark set. In this case we have 1831 observations with 21 variables, of which about 9.6% of them are considered anomalous. These are conviently labeled for us, where a value of 1 indicates an anomalous reading. If we fit a simple random forest classifier we see that it is trivial to get a very high AUC on the test data (let’s also not get ahead of ourselves here, as this is a toy dataset with a target that is quite easy to predict). Below we see an example of the fairly strong separation between the inliers and outliers. Our random forest works well in this case - giving us a test AUC of .99 and an average precision of .98. While this is an overly simple example, it does expose how easy some models can be (in many cases) when there is a definite target variable.

Code

# fit a random forest classifier
rf = RandomForestClassifier()
rf.fit(Xtrain, ytrain)

# extract the predictions, calculate AUC
rf_preds = rf.predict_proba(Xtest)[:,1]

eval_preds(ytest, rf_preds)

Roc:0.997
Prn:0.977

Code

sns.scatterplot(x = X[:,7], y = X[:,18], hue = y)
(
    plt.xlabel("Feature 7"),
    plt.ylabel("Feature 18"),
    plt.title("Cardio Scatterplot, Inliers and Outliers")
)
plt.show()

Scatterplot of inliers (0) and outliers (1), displaying strong separation

Unsupervised Outlier Detection

Let’s talk about unsupervised outlier detection. Unlike the situation above in an unsupervised setting we don’t have the convenience of a set of labels identifying whether a given observation is anomalous or not. And because we’re lacking this ground truth, it makes things a lot more complicated for choosing both our model(s) and the parameters for those model(s). Let’s talk about why.

How do we select a best model?

In classic supervised learning we can choose a metric to optimize (say, root-mean squared error or log-loss), then fit a model which attempts to minimize that metric. In the simplest case, think about ordinary least squares. In that case we have a simple target of minimizing the sum of squared errors. We can validate the fit of the model by looking at evalution metrics (RMSE, R-Squared, standardized residuals).However, when we lack a way to identify if a given observation is anomalous or not we don’t have any meaningful way to know if one model is doing better than another.

You might also be thinking about optimizing a specific parameter in a model (like the number of nearest neighbors \(K\) in a K-nearest neighbors model) using some criterion that doesn’t rely on a target variable (like the ‘elbow’ method). However, optimizing this parameter doesn’t guarantee that the model itself is useful. Simply put, we’re often left groping around in the dark trying to decide what the optimal model or set of parameters is.

Let’s consider this example: Say we’re looking at the same cardio dataset from above and trying to decide what unsupervised outlier detector we want to use. Maybe we’re deciding between a distance-based one like exact K-nearest neighbors (KNN) or a density-based one like local outlier factor (LOF). Let’s also say we’re agnostic to parameter choices, so we stick with the default ones provided by pyod.

# initalize and fit using default params
knn = KNN()
lof = LOF()

knn.fit(X)
lof.fit(X)

# extract the predictions, calculate AUC
knn_pred = knn.decision_function(X)

eval_preds(y, knn_pred)

Roc:0.686
Prn:0.286

Code

# extract the predictions, calculate AUC
lof_pred = lof.decision_function(X)

eval_preds(y, lof_pred)

Roc:0.546
Prn:0.154

Here we see that the KNN model performs better than the LOF model - however we didn’t adjust any of the \(K\) parameters for either model. Because, in practice, we can’t see this, we don’t know a-priori which model or set of parameters will work best in a given case. This is in stark contrast to our first attempt when we could simply focus on decreasing out-of-sample bias.

An Unsupervised Ensambling Approach

So, because we can’t easily decrease bias (due to the lack of ground truth values) what are our options? Well, as we saw above, there is a large source of variance implicit in these models. This variance can come from multiple sources:

Choice of model. There is considerable variance in how different models perform under different kinds of anomolies. For example, absent some evaluation metric how do you meaningfully choose between KNN, LOF, or one of many other methods. The pyod benchmark page shows that many models perform quite differently under different types of dimensionality and outlier proportion.
Choice of parameters. Almost all anomaly detection models have added uncertaintly based on the choice of parameters. For example, in K-nearest neighbors we need to specify the parameter \(K\). One-class support vector machines (SVM) are notoriously difficult to tune in part due to the number of parameters to choose (choice of kernel, polynomial degree, ect…).

Therefore, in an unsupervised model our best option is to try to reduce the variance implicit in both the sources above. Rather than staking our whole model on a single choice of model, or a single set of parameters, we can ensemble over a wide number of choices to avoid the risk of choosing a catastrophically bad combination. Since we don’t have access to ground truth, this ends up being the safest option (and as we will see, generally produces good results).

ALSO: A regression-based approach

For this specific post I’m going to focus on an unsupervised ensemble algothrim proposed by Paulheim & Meusel (2015) and further discussed in Aggarwal & Sathe (2017). The authors dub this method “attribute-wise learning for scoring outliers” or ALSO. The approach we are going to use extends the logic of the ALSO model to an ensemble-based approach (hence ALSO-E).

Let’s talk a bit about the logic here. The general idea of the algothrim is that we iteratively choose a target feature \(X_j\) from the full set of features \(X\). The chosen \(j\) value is used as the target and the remaining \(X - j\) features are used to predict \(j\). We repeat this for all features in \(X\), collecting the standardized model residuals at each step. We then average the residuals across all the models and use them to identify “anomalous” observations. In this case, more anomalous observations will likely be ones whose residuals are substantially larger than the rest of the sample. The beauty of this method is that for the modelling portion we can choose any base model for prediction (e.g. linear regression, random forests, etc…).

To avoid models that are very overfit, or have virtually no predictive ability, we define weights for each model using cross-validated RMSE. The goal here is to downweight models that have low predictive ability so they have a proportionally lower effect on the final outlier score. This is defined as

\[w_k = 1 - min(1, RMSE(M_k))\]

which simply means that models that perform worse than predicting the mean (which would give us an RMSE of 1) are weighted to 0, while a theoretically “perfect” model would be weighted 1. This gives us the added benefit of downweighting features that have little or no predictive value, which helps in cases when we might have one or more irrelevant variables.

Adding an E to ALSO

The base algothrim above fits \(X\) models using all \(n\) observations in the data. However, we can extend this model to an ensambling method by applying some useful statistical tools - namely variable subsambling. The idea proposed by Aggarwal & Sathe (2017) is to define a number of iterations (say, 100), and for each iteration randomly subsamble the data from between \(min(1, \frac{50}{n})\) and \(min(1, \frac{1000}{n})\). This means that each model is fit on a minimum of 50 observations and up to 1000 (or, \(n\) if \(n\) is less than 1000). In addition, we randomly choose a feature \(j\) to be predicted. Combined, this ensambling approach makes a more efficient use of the available data and results in a more diverse set of models.

To my knowledge, there is no “official” implementation of the ALSO-E algothrim, and it is not present in any large libraries (e.g. pyod or sklearn). However the method is generic enough that it is not difficult to code from scrach. Using the notes available I implemented the method myself using a random forest regressor as the base detector. The code below defines a class with a fit() and predict() function. The fit() function handles all the subsampling and fits each sample on a very shallow random forest regressor. The predict() function does the work of getting the residuals and re-weighting them according to their CV-error. While my implementation is certainly not “feature complete”, it’s good enough to try out:

Code

# code to fit an ALSOe anomaly detector

import numpy as np

from sklearn.ensemble import RandomForestRegressor
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import cross_val_score

class ALSOe():
    """ Initializes a regression-based outlier detector
    """

    def __init__(self, N = 100) -> None:

        self.param_list = []
        self.model_list = []
        self.anom_list = []
        self.wt = []
    
        self.N = N
        self.std_scaler = StandardScaler()

    def fit(self, data):
        """ Fit an ensemble detector
        """

        # standardize data
        self.std_scaler = self.std_scaler.fit(X = data)
        data = self.std_scaler.transform(data)

        # fit N models
        for i in range(0, self.N):

            # define sample space
            n = data.shape[0]
            p = data.shape[1]
            s = [min([n, 50]), min(n,1000)]

            # draw s random samples from dataframe X
            s1 = np.random.randint(low = s[0], high = s[1])
            p1 = np.random.randint(low = 0, high = p)
            ind = np.random.choice(n, size = s1, replace = False)

            # define random y and X 
            df = data[ind]
            y = df[:,p1]
            X = np.delete(df, p1, axis=1)

            # initalize RF regressor
            rf = RandomForestRegressor(n_estimators=10)

            # fit & predict
            rf.fit(X, y)

            # add fitted models & y param to list
            self.model_list.append(rf)
            self.param_list.append(p1)

    def predict(self, newdata):

        """ Get anomaly scores from fitted models
        """

        # standardize data
        newdata = self.std_scaler.transform(newdata)    

        for i,j in zip(self.model_list, self.param_list):

            # define X, y
            y = newdata[:,j]
            X = np.delete(newdata, j, axis=1)

            # get predictions on model i, dropping feature j
            yhat = i.predict(X)

            # rmse
            resid = np.sqrt(np.square(y - yhat))
            resid = (resid - np.mean(resid)) / np.std(resid) 

            # compute and apply weights
            cve = cross_val_score(i, X, y, cv=3, scoring='neg_root_mean_squared_error')
            w = 1 - min(1, np.mean(cve)*-1)

            resid = resid*w

            # add weights and preds to lists
            self.wt.append(w)
            self.anom_list.append(resid)

        # export results as min-max scaled
        anom_score = np.array(self.anom_list).T
        anom_score = np.mean(anom_score, axis = 1)

        # rescale and export
        anom_score = StandardScaler().fit_transform(anom_score.reshape(-1,1))
        anom_score = anom_score.flatten()

        return anom_score

Fitting the model

With all this in mind, fitting the actual model is quite simple. As stated above, the ALSO-E approach is largely parameter free, which means there isn’t much for the user to worry about. Here we’ll just initialize a model with 100 iterations, fit all of the random forest regressors, then get the weighted standardized residuals. This whole process can be condensed into basically 3 lines of code:

# Fit an ALSOe regression ensemble
# using 100 random forests
ad = ALSOe(N = 100)
ad.fit(X)

# extract predictions from the models and combine
# the standardized outlier scores
ad_preds = ad.predict(X)

Evaluating the predictions

Now that we have the predictions, we can look at the distribution of outlier scores. Below we see a histogram of the ensembled scores which, to recall, are rescaled to mean 0 and standard deviation 1. Therefore, the most anomalous observations will have large positive values. Consistent with what we would expect to see, there is a long tail of large residuals which correspond to the outliers, while the bulk of the data corresponds to a mostly “normal” set of values centered around zero.

Code

sns.histplot(x = ad_preds)
(
    plt.xlabel("Anomaly Score"),
    plt.ylabel("Observations"),
    plt.title("ALSO-E Anomaly Scores")
)
plt.show()

Histogram of anomaly scores. The characteristic long tail highlights potential anomalous observations

We can evaluate the performance of our method by bootstrapping the original dataset 10 times, then running our model on each of the boostrap replicates. This is because there is bound to be some degree of randomness based on the chosen samples and variables for each iteration. Averaging over the bootstrap replicates helps give us some idea of how this model might perform “in the wild” so to speak. Below I define a little helper function to resample the dataset, fit the model, and then extract the relevant evaluation metrics. We then loop through the function and put the fit statistics in a set of lists. For evaluation we’ll look at the averages for each set.

# Bootstrap sample from base dataset and evaluate metrics
def boot_eval(df):
    X, y = resample(df['X'], df['y'])

    ad = ALSOe(N = 100)
    ad.fit(X)
    ad_preds = ad.predict(X)

    auc = roc_auc_score(y, ad_preds)
    pre = average_precision_score(y, ad_preds)

    return [auc, pre]

# run models
roc_list = []
prn_list = []

for i in range(10):
    roc, prn = boot_eval(data)

    roc_list.append(roc)
    prn_list.append(prn)

Shown below, we see we have a decent AUC and average precision score of about 0.71 and 0.26 respectively. While this is substantially lower than the supervised model, it is still better than the base KNN and LOF models above. The ensambling process also makes it easy because we don’t have to specify any parameters other than the number of iterations to run. In testing, the default 100 works well as a starting point, and there aren’t huge performance gains by increasing it substantially.

Code

print(f'Avg. ROC: {np.round(np.mean(roc_list),3) }\nAvg. Prn: {np.round(np.mean(prn_list),3)}')

Avg. ROC: 0.693
Avg. Prn: 0.261

Comparing performance across datasets

We can also evaluate its performance on a variety of other datasets. Here I randomly chose another four datasets from the pyod benchmarks page and compared its performance over 10 bootstrap resamplings to the other benchmarked methods in the pyod ecosystem. Looking at the results we see that we get median RocAUC scores of between .7 to .85 and average precision scores between .2 to .85. For an unsupervised model this isn’t too bad, and largely falls within the range of other detectors.

We should note that while its performance is never the best, it is also never the worst either. For example: the .707 we achieved on the cardio dataset is lower than some of the best methods (in this case, PCA and Cluster-Based LOF). However, we avoid extremely bad results like with Angle-Based Outlier Detection or LOF. This underscores our goals with the ensemble model: we prefer a more conservative model that tends to perform consistently across many types of anomalies. We also avoid issues related to choosing optimal parameters but simply ensambling over many detectors. In an unsupervised case this decrease in variance is especially desirable.

Code

# load additional datasets
d1 = np.load("C:/Users/gioc4/Documents/blog/data/Classical/14_glass.npz")
d2 = np.load("C:/Users/gioc4/Documents/blog/data/Classical/18_Ionosphere.npz")
d3 = np.load("C:/Users/gioc4/Documents/blog/data/Classical/20_letter.npz")
d4 = np.load("C:/Users/gioc4/Documents/blog/data/Classical/21_Lymphography.npz")

dlist = [d1,d2,d3,d4]
mname = ['Cardio', 'Glass','Ionosphere','Letter','Lympho']

roc_list_m = []
prn_list_m = []

# run models
for j in dlist:
 
    for i in range(10):
        roc, prn = boot_eval(j)

        roc_list_m.append(roc)
        prn_list_m.append(prn)


# Plot evaluation metrics across datasets
evaldf = pd.DataFrame({'RocAUC' : roc_list +roc_list_m, 
                       'Precision' : prn_list + prn_list_m,
                       'Dataset': sorted([x for x in mname*10])})\
            .melt(id_vars = 'Dataset', var_name = 'Metric', value_name = 'Score')

# facet plot across datasets
g = sns.FacetGrid(evaldf, col = 'Metric', sharey = False, col_wrap=1, aspect = 2)
(
    g.map(sns.boxplot, 'Dataset','Score', order = mname),
    g.fig.subplots_adjust(top=0.9),
    g.fig.suptitle('ALSO-E Model Evaluation', fontsize=16)
)
plt.show()

ensemble models often are often not as good as the best method, but can achieve consistently decent performance.