We have deeply rethought the vision of this package and have completely rewritten the entire package to support existing, new, and future planned functionality. The changes are so radical that there is no continuity with the previous version 0.3.1. Thus, we’ve skipped a version number and now are at version 0.5.0.
Honestly, we can’t keep track of all the changes; experienced users are advised to rerun the vignettes to get up to speed with the new version. We apologize for the discontinuity but we trust that the latest version is easier to use and much more functional. What follows is a list of some of the most notable changes we’ve kept track of.
ale
package objects has been completely rewritten. The
latest objects are not compatible with earlier versions. However, the
new structure supports the roadmap of future functionality, so we hope
that there will be minimal changes in the future that interrupt backward
compatibility.{S7}
classes to represent
different kinds of ale
package objects:
ALE
: the core ale
package object that
holds ALE data for a model (replaces the former ale()
and
ale_ixn()
functions).ModelBoot
: results of full-model bootstrapping
(replaces the former model_bootstrap()
function).ALEPlots
: store ALE plots generated from either
ALE
or ModelBoot
with convenient
print()
and plot()
methods.ALEpDist
: p-value distribution information (replaces
the former create_p_dist()
function).{ALEPlot}
package code and so now claim full authorship of
the code. One of the most significant implications of this is that we
have decided to change the package license from the GPL 2 to MIT, which
permits maximum dissemination of our algorithms.ale_ixn()
has been eliminated and now both 1D and 2D
ALE are calculated with the ALE()
constructor for .ALE
object constructor no longer produces plots
directly. ALE plots are now created as ale_plot
objects
using the newly added plot()
methods that create all
possible plots from the ALE data from ALE
or
ale_boot
objects. Thus, serializing ALE
objects now avoids the previous problems of environment bloat of the
included ggplot
objects.rug_sample_size
argument of the
ALE
constructor to sample_size
. Now it
reflects the size of data
that should be sampled in the
ale
object, which can be used not only for rug plots but
for other purposes.We have dealt with innumerable bugs during our development journey but, fortunately, very few publicly signalled bugs. Only fixes for publicly reported bugs are indicated here.
x_cols
argument in ALE()
now supports
a complex syntax for specifying which specific columns for 1D ALE or
pairs of columns for 2D interactions are desired. It also supports
specification using standard R formula syntax.get()
methods now provide convenient access to
ALE
, ModelBoot
, and ALEPlots
objects.plot()
methods, eliminated the
compact_plots
to ale()
.print()
and plot()
methods have been added
to the ale_plots
object.print()
method has been added to the ALE
object.model_bootstrap()
has added various model performance
measures that are validated using bootstrap validation with the .632
correction.p_funs
has been completely changed; it
has now been converted to an object named ale_p
and the
functions are separated from the object as internal functions. The
function create_p_funs()
has been renamed
create_p_dist()
.ALEpDist()
now produces three types of p-values:
“exact” (very slow) with at least 1000 random iterations on the original
model; “approx” for 100 to 999 iterations on the original model; and
“surrogate” for much faster but less reliable p-values based on a
surrogate linear model. See ALEpDist()
for details.One of the most fundamental changes is not directly visible but
affects how some ALE values are calculated. In certain very specific
cases, the ALE values are now slightly different from those of the
reference {ALEPlot}
package. These are only for
non-numerical variables for some prediction types other than predictions
scaled on the response variable. (E.g., a binary or categorical variable
for a logarithmic prediction not scaled to the same scale as the
response variable.) We made this change for two reasons:
{ALEPlot}
implementation. These cases are not covered at
all in the base ALE scientific article and they are poorly documented in
the {ALEPlot}
code. We cannot help users to interpret
results that we do not understand ourselves.{ALEPlot}
reference implementation is not scalable: custom code must be written
for each type and each degree of interaction.Other than for these edge cases, our implementation continues to give
identical results to the reference {ALEPlot}
package.
Other notable changes that might not be readily visible to users:
{staccuracy}
.{rlang}
and
{cli}
packages. Reduced the imported functions to a
minimum.{cli}
.{assertthat}
with custom validation functions
that adapt some {assertthat}
code.helper.R
test files so that some testing objects
are available to the loaded package.{future}
parallelization code to restore
original values on exit.ale_p
objects.{ggplot2}
3.5.{covr}
.The most significant updates are the addition of p-values for the ALE statistics, the launching of a pkgdown website which will henceforth host the development version of the package, and parallelization of core functions with a resulting performance boost.
One of the key goals for the ale
package is that it
would be truly model-agnostic: it should support any R object that can
be considered a model, where a model is defined as an object that makes
a prediction for each input row of data that it is provided. Towards
this goal, we had to adjust the custom predict function to make it more
flexible for various kinds of model objects. We are happy that our
changes now enable support for tidymodels
objects and
various survival models (but for now, only those that return
single-vector predictions). So, in addition to taking required
object
and newdata
arguments, the custom
predict function pred_fun
in the ale()
function now also requires an argument for type
to specify
the prediction type, whether it is used or not. This change
breaks previous code that used custom predict functions, but it allows
ale
to analyze many new model types than before. Code that
did not require custom predict functions should not be affected by this
change. See the updated documentation of the ale()
function
for details.
Another change that breaks former code is that the arguments for
model_bootstrap()
have been modified. Instead of a
cumbersome model_call_string
,
model_bootstrap()
now uses the
{insight}
package to automatically detect many R models and
directly manipulate the model object as needed. So, the second
argument is now the model
object. However, for non-standard
models that {insight}
cannot automatically parse, a
modified model_call_string
is still available to assure
model-agnostic functionality. Although this change breaks former code
that ran model_bootstrap()
, we believe that the new
function interface is much more user-friendly.
A slight change that might break some existing code is that the
conf_regions
output associated with ALE statistics has been
restructured. The new structure provides more useful information. See
help(ale)
for details.
pkgdown
website located
at https://tripartio.github.io/ale/. This is where
the most recent development features will be documented.create_p_funs()
function for details and an
example.vignette('ale-statistics')
for
details. The vignette has been expanded with more details on how to
properly interpret normalized ALE statistics.vignette('ale-statistics')
for details.{furrr}
library. In our tests, practically, we typically
found speed-ups of n – 2
where n
is the number
of physical cores (machine learning is generally unable to use logical
cores). For example, a computer with 4 physical cores should see at
least ×2 speed-up and a computer with 6 physical cores should see at
least ×4 speed-up. However, parallelization is tricky with our
model-agnostic design. When users work with models that follow standard
R conventions, the ale
package should be able to
automatically configure the system for parallelization. But for some
non-standard models users may have to explicitly list the model’s
packages in the new model_packages
argument so that each
parallel thread can find all necessary functions. This is only a concern
if you get weird errors. See help(ale)
for details.ale()
function. See
help(ale)
for details.median_band_pct
argument to ale()
now
takes a vector of two numbers, one for the inner band and one for the
outer.{gridExtra}
with {patchwork}
for
examples and vignettes for printing plots.ale()
function documentation from
ale-package
documentation.alt
tags to describe plots for accessibility.{insight}
package to automatically detect
y_col and model call objects when possible; this increases the range of
automatic model detection of the ale
package in
general.{progressr}
package for
progress bars. With the cli
progression handler, this
enables accurate estimated times of arrival (ETA) for long procedures,
even with parallel computing. A message is displayed once per session
informing users of how to customize their progress bars. For details,
see help(ale)
, particularly the documentation on progress
bars and the silent
argument.{ggplot2}
from a dependency to an import. So, it
is no longer automatically loaded with the package.var_summary()
function. In particular, encodes whether the user is using p-values
(ALER band) or not (median band).validation.R
file.compact_plots
to plotting functions
to strip plot environments to reduce the size of returned objects. See
help(ale)
for details.package_scope
environment.ale_ixn()
).ale_ixn()
).ale()
does not yet support multi-output model
prediction types (e.g., multi-class classification and multi-time
survival probabilities).This version introduces various ALE-based statistics that let ALE be used for statistical inference, not just interpretable machine learning. A dedicated vignette introduces this functionality (see “ALE-based statistics for statistical inference and effect sizes” from the vignettes link on the main CRAN page at https://CRAN.R-project.org/package=ale). We introduce these statistics in detail in a working paper: Okoli, Chitu. 2023. “Statistical Inference Using Machine Learning and Classical Techniques Based on Accumulated Local Effects (ALE).” arXiv. https://doi.org/10.48550/arXiv.2310.09877. Please note that they might be further refined after peer review.
ale()
and model_bootstrap()
now output these
statistics. (ale_ixn()
will come later.)ale
package with
the reference {ALEPlot}
package:
“Comparison between {ALEPlot}
and ale
packages” (available from the vignettes link on the main CRAN page at https://CRAN.R-project.org/package=ale).var_cars
is a modified version of mtcars that features
many different types of variables.census
is a polished version of the adult income
dataset used for a vignette in the {ALEPlot}
package.silent = TRUE
to
ale()
, ale_ixn()
, or
model_bootstrap()
.seed
argument to ale()
,
ale_ixn()
, or model_bootstrap()
.By far the most extensive changes have been to assure the accuracy and stability of the package from a software engineering perspective. Even though these are not visible to users, they make the package more robust with hopefully fewer bugs. Indeed, the extensive data validation may help users debug their own errors.
{assertthat}
package; if
not, the function fails quickly with an appropriate error message.{testthat}
package is now used for testing
the outputs of each user-facing function. This should help the code base
to be more robust going forward with future developments.{ALEPlot}
package. These
tests should ensure that any future code that breaks the accuracy of ALE
calculations will be caught quickly.ale_ixn()
).ale_ixn()
).This is the first CRAN release of the ale
package. Here
is its official description with the initial release:
Accumulated Local Effects (ALE) were initially developed as a model-agnostic approach for global explanations of the results of black-box machine learning algorithms. (Apley, Daniel W., and Jingyu Zhu. “Visualizing the effects of predictor variables in black box supervised learning models.” Journal of the Royal Statistical Society Series B: Statistical Methodology 82.4 (2020): 1059-1086 doi:10.1111/rssb.12377.) ALE has two primary advantages over other approaches like partial dependency plots (PDP) and SHapley Additive exPlanations (SHAP): its values are not affected by the presence of interactions among variables in a model and its computation is relatively rapid. This package rewrites the original code from the ‘ALEPlot’ package for calculating ALE data and it completely reimplements the plotting of ALE values.
(This package uses the same GPL-2 license as the
{ALEPlot}
package.)
This initial release replicates the full functionality of the
{ALEPlot}
package and a lot more. It currently presents
three functions:
ale()
: create data for and plot one-way ALE (single
variables). ALE values may be bootstrapped.ale_ixn()
: create data for and plot two-way ALE
interactions. Bootstrapping of the interaction ALE values has not yet
been implemented.model_bootstrap()
: bootstrap an entire model, not just
the ALE values. This function returns the bootstrapped model statistics
and coefficients as well as the bootstrapped ALE values. This is the
appropriate approach for small samples.This release provides more details in the following vignettes (they are all available from the vignettes link on the main CRAN page at https://CRAN.R-project.org/package=ale):
ale
packageale()
function handling of various datatypes for x