StatsModels.jl provides the `@formula`

mini-language for conveniently specifying table-to-matrix transformations for statistical modeling. RegressionFormulae.jl extends this mini-language with additional syntax that users coming from other statistical modeling ecosystems such as R may be familiar with. This package also serves as a template for developers wish to expand the StatsModels.jl `@formula`

syntax in their own packages.

StatsModels.jl provides the `@formula`

mini-language for conveniently specifying table-to-matrix transformations for statistical modeling. This mini-language is designed with extensibility and composability in mind, using normal Julia mechanisms of multiple dispatch to implement additional syntax both inside StatsModels.jl and in external packages. RegressionFormulae.jl takes advantage of this extensibility to provide *additional syntax* that is familiar to many users of other statistical software (e.g., R) in an "opt-in" manner, without forcing *all* downstream packages that depend on StatsModels.jl/`@formula`

to support this syntax.

The StatsModels.jl `@formula`

syntax is based on the Wilkinson-Rogers Formula Notation which has been a widely-used standard in multi-factor regression modeling since it was first described in Wilkinson and Rogers (1973). The basic syntax includes operators for *addition* (`+`

) and *crossing* (`&`

and `*`

) of regressors, as well as the `~`

operator to link outcome and regressor terms. As the conventions around this syntax have evolved in the last 50 years, other systems have introduced additional operators.

RegressionFormulae.jl expands the StatsModels.jl `@formula`

to support two commonly-used operators from R: `^`

(incomplete crossing) and `/`

(nesting). Specifically, it implements

`(a + b + c + ...) ^ n`

to create all interactions up to`n`

-way, corresponding to an incomplete cross of`a, b, c, ...`

.`a / b`

to create`a + a & b`

, which results in a "nested" model of`b`

, with a separate coefficient for`b`

for each level of`a`

Both of these operators are particularly useful for creating *interpretable* models. Models with high-order interactions are extremely challenging to interpret and require considerable care, and are prone to over-fitting since the number of coefficients grows very quickly with additional terms participating in the interactions. The incomplete cross `^`

syntax can ameliorate these difficulties, limiting the highest degree of the resulting interaction terms and reducing the overall number of predictors. Nesting (`a / b`

) similarly provides an alternative to fully crossed models (`a * b`

) that is more directly interpretable in situations where the analytic questions are focused on the effects of a predictor `b`

within each individual level of some other variable `a`

, without concern for direct *comparison* of these effects to each other.

Finally, this syntax is implemented in a way that does not *require* other modeling packages that use `@formula`

to support them, or even *prevent* other packages from defining *alternative* meaning to the `^`

or `/`

operators. Within a `@formula`

, the special syntax is implemented by methods like

```
function StatsModels.apply_schema(
t::FunctionTerm{typeof(/)},
...
```

and

```
function Base.:(/)(outer::CategoricalTerm, inner::AbstractTerm)
...
```

The result of this is that if RegressionFormulae.jl is not loaded, then `/`

and `^`

inside a `@formula`

behave exactly as they normally would (e.g., as calls the normal Julia functions `/`

and `^`

). Moreover, if a user loads RegressionFormulae.jl at the same time as some other package that defines special syntax for `/`

or `^`

(for `RegressionModel`

), they will receive a warning about method redefinition or method ambiguity.