Usage Guide

This guide covers LinearBoostClassifier and the base estimator SEFR from installation through advanced options, aligned with the current implementation in linear_boost.py and sefr.py.

Installation

Install the linearboost package (and optionally use a virtual environment):

pip install linearboost

Requirements: Python >= 3.8, scikit-learn >= 1.2.2.

A Complete Example

Basic workflow: load or generate data, split, fit LinearBoostClassifier, and evaluate.

from linearboost import LinearBoostClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, f1_score

X, y = make_classification(
    n_samples=1000,
    n_features=20,
    n_informative=10,
    n_redundant=5,
    random_state=42,
)

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

clf = LinearBoostClassifier()
clf.fit(X_train, y_train)

y_pred = clf.predict(X_test)
print(f"Accuracy: {accuracy_score(y_test, y_pred):.4f}")
print(f"F1 (weighted): {f1_score(y_test, y_pred, average='weighted'):.4f}")

y_proba = clf.predict_proba(X_test)
print("Probability estimates for first 2 samples:")
print(y_proba[:2])

Key Parameters (LinearBoostClassifier)

Parameters below match the current API. See API Reference for the full signature and attributes.

Boosting type and algorithm

  • boosting_type : {‘adaboost’, ‘gradient’}, default=`’adaboost’` - 'adaboost': Classic AdaBoost (SAMME or SAMME.R) that reweights samples by classification error. - 'gradient': Gradient boosting; each base estimator fits pseudo-residuals (negative gradient of log-loss). Often better for highly non-linear or XOR-like patterns. When using 'gradient', the algorithm parameter is ignored.

  • algorithm : {‘SAMME’, ‘SAMME.R’}, default=`’SAMME.R’` Used only when boosting_type=’adaboost’. 'SAMME.R' typically converges faster and achieves lower test error with fewer iterations. 'SAMME' is the discrete variant.

  • n_estimators : int, default=`200` Maximum number of base SEFR estimators. With early_stopping=True you can set a larger value (e.g. 500) and let training stop when validation score does not improve.

  • learning_rate : float, default=`1.0` Shrinks the contribution of each estimator. There is a trade-off with n_estimators: lower learning rate usually needs more estimators but can improve generalization.

Regularization and early stopping

  • subsample : float, default=`1.0` Fraction of samples used to fit each base estimator. Values in (0, 1] enable stochastic boosting (e.g. 0.8) and can reduce variance.

  • shrinkage : float, default=`1.0` Multiplier for each estimator’s weight. Values in (0, 1] (e.g. 0.8--0.95) reduce overfitting and can improve generalization.

  • early_stopping : bool, default=`False` If True, training stops when validation score does not improve for n_iter_no_change consecutive iterations. Requires n_iter_no_change to be set.

  • validation_fraction : float, default=`0.1` Fraction of training data used as validation for early stopping. Only used when early_stopping=True and subsample >= 1.0. When subsample < 1.0, out-of-bag (OOB) evaluation is used instead and this parameter is ignored.

  • n_iter_no_change : int, default=`5` Number of iterations with no improvement to wait before stopping (when early_stopping=True).

  • tol : float, default=`1e-4` Minimum improvement in score to count as “improvement” for early stopping.

Data scaling

  • scaler : str, default=`’minmax’` Scaling applied before training. When scaler != 'minmax', the pipeline is: chosen scaler → MinMaxScaler (so SEFR always sees values in a bounded range). Options include: - 'minmax': MinMaxScaler only. - 'standard', 'robust', 'quantile-uniform', 'quantile-normal'. - 'normalizer-l1', 'normalizer-l2', 'normalizer-max', 'power', 'maxabs'.

The fitted transformer is available as scaler_ (a pipeline when scaler is not 'minmax').

Imbalanced data and custom loss

  • class_weight : dict, ‘balanced’, or None, default=`None` Class weights. Use 'balanced' to weight inversely to class frequencies. Can be combined with sample_weight in fit().

  • loss_function : callable or None, default=`None` Optional custom loss with signature (y_true, y_pred, sample_weight) -> float for optimization.

Kernels and kernel approximation

  • kernel : {‘linear’, ‘poly’, ‘rbf’, ‘sigmoid’} or callable, default=`’linear’` - 'linear': No kernel; fastest, for linearly separable data. - 'rbf': Radial basis function; flexible for non-linear boundaries. - 'poly': Polynomial; complexity controlled by degree. - 'sigmoid': Sigmoid kernel.

  • gamma : float or None, default=`None` Kernel coefficient for 'rbf', 'poly', 'sigmoid'. If None, set to 1 / n_features.

  • degree : int, default=`3` Degree for 'poly' kernel.

  • coef0 : float, default=`1` Independent term in 'poly' and 'sigmoid' kernels.

  • kernel_approx : {‘rff’, ‘nystrom’} or None, default=`None` For large datasets with non-linear kernel, use approximation to avoid an O(n²) Gram matrix: - 'rff': Random Fourier Features; only valid for kernel=’rbf’. - 'nystrom': Nyström approximation; works with 'rbf', 'poly', 'sigmoid'. - None: Exact kernel (full Gram matrix).

  • n_components : int, default=`256` Dimensionality of the kernel feature map when kernel_approx is used (number of random features for 'rff', rank for 'nystrom').

Examples by feature

Non-linear kernel (exact)

model = LinearBoostClassifier(
    kernel="rbf",
    gamma=0.1,
    n_estimators=100,
    learning_rate=0.5,
)
model.fit(X_train, y_train)

Kernel approximation (scalable)

model = LinearBoostClassifier(
    kernel="rbf",
    kernel_approx="rff",
    n_components=256,
    n_estimators=100,
)
model.fit(X_train, y_train)

Gradient boosting

model = LinearBoostClassifier(
    boosting_type="gradient",
    kernel="rbf",
    n_estimators=200,
)
model.fit(X_train, y_train)

Early stopping with validation split

model = LinearBoostClassifier(
    n_estimators=500,
    early_stopping=True,
    validation_fraction=0.1,
    n_iter_no_change=5,
    tol=1e-4,
)
model.fit(X_train, y_train)

Early stopping with OOB (when using subsampling)

model = LinearBoostClassifier(
    n_estimators=500,
    subsample=0.8,
    early_stopping=True,
    n_iter_no_change=5,
)
model.fit(X_train, y_train)

Imbalanced data

model = LinearBoostClassifier(class_weight="balanced", n_estimators=200)
model.fit(X_train, y_train)

Using SEFR standalone

SEFR is the base binary linear classifier. You can use it alone for a very fast, lightweight model. It supports fit_intercept, kernel (linear, poly, rbf, sigmoid, precomputed), and gamma, degree, coef0.

from linearboost import SEFR

clf = SEFR(kernel="rbf", fit_intercept=True)
clf.fit(X_train, y_train)
clf.predict(X_test)
clf.score(X_test, y_test)

See API Reference for SEFR’s full parameters and attributes.

Advanced usage

Inspecting the fitted model

print(f"Number of estimators: {len(clf.estimators_)}")
print(f"Estimator weights: {clf.estimator_weights_}")
print(f"Estimator errors: {clf.estimator_errors_}")
print(f"Fitted scaler: {clf.scaler_}")

# Transform new data with the same scaling
X_new_scaled = clf.scaler_.transform(X_test)

When boosting_type=’gradient’, the raw scores and initial score are in F_ and init_score_ (if present).

Hyperparameter tuning with GridSearchCV

from sklearn.model_selection import GridSearchCV

param_grid = {
    "n_estimators": [50, 100, 200],
    "learning_rate": [0.1, 0.5, 1.0],
    "kernel": ["linear", "rbf"],
    "scaler": ["minmax", "robust"],
    "boosting_type": ["adaboost", "gradient"],
}

lbc = LinearBoostClassifier()
grid = GridSearchCV(estimator=lbc, param_grid=param_grid, cv=3, n_jobs=-1, verbose=2)
grid.fit(X_train, y_train)

print(grid.best_params_)
best_model = grid.best_estimator_
print(f"Test accuracy: {best_model.score(X_test, y_test):.4f}")

Limitations

  • Binary classification only: Both LinearBoostClassifier and SEFR support only two-class targets.

  • Numeric features only: Input X must be numeric. Encode categorical features (e.g. one-hot) before use.

Feedback

For more details and source code, see the LinearBoost GitHub repository. We welcome issues, contributions, and suggestions.