Usage Guide =========== This guide covers **LinearBoostClassifier** and the base estimator **SEFR** from installation through advanced options, aligned with the current implementation in ``linear_boost.py`` and ``sefr.py``. .. _installation: Installation ------------ Install the **linearboost** package (and optionally use a virtual environment): .. code-block:: console pip install linearboost Requirements: Python >= 3.8, scikit-learn >= 1.2.2. --- A Complete Example ------------------ Basic workflow: load or generate data, split, fit **LinearBoostClassifier**, and evaluate. .. code-block:: python from linearboost import LinearBoostClassifier from sklearn.datasets import make_classification from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score, f1_score X, y = make_classification( n_samples=1000, n_features=20, n_informative=10, n_redundant=5, random_state=42, ) X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=42 ) clf = LinearBoostClassifier() clf.fit(X_train, y_train) y_pred = clf.predict(X_test) print(f"Accuracy: {accuracy_score(y_test, y_pred):.4f}") print(f"F1 (weighted): {f1_score(y_test, y_pred, average='weighted'):.4f}") y_proba = clf.predict_proba(X_test) print("Probability estimates for first 2 samples:") print(y_proba[:2]) --- Key Parameters (LinearBoostClassifier) -------------------------------------- Parameters below match the current API. See :doc:`api` for the full signature and attributes. Boosting type and algorithm ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - **boosting_type** : `{'adaboost', 'gradient'}`, default=`'adaboost'` - ``'adaboost'``: Classic AdaBoost (SAMME or SAMME.R) that reweights samples by classification error. - ``'gradient'``: Gradient boosting; each base estimator fits pseudo-residuals (negative gradient of log-loss). Often better for highly non-linear or XOR-like patterns. When using ``'gradient'``, the **algorithm** parameter is ignored. - **algorithm** : `{'SAMME', 'SAMME.R'}`, default=`'SAMME.R'` Used only when **boosting_type='adaboost'**. ``'SAMME.R'`` typically converges faster and achieves lower test error with fewer iterations. ``'SAMME'`` is the discrete variant. - **n_estimators** : `int`, default=`200` Maximum number of base SEFR estimators. With **early_stopping=True** you can set a larger value (e.g. 500) and let training stop when validation score does not improve. - **learning_rate** : `float`, default=`1.0` Shrinks the contribution of each estimator. There is a trade-off with **n_estimators**: lower learning rate usually needs more estimators but can improve generalization. Regularization and early stopping ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - **subsample** : `float`, default=`1.0` Fraction of samples used to fit each base estimator. Values in (0, 1] enable stochastic boosting (e.g. ``0.8``) and can reduce variance. - **shrinkage** : `float`, default=`1.0` Multiplier for each estimator’s weight. Values in (0, 1] (e.g. ``0.8--0.95``) reduce overfitting and can improve generalization. - **early_stopping** : `bool`, default=`False` If ``True``, training stops when validation score does not improve for **n_iter_no_change** consecutive iterations. Requires **n_iter_no_change** to be set. - **validation_fraction** : `float`, default=`0.1` Fraction of training data used as validation for early stopping. Only used when **early_stopping=True** and **subsample >= 1.0**. When **subsample < 1.0**, out-of-bag (OOB) evaluation is used instead and this parameter is ignored. - **n_iter_no_change** : `int`, default=`5` Number of iterations with no improvement to wait before stopping (when **early_stopping=True**). - **tol** : `float`, default=`1e-4` Minimum improvement in score to count as “improvement” for early stopping. Data scaling ~~~~~~~~~~~~ - **scaler** : `str`, default=`'minmax'` Scaling applied before training. When ``scaler != 'minmax'``, the pipeline is: chosen scaler → **MinMaxScaler** (so SEFR always sees values in a bounded range). Options include: - ``'minmax'``: MinMaxScaler only. - ``'standard'``, ``'robust'``, ``'quantile-uniform'``, ``'quantile-normal'``. - ``'normalizer-l1'``, ``'normalizer-l2'``, ``'normalizer-max'``, ``'power'``, ``'maxabs'``. The fitted transformer is available as **scaler_** (a pipeline when scaler is not ``'minmax'``). Imbalanced data and custom loss ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - **class_weight** : `dict`, `'balanced'`, or `None`, default=`None` Class weights. Use ``'balanced'`` to weight inversely to class frequencies. Can be combined with **sample_weight** in ``fit()``. - **loss_function** : callable or `None`, default=`None` Optional custom loss with signature ``(y_true, y_pred, sample_weight) -> float`` for optimization. Kernels and kernel approximation ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - **kernel** : `{'linear', 'poly', 'rbf', 'sigmoid'}` or callable, default=`'linear'` - ``'linear'``: No kernel; fastest, for linearly separable data. - ``'rbf'``: Radial basis function; flexible for non-linear boundaries. - ``'poly'``: Polynomial; complexity controlled by **degree**. - ``'sigmoid'``: Sigmoid kernel. - **gamma** : `float` or `None`, default=`None` Kernel coefficient for ``'rbf'``, ``'poly'``, ``'sigmoid'``. If ``None``, set to ``1 / n_features``. - **degree** : `int`, default=`3` Degree for ``'poly'`` kernel. - **coef0** : `float`, default=`1` Independent term in ``'poly'`` and ``'sigmoid'`` kernels. - **kernel_approx** : `{'rff', 'nystrom'}` or `None`, default=`None` For large datasets with non-linear **kernel**, use approximation to avoid an O(n²) Gram matrix: - ``'rff'``: Random Fourier Features; only valid for **kernel='rbf'**. - ``'nystrom'``: Nyström approximation; works with ``'rbf'``, ``'poly'``, ``'sigmoid'``. - ``None``: Exact kernel (full Gram matrix). - **n_components** : `int`, default=`256` Dimensionality of the kernel feature map when **kernel_approx** is used (number of random features for ``'rff'``, rank for ``'nystrom'``). --- Examples by feature -------------------- Non-linear kernel (exact) ~~~~~~~~~~~~~~~~~~~~~~~~~~ .. code-block:: python model = LinearBoostClassifier( kernel="rbf", gamma=0.1, n_estimators=100, learning_rate=0.5, ) model.fit(X_train, y_train) Kernel approximation (scalable) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. code-block:: python model = LinearBoostClassifier( kernel="rbf", kernel_approx="rff", n_components=256, n_estimators=100, ) model.fit(X_train, y_train) Gradient boosting ~~~~~~~~~~~~~~~~~~ .. code-block:: python model = LinearBoostClassifier( boosting_type="gradient", kernel="rbf", n_estimators=200, ) model.fit(X_train, y_train) Early stopping with validation split ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. code-block:: python model = LinearBoostClassifier( n_estimators=500, early_stopping=True, validation_fraction=0.1, n_iter_no_change=5, tol=1e-4, ) model.fit(X_train, y_train) Early stopping with OOB (when using subsampling) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. code-block:: python model = LinearBoostClassifier( n_estimators=500, subsample=0.8, early_stopping=True, n_iter_no_change=5, ) model.fit(X_train, y_train) Imbalanced data ~~~~~~~~~~~~~~~ .. code-block:: python model = LinearBoostClassifier(class_weight="balanced", n_estimators=200) model.fit(X_train, y_train) --- Using SEFR standalone ---------------------- **SEFR** is the base binary linear classifier. You can use it alone for a very fast, lightweight model. It supports **fit_intercept**, **kernel** (linear, poly, rbf, sigmoid, precomputed), and **gamma**, **degree**, **coef0**. .. code-block:: python from linearboost import SEFR clf = SEFR(kernel="rbf", fit_intercept=True) clf.fit(X_train, y_train) clf.predict(X_test) clf.score(X_test, y_test) See :doc:`api` for SEFR’s full parameters and attributes. --- Advanced usage -------------- Inspecting the fitted model ~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. code-block:: python print(f"Number of estimators: {len(clf.estimators_)}") print(f"Estimator weights: {clf.estimator_weights_}") print(f"Estimator errors: {clf.estimator_errors_}") print(f"Fitted scaler: {clf.scaler_}") # Transform new data with the same scaling X_new_scaled = clf.scaler_.transform(X_test) When **boosting_type='gradient'**, the raw scores and initial score are in **F_** and **init_score_** (if present). Hyperparameter tuning with GridSearchCV ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. code-block:: python from sklearn.model_selection import GridSearchCV param_grid = { "n_estimators": [50, 100, 200], "learning_rate": [0.1, 0.5, 1.0], "kernel": ["linear", "rbf"], "scaler": ["minmax", "robust"], "boosting_type": ["adaboost", "gradient"], } lbc = LinearBoostClassifier() grid = GridSearchCV(estimator=lbc, param_grid=param_grid, cv=3, n_jobs=-1, verbose=2) grid.fit(X_train, y_train) print(grid.best_params_) best_model = grid.best_estimator_ print(f"Test accuracy: {best_model.score(X_test, y_test):.4f}") --- Limitations ----------- - **Binary classification only**: Both LinearBoostClassifier and SEFR support only two-class targets. - **Numeric features only**: Input ``X`` must be numeric. Encode categorical features (e.g. one-hot) before use. --- Feedback -------- For more details and source code, see the `LinearBoost GitHub repository `_. We welcome issues, contributions, and suggestions.