API Reference#

Top-level package for SpectoPrep. SpectroPrep: A comprehensive toolkit for spectroscopic data preprocessing and modeling.

This package provides tools for preprocessing spectroscopic data, pipeline optimization, and modeling using Ridge regression.

class spectoprep.OptimizedRidgeCV(alphas=None, cv=5, scoring='neg_mean_squared_error', fit_intercept=True, normalize=False, gcv_mode=None, store_cv_values=False, groups=None)[source]#

Bases: BaseEstimator, RegressorMixin

Ridge regression with built-in cross-validation and optimization capabilities.

Parameters#

alphasarray-like, default=np.logspace(-3, 3, 10)

Array of alpha values to try. A large array of values will slow down the computation.

cvint, cross-validation generator or an iterable, default=5

Determines the cross-validation splitting strategy.

scoringstr, callable, default=’neg_mean_squared_error’

A string or a scorer callable object / function with signature scorer(estimator, X, y).

fit_interceptbool, default=True

Whether to calculate the intercept for this model.

normalizebool, default=False

This parameter is ignored when fit_intercept is set to False. If True, the regressors X will be normalized before regression by subtracting the mean and dividing by the l2-norm.

gcv_mode{None, ‘auto’, ‘svd’, ‘eigen’}, default=None

Flag indicating which strategy to use when performing Generalized Cross-Validation.

store_cv_valuesbool, default=False

Flag indicating if the cross-validation values corresponding to each alpha should be stored in the cv_values_ attribute.

groupsarray-like, default=None

Group labels for the samples. Only used if cv is a group-based cross-validation splitter.

fit(X, y, sample_weight=None)[source]#

Fit Ridge regression model with cross-validation.

Parameters#

Xarray-like of shape (n_samples, n_features)

Training data.

yarray-like of shape (n_samples,) or (n_samples, n_targets)

Target values.

sample_weightfloat or array-like of shape (n_samples,), default=None

Individual weights for each sample.

Returns#

selfobject

Returns self.

get_cv_results()[source]#

Return cross-validation results.

Returns#

cv_resultsdict

Results from cross-validation.

predict(X)[source]#

Predict using the Ridge model.

Parameters#

Xarray-like of shape (n_samples, n_features)

Samples.

Returns#

y_predarray-like of shape (n_samples,) or (n_samples, n_targets)

Returns predicted values.

score(X, y, sample_weight=None)[source]#

Return the coefficient of determination R^2 of the prediction.

Parameters#

Xarray-like of shape (n_samples, n_features)

Test samples.

yarray-like of shape (n_samples,) or (n_samples, n_targets)

True values for X.

sample_weightarray-like of shape (n_samples,), default=None

Sample weights.

Returns#

scorefloat

R^2 of self.predict(X) wrt. y.

set_fit_request(*, sample_weight: bool | None | str = '$UNCHANGED$') OptimizedRidgeCV#

Configure whether metadata should be requested to be passed to the fit method.

Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with enable_metadata_routing=True (see sklearn.set_config()). Please check the User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to fit if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to fit.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Parameters#

sample_weightstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED

Metadata routing for sample_weight parameter in fit.

Returns#

selfobject

The updated object.

set_score_request(*, sample_weight: bool | None | str = '$UNCHANGED$') OptimizedRidgeCV#

Configure whether metadata should be requested to be passed to the score method.

Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with enable_metadata_routing=True (see sklearn.set_config()). Please check the User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to score if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to score.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Parameters#

sample_weightstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED

Metadata routing for sample_weight parameter in score.

Returns#

selfobject

The updated object.

class spectoprep.PipelineOptimizer(X_train: ndarray[tuple[Any, ...], dtype[_ScalarT]], y_train: ndarray[tuple[Any, ...], dtype[_ScalarT]], preprocessing_steps: List[str] | None = None, X_test: ndarray[tuple[Any, ...], dtype[_ScalarT]] | None = None, y_test: ndarray[tuple[Any, ...], dtype[_ScalarT]] | None = None, cv_method: str = 'group_shuffle_split', n_splits: int = 3, test_size: float = 0.3, n_groups_out: int = 2, random_state: int = 42, groups: ndarray[tuple[Any, ...], dtype[_ScalarT]] | None = None, max_pipeline_length: int = 5, n_jobs: int = -1, allowed_preprocess_combinations: int | List[int] | Tuple[int, ...] | None = [1, 2], log_level: str = 'INFO')[source]#

Bases: object

A class for optimizing machine learning pipelines using Bayesian optimization. It precomputes possible pipeline configurations and then searches over both the pipeline configuration (encoded as an index) and the hyperparameters.

bayes_objective(**params) float[source]#

Objective function for Bayesian optimization.

Args:

**params: Parameters to evaluate

Returns:

float: Negative RMSE or penalty score on error

bayesian_optimize(init_points: int = 10, n_iter: int = 50, acquisition_function: str = 'ei') Tuple[Dict, Pipeline][source]#

Run Bayesian optimization to find the best pipeline configuration and hyperparameters.

Args:

init_points: Number of random initial points n_iter: Number of Bayesian optimization iterations acquisition_function: Acquisition function for Bayesian optimization

Returns:
Tuple containing:
  • Dict of best parameters

  • Fitted Pipeline with best configuration

export_best_pipeline(file_path: str) None[source]#

Export the best pipeline configuration and hyperparameters to a file.

Args:

file_path: Path to save the export file

Raises:

AttributeError: If optimizer hasn’t been run yet

get_all_tested_pipelines() List[Dict][source]#

Get details of all tested pipeline configurations.

Returns:

List of dictionaries with pipeline details

get_best_pipeline_predictions(best_pipeline: Pipeline) Tuple[ndarray[tuple[Any, ...], dtype[_ScalarT]], float, float][source]#

Get predictions using the best pipeline.

Args:

best_pipeline: Fitted pipeline object

Returns:
Tuple containing:
  • Predictions array

  • RMSE score

  • R² score

print_evaluated_pipelines() None[source]#

Print details for all evaluated pipelines from the Bayesian optimizer.

This method assumes that bayesian_optimize() has been run and that self.optimizer exists.

summarize_optimization() Dict[source]#

Generate a summary of the optimization results.

Returns:

Dictionary containing optimization summary metrics

class spectoprep.SpectroPrepPlotter[source]#

Bases: object

A class for creating high-quality plots for spectroscopy data.

This class provides various plotting functions specifically designed for spectroscopy data and pipeline optimization results.

static plot_feature_importance(wavenumbers: ndarray, coefficients: ndarray, title: str = 'Feature Importance', xlabel: str = 'Wavenumber (cm$^{-1}$)', ylabel: str = 'Coefficient Value', figsize: Tuple[int, int] = (12, 6), color: str = 'purple', highlight_threshold: float | None = None, highlight_color: str = 'red', save_path: str | None = None)[source]#

Plot feature importance from model coefficients.

Parameters#

wavenumbersarray-like

The x-axis values (wavenumbers).

coefficientsarray-like

Model coefficients corresponding to each wavenumber.

titlestr, default=’Feature Importance’

Plot title.

xlabelstr, default=’Wavenumber (cm$^{-1}$)’

X-axis label.

ylabelstr, default=’Coefficient Value’

Y-axis label.

figsizetuple, default=(12, 6)

Figure size.

colorstr, default=’purple’

Color of the line.

highlight_thresholdfloat, optional

If provided, highlights coefficients with absolute values above this threshold.

highlight_colorstr, default=’red’

Color for highlighted coefficients.

save_pathstr, optional

If provided, save the figure to this path.

Returns#

figmatplotlib.figure.Figure

The figure object.

axmatplotlib.axes.Axes

The axes object.

static plot_optimization_progress(optimizer: PipelineOptimizer, figsize: Tuple[int, int] = (12, 6), title: str = 'Optimization Progress', save_path: str | None = None)[source]#

Plot optimization progress over iterations.

Parameters#

optimizerPipelineOptimizer

The fitted pipeline optimizer.

figsizetuple, default=(12, 6)

Figure size.

titlestr, default=’Optimization Progress’

Plot title.

save_pathstr, optional

If provided, save the figure to this path.

Returns#

figmatplotlib.figure.Figure

The figure object.

axmatplotlib.axes.Axes

The axes object.

static plot_optimization_results(optimizer: PipelineOptimizer, top_n: int = 5, figsize: Tuple[int, int] = (12, 8), title: str = 'Pipeline Optimization Results', save_path: str | None = None)[source]#

Plot optimization results from PipelineOptimizer.

Parameters#

optimizerPipelineOptimizer

The fitted pipeline optimizer.

top_nint, default=5

Number of top pipelines to display.

figsizetuple, default=(12, 8)

Figure size.

titlestr, default=’Pipeline Optimization Results’

Plot title.

save_pathstr, optional

If provided, save the figure to this path.

Returns#

figmatplotlib.figure.Figure

The figure object.

static plot_prediction_scatter(y_true: ndarray, y_pred: ndarray, title: str = 'Prediction Performance', xlabel: str = 'Measured', ylabel: str = 'Predicted', figsize: Tuple[int, int] = (10, 8), alpha: float = 0.7, color: str = 'blue', add_metrics: bool = True, save_path: str | None = None)[source]#

Create a scatter plot of predicted vs true values.

Parameters#

y_truearray-like

True target values.

y_predarray-like

Predicted target values.

titlestr, default=’Prediction Performance’

Plot title.

xlabelstr, default=’Measured’

X-axis label.

ylabelstr, default=’Predicted’

Y-axis label.

figsizetuple, default=(10, 8)

Figure size.

alphafloat, default=0.7

Transparency of the points.

colorstr, default=’blue’

Color of the scatter points.

add_metricsbool, default=True

Whether to add RMSE and R² metrics to the plot.

save_pathstr, optional

If provided, save the figure to this path.

Returns#

figmatplotlib.figure.Figure

The figure object.

axmatplotlib.axes.Axes

The axes object.

static plot_preprocessing_comparison(wavenumbers: ndarray, original_spectra: ndarray, processed_spectra: Dict[str, ndarray], sample_indices: List[int] | None = None, figsize: Tuple[int, int] = (15, 10), title: str = 'Preprocessing Comparison', color_map: str = 'tab10', save_path: str | None = None)[source]#

Plot comparison of original and processed spectra.

Parameters#

wavenumbersarray-like

The x-axis values (wavenumbers).

original_spectraarray-like

The original spectra data of shape (n_samples, n_features).

processed_spectradict

Dictionary mapping preprocessing method names to processed spectra.

sample_indiceslist of int, optional

Indices of samples to plot. If None, all samples are plotted.

figsizetuple, default=(15, 10)

Figure size.

titlestr, default=’Preprocessing Comparison’

Main title for the figure.

color_mapstr, default=’tab10’

Colormap for differentiating samples.

save_pathstr, optional

If provided, save the figure to this path.

Returns#

figmatplotlib.figure.Figure

The figure object.

static plot_spectra(wavenumbers: ndarray, spectra: ndarray, labels: List[str] | None = None, title: str = 'Spectral Data', xlabel: str = 'Wavenumber (cm$^{-1}$)', ylabel: str = 'Absorbance', alpha: float = 0.7, figsize: Tuple[int, int] = (12, 6), color_map: str = 'viridis', legend_loc: str = 'best', grid: bool = True, save_path: str | None = None)[source]#

Plot spectral data.

Parameters#

wavenumbersarray-like

The x-axis values (wavenumbers).

spectraarray-like

The spectra data of shape (n_samples, n_features).

labelslist of str, optional

Labels for each spectrum. If None, spectra are numbered.

titlestr, default=’Spectral Data’

Plot title.

xlabelstr, default=’Wavenumber (cm$^{-1}$)’

X-axis label.

ylabelstr, default=’Absorbance’

Y-axis label.

alphafloat, default=0.7

Transparency of the lines.

figsizetuple, default=(12, 6)

Figure size.

color_mapstr, default=’viridis’

Colormap for the spectra.

legend_locstr, default=’best’

Location of the legend.

gridbool, default=True

Whether to show grid.

save_pathstr, optional

If provided, save the figure to this path.

Returns#

figmatplotlib.figure.Figure

The figure object.

axmatplotlib.axes.Axes

The axes object.

static set_style(style='whitegrid', context='paper', font_scale=1.2)[source]#

Set the visual style for the plots.

Parameters#

stylestr, default=’whitegrid’

The seaborn style.

contextstr, default=’paper’

The seaborn context.

font_scalefloat, default=1.2

The font scale.

Modules#