API Reference#
Top-level package for SpectoPrep. SpectroPrep: A comprehensive toolkit for spectroscopic data preprocessing and modeling.
This package provides tools for preprocessing spectroscopic data, pipeline optimization, and modeling using Ridge regression.
- class spectoprep.OptimizedRidgeCV(alphas=None, cv=5, scoring='neg_mean_squared_error', fit_intercept=True, normalize=False, gcv_mode=None, store_cv_values=False, groups=None)[source]#
Bases:
BaseEstimator,RegressorMixinRidge regression with built-in cross-validation and optimization capabilities.
Parameters#
- alphasarray-like, default=np.logspace(-3, 3, 10)
Array of alpha values to try. A large array of values will slow down the computation.
- cvint, cross-validation generator or an iterable, default=5
Determines the cross-validation splitting strategy.
- scoringstr, callable, default=’neg_mean_squared_error’
A string or a scorer callable object / function with signature
scorer(estimator, X, y).- fit_interceptbool, default=True
Whether to calculate the intercept for this model.
- normalizebool, default=False
This parameter is ignored when fit_intercept is set to False. If True, the regressors X will be normalized before regression by subtracting the mean and dividing by the l2-norm.
- gcv_mode{None, ‘auto’, ‘svd’, ‘eigen’}, default=None
Flag indicating which strategy to use when performing Generalized Cross-Validation.
- store_cv_valuesbool, default=False
Flag indicating if the cross-validation values corresponding to each alpha should be stored in the cv_values_ attribute.
- groupsarray-like, default=None
Group labels for the samples. Only used if cv is a group-based cross-validation splitter.
- fit(X, y, sample_weight=None)[source]#
Fit Ridge regression model with cross-validation.
Parameters#
- Xarray-like of shape (n_samples, n_features)
Training data.
- yarray-like of shape (n_samples,) or (n_samples, n_targets)
Target values.
- sample_weightfloat or array-like of shape (n_samples,), default=None
Individual weights for each sample.
Returns#
- selfobject
Returns self.
- get_cv_results()[source]#
Return cross-validation results.
Returns#
- cv_resultsdict
Results from cross-validation.
- predict(X)[source]#
Predict using the Ridge model.
Parameters#
- Xarray-like of shape (n_samples, n_features)
Samples.
Returns#
- y_predarray-like of shape (n_samples,) or (n_samples, n_targets)
Returns predicted values.
- score(X, y, sample_weight=None)[source]#
Return the coefficient of determination R^2 of the prediction.
Parameters#
- Xarray-like of shape (n_samples, n_features)
Test samples.
- yarray-like of shape (n_samples,) or (n_samples, n_targets)
True values for X.
- sample_weightarray-like of shape (n_samples,), default=None
Sample weights.
Returns#
- scorefloat
R^2 of self.predict(X) wrt. y.
- set_fit_request(*, sample_weight: bool | None | str = '$UNCHANGED$') OptimizedRidgeCV#
Configure whether metadata should be requested to be passed to the
fitmethod.Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with
enable_metadata_routing=True(seesklearn.set_config()). Please check the User Guide on how the routing mechanism works.The options for each parameter are:
True: metadata is requested, and passed tofitif provided. The request is ignored if metadata is not provided.False: metadata is not requested and the meta-estimator will not pass it tofit.None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.Added in version 1.3.
Parameters#
- sample_weightstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED
Metadata routing for
sample_weightparameter infit.
Returns#
- selfobject
The updated object.
- set_score_request(*, sample_weight: bool | None | str = '$UNCHANGED$') OptimizedRidgeCV#
Configure whether metadata should be requested to be passed to the
scoremethod.Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with
enable_metadata_routing=True(seesklearn.set_config()). Please check the User Guide on how the routing mechanism works.The options for each parameter are:
True: metadata is requested, and passed toscoreif provided. The request is ignored if metadata is not provided.False: metadata is not requested and the meta-estimator will not pass it toscore.None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.Added in version 1.3.
Parameters#
- sample_weightstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED
Metadata routing for
sample_weightparameter inscore.
Returns#
- selfobject
The updated object.
- class spectoprep.PipelineOptimizer(X_train: ndarray[tuple[Any, ...], dtype[_ScalarT]], y_train: ndarray[tuple[Any, ...], dtype[_ScalarT]], preprocessing_steps: List[str] | None = None, X_test: ndarray[tuple[Any, ...], dtype[_ScalarT]] | None = None, y_test: ndarray[tuple[Any, ...], dtype[_ScalarT]] | None = None, cv_method: str = 'group_shuffle_split', n_splits: int = 3, test_size: float = 0.3, n_groups_out: int = 2, random_state: int = 42, groups: ndarray[tuple[Any, ...], dtype[_ScalarT]] | None = None, max_pipeline_length: int = 5, n_jobs: int = -1, allowed_preprocess_combinations: int | List[int] | Tuple[int, ...] | None = [1, 2], log_level: str = 'INFO')[source]#
Bases:
objectA class for optimizing machine learning pipelines using Bayesian optimization. It precomputes possible pipeline configurations and then searches over both the pipeline configuration (encoded as an index) and the hyperparameters.
- bayes_objective(**params) float[source]#
Objective function for Bayesian optimization.
- Args:
**params: Parameters to evaluate
- Returns:
float: Negative RMSE or penalty score on error
- bayesian_optimize(init_points: int = 10, n_iter: int = 50, acquisition_function: str = 'ei') Tuple[Dict, Pipeline][source]#
Run Bayesian optimization to find the best pipeline configuration and hyperparameters.
- Args:
init_points: Number of random initial points n_iter: Number of Bayesian optimization iterations acquisition_function: Acquisition function for Bayesian optimization
- Returns:
- Tuple containing:
Dict of best parameters
Fitted Pipeline with best configuration
- export_best_pipeline(file_path: str) None[source]#
Export the best pipeline configuration and hyperparameters to a file.
- Args:
file_path: Path to save the export file
- Raises:
AttributeError: If optimizer hasn’t been run yet
- get_all_tested_pipelines() List[Dict][source]#
Get details of all tested pipeline configurations.
- Returns:
List of dictionaries with pipeline details
- get_best_pipeline_predictions(best_pipeline: Pipeline) Tuple[ndarray[tuple[Any, ...], dtype[_ScalarT]], float, float][source]#
Get predictions using the best pipeline.
- Args:
best_pipeline: Fitted pipeline object
- Returns:
- Tuple containing:
Predictions array
RMSE score
R² score
- class spectoprep.SpectroPrepPlotter[source]#
Bases:
objectA class for creating high-quality plots for spectroscopy data.
This class provides various plotting functions specifically designed for spectroscopy data and pipeline optimization results.
- static plot_feature_importance(wavenumbers: ndarray, coefficients: ndarray, title: str = 'Feature Importance', xlabel: str = 'Wavenumber (cm$^{-1}$)', ylabel: str = 'Coefficient Value', figsize: Tuple[int, int] = (12, 6), color: str = 'purple', highlight_threshold: float | None = None, highlight_color: str = 'red', save_path: str | None = None)[source]#
Plot feature importance from model coefficients.
Parameters#
- wavenumbersarray-like
The x-axis values (wavenumbers).
- coefficientsarray-like
Model coefficients corresponding to each wavenumber.
- titlestr, default=’Feature Importance’
Plot title.
- xlabelstr, default=’Wavenumber (cm$^{-1}$)’
X-axis label.
- ylabelstr, default=’Coefficient Value’
Y-axis label.
- figsizetuple, default=(12, 6)
Figure size.
- colorstr, default=’purple’
Color of the line.
- highlight_thresholdfloat, optional
If provided, highlights coefficients with absolute values above this threshold.
- highlight_colorstr, default=’red’
Color for highlighted coefficients.
- save_pathstr, optional
If provided, save the figure to this path.
Returns#
- figmatplotlib.figure.Figure
The figure object.
- axmatplotlib.axes.Axes
The axes object.
- static plot_optimization_progress(optimizer: PipelineOptimizer, figsize: Tuple[int, int] = (12, 6), title: str = 'Optimization Progress', save_path: str | None = None)[source]#
Plot optimization progress over iterations.
Parameters#
- optimizerPipelineOptimizer
The fitted pipeline optimizer.
- figsizetuple, default=(12, 6)
Figure size.
- titlestr, default=’Optimization Progress’
Plot title.
- save_pathstr, optional
If provided, save the figure to this path.
Returns#
- figmatplotlib.figure.Figure
The figure object.
- axmatplotlib.axes.Axes
The axes object.
- static plot_optimization_results(optimizer: PipelineOptimizer, top_n: int = 5, figsize: Tuple[int, int] = (12, 8), title: str = 'Pipeline Optimization Results', save_path: str | None = None)[source]#
Plot optimization results from PipelineOptimizer.
Parameters#
- optimizerPipelineOptimizer
The fitted pipeline optimizer.
- top_nint, default=5
Number of top pipelines to display.
- figsizetuple, default=(12, 8)
Figure size.
- titlestr, default=’Pipeline Optimization Results’
Plot title.
- save_pathstr, optional
If provided, save the figure to this path.
Returns#
- figmatplotlib.figure.Figure
The figure object.
- static plot_prediction_scatter(y_true: ndarray, y_pred: ndarray, title: str = 'Prediction Performance', xlabel: str = 'Measured', ylabel: str = 'Predicted', figsize: Tuple[int, int] = (10, 8), alpha: float = 0.7, color: str = 'blue', add_metrics: bool = True, save_path: str | None = None)[source]#
Create a scatter plot of predicted vs true values.
Parameters#
- y_truearray-like
True target values.
- y_predarray-like
Predicted target values.
- titlestr, default=’Prediction Performance’
Plot title.
- xlabelstr, default=’Measured’
X-axis label.
- ylabelstr, default=’Predicted’
Y-axis label.
- figsizetuple, default=(10, 8)
Figure size.
- alphafloat, default=0.7
Transparency of the points.
- colorstr, default=’blue’
Color of the scatter points.
- add_metricsbool, default=True
Whether to add RMSE and R² metrics to the plot.
- save_pathstr, optional
If provided, save the figure to this path.
Returns#
- figmatplotlib.figure.Figure
The figure object.
- axmatplotlib.axes.Axes
The axes object.
- static plot_preprocessing_comparison(wavenumbers: ndarray, original_spectra: ndarray, processed_spectra: Dict[str, ndarray], sample_indices: List[int] | None = None, figsize: Tuple[int, int] = (15, 10), title: str = 'Preprocessing Comparison', color_map: str = 'tab10', save_path: str | None = None)[source]#
Plot comparison of original and processed spectra.
Parameters#
- wavenumbersarray-like
The x-axis values (wavenumbers).
- original_spectraarray-like
The original spectra data of shape (n_samples, n_features).
- processed_spectradict
Dictionary mapping preprocessing method names to processed spectra.
- sample_indiceslist of int, optional
Indices of samples to plot. If None, all samples are plotted.
- figsizetuple, default=(15, 10)
Figure size.
- titlestr, default=’Preprocessing Comparison’
Main title for the figure.
- color_mapstr, default=’tab10’
Colormap for differentiating samples.
- save_pathstr, optional
If provided, save the figure to this path.
Returns#
- figmatplotlib.figure.Figure
The figure object.
- static plot_spectra(wavenumbers: ndarray, spectra: ndarray, labels: List[str] | None = None, title: str = 'Spectral Data', xlabel: str = 'Wavenumber (cm$^{-1}$)', ylabel: str = 'Absorbance', alpha: float = 0.7, figsize: Tuple[int, int] = (12, 6), color_map: str = 'viridis', legend_loc: str = 'best', grid: bool = True, save_path: str | None = None)[source]#
Plot spectral data.
Parameters#
- wavenumbersarray-like
The x-axis values (wavenumbers).
- spectraarray-like
The spectra data of shape (n_samples, n_features).
- labelslist of str, optional
Labels for each spectrum. If None, spectra are numbered.
- titlestr, default=’Spectral Data’
Plot title.
- xlabelstr, default=’Wavenumber (cm$^{-1}$)’
X-axis label.
- ylabelstr, default=’Absorbance’
Y-axis label.
- alphafloat, default=0.7
Transparency of the lines.
- figsizetuple, default=(12, 6)
Figure size.
- color_mapstr, default=’viridis’
Colormap for the spectra.
- legend_locstr, default=’best’
Location of the legend.
- gridbool, default=True
Whether to show grid.
- save_pathstr, optional
If provided, save the figure to this path.
Returns#
- figmatplotlib.figure.Figure
The figure object.
- axmatplotlib.axes.Axes
The axes object.