Pipelines#
A pipeline chains together multiple steps, meaning the output of each step is used as input to the next step.
Creating a pipeline#
Load data:
import pandas as pd
import numpy as np
train = pd.DataFrame({'feat1':[10, 20, np.nan, 2], 'feat2':[25., 20, 5, 3], 'label':['A', 'A', 'B', 'B']})
test = pd.DataFrame({'feat1':[30., 5, 15], 'feat2':[12, 10, np.nan]})
Defining steps:
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
imputer = SimpleImputer()
clf = LogisticRegression()
Create a 2-step pipeline. Impute missing values, then pass the results to the classifier:
from sklearn.pipeline import make_pipeline
pipe = make_pipeline(imputer, clf)
Using the pipeline:
features = ['feat1', 'feat2']
X, y = train[features], train['label']
X_new = test[features]
# pipeline applies the imputer to X before fitting the classifier
pipe.fit(X, y)
# pipeline applies the imputer to X_new before making predictions
# note: pipeline uses imputation values learned during the "fit" step
pipe.predict(X_new)
make_pipeline
vs Pipeline
#
Pipeline
requires naming of steps while make_pipeline
does not.
With make_pipeline
:
from sklearn.pipeline import make_pipeline
pipe = make_pipeline(imputer, clf)
With Pipeline
:
from sklearn.pipeline import Pipeline
pipe = Pipeline([('preprocessor', imputer), ('classifier', clf)])
Examine the intermediate steps in a Pipeline#
Use the named_steps
attribute as pipe.named_steps.STEP_NAME.ATTRIBUTE
:
pipe.named_steps.imputer.statistics_
# or
pipe.named_steps.preprocessor.statistics_
If using make_pipeline
the name of the step is the name of the variable (here imputer
). When using Pipeline
the name is the assigned name when creating the pipeline (here preprocessor
).
Cross-validate and grid search an entire pipeline#
Cross-validate the entire pipeline (not just the model):
from sklearn.model_selection import cross_val_score
cross_val_score(pipe, X, y, cv=5, scoring='accuracy').mean()
Find optimal tuning parameters for the entire pipeline:
# specify parameter values to search
params = {}
params['columntransformer__countvectorizer__min_df'] = [1, 2]
params['logisticregression__C'] = [0.1, 1, 10]
params['logisticregression__penalty'] = ['l1', 'l2']
# try all possible combinations of those parameter values
from sklearn.model_selection import GridSearchCV
grid = GridSearchCV(pipe, params, cv=5, scoring='accuracy')
grid.fit(X, y);
Best score found during the search:
grid.best_score_
Combination of parameters that produced the best score:
grid.best_params_
Pipeline diagram#
Create interactive diagrams of Pipelines (and other estimators):
from sklearn import set_config
set_config(display='diagram')
pipe = make_pipeline(ct, selection, logreg)
pipe
Export the diagram to an HTML file:
from sklearn.utils import estimator_html_repr
with open('pipeline.html', 'w') as f:
f.write(estimator_html_repr(pipe))
Operate on part of a Pipeline#
Slice the Pipeline using Python’s slicing notation:
# access step 0 (preprocessor)
pipe[0].fit_transform(X)
# access steps 0 and 1 (preprocessor and feature selector)
pipe[0:2].fit_transform(X, y)
# access step 1 (feature selector)
pipe[1].get_support()