Pipelines
A pipeline chains together multiple steps, meaning the output of each step is used as input to the next step.
Creating a pipeline
Load data:
import pandas as pd
import numpy as np
train = pd.DataFrame({'feat1':[10, 20, np.nan, 2], 'feat2':[25., 20, 5, 3], 'label':['A', 'A', 'B', 'B']})
test = pd.DataFrame({'feat1':[30., 5, 15], 'feat2':[12, 10, np.nan]})
Defining steps:
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
imputer = SimpleImputer()
clf = LogisticRegression()
Create a 2-step pipeline. Impute missing values, then pass the results to the classifier:
Using the pipeline:
features = ['feat1', 'feat2']
X, y = train[features], train['label']
X_new = test[features]
# pipeline applies the imputer to X before fitting the classifier
pipe.fit(X, y)
# pipeline applies the imputer to X_new before making predictions
# note: pipeline uses imputation values learned during the "fit" step
pipe.predict(X_new)
make_pipeline
vs Pipeline
Pipeline
requires naming of steps while make_pipeline
does not.
With make_pipeline
:
With Pipeline
:
from sklearn.pipeline import Pipeline
pipe = Pipeline([('preprocessor', imputer), ('classifier', clf)])
Examine the intermediate steps in a Pipeline
Use the named_steps
attribute as pipe.named_steps.STEP_NAME.ATTRIBUTE
:
If using make_pipeline
the name of the step is the name of the variable (here imputer
). When using Pipeline
the name is the assigned name when creating the pipeline (here preprocessor
).
Cross-validate and grid search an entire pipeline
Cross-validate the entire pipeline (not just the model):
from sklearn.model_selection import cross_val_score
cross_val_score(pipe, X, y, cv=5, scoring='accuracy').mean()
Find optimal tuning parameters for the entire pipeline:
# specify parameter values to search
params = {}
params['columntransformer__countvectorizer__min_df'] = [1, 2]
params['logisticregression__C'] = [0.1, 1, 10]
params['logisticregression__penalty'] = ['l1', 'l2']
# try all possible combinations of those parameter values
from sklearn.model_selection import GridSearchCV
grid = GridSearchCV(pipe, params, cv=5, scoring='accuracy')
grid.fit(X, y);
Best score found during the search:
Combination of parameters that produced the best score:
Pipeline diagram
Create interactive diagrams of Pipelines (and other estimators):
from sklearn import set_config
set_config(display='diagram')
pipe = make_pipeline(ct, selection, logreg)
pipe
Export the diagram to an HTML file:
from sklearn.utils import estimator_html_repr
with open('pipeline.html', 'w') as f:
f.write(estimator_html_repr(pipe))
Operate on part of a Pipeline
Slice the Pipeline using Python's slicing notation:
# access step 0 (preprocessor)
pipe[0].fit_transform(X)
# access steps 0 and 1 (preprocessor and feature selector)
pipe[0:2].fit_transform(X, y)
# access step 1 (feature selector)
pipe[1].get_support()