How To Include An ADA Boost Model’s Base Estimator In A Grid Search When It’s Contained In A Pipeline (*GASP*)

Christopher Delacruz
4 min readJul 2, 2021

Easy there friend. I promise, there’s a very easy way we will get through this.

Pipelines and Grid Searches

As any machine learning specialist will tell you, pipelines and grid searches are of the utmost importance for optimizing and tuning the parameters of your machine learning model. Pipelines ensure that your data is going through the appropriate steps in the correct order (cleaning, pre-processing, vectorizing, classifying, etc). Here’s an example:

example_pipeline = Pipeline([
('tfidf', TfidfVectorizer()),
('SMOTE', SMOTE()),
('dtc', DecisionTreeClassifier())
])

In the above example, you can see that any data that I would fit to the example_pipeline would FIRST be vectorized through the TF-IDF Vectorizer (obviously, we’re working with language here), SECONDLY, SMOTE would be run on the vectorized language which would auto balance the minority class by artificially creating additional data points that are variations on existing points in the minority class, and FINALLY, that data is fed into the Decision Tree Classifier which classifies the data based on the target I have asked it to look at.

Grid Searches are used to test multiple variations of a model to find the ideal parameters for what you want your model to do. Running a grid search on a model by itself is fairly straight forward, here’s an example:

dtc = DecisionTreeClassifier() #I instantiate a decision tree model# Below I create a dictionary of the Decision Tree parameters I would like to test in my grid searchparameters = {
'criterion': ['gini', 'entropy'],
'max_depth': [1, 5, 9]
}
# I import GridSearchCV from Sci-Kit Learn from sklearn.model_selection import GridSearchCV# I instantiate a GridSearchCVgs = GridSearchCV(dtc, # Here I refer our earlier model
param_grid=parameters, # the parameters we defined
cv = 3, # specify 3 folds of cross-validation
scoring = 'accuracy) # Accuracy is our metric
# Fit to the datags.fit(X, y)

You can see above that grid searches are fairly intuitive. Pipelines have multiple instantiated classes so how do we specify specific parameters for each class? Let’s use the example Pipeline we instantiated earlier.

# For reference, so you don't have to scroll up and down :)example_pipeline = Pipeline([
('tfidf', TfidfVectorizer()),
('SMOTE', SMOTE()),
('dtc', DecisionTreeClassifier())
])
# define our parameters for each instantiationtfidf_dtc_params = {
'tfidf__max_features':[None, 27_000, 10_000],
'tfidf__ngram_range':[(1,1), (1,2), (2,2)],
'SMOTE__sampling_strategy':['auto', .75, .9],
'dtc__criterion': ['gini', 'entropy'],
'dtc__max_depth': [1, 5, 9]
}

You can see above that we can simply reference each instantiation by its defined name that we gave it in example_pipeline followed by a double underscore __ and then the parameters. Easy peasy.

Enter The ADA Boost Model

ADA Boost Models are incredibly powerful models that uses multiple weak learners which feeds data of what was incorrectly classified to the next weak learner and creates stronger learners through this process (this is a bit of an oversimplification but this post is meant for people who are already familiar with ADA Boost). By default, the ADA Boost Model uses a Decision Tree with a max depth of 1 as its weak learner. There are a number of parameters that can be grid searches in an ADA Boost Model but the Decision Tree as a base estimator is one of those parameters:

dtc = DecisionTreeClassifier()ada = ADABoostClassifier(base_estimator = dtc)

You are welcome to change the base estimator based on your needs but for our case, we want to use a decision tree which is already nested inside the ADA Boost model. So how do we access the Decision Tree parameters in a grid search? You can code a complicated way to do so (with additional variables and for loops) OR you can use this easy approach:

# Let's make a different pipeline hereanother_pipeline = Pipeline([
("ct", ColumnTransformer()),
("ada", AdaBoostClassifier(base_estimator = dtc))
])
# Now, let's define those parameters!params = {
'ada__n_estimators': [int(x) for x in np.linspace(start = 200, stop = 2000, num = 4)],
'ada__learning_rate': [0.01, 0.1, 0.5, 1],
'ada__base_estimator__criterion': ['gini', 'entropy'],
'ada__base_estimator__max_depth': [1, 5, 9],
'ada__base_estimator__min_samples_split': [1, 5, 9],
'ada__base_estimator__min_samples_leaf': [1, 3, 5],
'ada__base_estimator__splitter': ['best', 'random'],
'ada__base_estimator__max_features': [None, 'auto', 'log2'],
'ada__base_estimator__max_leaf_nodes': [1, 5, 9],
'ada__base_estimator__class_weight': [None, 'balanced']
}

In the above parameters, n_estimators and learning_rate are parameters specifically for the ADA Boost model BUT the other parameters are for the Decision Tree Model. We are accessing the base estimator by first specifying the “ada” name in the pipeline followed by a double underscore __ and THEN we reference the base_estimator (which we already defined in the pipeline inside the ADABoostClassifier’s parameters) followed by ANOTHER double underscore __ and then simply reference the parameters of the decision tree classifier that we would like to tune. Boom. Mic Drop.

This is a great way to really dig deeper into your grid searches and further fine tune your models when base estimators are part of the journey (or really to reference anything further in a pipeline). Doing so will most certainly make your Grid Searches longer (obviously but don’t forget to make your n_jobs = -1) but will test out more possibilities and open up your odds of finding a better model. Maybe one that will change the world/your company/just you. Good luck machine learners!

--

--

Christopher Delacruz

Chris de la Cruz is a guacamole-eating and fitness-loving data scientist, actor, freestyler, and beatboxer (under the moniker MC Lightbulb)