Scikit-Learn’s SimpleImputer – Fill Missing Values

Greetings! Some links on this site are affiliate links. That means that, if you choose to make a purchase, The Click Reader may earn a small commission at no extra cost to you. We greatly appreciate your support!

In simple words, the SimpleImputer is a Python class from Scikit-Learn that is used to fill missing values in structured datasets containing None or NaN data types.

Scikit-Learn's SimpleImputer - Fill Missing Values

As the name suggests, the class performs simple imputations, that is, it replaces missing data with substitute values based on a given strategy. Let’s have a look at the syntax for SimpleImputer initialization to understand this better:

SimpleImputer(*missing_values=nanstrategy='mean'fill_value=Noneverbose=0copy=Trueadd_indicator=False)

The parameters/arguments in the SimpleImputer class are as follows:

  • missing_values: This is a placeholder for the missing values to fill and it is set to np.nan by default. All occurences of this parameter’s value will be imputed.
  • strategy: This parameter defines the imputation strategy and you can either set it to ‘mean’, ‘median’, ‘most_frequent’, or ‘constant’.
  • fill_value: This parameter is used when the strategy=constant and a constant value that is to be filled is needed to be supplied. By default, the fill_value is set as 0.
  • verbose: This parameter is used to control the verbosity of the SimpleImputer and is 0 by default.
  • copy: If True, a copy of the input dataset will be created. If False, imputation will be done in-place whenever possible.
  • add_indicator: If True, a MissingIndicator transform will stack onto output of the imputer’s transform. This allows a predictive estimator to account for missingness despite imputation.

Getting started with the SimpleImputer

To start using the SimpleImputer class, you must install the Scikit-Learn library in your machine alongside Python.

You can run the following command from your command line/terminal to install scikit-learn using Python’s Package Manager (pip):

pip install scikit-learn

Once you’ve installed the library, you can import it in Python by running the following line of code in your Python IDE or Python Shell.

import sklearn

If running this line of code doesn’t give you an error, you’ve successfully installed Scikit-Learn and imported it in Python. Now, you can use the SimpleImputer to fill missing values.

Performing imputation using the ‘mean’ strategy in SimpleImputer

The ‘mean’ strategy of SimpleImputer replaces missing values using the median along each column and this can only be used with numeric data.

Here’s an example of how a ‘mean’ strategy can be used to fill missing values using the SimpleImputer:

# Importing the NumPy library to create nan values
import numpy as np

# Importing the SimpleImputer class from sklearn
from sklearn.impute import SimpleImputer

# Initializing the SimpleImputer object with missing_value and strategy defined
imp_mean = SimpleImputer(missing_values=np.nan, strategy='mean')

# Fitting the SimpleImputer using a sample dataset
imp_mean.fit([[7, 2, 3], [4, np.nan, 6], [10, 5, 9]])

# Initializing a dataset that isn't fitted to the SimpleImputer
X = [[np.nan, 2, 3], [4, np.nan, 6], [10, np.nan, 9]]

# Filling in the missing values in X using the fitted SimpleImputer
print(imp_mean.transform(X))
[[ 7. 2. 3. ] 
 [ 4. 3.5 6. ] 
 [10. 3.5 9. ]]

In the example above, you can see that we fitted the SimpleImputer using a sample dataset which in itself contained missing values. Then, the dataset X is transformed to fill in the missing values using the fitted SimpleImputer. This kind of imputation where you fill in the missing values with the mean is also known as ‘mean imputation’.

Performing imputation using the ‘median’ strategy in SimpleImputer

The ‘median’ strategy of SimpleImputer replaces missing values using the median along each column and this can only be used with numeric data.

Here’s an example of how a ‘median’ strategy can be used to fill missing values using the SimpleImputer:

# Importing the NumPy library to create nan values
import numpy as np

# Importing the SimpleImputer class from sklearn
from sklearn.impute import SimpleImputer

# Initializing the SimpleImputer object with missing_value and strategy defined
imp_median = SimpleImputer(missing_values=np.nan, strategy='median')

# Fitting the SimpleImputer using the given dataset
imp_median.fit([[7, 2, 3], [4, np.nan, 6], [10, 5, 9]])

# Initializing a sample dataset that isn't fitted
X = [[np.nan, 2, 3], [4, np.nan, 6], [10, np.nan, 9]]

# Filling in the missing values in the sample dataset using the fitted SimpleImputer
print(imp_median.transform(X))
[[ 7. 2. 3. ]
[ 4. 3.5 6. ]
[10. 3.5 9. ]]


Performing imputation using the ‘most_frequent’ strategy in SimpleImputer

The ‘most_frequent’ strategy of SimpleImputer replaces missing values using the most frequent value along each column and it can be used with strings or numeric data. If there is more than one such value, only the smallest value is returned.

Here’s an example of how the ‘most_frequent’ strategy can be used to fill missing values using the SimpleImputer:

# Importing the NumPy library to create nan values
import numpy as np

# Importing the SimpleImputer class from sklearn
from sklearn.impute import SimpleImputer

# Initializing the SimpleImputer object with missing_value and strategy defined
imp_most_freq = SimpleImputer(missing_values=np.nan, strategy='most_frequent')

# Fitting the SimpleImputer using the given dataset
imp_most_freq.fit([[7, 2, 3], [4, np.nan, 6], [10, 5, 9]])

# Initializing a sample dataset that isn't fitted
X = [[np.nan, 2, 3], [4, np.nan, 6], [10, np.nan, 9]]

# Filling in the missing values in the sample dataset using the fitted SimpleImputer
print(imp_most_freq.transform(X))
[[ 4. 2. 3.]
 [ 4. 2. 6.]
 [10. 2. 9.]]

Performing imputation using the ‘constant’ strategy in SimpleImputer

The ‘constant’ strategy of SimpleImputer replaces missing values using a provided fill_value and it can be used with strings or numeric data.

Here’s an example of how the ‘constant’ strategy can be used to fill missing values using the SimpleImputer:

# Importing the NumPy library to create nan values
import numpy as np

# Importing the SimpleImputer class from sklearn
from sklearn.impute import SimpleImputer

# Initializing the SimpleImputer object with missing_value and strategy defined
imp_constant = SimpleImputer(missing_values=np.nan, strategy='constant', fill_value=20)

# Fitting the SimpleImputer using the given dataset
imp_constant.fit([[7, 2, 3], [4, np.nan, 6], [10, 5, 9]])

# Initializing a sample dataset that isn't fitted
X = [[np.nan, 2, 3], [4, np.nan, 6], [10, np.nan, 9]]

# Filling in the missing values in the sample dataset using the fitted SimpleImputer
print(imp_constant.transform(X))
[[20. 2. 3.]
 [ 4. 20. 6.]
 [10. 20. 9.]]

In Conclusion

You have now successfully learned how to use the SimpleImputer class from Scikit-Learn to fill missing values with 4 different strategies. If you have any questions, please feel free to ask them down in the comments and we will get back to you.


Scikit-Learn's SimpleImputer - Fill Missing ValuesScikit-Learn's SimpleImputer - Fill Missing Values

Do you want to learn Python, Data Science, and Machine Learning while getting certified? Here are some best selling Datacamp courses that we recommend you enroll in:

  1. Introduction to Python (Free Course) - 1,000,000+ students already enrolled!
  2. Introduction to Data Science  in Python- 400,000+ students already enrolled!
  3. Introduction to TensorFlow for Deep Learning with Python - 90,000+ students already enrolled!
  4. Data Science and Machine Learning Bootcamp with R - 70,000+ students already enrolled!

Leave a Comment