In simple words, the SimpleImputer is a Python class from Scikit-Learn that is used to fill missing values in structured datasets containing None or NaN data types.
As the name suggests, the class performs simple imputations, that is, it replaces missing data with substitute values based on a given strategy. Let’s have a look at the syntax for SimpleImputer initialization to understand this better:
SimpleImputer
(*, missing_values=nan, strategy='mean', fill_value=None, verbose=0, copy=True, add_indicator=False)
The parameters/arguments in the SimpleImputer class are as follows:
missing_values
: This is a placeholder for the missing values to fill and it is set tonp.nan
by default. All occurences of this parameter’s value will be imputed.strategy
: This parameter defines the imputation strategy and you can either set it to ‘mean’, ‘median’, ‘most_frequent’, or ‘constant’.fill_value
: This parameter is used when thestrategy=constant
and a constant value that is to be filled is needed to be supplied. By default, thefill_value
is set as 0.verbose
: This parameter is used to control the verbosity of the SimpleImputer and is 0 by default.copy
: If True, a copy of the input dataset will be created. If False, imputation will be done in-place whenever possible.add_indicator
: If True, aMissingIndicator
transform will stack onto output of the imputer’s transform. This allows a predictive estimator to account for missingness despite imputation.
Getting started with the SimpleImputer
To start using the SimpleImputer class, you must install the Scikit-Learn library in your machine alongside Python.
You can run the following command from your command line/terminal to install scikit-learn using Python’s Package Manager (pip):
pip install scikit-learn
Once you’ve installed the library, you can import it in Python by running the following line of code in your Python IDE or Python Shell.
import sklearn
If running this line of code doesn’t give you an error, you’ve successfully installed Scikit-Learn and imported it in Python. Now, you can use the SimpleImputer to fill missing values.
Performing imputation using the ‘mean’ strategy in SimpleImputer
The ‘mean’ strategy of SimpleImputer replaces missing values using the median along each column and this can only be used with numeric data.
Here’s an example of how a ‘mean’ strategy can be used to fill missing values using the SimpleImputer:
# Importing the NumPy library to create nan values import numpy as np # Importing the SimpleImputer class from sklearn from sklearn.impute import SimpleImputer # Initializing the SimpleImputer object with missing_value and strategy defined imp_mean = SimpleImputer(missing_values=np.nan, strategy='mean') # Fitting the SimpleImputer using a sample dataset imp_mean.fit([[7, 2, 3], [4, np.nan, 6], [10, 5, 9]]) # Initializing a dataset that isn't fitted to the SimpleImputer X = [[np.nan, 2, 3], [4, np.nan, 6], [10, np.nan, 9]] # Filling in the missing values in X using the fitted SimpleImputer print(imp_mean.transform(X))
[[ 7. 2. 3. ] [ 4. 3.5 6. ] [10. 3.5 9. ]]
In the example above, you can see that we fitted the SimpleImputer using a sample dataset which in itself contained missing values. Then, the dataset X is transformed to fill in the missing values using the fitted SimpleImputer. This kind of imputation where you fill in the missing values with the mean is also known as ‘mean imputation’.
Performing imputation using the ‘median’ strategy in SimpleImputer
The ‘median’ strategy of SimpleImputer replaces missing values using the median along each column and this can only be used with numeric data.
Here’s an example of how a ‘median’ strategy can be used to fill missing values using the SimpleImputer:
# Importing the NumPy library to create nan values import numpy as np # Importing the SimpleImputer class from sklearn from sklearn.impute import SimpleImputer # Initializing the SimpleImputer object with missing_value and strategy defined imp_median = SimpleImputer(missing_values=np.nan, strategy='median') # Fitting the SimpleImputer using the given dataset imp_median.fit([[7, 2, 3], [4, np.nan, 6], [10, 5, 9]]) # Initializing a sample dataset that isn't fitted X = [[np.nan, 2, 3], [4, np.nan, 6], [10, np.nan, 9]] # Filling in the missing values in the sample dataset using the fitted SimpleImputer print(imp_median.transform(X))
[[ 7. 2. 3. ] [ 4. 3.5 6. ] [10. 3.5 9. ]]
Performing imputation using the ‘most_frequent’ strategy in SimpleImputer
The ‘most_frequent’ strategy of SimpleImputer replaces missing values using the most frequent value along each column and it can be used with strings or numeric data. If there is more than one such value, only the smallest value is returned.
Here’s an example of how the ‘most_frequent’ strategy can be used to fill missing values using the SimpleImputer:
# Importing the NumPy library to create nan values import numpy as np # Importing the SimpleImputer class from sklearn from sklearn.impute import SimpleImputer # Initializing the SimpleImputer object with missing_value and strategy defined imp_most_freq = SimpleImputer(missing_values=np.nan, strategy='most_frequent') # Fitting the SimpleImputer using the given dataset imp_most_freq.fit([[7, 2, 3], [4, np.nan, 6], [10, 5, 9]]) # Initializing a sample dataset that isn't fitted X = [[np.nan, 2, 3], [4, np.nan, 6], [10, np.nan, 9]] # Filling in the missing values in the sample dataset using the fitted SimpleImputer print(imp_most_freq.transform(X))
[[ 4. 2. 3.] [ 4. 2. 6.] [10. 2. 9.]]
Performing imputation using the ‘constant’ strategy in SimpleImputer
The ‘constant’ strategy of SimpleImputer replaces missing values using a provided fill_value
and it can be used with strings or numeric data.
Here’s an example of how the ‘constant’ strategy can be used to fill missing values using the SimpleImputer:
# Importing the NumPy library to create nan values import numpy as np # Importing the SimpleImputer class from sklearn from sklearn.impute import SimpleImputer # Initializing the SimpleImputer object with missing_value and strategy defined imp_constant = SimpleImputer(missing_values=np.nan, strategy='constant', fill_value=20) # Fitting the SimpleImputer using the given dataset imp_constant.fit([[7, 2, 3], [4, np.nan, 6], [10, 5, 9]]) # Initializing a sample dataset that isn't fitted X = [[np.nan, 2, 3], [4, np.nan, 6], [10, np.nan, 9]] # Filling in the missing values in the sample dataset using the fitted SimpleImputer print(imp_constant.transform(X))
[[20. 2. 3.] [ 4. 20. 6.] [10. 20. 9.]]
In Conclusion
You have now successfully learned how to use the SimpleImputer class from Scikit-Learn to fill missing values with 4 different strategies. If you have any questions, please feel free to ask them down in the comments and we will get back to you.
Do you want to learn Python, Data Science, and Machine Learning while getting certified? Here are some best selling Datacamp courses that we recommend you enroll in:
- Introduction to Python (Free Course) - 1,000,000+ students already enrolled!
- Introduction to Data Science in Python- 400,000+ students already enrolled!
- Introduction to TensorFlow for Deep Learning with Python - 90,000+ students already enrolled!
- Data Science and Machine Learning Bootcamp with R - 70,000+ students already enrolled!