Data Cleaning
Data cleaning is the process of fixing or removing incorrect, corrupted, incorrectly formatted, duplicate, or incomplete data within a dataset. It's a crucial step in the data preparation phase.
Common Data Cleaning Tasks
- Handling Missing Values: Decide whether to impute or remove missing data.
- Removing Duplicates: Identify and remove duplicate entries.
- Fixing Structural Errors: Correct typos, inconsistent capitalization, etc.
- Handling Outliers: Decide how to treat extreme values.
Techniques for Data Cleaning
Handling Missing Values
# Fill missing values with the mean
df['column'].fillna(df['column'].mean(), inplace=True)
# Or drop rows with missing values
df.dropna(inplace=True)
# Or drop empty columns
df = df.drop(['column'],axis=1)
# Remove duplicate rows
df.drop_duplicates(inplace=True)
# Using IQR method to remove outliers
Q1 = df['column'].quantile(0.25)
Q3 = df['column'].quantile(0.75)
IQR = Q3 - Q1
df = df[(df['column'] >= Q1 - 1.5*IQR) & (df['column'] <= Q3 + 1.5*IQR)]
Common questions
- When to drop null / empty values?
- Drop missing or sparse rows/columns
- Null values can break model
- Data cleaning / dropping values depends on EDA findings
- If given column has too many missing values:
- Drop column
- If target column has missing values:
- Drop rows with missing targets
- Or treat as separate category
- What to do when there are only a few missing values?
- Imputation:
- Fill missing values with substitutes
- Strategies
- Fill with mean or median
- Use constant or previous value
Advanced imputation
Advanced techniques:
- K-nearest neighbors
- SMOTE (synthetic minority oversampling technique)
from sklearn.impute import KNNImputer# Initialize KNNImputer
imputer = KNNImputer (n_neighbors=2, weights="uniform")
# Perform the imputation on your DataFrame
df_imputed['oldpeak'] = imputer.fit_transform(df ['oldpeak'])