Skip to main content

Data Cleaning

Data cleaning is the process of fixing or removing incorrect, corrupted, incorrectly formatted, duplicate, or incomplete data within a dataset. It's a crucial step in the data preparation phase.

Common Data Cleaning Tasks

  1. Handling Missing Values: Decide whether to impute or remove missing data.
  2. Removing Duplicates: Identify and remove duplicate entries.
  3. Fixing Structural Errors: Correct typos, inconsistent capitalization, etc.
  4. Handling Outliers: Decide how to treat extreme values.

Techniques for Data Cleaning

Handling Missing Values

# Fill missing values with the mean
df['column'].fillna(df['column'].mean(), inplace=True)

# Or drop rows with missing values
df.dropna(inplace=True)

# Or drop empty columns
df = df.drop(['column'],axis=1)

# Remove duplicate rows
df.drop_duplicates(inplace=True)

# Using IQR method to remove outliers
Q1 = df['column'].quantile(0.25)
Q3 = df['column'].quantile(0.75)
IQR = Q3 - Q1
df = df[(df['column'] >= Q1 - 1.5*IQR) & (df['column'] <= Q3 + 1.5*IQR)]

Common questions

  1. When to drop null / empty values?
  • Drop missing or sparse rows/columns
  • Null values can break model
  • Data cleaning / dropping values depends on EDA findings
  • If given column has too many missing values:
    • Drop column
  • If target column has missing values:
    • Drop rows with missing targets
    • Or treat as separate category
  1. What to do when there are only a few missing values?
  • Imputation:
    • Fill missing values with substitutes
  • Strategies
    • Fill with mean or median
    • Use constant or previous value

Advanced imputation

Advanced techniques:

  • K-nearest neighbors
  • SMOTE (synthetic minority oversampling technique)
from sklearn.impute import KNNImputer# Initialize KNNImputer
imputer = KNNImputer (n_neighbors=2, weights="uniform")
# Perform the imputation on your DataFrame
df_imputed['oldpeak'] = imputer.fit_transform(df ['oldpeak'])