Skip to main content

Data Cleaning

Data cleaning is the process of fixing or removing incorrect, corrupted, incorrectly formatted, duplicate, or incomplete data within a dataset. It's a crucial step in the data preparation phase.

Common Data Cleaning Tasks

  1. Handling Missing Values: Decide whether to impute or remove missing data.
  2. Removing Duplicates: Identify and remove duplicate entries.
  3. Fixing Structural Errors: Correct typos, inconsistent capitalization, etc.
  4. Handling Outliers: Decide how to treat extreme values.

Techniques for Data Cleaning

Handling Missing Values

# Fill missing values with the mean
df['column'].fillna(df['column'].mean(), inplace=True)

# Or drop rows with missing values
df.dropna(inplace=True)

# Remove duplicate rows
df.drop_duplicates(inplace=True)

# Using IQR method to remove outliers
Q1 = df['column'].quantile(0.25)
Q3 = df['column'].quantile(0.75)
IQR = Q3 - Q1
df = df[(df['column'] >= Q1 - 1.5*IQR) & (df['column'] <= Q3 + 1.5*IQR)]