Detecting Missing Values¶

Missing data is inevitable in real-world datasets. pandas provides methods to detect and identify missing values.

isnull Method¶

The isnull() method returns a boolean mask indicating missing values.

1. Series Detection¶

import pandas as pd
import numpy as np

s = pd.Series([1, np.nan, 3, None, 5])
print(s.isnull())

0    False
1     True
2    False
3     True
4    False
dtype: bool

2. DataFrame Detection¶

df = pd.DataFrame({
    'A': [1, np.nan, 3],
    'B': [4, 5, np.nan]
})
print(df.isnull())

       A      B
0  False  False
1   True  False
2  False   True

3. Count Missing Values¶

print(df.isnull().sum())  # Per column
print(df.isnull().sum().sum())  # Total

notnull Method¶

The notnull() method is the inverse of isnull().

1. Basic Usage¶

print(s.notnull())

0     True
1    False
2     True
3    False
4     True
dtype: bool

2. Filter Non-null Values¶

s[s.notnull()]  # Keep only non-null values

3. Alias isna and notna¶

s.isna()    # Same as isnull()
s.notna()   # Same as notnull()

Missing Value Types¶

pandas treats several values as missing, but they behave differently.

The Three Main Missing Types¶

Type	Package	Works With	Propagates
`np.nan`	NumPy	float only	Yes
`None`	Python	object dtype	Yes
`pd.NA`	pandas	All nullable dtypes	Yes

1. np.nan (NumPy NaN)¶

The traditional missing value, but only works with float dtype.

import numpy as np
import pandas as pd

# np.nan is a float
print(type(np.nan))  # <class 'float'>

# Integer column with np.nan converts to float!
s = pd.Series([1, 2, np.nan])
print(s.dtype)  # float64 (not int!)
print(s)

0    1.0
1    2.0
2    NaN
dtype: float64

Key behaviors:

# NaN is not equal to itself
print(np.nan == np.nan)  # False

# Arithmetic with NaN propagates
print(1 + np.nan)  # nan

# Use isna() to detect
print(pd.isna(np.nan))  # True

2. None (Python None)¶

Python's built-in null, stored in object dtype.

s = pd.Series([1, None, 3])
print(s.dtype)  # object (not numeric!)
print(s)

0       1
1    None
2       3
dtype: object

Behavior in different contexts:

# In numeric Series, None becomes np.nan
s = pd.Series([1.0, None, 3.0])
print(s.dtype)  # float64
print(s[1])     # nan (converted from None)

# In string Series, stays as None but detected as missing
s = pd.Series(['a', None, 'c'])
print(s.isnull())  # [False, True, False]

3. pd.NA (pandas NA) - Recommended¶

Introduced in pandas 1.0, works with all nullable dtypes.

# pd.NA works with nullable integer dtype
s = pd.Series([1, 2, pd.NA], dtype='Int64')  # Note capital I
print(s.dtype)  # Int64
print(s)

0       1
1       2
2    <NA>
dtype: Int64

Why pd.NA is better:

# Integer column stays integer!
s = pd.Series([1, 2, pd.NA], dtype='Int64')
print(s.dtype)  # Int64 (preserved!)

# Boolean column stays boolean
b = pd.Series([True, False, pd.NA], dtype='boolean')
print(b.dtype)  # boolean

# String column with proper type
st = pd.Series(['a', 'b', pd.NA], dtype='string')
print(st.dtype)  # string

4. NaT (Not a Time)¶

Special missing value for datetime types.

dates = pd.Series([pd.Timestamp('2024-01-01'), pd.NaT, pd.Timestamp('2024-01-03')])
print(dates)
print(dates.isnull())

0   2024-01-01
1          NaT
2   2024-01-03
dtype: datetime64[ns]

0    False
1     True
2    False
dtype: bool

Comparison Table¶

# Create Series with different missing types
s_nan = pd.Series([1, np.nan, 3])           # float64
s_none = pd.Series([1, None, 3])            # object  
s_na = pd.Series([1, pd.NA, 3], dtype='Int64')  # Int64

print("np.nan Series:")
print(f"  dtype: {s_nan.dtype}, missing: {s_nan.isnull().sum()}")

print("None Series:")
print(f"  dtype: {s_none.dtype}, missing: {s_none.isnull().sum()}")

print("pd.NA Series:")
print(f"  dtype: {s_na.dtype}, missing: {s_na.isnull().sum()}")

Nullable Dtypes¶

Use nullable dtypes to maintain proper types with missing values:

Standard dtype	Nullable dtype
int64	Int64
float64	Float64
bool	boolean
object (str)	string

# Convert to nullable dtypes
df = pd.DataFrame({
    'int_col': [1, 2, None],
    'float_col': [1.0, None, 3.0],
    'str_col': ['a', None, 'c'],
    'bool_col': [True, None, False]
})

# Convert all to nullable dtypes
df = df.convert_dtypes()
print(df.dtypes)

int_col      Int64
float_col    Float64
str_col      string
bool_col     boolean
dtype: object

Best Practice Recommendations¶

For new code: Use pd.NA with nullable dtypes
For compatibility: np.nan is still widely used
For databases: None may appear from SQL NULLs
For datetime: pd.NaT is automatically used

# Recommended: Use convert_dtypes() for automatic handling
df = pd.read_csv('data.csv').convert_dtypes()

Visualizing Missing Data¶

Understand missing data patterns.

1. Missing Percentage¶

def missing_report(df):
    total = df.isnull().sum()
    percent = 100 * total / len(df)
    return pd.DataFrame({
        'Missing': total,
        'Percent': percent
    })

2. Missing Heatmap¶

import matplotlib.pyplot as plt

plt.figure(figsize=(10, 6))
plt.imshow(df.isnull(), aspect='auto', cmap='gray')
plt.xlabel('Columns')
plt.ylabel('Rows')
plt.title('Missing Data Pattern')

3. Per-row Missing Count¶

df['missing_count'] = df.isnull().sum(axis=1)

Practical Example¶

Detect missing values in weather data.

1. Load Data¶

url = "https://raw.githubusercontent.com/codebasics/py/master/pandas/5_handling_missing_data_fillna_dropna_interpolate/weather_data.csv"
df = pd.read_csv(url, index_col='day', parse_dates=True)
print(df)

2. Identify Missing¶

print(df.isnull())
print(f"Total missing: {df.isnull().sum().sum()}")

3. Missing by Column¶

print(df.isnull().sum())

Runnable Example: `data_cleaning_tutorial.py`¶

"""
Pandas Tutorial 04: Data Cleaning and Preprocessing
===================================================

This tutorial covers essential data cleaning techniques.
We'll cover:
1. Handling missing data (advanced techniques)
2. Removing duplicates
3. Data validation
4. Outlier detection and handling
5. String cleaning
6. Data type conversions
7. Standardizing and normalizing data

Prerequisites: Tutorials 01-03
Difficulty: Beginner
"""

import pandas as pd
import numpy as np
import warnings

if __name__ == "__main__":
    warnings.filterwarnings('ignore')

    # ============================================================================
    # SECTION 1: COMPREHENSIVE MISSING DATA HANDLING
    # ============================================================================

    print("=" * 70)
    print("HANDLING MISSING DATA")
    print("=" * 70)

    # Create messy dataset with various missing data patterns
    df_missing = pd.DataFrame({
        'Name': ['Alice', 'Bob', None, 'David', 'Eve', '  ', 'Frank', np.nan],
        'Age': [25, np.nan, 35, 28, np.nan, 32, 45, 40],
        'Salary': [50000, 60000, np.nan, 55000, 65000, np.nan, 90000, 70000],
        'Department': ['HR', 'IT', 'Finance', None, 'HR', 'Finance', np.nan, 'IT'],
        'Years': [2, 5, np.nan, 3, np.nan, 7, 12, 8]
    })

    print("Original data with missing values:")
    print(df_missing)

    # Identify missing data
    print("\n1. Check for missing values:")
    print(df_missing.isnull().sum())

    print("\n2. Percentage of missing values per column:")
    print(df_missing.isnull().sum() / len(df_missing) * 100)

    # Replace empty strings and whitespace with NaN
    print("\n3. Replace empty strings with NaN:")
    df_missing = df_missing.replace(r'^\s*$', np.nan, regex=True)
    print(df_missing)

    # Different strategies for filling missing values
    print("\n4. Fill Age with median:")
    df_missing['Age'] = df_missing['Age'].fillna(df_missing['Age'].median())
    print(df_missing)

    print("\n5. Fill Department with mode (most common value):")
    mode_dept = df_missing['Department'].mode()[0]
    df_missing['Department'] = df_missing['Department'].fillna(mode_dept)
    print(df_missing)

    print("\n6. Interpolate Years (for numeric data with trend):")
    df_missing['Years'] = df_missing['Years'].interpolate()
    print(df_missing)

    print("\n7. Drop rows with remaining NaN values:")
    df_clean = df_missing.dropna()
    print(f"Rows before: {len(df_missing)}, after: {len(df_clean)}")
    print(df_clean)

    # ============================================================================
    # SECTION 2: HANDLING DUPLICATES
    # ============================================================================

    print("\n" + "=" * 70)
    print("HANDLING DUPLICATES")
    print("=" * 70)

    # Create data with duplicates
    df_duplicates = pd.DataFrame({
        'ID': [1, 2, 2, 3, 4, 4, 5],
        'Name': ['Alice', 'Bob', 'Bob', 'Charlie', 'David', 'David', 'Eve'],
        'Score': [85, 92, 92, 78, 88, 88, 95]
    })

    print("Data with duplicates:")
    print(df_duplicates)

    print("\n1. Check for duplicate rows:")
    print(df_duplicates.duplicated())
    print(f"Number of duplicates: {df_duplicates.duplicated().sum()}")

    print("\n2. Show duplicate rows:")
    duplicate_rows = df_duplicates[df_duplicates.duplicated()]
    print(duplicate_rows)

    print("\n3. Remove duplicates (keep first occurrence):")
    df_no_dups = df_duplicates.drop_duplicates()
    print(df_no_dups)

    print("\n4. Remove duplicates based on specific column:")
    df_no_dups_id = df_duplicates.drop_duplicates(subset=['ID'])
    print(df_no_dups_id)

    print("\n5. Keep last occurrence instead of first:")
    df_keep_last = df_duplicates.drop_duplicates(keep='last')
    print(df_keep_last)

    # ============================================================================
    # SECTION 3: DATA VALIDATION
    # ============================================================================

    print("\n" + "=" * 70)
    print("DATA VALIDATION")
    print("=" * 70)

    # Create data with validation issues
    df_validate = pd.DataFrame({
        'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
        'Age': [25, 150, 35, -5, 32],  # Invalid ages
        'Email': ['alice@email.com', 'bob@email', 'charlie@email.com', 
                  'invalid', 'eve@email.com'],  # Invalid emails
        'Salary': [50000, 60000, -10000, 55000, 65000]  # Negative salary
    })

    print("Data with validation issues:")
    print(df_validate)

    # Validate age range
    print("\n1. Validate Age (should be 18-65):")
    invalid_ages = df_validate[(df_validate['Age'] < 18) | (df_validate['Age'] > 65)]
    print(f"Invalid ages found:")
    print(invalid_ages[['Name', 'Age']])

    # Validate email format
    print("\n2. Validate Email format:")
    valid_email_pattern = r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$'
    df_validate['Valid_Email'] = df_validate['Email'].str.match(valid_email_pattern)
    print(df_validate[['Name', 'Email', 'Valid_Email']])

    # Validate salary (must be positive)
    print("\n3. Validate Salary (must be positive):")
    invalid_salaries = df_validate[df_validate['Salary'] < 0]
    print(f"Invalid salaries found:")
    print(invalid_salaries[['Name', 'Salary']])

    # Fix invalid data
    print("\n4. Fix invalid data:")
    # Replace invalid ages with median
    median_age = df_validate[df_validate['Age'].between(18, 65)]['Age'].median()
    df_validate.loc[~df_validate['Age'].between(18, 65), 'Age'] = median_age

    # Replace negative salaries with mean
    mean_salary = df_validate[df_validate['Salary'] > 0]['Salary'].mean()
    df_validate.loc[df_validate['Salary'] < 0, 'Salary'] = mean_salary

    print("Fixed data:")
    print(df_validate)

    # ============================================================================
    # SECTION 4: OUTLIER DETECTION AND HANDLING
    # ============================================================================

    print("\n" + "=" * 70)
    print("OUTLIER DETECTION")
    print("=" * 70)

    # Create data with outliers
    np.random.seed(42)
    normal_data = np.random.normal(100, 15, 100)
    outliers = [200, 250, 300, -50, -100]  # Add extreme values
    df_outliers = pd.DataFrame({
        'Value': np.append(normal_data, outliers)
    })

    print(f"Data with {len(df_outliers)} values")
    print("\nStatistics:")
    print(df_outliers.describe())

    # Method 1: IQR (Interquartile Range) method
    print("\n1. Detect outliers using IQR method:")
    Q1 = df_outliers['Value'].quantile(0.25)
    Q3 = df_outliers['Value'].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR

    print(f"Q1: {Q1:.2f}, Q3: {Q3:.2f}, IQR: {IQR:.2f}")
    print(f"Lower bound: {lower_bound:.2f}, Upper bound: {upper_bound:.2f}")

    outlier_mask = (df_outliers['Value'] < lower_bound) | (df_outliers['Value'] > upper_bound)
    print(f"Number of outliers: {outlier_mask.sum()}")
    print(f"Outlier values: {df_outliers[outlier_mask]['Value'].tolist()}")

    # Method 2: Z-score method
    print("\n2. Detect outliers using Z-score method (|z| > 3):")
    mean = df_outliers['Value'].mean()
    std = df_outliers['Value'].std()
    df_outliers['Z_Score'] = (df_outliers['Value'] - mean) / std
    outliers_z = df_outliers[np.abs(df_outliers['Z_Score']) > 3]
    print(f"Number of outliers: {len(outliers_z)}")
    print(outliers_z)

    # Handling outliers
    print("\n3. Handle outliers by capping at boundaries:")
    df_capped = df_outliers.copy()
    df_capped['Value_Capped'] = df_capped['Value'].clip(lower=lower_bound, upper=upper_bound)
    print("Before and after capping:")
    print(df_capped[outlier_mask][['Value', 'Value_Capped']])

    print("\n4. Remove outliers:")
    df_no_outliers = df_outliers[~outlier_mask]
    print(f"Original size: {len(df_outliers)}, After removing outliers: {len(df_no_outliers)}")

    # ============================================================================
    # SECTION 5: STRING CLEANING
    # ============================================================================

    print("\n" + "=" * 70)
    print("STRING CLEANING")
    print("=" * 70)

    # Create data with messy strings
    df_strings = pd.DataFrame({
        'Name': ['  Alice  ', 'BOB', 'charlie', '  DAVID  ', 'eve'],
        'Email': ['ALICE@EMAIL.COM', 'bob@email.com  ', '  charlie@EMAIL.com', 
                  'david@email.COM', 'eve@Email.COM'],
        'Phone': ['(123) 456-7890', '123-456-7890', '123.456.7890', 
                  '1234567890', '(987)654-3210']
    })

    print("Messy string data:")
    print(df_strings)

    # Clean strings
    print("\n1. Strip whitespace:")
    df_strings['Name'] = df_strings['Name'].str.strip()
    df_strings['Email'] = df_strings['Email'].str.strip()
    print(df_strings)

    print("\n2. Standardize case:")
    df_strings['Name'] = df_strings['Name'].str.title()  # Title case
    df_strings['Email'] = df_strings['Email'].str.lower()  # Lowercase
    print(df_strings)

    print("\n3. Standardize phone numbers:")
    # Remove all non-digit characters and format
    df_strings['Phone_Clean'] = df_strings['Phone'].str.replace(r'\D', '', regex=True)
    df_strings['Phone_Formatted'] = df_strings['Phone_Clean'].apply(
        lambda x: f"({x[:3]}) {x[3:6]}-{x[6:]}" if len(x) == 10 else x
    )
    print(df_strings[['Phone', 'Phone_Formatted']])

    print("\n4. Remove special characters from names:")
    messy_names = pd.Series(['Alice#123', 'Bob$$$', 'Charlie@@@', 'David!!!'])
    clean_names = messy_names.str.replace(r'[^a-zA-Z\s]', '', regex=True)
    print(f"Before: {messy_names.tolist()}")
    print(f"After: {clean_names.tolist()}")

    # ============================================================================
    # SECTION 6: DATA TYPE CONVERSIONS
    # ============================================================================

    print("\n" + "=" * 70)
    print("DATA TYPE CONVERSIONS")
    print("=" * 70)

    # Create data with wrong types
    df_types = pd.DataFrame({
        'ID': ['1', '2', '3', '4', '5'],
        'Age': ['25', '30', '35', '28', '32'],
        'Salary': ['50000.50', '60000.75', '75000.25', '55000.00', '65000.99'],
        'Date': ['2024-01-01', '2024-01-02', '2024-01-03', '2024-01-04', '2024-01-05'],
        'Active': ['True', 'False', 'True', 'True', 'False']
    })

    print("Data with string types:")
    print(df_types.dtypes)

    print("\n1. Convert to numeric:")
    df_types['ID'] = pd.to_numeric(df_types['ID'])
    df_types['Age'] = pd.to_numeric(df_types['Age'])
    df_types['Salary'] = pd.to_numeric(df_types['Salary'])
    print("After conversion:")
    print(df_types.dtypes)

    print("\n2. Convert to datetime:")
    df_types['Date'] = pd.to_datetime(df_types['Date'])
    print(df_types.dtypes)

    print("\n3. Convert to boolean:")
    df_types['Active'] = df_types['Active'].map({'True': True, 'False': False})
    print(df_types.dtypes)

    print("\nFinal DataFrame:")
    print(df_types)

    # Handle conversion errors
    print("\n4. Handle conversion errors:")
    messy_numbers = pd.Series(['1', '2', 'three', '4', 'five'])
    # errors='coerce' converts errors to NaN
    clean_numbers = pd.to_numeric(messy_numbers, errors='coerce')
    print(f"Original: {messy_numbers.tolist()}")
    print(f"Converted: {clean_numbers.tolist()}")

    # ============================================================================
    # SECTION 7: STANDARDIZING AND NORMALIZING
    # ============================================================================

    print("\n" + "=" * 70)
    print("STANDARDIZING AND NORMALIZING")
    print("=" * 70)

    # Create sample data
    df_norm = pd.DataFrame({
        'Height_cm': [160, 170, 180, 150, 175],
        'Weight_kg': [60, 70, 80, 50, 75],
        'Age': [25, 30, 35, 20, 32]
    })

    print("Original data:")
    print(df_norm)

    # Min-Max Normalization (scale to 0-1)
    print("\n1. Min-Max Normalization (0-1):")
    df_norm['Height_Normalized'] = (df_norm['Height_cm'] - df_norm['Height_cm'].min()) / \
                                   (df_norm['Height_cm'].max() - df_norm['Height_cm'].min())
    print(df_norm[['Height_cm', 'Height_Normalized']])

    # Standardization (Z-score normalization)
    print("\n2. Standardization (Z-score):")
    df_norm['Weight_Standardized'] = (df_norm['Weight_kg'] - df_norm['Weight_kg'].mean()) / \
                                     df_norm['Weight_kg'].std()
    print(df_norm[['Weight_kg', 'Weight_Standardized']])

    # Robust scaling (using median and IQR)
    print("\n3. Robust Scaling:")
    median = df_norm['Age'].median()
    Q1 = df_norm['Age'].quantile(0.25)
    Q3 = df_norm['Age'].quantile(0.75)
    IQR = Q3 - Q1
    df_norm['Age_Robust'] = (df_norm['Age'] - median) / IQR
    print(df_norm[['Age', 'Age_Robust']])

    # ============================================================================
    # SECTION 8: COMPLETE CLEANING PIPELINE
    # ============================================================================

    print("\n" + "=" * 70)
    print("COMPLETE CLEANING PIPELINE")
    print("=" * 70)

    # Create a messy dataset
    messy_data = pd.DataFrame({
        'Name': ['  Alice  ', 'BOB', 'charlie', 'Alice', '  DAVID  '],
        'Age': ['25', '150', '35', '25', '-5'],
        'Email': ['alice@email.com', 'bob@email', 'charlie@email.com', 
                  'alice@email.com', 'david@email.com'],
        'Salary': ['50000', '60000', None, '50000', '-10000']
    })

    print("Messy data:")
    print(messy_data)

    def clean_dataframe(df):
        """Complete data cleaning pipeline"""
        df = df.copy()

        # 1. Strip whitespace from strings
        for col in df.select_dtypes(include=['object']).columns:
            df[col] = df[col].str.strip()

        # 2. Standardize case
        if 'Name' in df.columns:
            df['Name'] = df['Name'].str.title()
        if 'Email' in df.columns:
            df['Email'] = df['Email'].str.lower()

        # 3. Convert data types
        if 'Age' in df.columns:
            df['Age'] = pd.to_numeric(df['Age'], errors='coerce')
        if 'Salary' in df.columns:
            df['Salary'] = pd.to_numeric(df['Salary'], errors='coerce')

        # 4. Validate and fix ages
        if 'Age' in df.columns:
            median_age = df[df['Age'].between(18, 100)]['Age'].median()
            df.loc[~df['Age'].between(18, 100), 'Age'] = median_age

        # 5. Handle negative salaries
        if 'Salary' in df.columns:
            df.loc[df['Salary'] < 0, 'Salary'] = np.nan

        # 6. Fill missing values
        if 'Salary' in df.columns:
            df['Salary'] = df['Salary'].fillna(df['Salary'].median())

        # 7. Remove duplicates
        df = df.drop_duplicates()

        return df

    print("\nCleaned data:")
    cleaned_data = clean_dataframe(messy_data)
    print(cleaned_data)
    print("\nData types:")
    print(cleaned_data.dtypes)

    # ============================================================================
    # SECTION 9: SUMMARY
    # ============================================================================

    print("\n" + "=" * 70)
    print("KEY TAKEAWAYS")
    print("=" * 70)

    summary = """
    1. Missing Data:
       - Use isnull() to identify
       - Fill with fillna(), interpolate(), or drop with dropna()
       - Choose strategy based on data type and pattern

    2. Duplicates:
       - Identify with duplicated()
       - Remove with drop_duplicates()
       - Consider which occurrence to keep

    3. Data Validation:
       - Check ranges, formats, and business rules
       - Use boolean indexing to find invalid data
       - Fix or remove invalid entries

    4. Outliers:
       - Detect using IQR or Z-score methods
       - Handle by capping, removing, or transforming
       - Consider domain knowledge

    5. String Cleaning:
       - Use str.strip(), str.lower(), str.title()
       - Remove special characters with str.replace()
       - Standardize formats

    6. Data Types:
       - Convert with pd.to_numeric(), pd.to_datetime()
       - Handle errors with errors='coerce'
       - Validate conversions

    7. Normalization:
       - Min-Max: Scale to [0, 1]
       - Standardization: Z-score (mean=0, std=1)
       - Robust: Use median and IQR

    8. Create a cleaning pipeline for consistency

    Next Steps:
    -----------
    - Practice with real messy datasets
    - Try the exercise file: exercises/05_cleaning_exercises.py
    - Move on to intermediate topics
    """

    print(summary)