Skip to content

Unexpected NaNs

NaN (Not a Number) values can appear unexpectedly in pandas operations. Understanding the common causes helps prevent and debug data quality issues.

Mental Model

NaN is pandas' way of saying "I do not have a value here." It appears in three main situations: outer joins with missing keys, arithmetic between misaligned indices, and type coercion of non-numeric strings. Whenever NaN appears unexpectedly, check these three sources first -- the root cause is almost always an alignment or key-matching issue.

Common Sources of Unexpected NaN

1. Merge/Join with Missing Keys

```python import pandas as pd

df1 = pd.DataFrame({ 'key': ['A', 'B', 'C'], 'value1': [1, 2, 3] })

df2 = pd.DataFrame({ 'key': ['B', 'C', 'D'], 'value2': [4, 5, 6] })

Outer merge introduces NaN for non-matching keys

result = pd.merge(df1, df2, on='key', how='outer') print(result) ```

key value1 value2 0 A 1.0 NaN # A only in df1 1 B 2.0 4.0 2 C 3.0 5.0 3 D NaN 6.0 # D only in df2

2. Left/Right Join Missing Matches

```python

Left join: NaN when right table has no match

result = pd.merge(df1, df2, on='key', how='left') print(result) ```

key value1 value2 0 A 1 NaN # No match for A in df2 1 B 2 4.0 2 C 3 5.0

3. Index Misalignment in Operations

```python s1 = pd.Series([1, 2, 3], index=['a', 'b', 'c']) s2 = pd.Series([4, 5, 6], index=['b', 'c', 'd'])

print(s1 + s2) ```

a NaN # 'a' only in s1 b 6.0 c 8.0 d NaN # 'd' only in s2 dtype: float64

4. Reindex with New Labels

python s = pd.Series([1, 2, 3], index=['a', 'b', 'c']) s_reindexed = s.reindex(['a', 'b', 'c', 'd', 'e']) print(s_reindexed)

a 1.0 b 2.0 c 3.0 d NaN # New label, no value e NaN # New label, no value dtype: float64

5. Division by Zero

python df = pd.DataFrame({'a': [1, 2, 3], 'b': [1, 0, 1]}) df['ratio'] = df['a'] / df['b'] print(df)

a b ratio 0 1 1 1.0 1 2 0 inf # or NaN in some cases 2 3 1 3.0

6. GroupBy with Missing Groups

```python df = pd.DataFrame({ 'group': pd.Categorical(['A', 'B', 'A'], categories=['A', 'B', 'C']), 'value': [1, 2, 3] })

GroupBy includes all categories

print(df.groupby('group', observed=False)['value'].sum()) ```

group A 4 B 2 C 0 # or NaN depending on operation Name: value, dtype: int64

7. shift() Creates NaN

python s = pd.Series([1, 2, 3, 4, 5]) print(s.shift(1))

0 NaN # First value becomes NaN 1 1.0 2 2.0 3 3.0 4 4.0 dtype: float64

8. pct_change() First Value

python s = pd.Series([100, 102, 101, 105]) print(s.pct_change())

0 NaN # No previous value to compare 1 0.020000 2 -0.009804 3 0.039604 dtype: float64

9. Rolling Window Not Full

python s = pd.Series([1, 2, 3, 4, 5]) print(s.rolling(3).mean())

0 NaN # Window not full 1 NaN # Window not full 2 2.0 3 3.0 4 4.0 dtype: float64

Detecting Unexpected NaN

After Merge/Join

```python

Always check for NaN after merge

result = pd.merge(df1, df2, on='key', how='outer')

Count NaN per column

nan_counts = result.isnull().sum() print("NaN counts after merge:") print(nan_counts)

Rows with any NaN

rows_with_nan = result[result.isnull().any(axis=1)] print(f"\nRows with NaN: {len(rows_with_nan)}") ```

After Arithmetic Operations

```python

Check for NaN introduction

before_nan = df.isnull().sum().sum() result = df['a'] / df['b'] after_nan = result.isnull().sum()

if after_nan > before_nan: print(f"Warning: {after_nan - before_nan} new NaN values!") ```

Solutions

1. Use indicator in Merge

python result = pd.merge(df1, df2, on='key', how='outer', indicator=True) print(result)

key value1 value2 _merge 0 A 1.0 NaN left_only 1 B 2.0 4.0 both 2 C 3.0 5.0 both 3 D NaN 6.0 right_only

2. Fill NaN with Default Value

```python

During merge

result = pd.merge(df1, df2, on='key', how='outer').fillna(0)

During reindex

s_reindexed = s.reindex(['a', 'b', 'c', 'd'], fill_value=0) ```

3. Use min_periods for Rolling

python s = pd.Series([1, 2, 3, 4, 5]) print(s.rolling(3, min_periods=1).mean())

0 1.0 # Only 1 value available 1 1.5 # 2 values available 2 2.0 # Full window 3 3.0 4 4.0 dtype: float64

4. Fill shift/pct_change NaN

```python s = pd.Series([1, 2, 3, 4, 5])

Forward fill first value

s_shifted = s.shift(1).fillna(method='bfill')

Or use fill_value parameter

s_shifted = s.shift(1, fill_value=s.iloc[0]) ```

5. Handle Division by Zero

```python df = pd.DataFrame({'a': [1, 2, 3], 'b': [1, 0, 1]})

Replace 0 before division

df['ratio'] = df['a'] / df['b'].replace(0, np.nan)

Or use np.where

import numpy as np df['ratio'] = np.where(df['b'] != 0, df['a'] / df['b'], 0) ```

Validation Pattern

```python def validate_no_new_nan(df_before, df_after, operation_name): """Check that operation didn't introduce unexpected NaN.""" nan_before = df_before.isnull().sum().sum() nan_after = df_after.isnull().sum().sum()

if nan_after > nan_before:
    new_nan = nan_after - nan_before
    print(f"Warning: {operation_name} introduced {new_nan} NaN values")

    # Show which columns
    for col in df_after.columns:
        before = df_before[col].isnull().sum() if col in df_before else 0
        after = df_after[col].isnull().sum()
        if after > before:
            print(f"  - {col}: {after - before} new NaN")

    return False
return True

Usage

result = pd.merge(df1, df2, on='key', how='outer') validate_no_new_nan(df1, result, "merge") ```

Summary

Source Solution
Merge mismatch Use indicator=True, check after merge
Index mismatch reindex with fill_value
Division by zero Replace zeros or use np.where
Rolling window Use min_periods=1
shift/pct_change Fill or handle first values
Reindex new labels Provide fill_value

Exercises

Exercise 1. Create two Series with mismatched indices and perform arithmetic. Explain why NaN values appear in the result.

Solution to Exercise 1

```python import pandas as pd import numpy as np

Solution for the specific exercise

np.random.seed(42) df = pd.DataFrame({'A': np.random.randn(10), 'B': np.random.randn(10)}) print(df.head()) ```


Exercise 2. Write code that demonstrates how merge() with how='left' can introduce NaN values. How do you handle them?

Solution to Exercise 2

See the main content for the detailed explanation. The key concept involves understanding the Pandas API and its behavior for this specific operation.


Exercise 3. Explain three common sources of unexpected NaN values in Pandas operations.

Solution to Exercise 3

```python import pandas as pd import numpy as np

np.random.seed(42) df = pd.DataFrame({'A': np.random.randn(20), 'B': np.random.randn(20)}) result = df.describe() print(result) ```


Exercise 4. Write code that uses fillna() and dropna() to handle NaN values introduced by index misalignment.

Solution to Exercise 4

```python import pandas as pd import numpy as np

np.random.seed(42) df = pd.DataFrame({'A': np.random.randn(50), 'group': np.random.choice(['X', 'Y'], 50)}) result = df.groupby('group').mean() print(result) ```