Unexpected NaNs¶
NaN (Not a Number) values can appear unexpectedly in pandas operations. Understanding the common causes helps prevent and debug data quality issues.
Mental Model
NaN is pandas' way of saying "I do not have a value here." It appears in three main situations: outer joins with missing keys, arithmetic between misaligned indices, and type coercion of non-numeric strings. Whenever NaN appears unexpectedly, check these three sources first -- the root cause is almost always an alignment or key-matching issue.
Common Sources of Unexpected NaN¶
1. Merge/Join with Missing Keys¶
```python import pandas as pd
df1 = pd.DataFrame({ 'key': ['A', 'B', 'C'], 'value1': [1, 2, 3] })
df2 = pd.DataFrame({ 'key': ['B', 'C', 'D'], 'value2': [4, 5, 6] })
Outer merge introduces NaN for non-matching keys¶
result = pd.merge(df1, df2, on='key', how='outer') print(result) ```
key value1 value2
0 A 1.0 NaN # A only in df1
1 B 2.0 4.0
2 C 3.0 5.0
3 D NaN 6.0 # D only in df2
2. Left/Right Join Missing Matches¶
```python
Left join: NaN when right table has no match¶
result = pd.merge(df1, df2, on='key', how='left') print(result) ```
key value1 value2
0 A 1 NaN # No match for A in df2
1 B 2 4.0
2 C 3 5.0
3. Index Misalignment in Operations¶
```python s1 = pd.Series([1, 2, 3], index=['a', 'b', 'c']) s2 = pd.Series([4, 5, 6], index=['b', 'c', 'd'])
print(s1 + s2) ```
a NaN # 'a' only in s1
b 6.0
c 8.0
d NaN # 'd' only in s2
dtype: float64
4. Reindex with New Labels¶
python
s = pd.Series([1, 2, 3], index=['a', 'b', 'c'])
s_reindexed = s.reindex(['a', 'b', 'c', 'd', 'e'])
print(s_reindexed)
a 1.0
b 2.0
c 3.0
d NaN # New label, no value
e NaN # New label, no value
dtype: float64
5. Division by Zero¶
python
df = pd.DataFrame({'a': [1, 2, 3], 'b': [1, 0, 1]})
df['ratio'] = df['a'] / df['b']
print(df)
a b ratio
0 1 1 1.0
1 2 0 inf # or NaN in some cases
2 3 1 3.0
6. GroupBy with Missing Groups¶
```python df = pd.DataFrame({ 'group': pd.Categorical(['A', 'B', 'A'], categories=['A', 'B', 'C']), 'value': [1, 2, 3] })
GroupBy includes all categories¶
print(df.groupby('group', observed=False)['value'].sum()) ```
group
A 4
B 2
C 0 # or NaN depending on operation
Name: value, dtype: int64
7. shift() Creates NaN¶
python
s = pd.Series([1, 2, 3, 4, 5])
print(s.shift(1))
0 NaN # First value becomes NaN
1 1.0
2 2.0
3 3.0
4 4.0
dtype: float64
8. pct_change() First Value¶
python
s = pd.Series([100, 102, 101, 105])
print(s.pct_change())
0 NaN # No previous value to compare
1 0.020000
2 -0.009804
3 0.039604
dtype: float64
9. Rolling Window Not Full¶
python
s = pd.Series([1, 2, 3, 4, 5])
print(s.rolling(3).mean())
0 NaN # Window not full
1 NaN # Window not full
2 2.0
3 3.0
4 4.0
dtype: float64
Detecting Unexpected NaN¶
After Merge/Join¶
```python
Always check for NaN after merge¶
result = pd.merge(df1, df2, on='key', how='outer')
Count NaN per column¶
nan_counts = result.isnull().sum() print("NaN counts after merge:") print(nan_counts)
Rows with any NaN¶
rows_with_nan = result[result.isnull().any(axis=1)] print(f"\nRows with NaN: {len(rows_with_nan)}") ```
After Arithmetic Operations¶
```python
Check for NaN introduction¶
before_nan = df.isnull().sum().sum() result = df['a'] / df['b'] after_nan = result.isnull().sum()
if after_nan > before_nan: print(f"Warning: {after_nan - before_nan} new NaN values!") ```
Solutions¶
1. Use indicator in Merge¶
python
result = pd.merge(df1, df2, on='key', how='outer', indicator=True)
print(result)
key value1 value2 _merge
0 A 1.0 NaN left_only
1 B 2.0 4.0 both
2 C 3.0 5.0 both
3 D NaN 6.0 right_only
2. Fill NaN with Default Value¶
```python
During merge¶
result = pd.merge(df1, df2, on='key', how='outer').fillna(0)
During reindex¶
s_reindexed = s.reindex(['a', 'b', 'c', 'd'], fill_value=0) ```
3. Use min_periods for Rolling¶
python
s = pd.Series([1, 2, 3, 4, 5])
print(s.rolling(3, min_periods=1).mean())
0 1.0 # Only 1 value available
1 1.5 # 2 values available
2 2.0 # Full window
3 3.0
4 4.0
dtype: float64
4. Fill shift/pct_change NaN¶
```python s = pd.Series([1, 2, 3, 4, 5])
Forward fill first value¶
s_shifted = s.shift(1).fillna(method='bfill')
Or use fill_value parameter¶
s_shifted = s.shift(1, fill_value=s.iloc[0]) ```
5. Handle Division by Zero¶
```python df = pd.DataFrame({'a': [1, 2, 3], 'b': [1, 0, 1]})
Replace 0 before division¶
df['ratio'] = df['a'] / df['b'].replace(0, np.nan)
Or use np.where¶
import numpy as np df['ratio'] = np.where(df['b'] != 0, df['a'] / df['b'], 0) ```
Validation Pattern¶
```python def validate_no_new_nan(df_before, df_after, operation_name): """Check that operation didn't introduce unexpected NaN.""" nan_before = df_before.isnull().sum().sum() nan_after = df_after.isnull().sum().sum()
if nan_after > nan_before:
new_nan = nan_after - nan_before
print(f"Warning: {operation_name} introduced {new_nan} NaN values")
# Show which columns
for col in df_after.columns:
before = df_before[col].isnull().sum() if col in df_before else 0
after = df_after[col].isnull().sum()
if after > before:
print(f" - {col}: {after - before} new NaN")
return False
return True
Usage¶
result = pd.merge(df1, df2, on='key', how='outer') validate_no_new_nan(df1, result, "merge") ```
Summary¶
| Source | Solution |
|---|---|
| Merge mismatch | Use indicator=True, check after merge |
| Index mismatch | reindex with fill_value |
| Division by zero | Replace zeros or use np.where |
| Rolling window | Use min_periods=1 |
| shift/pct_change | Fill or handle first values |
| Reindex new labels | Provide fill_value |
Exercises¶
Exercise 1. Create two Series with mismatched indices and perform arithmetic. Explain why NaN values appear in the result.
Solution to Exercise 1
```python import pandas as pd import numpy as np
Solution for the specific exercise¶
np.random.seed(42) df = pd.DataFrame({'A': np.random.randn(10), 'B': np.random.randn(10)}) print(df.head()) ```
Exercise 2. Write code that demonstrates how merge() with how='left' can introduce NaN values. How do you handle them?
Solution to Exercise 2
See the main content for the detailed explanation. The key concept involves understanding the Pandas API and its behavior for this specific operation.
Exercise 3. Explain three common sources of unexpected NaN values in Pandas operations.
Solution to Exercise 3
```python import pandas as pd import numpy as np
np.random.seed(42) df = pd.DataFrame({'A': np.random.randn(20), 'B': np.random.randn(20)}) result = df.describe() print(result) ```
Exercise 4. Write code that uses fillna() and dropna() to handle NaN values introduced by index misalignment.
Solution to Exercise 4
```python import pandas as pd import numpy as np
np.random.seed(42) df = pd.DataFrame({'A': np.random.randn(50), 'group': np.random.choice(['X', 'Y'], 50)}) result = df.groupby('group').mean() print(result) ```