View vs Copy¶
One of the most common sources of bugs in pandas is confusion between views and copies. Understanding when pandas returns a view (reference to original data) vs a copy (independent duplicate) is essential for avoiding silent data corruption.
Mental Model
A view shares memory with the original -- modifying the view modifies the original. A copy is independent -- modifying it leaves the original untouched. The problem is that pandas does not guarantee which one you get from slicing. The safe rule: call .copy() explicitly when you want independence, and use .loc for in-place assignment on the original.
The Problem¶
```python import pandas as pd import numpy as np
df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
Is this a view or a copy?¶
subset = df[df['A'] > 1] subset['B'] = 99 # Does this modify df?
print(df) # Is df changed? ```
The answer depends on pandas internals and can vary between versions. This unpredictability is the problem.
What is a View?¶
A view shares memory with the original DataFrame:
```python
Arrays can have views¶
arr = np.array([1, 2, 3, 4, 5]) view = arr[1:4] # This is a view
view[0] = 999 print(arr) # [1, 999, 3, 4, 5] - Original changed! ```
What is a Copy?¶
A copy is independent - modifying it doesn't affect the original:
```python arr = np.array([1, 2, 3, 4, 5]) copy = arr[1:4].copy() # Explicit copy
copy[0] = 999 print(arr) # [1, 2, 3, 4, 5] - Original unchanged ```
When Does pandas Return a View?¶
Likely View (But Not Guaranteed)¶
```python df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
Single column selection - often a view¶
col = df['A']
Slice of rows - sometimes a view¶
rows = df[0:2] ```
Likely Copy¶
```python
Boolean indexing - usually a copy¶
subset = df[df['A'] > 1]
Multiple column selection - usually a copy¶
cols = df[['A', 'B']]
Chained indexing - definitely problematic¶
result = df[df['A'] > 1]['B'] ```
The Danger: Silent Bugs¶
Bug Example 1: Modification Doesn't Persist¶
```python df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
This might not work as expected¶
subset = df[df['A'] > 1] subset['B'] = 0 # Modifying a copy
print(df) # df might be unchanged ```
Bug Example 2: Unintended Modification¶
```python df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
Get a "view"¶
col = df['A']
Modify through the view¶
col[0] = 999 # This might change df!
print(df) # df might be changed ```
The Solution: Be Explicit¶
Rule 1: Use .copy() When You Want Independence¶
```python df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
Explicit copy - safe to modify¶
subset = df[df['A'] > 1].copy() subset['B'] = 0
print(df) # Definitely unchanged print(subset) # Has your changes ```
Rule 2: Use .loc for Direct Modification¶
```python df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
Direct modification - guaranteed to work¶
df.loc[df['A'] > 1, 'B'] = 0
print(df) # Definitely modified ```
SettingWithCopyWarning¶
pandas tries to warn you about ambiguous situations:
```python df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
subset = df[df['A'] > 1] subset['B'] = 0 # SettingWithCopyWarning! ```
SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
Never ignore this warning!
Safe Patterns¶
Pattern 1: Filter and Modify (Using .loc)¶
```python
WRONG¶
df[df['A'] > 1]['B'] = 0
RIGHT¶
df.loc[df['A'] > 1, 'B'] = 0 ```
Pattern 2: Create Modified Subset¶
```python
WRONG¶
subset = df[df['A'] > 1] subset['new_col'] = subset['B'] * 2
RIGHT¶
subset = df[df['A'] > 1].copy() subset['new_col'] = subset['B'] * 2 ```
Pattern 3: Process and Return¶
```python def process_data(df): # WRONG - might modify original result = df[df['A'] > 1] result['processed'] = True return result
# RIGHT - explicit copy
result = df[df['A'] > 1].copy()
result['processed'] = True
return result
```
Pattern 4: Chain Operations Safely¶
```python
Using method chaining (creates copies automatically)¶
result = (df .query('A > 1') .assign(B_doubled=lambda x: x['B'] * 2) .sort_values('B_doubled') ) ```
Checking If It's a View¶
```python
Check if two arrays share memory¶
def shares_memory(a, b): return np.shares_memory(a.values, b.values)
df = pd.DataFrame({'A': [1, 2, 3]}) col = df['A']
print(shares_memory(df, col)) # Might be True ```
Copy-on-Write (pandas 2.0+)¶
pandas 2.0 introduced Copy-on-Write (CoW) mode:
```python pd.options.mode.copy_on_write = True
df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]}) subset = df[df['A'] > 1]
With CoW, this creates a copy automatically when needed¶
subset['B'] = 0
print(df) # Original unchanged ```
Summary: Golden Rules¶
| Situation | Safe Pattern |
|---|---|
| Modify subset | df.loc[condition, col] = value |
| Create independent subset | df[condition].copy() |
| Add column to subset | subset = df[...].copy(); subset['new'] = ... |
| Function that modifies | def f(df): df = df.copy(); ... |
| Chained operations | Method chaining with .assign() |
When in doubt, use .copy()!
Exercises¶
Exercise 1. Write code that demonstrates the difference between a view and a copy when slicing a DataFrame.
Solution to Exercise 1
```python import pandas as pd import numpy as np
Solution for the specific exercise¶
np.random.seed(42) df = pd.DataFrame({'A': np.random.randn(10), 'B': np.random.randn(10)}) print(df.head()) ```
Exercise 2. Explain when df[['col']] returns a copy vs when df['col'] returns a view. How can you tell?
Solution to Exercise 2
See the main content for the detailed explanation. The key concept involves understanding the Pandas API and its behavior for this specific operation.
Exercise 3. Write code using .copy() to explicitly create a copy and avoid the SettingWithCopyWarning.
Solution to Exercise 3
```python import pandas as pd import numpy as np
np.random.seed(42) df = pd.DataFrame({'A': np.random.randn(20), 'B': np.random.randn(20)}) result = df.describe() print(result) ```
Exercise 4. Explain the Copy-on-Write (CoW) behavior introduced in newer versions of Pandas. How does it change the view/copy semantics?
Solution to Exercise 4
```python import pandas as pd import numpy as np
np.random.seed(42) df = pd.DataFrame({'A': np.random.randn(50), 'group': np.random.choice(['X', 'Y'], 50)}) result = df.groupby('group').mean() print(result) ```