View vs Copy¶

One of the most common sources of bugs in pandas is confusion between views and copies. Understanding when pandas returns a view (reference to original data) vs a copy (independent duplicate) is essential for avoiding silent data corruption.

The Problem¶

import pandas as pd
import numpy as np

df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})

# Is this a view or a copy?
subset = df[df['A'] > 1]
subset['B'] = 99  # Does this modify df?

print(df)  # Is df changed?

The answer depends on pandas internals and can vary between versions. This unpredictability is the problem.

What is a View?¶

A view shares memory with the original DataFrame:

# Arrays can have views
arr = np.array([1, 2, 3, 4, 5])
view = arr[1:4]  # This is a view

view[0] = 999
print(arr)  # [1, 999, 3, 4, 5] - Original changed!

What is a Copy?¶

A copy is independent - modifying it doesn't affect the original:

arr = np.array([1, 2, 3, 4, 5])
copy = arr[1:4].copy()  # Explicit copy

copy[0] = 999
print(arr)  # [1, 2, 3, 4, 5] - Original unchanged

When Does pandas Return a View?¶

Likely View (But Not Guaranteed)¶

df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})

# Single column selection - often a view
col = df['A']

# Slice of rows - sometimes a view
rows = df[0:2]

Likely Copy¶

# Boolean indexing - usually a copy
subset = df[df['A'] > 1]

# Multiple column selection - usually a copy
cols = df[['A', 'B']]

# Chained indexing - definitely problematic
result = df[df['A'] > 1]['B']

The Danger: Silent Bugs¶

Bug Example 1: Modification Doesn't Persist¶

df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})

# This might not work as expected
subset = df[df['A'] > 1]
subset['B'] = 0  # Modifying a copy

print(df)  # df might be unchanged

Bug Example 2: Unintended Modification¶

df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})

# Get a "view"
col = df['A']

# Modify through the view
col[0] = 999  # This might change df!

print(df)  # df might be changed

The Solution: Be Explicit¶

Rule 1: Use .copy() When You Want Independence¶

df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})

# Explicit copy - safe to modify
subset = df[df['A'] > 1].copy()
subset['B'] = 0

print(df)  # Definitely unchanged
print(subset)  # Has your changes

Rule 2: Use .loc for Direct Modification¶

df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})

# Direct modification - guaranteed to work
df.loc[df['A'] > 1, 'B'] = 0

print(df)  # Definitely modified

SettingWithCopyWarning¶

pandas tries to warn you about ambiguous situations:

df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})

subset = df[df['A'] > 1]
subset['B'] = 0  # SettingWithCopyWarning!

SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

Never ignore this warning!

Safe Patterns¶

Pattern 1: Filter and Modify (Using .loc)¶

# WRONG
df[df['A'] > 1]['B'] = 0

# RIGHT
df.loc[df['A'] > 1, 'B'] = 0

Pattern 2: Create Modified Subset¶

# WRONG
subset = df[df['A'] > 1]
subset['new_col'] = subset['B'] * 2

# RIGHT
subset = df[df['A'] > 1].copy()
subset['new_col'] = subset['B'] * 2

Pattern 3: Process and Return¶

def process_data(df):
    # WRONG - might modify original
    result = df[df['A'] > 1]
    result['processed'] = True
    return result

    # RIGHT - explicit copy
    result = df[df['A'] > 1].copy()
    result['processed'] = True
    return result

Pattern 4: Chain Operations Safely¶

# Using method chaining (creates copies automatically)
result = (df
    .query('A > 1')
    .assign(B_doubled=lambda x: x['B'] * 2)
    .sort_values('B_doubled')
)

Checking If It's a View¶

# Check if two arrays share memory
def shares_memory(a, b):
    return np.shares_memory(a.values, b.values)

df = pd.DataFrame({'A': [1, 2, 3]})
col = df['A']

print(shares_memory(df, col))  # Might be True

Copy-on-Write (pandas 2.0+)¶

pandas 2.0 introduced Copy-on-Write (CoW) mode:

pd.options.mode.copy_on_write = True

df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
subset = df[df['A'] > 1]

# With CoW, this creates a copy automatically when needed
subset['B'] = 0

print(df)  # Original unchanged

Summary: Golden Rules¶

Situation	Safe Pattern
Modify subset	`df.loc[condition, col] = value`
Create independent subset	`df[condition].copy()`
Add column to subset	`subset = df[...].copy(); subset['new'] = ...`
Function that modifies	`def f(df): df = df.copy(); ...`
Chained operations	Method chaining with `.assign()`

When in doubt, use .copy()!