Arrays vs DataFrames¶

Data Structure¶

1. NumPy Array¶

Homogeneous, n-dimensional:

import numpy as np

arr = np.array([[1, 2, 3], [4, 5, 6]])
print(arr.dtype)  # All elements same type
print(arr.shape)  # (2, 3)

2. Pandas DataFrame¶

Heterogeneous, labeled, tabular:

import pandas as pd

df = pd.DataFrame({
    'name': ['Alice', 'Bob'],
    'age': [25, 30],
    'score': [85.5, 92.0]
})
print(df.dtypes)  # Different types per column

3. Key Difference¶

Array: Single dtype, unlabeled DataFrame: Multiple dtypes, labeled

When to Use Each¶

1. Use NumPy When¶

# Matrix operations
A = np.array([[1, 2], [3, 4]])
B = np.array([[5, 6], [7, 8]])
C = A @ B  # Matrix multiplication

# Numerical computation
data = np.random.randn(1000, 100)
result = data.mean(axis=0)

2. Use Pandas When¶

# Heterogeneous data
df = pd.DataFrame({
    'date': pd.date_range('2023-01-01', periods=10),
    'category': ['A', 'B'] * 5,
    'value': np.random.rand(10)
})

# Data analysis
summary = df.groupby('category')['value'].mean()

3. Conversion¶

# DataFrame to array
arr = df.values  # or df.to_numpy()

# Array to DataFrame
df = pd.DataFrame(arr, columns=['col1', 'col2'])

Real-World Example¶

1. Financial Data¶

import yfinance as yf

# Returns DataFrame (heterogeneous)
df = yf.download("AAPL", start="2023-01-01")
print(df.dtypes)

# Extract for numpy (homogeneous numerical)
close_prices = df['Close'].values
returns = np.diff(close_prices) / close_prices[:-1]

2. Image Processing¶

from PIL import Image

# Load as array (homogeneous)
img = np.array(Image.open('photo.jpg'))
print(img.shape)  # (height, width, 3)

# Process
img_gray = img.mean(axis=2)

3. Time Series¶

# Use DataFrame for labeled time series
df = pd.DataFrame({
    'timestamp': pd.date_range('2023-01-01', periods=100, freq='H'),
    'temperature': np.random.randn(100) + 20
}).set_index('timestamp')

# Use array for computation
temps = df['temperature'].values
ma = np.convolve(temps, np.ones(5)/5, mode='valid')