Arrays vs DataFrames¶
Mental Model
NumPy arrays are homogeneous grids optimized for numerical computation -- every element shares one dtype. DataFrames are heterogeneous tables where each column can have a different dtype, plus labeled axes for alignment. Choose NumPy when all data is the same type and speed matters most; choose pandas when you need mixed types, labels, or missing-value handling.
Data Structure¶
1. NumPy Array¶
Homogeneous, n-dimensional:
```python import numpy as np
arr = np.array([[1, 2, 3], [4, 5, 6]]) print(arr.dtype) # All elements same type print(arr.shape) # (2, 3) ```
2. Pandas DataFrame¶
Heterogeneous, labeled, tabular:
```python import pandas as pd
df = pd.DataFrame({ 'name': ['Alice', 'Bob'], 'age': [25, 30], 'score': [85.5, 92.0] }) print(df.dtypes) # Different types per column ```
3. Key Difference¶
Array: Single dtype, unlabeled DataFrame: Multiple dtypes, labeled
When to Use Each¶
1. Use NumPy When¶
```python
Matrix operations¶
A = np.array([[1, 2], [3, 4]]) B = np.array([[5, 6], [7, 8]]) C = A @ B # Matrix multiplication
Numerical computation¶
data = np.random.randn(1000, 100) result = data.mean(axis=0) ```
2. Use Pandas When¶
```python
Heterogeneous data¶
df = pd.DataFrame({ 'date': pd.date_range('2023-01-01', periods=10), 'category': ['A', 'B'] * 5, 'value': np.random.rand(10) })
Data analysis¶
summary = df.groupby('category')['value'].mean() ```
3. Conversion¶
```python
DataFrame to array¶
arr = df.values # or df.to_numpy()
Array to DataFrame¶
df = pd.DataFrame(arr, columns=['col1', 'col2']) ```
Real-World Example¶
1. Financial Data¶
```python import yfinance as yf
Returns DataFrame (heterogeneous)¶
df = yf.download("AAPL", start="2023-01-01") print(df.dtypes)
Extract for numpy (homogeneous numerical)¶
close_prices = df['Close'].values returns = np.diff(close_prices) / close_prices[:-1] ```
2. Image Processing¶
```python from PIL import Image
Load as array (homogeneous)¶
img = np.array(Image.open('photo.jpg')) print(img.shape) # (height, width, 3)
Process¶
img_gray = img.mean(axis=2) ```
3. Time Series¶
```python
Use DataFrame for labeled time series¶
df = pd.DataFrame({ 'timestamp': pd.date_range('2023-01-01', periods=100, freq='H'), 'temperature': np.random.randn(100) + 20 }).set_index('timestamp')
Use array for computation¶
temps = df['temperature'].values ma = np.convolve(temps, np.ones(5)/5, mode='valid') ```
Exercises¶
Exercise 1. Create a NumPy 2D array and a Pandas DataFrame with the same data. Demonstrate one operation that is easier with DataFrame (e.g., column selection by name).
Solution to Exercise 1
```python import pandas as pd import numpy as np
arr = np.array([[1, 2, 3], [4, 5, 6]]) df = pd.DataFrame(arr, columns=['a', 'b', 'c']) print(df['b']) # Easy column selection by name ```
Exercise 2. Explain three advantages of Pandas DataFrames over NumPy arrays for tabular data analysis.
Solution to Exercise 2
- Named columns: DataFrames allow accessing columns by name rather than integer index.
- Mixed types: Each column can have a different dtype (int, float, string, etc.).
- Built-in methods: DataFrames provide
groupby,merge,pivot, and other high-level data manipulation methods not available in NumPy.
Exercise 3. Write code that converts a NumPy array to a DataFrame with named columns, and then converts a DataFrame back to a NumPy array using .to_numpy().
Solution to Exercise 3
```python import pandas as pd import numpy as np
arr = np.array([[1, 2], [3, 4], [5, 6]]) df = pd.DataFrame(arr, columns=['x', 'y']) print(df) arr_back = df.to_numpy() print(arr_back) ```
Exercise 4. Create a DataFrame with mixed types (int, float, string) and show that df.dtypes reports different dtypes per column, while a NumPy array would coerce all to one type.
Solution to Exercise 4
```python import pandas as pd import numpy as np
df = pd.DataFrame({'int_col': [1, 2], 'float_col': [1.5, 2.5], 'str_col': ['a', 'b']}) print(df.dtypes)
int64, float64, object -- different types per column¶
arr = np.array([[1, 1.5, 'a'], [2, 2.5, 'b']]) print(arr.dtype) # All coerced to '<U32' (string) ```