DataFrame Architecture¶
Mental Model
Under the hood, a DataFrame is not a 2D array -- it is a collection of 1D arrays (one per column) managed by a BlockManager. Each column can have a different dtype, and operations like adding or dropping columns rearrange pointers rather than copying data. Understanding this columnar layout explains why column operations are fast and row iterations are slow.
Columnar Design¶
1. Structure¶
DataFrame is a dict-like container of Series:
```python import pandas as pd
df = pd.DataFrame({ 'A': [1, 2, 3], 'B': [4.0, 5.0, 6.0], 'C': ['x', 'y', 'z'] }) ```
2. Columns as Series¶
```python col_a = df['A'] print(type(col_a)) # Series print(col_a.dtype) # int64
col_c = df['C'] print(col_c.dtype) # object ```
3. Heterogeneous Types¶
Each column has its own dtype:
```python print(df.dtypes)
A int64¶
B float64¶
C object¶
```
Construction¶
1. From Dict¶
python
data = {
'name': ['Alice', 'Bob'],
'age': [25, 30],
'salary': [50000, 60000]
}
df = pd.DataFrame(data)
2. From Lists¶
python
data = [
[1, 4, 'x'],
[2, 5, 'y'],
[3, 6, 'z']
]
df = pd.DataFrame(data, columns=['A', 'B', 'C'])
3. From Records¶
python
df = pd.DataFrame([
{'name': 'Alice', 'age': 25},
{'name': 'Bob', 'age': 30}
])
Indexing¶
1. Column Selection¶
python
df['A'] # Single column (Series)
df[['A', 'B']] # Multiple columns (DataFrame)
2. Row Selection¶
python
df.loc[0] # By label
df.iloc[0] # By position
df.loc[0:2] # Slice by label
3. Boolean Indexing¶
python
df[df['age'] > 25] # Filter rows
Operations¶
1. Column-wise¶
python
df['D'] = df['A'] + df['B'] # New column
df.drop('D', axis=1, inplace=True) # Remove column
2. Row-wise¶
python
df.loc[3] = [4, 7, 'w'] # Add row
df = df.drop(3) # Remove row
3. Aggregation¶
python
df.sum() # Sum each column
df.mean() # Mean each column
df.describe() # Summary statistics
Exercises¶
Exercise 1. Create a DataFrame from a dictionary with three columns and five rows. Print its shape, columns, and dtypes.
Solution to Exercise 1
```python import pandas as pd
df = pd.DataFrame({ 'name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'], 'age': [25, 30, 35, 40, 28], 'score': [85, 92, 78, 95, 88] }) print(f'Shape: {df.shape}') print(f'Columns: {df.columns.tolist()}') print(f'Dtypes:\n{df.dtypes}') ```
Exercise 2. Explain the relationship between a DataFrame and its constituent Series objects. How do you extract a single column as a Series?
Solution to Exercise 2
See the main content for the detailed explanation. The key concept involves understanding the Pandas data structures and their relationships.
Exercise 3. Write code that creates a DataFrame and accesses rows using .loc[] (label-based) and .iloc[] (position-based). Show the difference.
Solution to Exercise 3
```python import pandas as pd
df = pd.DataFrame({ 'name': ['Alice', 'Bob', 'Charlie'], 'age': [25, 30, 35], 'score': [85, 92, 78] })
Label-based¶
print(df.loc[0])
Position-based¶
print(df.iloc[-1]) ```
Exercise 4. Create a DataFrame and add a new column computed from existing columns. Then delete a column using del or .drop().
Solution to Exercise 4
```python import pandas as pd
df = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6]}) df['c'] = df['a'] + df['b'] df = df.drop(columns=['b']) print(df) ```