Skip to content

DataFrame Architecture

Mental Model

Under the hood, a DataFrame is not a 2D array -- it is a collection of 1D arrays (one per column) managed by a BlockManager. Each column can have a different dtype, and operations like adding or dropping columns rearrange pointers rather than copying data. Understanding this columnar layout explains why column operations are fast and row iterations are slow.

Columnar Design

1. Structure

DataFrame is a dict-like container of Series:

```python import pandas as pd

df = pd.DataFrame({ 'A': [1, 2, 3], 'B': [4.0, 5.0, 6.0], 'C': ['x', 'y', 'z'] }) ```

2. Columns as Series

```python col_a = df['A'] print(type(col_a)) # Series print(col_a.dtype) # int64

col_c = df['C'] print(col_c.dtype) # object ```

3. Heterogeneous Types

Each column has its own dtype:

```python print(df.dtypes)

A int64

B float64

C object

```

Construction

1. From Dict

python data = { 'name': ['Alice', 'Bob'], 'age': [25, 30], 'salary': [50000, 60000] } df = pd.DataFrame(data)

2. From Lists

python data = [ [1, 4, 'x'], [2, 5, 'y'], [3, 6, 'z'] ] df = pd.DataFrame(data, columns=['A', 'B', 'C'])

3. From Records

python df = pd.DataFrame([ {'name': 'Alice', 'age': 25}, {'name': 'Bob', 'age': 30} ])

Indexing

1. Column Selection

python df['A'] # Single column (Series) df[['A', 'B']] # Multiple columns (DataFrame)

2. Row Selection

python df.loc[0] # By label df.iloc[0] # By position df.loc[0:2] # Slice by label

3. Boolean Indexing

python df[df['age'] > 25] # Filter rows

Operations

1. Column-wise

python df['D'] = df['A'] + df['B'] # New column df.drop('D', axis=1, inplace=True) # Remove column

2. Row-wise

python df.loc[3] = [4, 7, 'w'] # Add row df = df.drop(3) # Remove row

3. Aggregation

python df.sum() # Sum each column df.mean() # Mean each column df.describe() # Summary statistics


Exercises

Exercise 1. Create a DataFrame from a dictionary with three columns and five rows. Print its shape, columns, and dtypes.

Solution to Exercise 1

```python import pandas as pd

df = pd.DataFrame({ 'name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'], 'age': [25, 30, 35, 40, 28], 'score': [85, 92, 78, 95, 88] }) print(f'Shape: {df.shape}') print(f'Columns: {df.columns.tolist()}') print(f'Dtypes:\n{df.dtypes}') ```


Exercise 2. Explain the relationship between a DataFrame and its constituent Series objects. How do you extract a single column as a Series?

Solution to Exercise 2

See the main content for the detailed explanation. The key concept involves understanding the Pandas data structures and their relationships.


Exercise 3. Write code that creates a DataFrame and accesses rows using .loc[] (label-based) and .iloc[] (position-based). Show the difference.

Solution to Exercise 3

```python import pandas as pd

df = pd.DataFrame({ 'name': ['Alice', 'Bob', 'Charlie'], 'age': [25, 30, 35], 'score': [85, 92, 78] })

Label-based

print(df.loc[0])

Position-based

print(df.iloc[-1]) ```


Exercise 4. Create a DataFrame and add a new column computed from existing columns. Then delete a column using del or .drop().

Solution to Exercise 4

```python import pandas as pd

df = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6]}) df['c'] = df['a'] + df['b'] df = df.drop(columns=['b']) print(df) ```