Series vs DataFrame¶

Understanding the relationship between Series and DataFrame is fundamental to working effectively with pandas. This document clarifies when to use each and how they interact.

Mental Model

A Series is one column; a DataFrame is a collection of columns sharing the same row index. Selecting a single column from a DataFrame returns a Series, and combining multiple Series side-by-side creates a DataFrame. The two types are not separate worlds -- they are the 1D and 2D faces of the same design.

Structural Comparison¶

┌─────────────────────────────────────────────────────────────┐ │ DataFrame │ │ ┌─────────┬─────────┬─────────┬─────────┐ │ │ │ Series │ Series │ Series │ Series │ ← Columns │ │ │ (col A) │ (col B) │ (col C) │ (col D) │ │ │ ├─────────┼─────────┼─────────┼─────────┤ │ │ │ 1.0 │ 'foo' │ True │ 100 │ ← Row 0 │ │ │ 2.0 │ 'bar' │ False │ 200 │ ← Row 1 │ │ │ 3.0 │ 'baz' │ True │ 300 │ ← Row 2 │ │ └─────────┴─────────┴─────────┴─────────┘ │ │ ↑ ↑ ↑ ↑ │ │ float64 object bool int64 ← dtype per col │ └─────────────────────────────────────────────────────────────┘

Aspect	Series	DataFrame
Dimensions	1D (single column)	2D (multiple columns)
Data types	Single dtype	Different dtype per column
Analogy	Excel column	Excel spreadsheet
NumPy equivalent	1D array	2D array (but heterogeneous)

Type Transitions¶

Understanding how operations change the type is crucial.

DataFrame to Series¶

```python import pandas as pd

df = pd.DataFrame({ 'A': [1, 2, 3], 'B': [4, 5, 6], 'C': [7, 8, 9] })

Single column selection -> Series¶

col_a = df['A'] print(type(col_a)) #

Single row selection -> Series¶

row_0 = df.iloc[0] print(type(row_0)) #

Aggregation -> Series¶

col_means = df.mean() print(type(col_means)) # ```

Preserving DataFrame Type¶

```python

Double brackets preserve DataFrame¶

col_a_df = df[['A']] print(type(col_a_df)) # print(col_a_df.shape) # (3, 1)

Multiple column selection -> DataFrame¶

subset = df[['A', 'B']] print(type(subset)) # ```

Series to DataFrame¶

```python s = pd.Series([1, 2, 3], name='values')

to_frame() method¶

df = s.to_frame() print(type(df)) #

reset_index() also creates DataFrame¶

df = s.reset_index() print(df.columns) # Index(['index', 'values']) ```

Shape Differences¶

```python url = "https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv" df = pd.read_csv(url)

Full DataFrame¶

print(df.shape) # (891, 12)

Multiple columns -> DataFrame¶

print(df[["Survived", "Sex"]].shape) # (891, 2)

Single column with double brackets -> DataFrame¶

print(df[["Survived"]].shape) # (891, 1)

Single column with single brackets -> Series¶

print(df["Survived"].shape) # (891,) - Note: 1D tuple ```

Access Patterns¶

Equivalent Operations¶

Operation	DataFrame Syntax	Series Syntax
Get element	`df.loc[row, col]`	`s[label]` or `s.loc[label]`
Get by position	`df.iloc[i, j]`	`s.iloc[i]`
Boolean filter	`df[df['A'] > 0]`	`s[s > 0]`
Get values	`df.values`	`s.values`

Column-wise vs Element-wise¶

```python df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})

DataFrame aggregation is column-wise by default¶

print(df.sum())

A 6¶

B 15¶

dtype: int64¶

s = pd.Series([1, 2, 3])

Series aggregation is element-wise¶

print(s.sum()) # 6 ```

Method Behavior Differences¶

Aggregations¶

```python df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]}) s = pd.Series([1, 2, 3])

DataFrame.mean() returns Series (one value per column)¶

print(df.mean())

A 2.0¶

B 5.0¶

dtype: float64¶

Series.mean() returns scalar¶

print(s.mean()) # 2.0 ```

Apply Behavior¶

```python

DataFrame apply works on columns (axis=0) or rows (axis=1)¶

df.apply(sum, axis=0) # Sum each column df.apply(sum, axis=1) # Sum each row

Series apply works element-wise¶

s.apply(lambda x: x ** 2) # Square each element ```

Common Conversion Patterns¶

Aggregation Results¶

```python

groupby returns Series by default¶

result = df.groupby('category')['value'].sum() print(type(result)) # Series

Convert to DataFrame with reset_index¶

result_df = df.groupby('category')['value'].sum().reset_index() print(type(result_df)) # DataFrame

Or use to_frame with custom column name¶

result_df = df.groupby('category')['value'].sum().to_frame(name='total') ```

Value Counts¶

```python s = pd.Series(['a', 'b', 'a', 'c', 'a', 'b'])

value_counts returns Series¶

counts = s.value_counts() print(type(counts)) # Series

Convert to DataFrame¶

counts_df = s.value_counts().reset_index() counts_df.columns = ['value', 'count'] ```

Practical Guidelines¶

When to Use Series¶

Working with a single variable
Time series of one measurement
Result of column extraction
Input to plotting functions expecting 1D data

```python

Time series analysis¶

prices = df['Close'] # Series returns = prices.pct_change() rolling_mean = prices.rolling(20).mean() ```

When to Use DataFrame¶

Multiple variables that should stay aligned
Tabular data with different column types
Data requiring row-wise operations
Input/output for file operations

```python

Multi-asset analysis¶

portfolio = df[['AAPL', 'MSFT', 'GOOGL']] # DataFrame correlations = portfolio.corr() portfolio_returns = portfolio.pct_change() ```

Avoiding Common Mistakes¶

```python

WRONG: Expecting DataFrame, getting Series¶

col = df['price'] # This is a Series! col.columns # AttributeError: 'Series' object has no attribute 'columns'

RIGHT: Keep as DataFrame if needed¶

col = df[['price']] # This is a DataFrame col.columns # Index(['price'], dtype='object')

WRONG: Chained assignment warning¶

df[df['A'] > 0]['B'] = 1 # May not work as expected

RIGHT: Use loc for assignment¶

df.loc[df['A'] > 0, 'B'] = 1 ```

Performance Considerations¶

Operation	Series	DataFrame
Memory	Lower (single dtype)	Higher (metadata per column)
Iteration	Faster	Slower
Vectorized ops	Optimal	Optimal
Type consistency	Guaranteed	Per-column

For large-scale numerical operations, extracting to NumPy arrays may provide additional performance benefits:

```python

Extract for numerical operations¶

arr = df['price'].values # NumPy array result = np.sqrt(arr) # Fast NumPy operation

Put back into pandas if needed¶

df['price_sqrt'] = result ```

Runnable Example: `python_numpy_pandas_comparison.py`¶

```python """ Three Ways: Pure Python vs NumPy vs Pandas for Data Analysis

This tutorial solves the same sales data analysis problem three ways, showing why NumPy and Pandas exist and how they simplify data work.

Problem: Analyze monthly sales data across 5 regions. Tasks: 1. Monthly total sales 2. Month-over-month growth rate 3. Annual sales by region (sorted) 4. Find peak sales (month + region) 5. Find most volatile region (variance)

Based on Python-100-Days Day66-80 day01.ipynb cells. """

import numpy as np import pandas as pd

=============================================================================¶

Setup: Sales Data (12 months x 5 regions, in millions)¶

=============================================================================¶

months = [f'{i:>2d}' for i in range(1, 13)] regions = ['East', 'West', 'North', 'South', 'Central'] sales_data = [ [32, 17, 12, 20, 28], [41, 30, 17, 15, 35], [35, 18, 13, 11, 24], [12, 42, 44, 21, 34], [29, 11, 42, 32, 50], [10, 15, 11, 12, 26], [16, 28, 48, 22, 28], [31, 40, 45, 30, 39], [25, 41, 47, 42, 47], [47, 21, 13, 49, 48], [41, 36, 17, 36, 22], [22, 25, 15, 20, 37], ]

=============================================================================¶

Way 1: Pure Python (loops and comprehensions)¶

=============================================================================¶

def pure_python_analysis(): """Analyze sales data using only Python builtins.""" print("=" * 50) print("WAY 1: Pure Python") print("=" * 50)

# Task 1: Monthly totals
monthly_totals = [sum(row) for row in sales_data]
print("\n--- Monthly Totals ---")
for m, total in zip(months, monthly_totals):
    print(f"  Month {m}: {total}M")

# Task 2: Month-over-month growth
print("\n--- Month-over-Month Growth ---")
for i in range(1, len(monthly_totals)):
    growth = (monthly_totals[i] - monthly_totals[i-1]) / monthly_totals[i-1]
    print(f"  Month {months[i]}: {growth:>+.2%}")

# Task 3: Annual sales by region (sorted)
region_totals = {}
for j, region in enumerate(regions):
    region_totals[region] = sum(sales_data[i][j] for i in range(12))
sorted_regions = sorted(region_totals, key=lambda r: region_totals[r], reverse=True)
print("\n--- Annual Sales by Region (sorted) ---")
for r in sorted_regions:
    print(f"  {r}: {region_totals[r]}M")

# Task 4: Peak sales
max_val, max_month, max_region = 0, 0, 0
for i in range(len(months)):
    for j in range(len(regions)):
        if sales_data[i][j] > max_val:
            max_val = sales_data[i][j]
            max_month, max_region = i, j
print(f"\n--- Peak Sales ---")
print(f"  Month {months[max_month]}, {regions[max_region]}: {max_val}M")

# Task 5: Most volatile region (population variance)
print("\n--- Most Volatile Region ---")
max_var, most_volatile = 0, ""
for j, region in enumerate(regions):
    values = [sales_data[i][j] for i in range(12)]
    avg = sum(values) / len(values)
    var = sum((x - avg) ** 2 for x in values) / len(values)
    if var > max_var:
        max_var, most_volatile = var, region
print(f"  {most_volatile} (variance: {max_var:.1f})")
print()

=============================================================================¶

Way 2: NumPy (vectorized operations with axis)¶

=============================================================================¶

def numpy_analysis(): """Same analysis using NumPy - vectorized, no loops.""" print("=" * 50) print("WAY 2: NumPy") print("=" * 50)

data = np.array(sales_data)
print(f"\nArray shape: {data.shape}  (12 months x 5 regions)")

# Task 1: Monthly totals - sum along axis=1 (columns)
monthly_totals = data.sum(axis=1)
print(f"\n--- Monthly Totals (axis=1) ---")
print(f"  {monthly_totals}")

# Task 2: Month-over-month growth
mom = np.diff(monthly_totals) / monthly_totals[:-1]
print(f"\n--- MoM Growth ---")
print(f"  {np.round(mom * 100, 1)}%")

# Task 3: Annual by region - sum along axis=0 (rows)
region_totals = data.sum(axis=0)
sorted_idx = np.argsort(region_totals)[::-1]
print(f"\n--- Annual Sales by Region (sorted) ---")
for idx in sorted_idx:
    print(f"  {regions[idx]}: {region_totals[idx]}M")

# Task 4: Peak sales - argmax on flattened then unravel
flat_idx = data.argmax()
peak_month, peak_region = np.unravel_index(flat_idx, data.shape)
print(f"\n--- Peak Sales ---")
print(f"  Month {months[peak_month]}, {regions[peak_region]}: "
      f"{data[peak_month, peak_region]}M")

# Task 5: Most volatile - variance along axis=0
variances = data.var(axis=0)
most_volatile = np.argmax(variances)
print(f"\n--- Most Volatile Region ---")
print(f"  {regions[most_volatile]} (variance: {variances[most_volatile]:.1f})")
print(f"  All variances: {np.round(variances, 1)}")
print()

=============================================================================¶

Way 3: Pandas (labeled data, built-in methods)¶

=============================================================================¶

def pandas_analysis(): """Same analysis using Pandas - labeled, expressive, chainable.""" print("=" * 50) print("WAY 3: Pandas") print("=" * 50)

df = pd.DataFrame(sales_data, columns=regions,
                  index=[f'Month {m}' for m in months])
print(f"\n{df}\n")

# Task 1: Monthly totals
print("--- Monthly Totals (df.sum(axis=1)) ---")
print(df.sum(axis=1))
print()

# Task 2: Month-over-month with pct_change()
print("--- MoM Growth (pct_change()) ---")
print(df.sum(axis=1).pct_change().dropna().map('{:.2%}'.format))
print()

# Task 3: Annual by region (sorted)
print("--- Annual Sales by Region (sorted) ---")
print(df.sum().sort_values(ascending=False))
print()

# Task 4: Peak sales with idxmax on stacked DataFrame
stacked = df.stack()
peak_idx = stacked.idxmax()
print(f"--- Peak Sales ---")
print(f"  {peak_idx[0]}, {peak_idx[1]}: {stacked[peak_idx]}M")
print()

# Task 5: Most volatile
print("--- Most Volatile Region (var()) ---")
variances = df.var(ddof=0)
print(f"  {variances.idxmax()} (variance: {variances.max():.1f})")
print(f"  All variances:\n{variances.round(1)}")
print()

=============================================================================¶

Comparison Summary¶

=============================================================================¶

def comparison_summary(): """Compare the three approaches.""" print("=" * 50) print("COMPARISON SUMMARY") print("=" * 50) print(""" Pure Python: + No dependencies + Easy to understand - Verbose (many loops) - Slow on large data

NumPy:
  + Fast (vectorized C operations)
  + Concise (axis-based operations)
  - Integer indexing only (no labels)
  - Homogeneous dtype

Pandas:
  + Labeled data (named rows/columns)
  + Rich methods (pct_change, describe, groupby)
  + Handles mixed types and missing values
  + Great for tabular data
  - More memory overhead
  - Learning curve for API
""")

=============================================================================¶

Main¶

=============================================================================¶

if name == 'main': pure_python_analysis() numpy_analysis() pandas_analysis() comparison_summary() ```

Exercises¶

Exercise 1. Create a DataFrame with 3 columns. Extract one column as a Series using df['col'] and as a single-column DataFrame using df[['col']]. Print the type and shape of each.

Solution to Exercise 1

Compare Series vs single-column DataFrame.

import pandas as pd

df = pd.DataFrame({'A': [1, 2], 'B': [3, 4], 'C': [5, 6]})
series = df['A']
dataframe = df[['A']]
print(f"Series type: {type(series)}, shape: {series.shape}")
print(f"DataFrame type: {type(dataframe)}, shape: {dataframe.shape}")

Exercise 2. Create a Series and convert it to a DataFrame using .to_frame(). Then create a DataFrame and extract a row as a Series using .loc[]. Observe how the index of the resulting Series corresponds to the column names.

Solution to Exercise 2

Convert between Series and DataFrame.

import pandas as pd

s = pd.Series([10, 20, 30], index=['a', 'b', 'c'], name='values')
df_from_series = s.to_frame()
print(type(df_from_series))

df = pd.DataFrame({'x': [1, 2], 'y': [3, 4]}, index=['row0', 'row1'])
row_series = df.loc['row0']
print(row_series)
print("Index of row Series:", row_series.index.tolist())

Exercise 3. Demonstrate that a DataFrame can hold columns of different dtypes (int, float, string) while a Series has a single dtype. Create both and use .dtypes (DataFrame) and .dtype (Series) to verify.

Solution to Exercise 3

Compare dtypes in DataFrame vs dtype in Series.

import pandas as pd

df = pd.DataFrame({'ints': [1, 2], 'floats': [1.5, 2.5], 'strings': ['a', 'b']})
print("DataFrame dtypes:\n", df.dtypes)
s = df['ints']
print(f"\nSeries dtype: {s.dtype}")

Series vs DataFrame¶

Structural Comparison¶

Type Transitions¶

DataFrame to Series¶

Single column selection -> Series¶

Single row selection -> Series¶

Aggregation -> Series¶

Preserving DataFrame Type¶

Double brackets preserve DataFrame¶

Multiple column selection -> DataFrame¶

Series to DataFrame¶

to_frame() method¶

reset_index() also creates DataFrame¶

Shape Differences¶

Full DataFrame¶

Multiple columns -> DataFrame¶

Single column with double brackets -> DataFrame¶

Single column with single brackets -> Series¶

Access Patterns¶

Equivalent Operations¶

Column-wise vs Element-wise¶

DataFrame aggregation is column-wise by default¶

A 6¶

B 15¶

dtype: int64¶

Series aggregation is element-wise¶

Method Behavior Differences¶

Aggregations¶

DataFrame.mean() returns Series (one value per column)¶

A 2.0¶

B 5.0¶

dtype: float64¶

Series.mean() returns scalar¶

Apply Behavior¶

DataFrame apply works on columns (axis=0) or rows (axis=1)¶

Series apply works element-wise¶

Common Conversion Patterns¶

Aggregation Results¶

groupby returns Series by default¶

Convert to DataFrame with reset_index¶

Or use to_frame with custom column name¶

Value Counts¶

value_counts returns Series¶

Convert to DataFrame¶

Practical Guidelines¶

When to Use Series¶

Time series analysis¶

When to Use DataFrame¶

Multi-asset analysis¶

Avoiding Common Mistakes¶

WRONG: Expecting DataFrame, getting Series¶

RIGHT: Keep as DataFrame if needed¶

WRONG: Chained assignment warning¶

RIGHT: Use loc for assignment¶

Performance Considerations¶

Extract for numerical operations¶

Put back into pandas if needed¶

Runnable Example: python_numpy_pandas_comparison.py¶

=============================================================================¶

Setup: Sales Data (12 months x 5 regions, in millions)¶

=============================================================================¶

=============================================================================¶

Way 1: Pure Python (loops and comprehensions)¶

=============================================================================¶

=============================================================================¶

Way 2: NumPy (vectorized operations with axis)¶

=============================================================================¶

=============================================================================¶

Way 3: Pandas (labeled data, built-in methods)¶

=============================================================================¶

=============================================================================¶

Comparison Summary¶

=============================================================================¶

=============================================================================¶

Main¶

=============================================================================¶

Exercises¶

Runnable Example: `python_numpy_pandas_comparison.py`¶