Series vs DataFrame¶

Understanding the relationship between Series and DataFrame is fundamental to working effectively with pandas. This document clarifies when to use each and how they interact.

Structural Comparison¶

┌─────────────────────────────────────────────────────────────┐
│                        DataFrame                             │
│  ┌─────────┬─────────┬─────────┬─────────┐                  │
│  │ Series  │ Series  │ Series  │ Series  │  ← Columns       │
│  │ (col A) │ (col B) │ (col C) │ (col D) │                  │
│  ├─────────┼─────────┼─────────┼─────────┤                  │
│  │   1.0   │  'foo'  │  True   │  100    │  ← Row 0        │
│  │   2.0   │  'bar'  │  False  │  200    │  ← Row 1        │
│  │   3.0   │  'baz'  │  True   │  300    │  ← Row 2        │
│  └─────────┴─────────┴─────────┴─────────┘                  │
│      ↑          ↑          ↑          ↑                      │
│   float64    object      bool      int64    ← dtype per col │
└─────────────────────────────────────────────────────────────┘

Aspect	Series	DataFrame
Dimensions	1D (single column)	2D (multiple columns)
Data types	Single dtype	Different dtype per column
Analogy	Excel column	Excel spreadsheet
NumPy equivalent	1D array	2D array (but heterogeneous)

Type Transitions¶

Understanding how operations change the type is crucial.

DataFrame to Series¶

import pandas as pd

df = pd.DataFrame({
    'A': [1, 2, 3],
    'B': [4, 5, 6],
    'C': [7, 8, 9]
})

# Single column selection -> Series
col_a = df['A']
print(type(col_a))  # <class 'pandas.core.series.Series'>

# Single row selection -> Series
row_0 = df.iloc[0]
print(type(row_0))  # <class 'pandas.core.series.Series'>

# Aggregation -> Series
col_means = df.mean()
print(type(col_means))  # <class 'pandas.core.series.Series'>

Preserving DataFrame Type¶

# Double brackets preserve DataFrame
col_a_df = df[['A']]
print(type(col_a_df))  # <class 'pandas.core.frame.DataFrame'>
print(col_a_df.shape)  # (3, 1)

# Multiple column selection -> DataFrame
subset = df[['A', 'B']]
print(type(subset))  # <class 'pandas.core.frame.DataFrame'>

Series to DataFrame¶

s = pd.Series([1, 2, 3], name='values')

# to_frame() method
df = s.to_frame()
print(type(df))  # <class 'pandas.core.frame.DataFrame'>

# reset_index() also creates DataFrame
df = s.reset_index()
print(df.columns)  # Index(['index', 'values'])

Shape Differences¶

url = "https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv"
df = pd.read_csv(url)

# Full DataFrame
print(df.shape)                  # (891, 12)

# Multiple columns -> DataFrame
print(df[["Survived", "Sex"]].shape)  # (891, 2)

# Single column with double brackets -> DataFrame
print(df[["Survived"]].shape)    # (891, 1)

# Single column with single brackets -> Series
print(df["Survived"].shape)      # (891,) - Note: 1D tuple

Access Patterns¶

Equivalent Operations¶

Operation	DataFrame Syntax	Series Syntax
Get element	`df.loc[row, col]`	`s[label]` or `s.loc[label]`
Get by position	`df.iloc[i, j]`	`s.iloc[i]`
Boolean filter	`df[df['A'] > 0]`	`s[s > 0]`
Get values	`df.values`	`s.values`

Column-wise vs Element-wise¶

df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})

# DataFrame aggregation is column-wise by default
print(df.sum())
# A     6
# B    15
# dtype: int64

s = pd.Series([1, 2, 3])

# Series aggregation is element-wise
print(s.sum())  # 6

Method Behavior Differences¶

Aggregations¶

df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
s = pd.Series([1, 2, 3])

# DataFrame.mean() returns Series (one value per column)
print(df.mean())
# A    2.0
# B    5.0
# dtype: float64

# Series.mean() returns scalar
print(s.mean())  # 2.0

Apply Behavior¶

# DataFrame apply works on columns (axis=0) or rows (axis=1)
df.apply(sum, axis=0)  # Sum each column
df.apply(sum, axis=1)  # Sum each row

# Series apply works element-wise
s.apply(lambda x: x ** 2)  # Square each element

Common Conversion Patterns¶

Aggregation Results¶

# groupby returns Series by default
result = df.groupby('category')['value'].sum()
print(type(result))  # Series

# Convert to DataFrame with reset_index
result_df = df.groupby('category')['value'].sum().reset_index()
print(type(result_df))  # DataFrame

# Or use to_frame with custom column name
result_df = df.groupby('category')['value'].sum().to_frame(name='total')

Value Counts¶

s = pd.Series(['a', 'b', 'a', 'c', 'a', 'b'])

# value_counts returns Series
counts = s.value_counts()
print(type(counts))  # Series

# Convert to DataFrame
counts_df = s.value_counts().reset_index()
counts_df.columns = ['value', 'count']

Practical Guidelines¶

When to Use Series¶

Working with a single variable
Time series of one measurement
Result of column extraction
Input to plotting functions expecting 1D data

# Time series analysis
prices = df['Close']  # Series
returns = prices.pct_change()
rolling_mean = prices.rolling(20).mean()

When to Use DataFrame¶

Multiple variables that should stay aligned
Tabular data with different column types
Data requiring row-wise operations
Input/output for file operations

# Multi-asset analysis
portfolio = df[['AAPL', 'MSFT', 'GOOGL']]  # DataFrame
correlations = portfolio.corr()
portfolio_returns = portfolio.pct_change()

Avoiding Common Mistakes¶

# WRONG: Expecting DataFrame, getting Series
col = df['price']  # This is a Series!
col.columns  # AttributeError: 'Series' object has no attribute 'columns'

# RIGHT: Keep as DataFrame if needed
col = df[['price']]  # This is a DataFrame
col.columns  # Index(['price'], dtype='object')

# WRONG: Chained assignment warning
df[df['A'] > 0]['B'] = 1  # May not work as expected

# RIGHT: Use loc for assignment
df.loc[df['A'] > 0, 'B'] = 1

Performance Considerations¶

Operation	Series	DataFrame
Memory	Lower (single dtype)	Higher (metadata per column)
Iteration	Faster	Slower
Vectorized ops	Optimal	Optimal
Type consistency	Guaranteed	Per-column

For large-scale numerical operations, extracting to NumPy arrays may provide additional performance benefits:

# Extract for numerical operations
arr = df['price'].values  # NumPy array
result = np.sqrt(arr)     # Fast NumPy operation

# Put back into pandas if needed
df['price_sqrt'] = result

Runnable Example: `python_numpy_pandas_comparison.py`¶

"""
Three Ways: Pure Python vs NumPy vs Pandas for Data Analysis

This tutorial solves the same sales data analysis problem three ways,
showing why NumPy and Pandas exist and how they simplify data work.

Problem: Analyze monthly sales data across 5 regions.
Tasks:
1. Monthly total sales
2. Month-over-month growth rate
3. Annual sales by region (sorted)
4. Find peak sales (month + region)
5. Find most volatile region (variance)

Based on Python-100-Days Day66-80 day01.ipynb cells.
"""

import numpy as np
import pandas as pd


# =============================================================================
# Setup: Sales Data (12 months x 5 regions, in millions)
# =============================================================================

months = [f'{i:>2d}' for i in range(1, 13)]
regions = ['East', 'West', 'North', 'South', 'Central']
sales_data = [
    [32, 17, 12, 20, 28],
    [41, 30, 17, 15, 35],
    [35, 18, 13, 11, 24],
    [12, 42, 44, 21, 34],
    [29, 11, 42, 32, 50],
    [10, 15, 11, 12, 26],
    [16, 28, 48, 22, 28],
    [31, 40, 45, 30, 39],
    [25, 41, 47, 42, 47],
    [47, 21, 13, 49, 48],
    [41, 36, 17, 36, 22],
    [22, 25, 15, 20, 37],
]


# =============================================================================
# Way 1: Pure Python (loops and comprehensions)
# =============================================================================

def pure_python_analysis():
    """Analyze sales data using only Python builtins."""
    print("=" * 50)
    print("WAY 1: Pure Python")
    print("=" * 50)

    # Task 1: Monthly totals
    monthly_totals = [sum(row) for row in sales_data]
    print("\n--- Monthly Totals ---")
    for m, total in zip(months, monthly_totals):
        print(f"  Month {m}: {total}M")

    # Task 2: Month-over-month growth
    print("\n--- Month-over-Month Growth ---")
    for i in range(1, len(monthly_totals)):
        growth = (monthly_totals[i] - monthly_totals[i-1]) / monthly_totals[i-1]
        print(f"  Month {months[i]}: {growth:>+.2%}")

    # Task 3: Annual sales by region (sorted)
    region_totals = {}
    for j, region in enumerate(regions):
        region_totals[region] = sum(sales_data[i][j] for i in range(12))
    sorted_regions = sorted(region_totals, key=lambda r: region_totals[r], reverse=True)
    print("\n--- Annual Sales by Region (sorted) ---")
    for r in sorted_regions:
        print(f"  {r}: {region_totals[r]}M")

    # Task 4: Peak sales
    max_val, max_month, max_region = 0, 0, 0
    for i in range(len(months)):
        for j in range(len(regions)):
            if sales_data[i][j] > max_val:
                max_val = sales_data[i][j]
                max_month, max_region = i, j
    print(f"\n--- Peak Sales ---")
    print(f"  Month {months[max_month]}, {regions[max_region]}: {max_val}M")

    # Task 5: Most volatile region (population variance)
    print("\n--- Most Volatile Region ---")
    max_var, most_volatile = 0, ""
    for j, region in enumerate(regions):
        values = [sales_data[i][j] for i in range(12)]
        avg = sum(values) / len(values)
        var = sum((x - avg) ** 2 for x in values) / len(values)
        if var > max_var:
            max_var, most_volatile = var, region
    print(f"  {most_volatile} (variance: {max_var:.1f})")
    print()


# =============================================================================
# Way 2: NumPy (vectorized operations with axis)
# =============================================================================

def numpy_analysis():
    """Same analysis using NumPy - vectorized, no loops."""
    print("=" * 50)
    print("WAY 2: NumPy")
    print("=" * 50)

    data = np.array(sales_data)
    print(f"\nArray shape: {data.shape}  (12 months x 5 regions)")

    # Task 1: Monthly totals - sum along axis=1 (columns)
    monthly_totals = data.sum(axis=1)
    print(f"\n--- Monthly Totals (axis=1) ---")
    print(f"  {monthly_totals}")

    # Task 2: Month-over-month growth
    mom = np.diff(monthly_totals) / monthly_totals[:-1]
    print(f"\n--- MoM Growth ---")
    print(f"  {np.round(mom * 100, 1)}%")

    # Task 3: Annual by region - sum along axis=0 (rows)
    region_totals = data.sum(axis=0)
    sorted_idx = np.argsort(region_totals)[::-1]
    print(f"\n--- Annual Sales by Region (sorted) ---")
    for idx in sorted_idx:
        print(f"  {regions[idx]}: {region_totals[idx]}M")

    # Task 4: Peak sales - argmax on flattened then unravel
    flat_idx = data.argmax()
    peak_month, peak_region = np.unravel_index(flat_idx, data.shape)
    print(f"\n--- Peak Sales ---")
    print(f"  Month {months[peak_month]}, {regions[peak_region]}: "
          f"{data[peak_month, peak_region]}M")

    # Task 5: Most volatile - variance along axis=0
    variances = data.var(axis=0)
    most_volatile = np.argmax(variances)
    print(f"\n--- Most Volatile Region ---")
    print(f"  {regions[most_volatile]} (variance: {variances[most_volatile]:.1f})")
    print(f"  All variances: {np.round(variances, 1)}")
    print()


# =============================================================================
# Way 3: Pandas (labeled data, built-in methods)
# =============================================================================

def pandas_analysis():
    """Same analysis using Pandas - labeled, expressive, chainable."""
    print("=" * 50)
    print("WAY 3: Pandas")
    print("=" * 50)

    df = pd.DataFrame(sales_data, columns=regions,
                      index=[f'Month {m}' for m in months])
    print(f"\n{df}\n")

    # Task 1: Monthly totals
    print("--- Monthly Totals (df.sum(axis=1)) ---")
    print(df.sum(axis=1))
    print()

    # Task 2: Month-over-month with pct_change()
    print("--- MoM Growth (pct_change()) ---")
    print(df.sum(axis=1).pct_change().dropna().map('{:.2%}'.format))
    print()

    # Task 3: Annual by region (sorted)
    print("--- Annual Sales by Region (sorted) ---")
    print(df.sum().sort_values(ascending=False))
    print()

    # Task 4: Peak sales with idxmax on stacked DataFrame
    stacked = df.stack()
    peak_idx = stacked.idxmax()
    print(f"--- Peak Sales ---")
    print(f"  {peak_idx[0]}, {peak_idx[1]}: {stacked[peak_idx]}M")
    print()

    # Task 5: Most volatile
    print("--- Most Volatile Region (var()) ---")
    variances = df.var(ddof=0)
    print(f"  {variances.idxmax()} (variance: {variances.max():.1f})")
    print(f"  All variances:\n{variances.round(1)}")
    print()


# =============================================================================
# Comparison Summary
# =============================================================================

def comparison_summary():
    """Compare the three approaches."""
    print("=" * 50)
    print("COMPARISON SUMMARY")
    print("=" * 50)
    print("""
    Pure Python:
      + No dependencies
      + Easy to understand
      - Verbose (many loops)
      - Slow on large data

    NumPy:
      + Fast (vectorized C operations)
      + Concise (axis-based operations)
      - Integer indexing only (no labels)
      - Homogeneous dtype

    Pandas:
      + Labeled data (named rows/columns)
      + Rich methods (pct_change, describe, groupby)
      + Handles mixed types and missing values
      + Great for tabular data
      - More memory overhead
      - Learning curve for API
    """)


# =============================================================================
# Main
# =============================================================================

if __name__ == '__main__':
    pure_python_analysis()
    numpy_analysis()
    pandas_analysis()
    comparison_summary()