Series vs DataFrame¶
Understanding the relationship between Series and DataFrame is fundamental to working effectively with pandas. This document clarifies when to use each and how they interact.
Structural Comparison¶
┌─────────────────────────────────────────────────────────────┐
│ DataFrame │
│ ┌─────────┬─────────┬─────────┬─────────┐ │
│ │ Series │ Series │ Series │ Series │ ← Columns │
│ │ (col A) │ (col B) │ (col C) │ (col D) │ │
│ ├─────────┼─────────┼─────────┼─────────┤ │
│ │ 1.0 │ 'foo' │ True │ 100 │ ← Row 0 │
│ │ 2.0 │ 'bar' │ False │ 200 │ ← Row 1 │
│ │ 3.0 │ 'baz' │ True │ 300 │ ← Row 2 │
│ └─────────┴─────────┴─────────┴─────────┘ │
│ ↑ ↑ ↑ ↑ │
│ float64 object bool int64 ← dtype per col │
└─────────────────────────────────────────────────────────────┘
| Aspect | Series | DataFrame |
|---|---|---|
| Dimensions | 1D (single column) | 2D (multiple columns) |
| Data types | Single dtype | Different dtype per column |
| Analogy | Excel column | Excel spreadsheet |
| NumPy equivalent | 1D array | 2D array (but heterogeneous) |
Type Transitions¶
Understanding how operations change the type is crucial.
DataFrame to Series¶
import pandas as pd
df = pd.DataFrame({
'A': [1, 2, 3],
'B': [4, 5, 6],
'C': [7, 8, 9]
})
# Single column selection -> Series
col_a = df['A']
print(type(col_a)) # <class 'pandas.core.series.Series'>
# Single row selection -> Series
row_0 = df.iloc[0]
print(type(row_0)) # <class 'pandas.core.series.Series'>
# Aggregation -> Series
col_means = df.mean()
print(type(col_means)) # <class 'pandas.core.series.Series'>
Preserving DataFrame Type¶
# Double brackets preserve DataFrame
col_a_df = df[['A']]
print(type(col_a_df)) # <class 'pandas.core.frame.DataFrame'>
print(col_a_df.shape) # (3, 1)
# Multiple column selection -> DataFrame
subset = df[['A', 'B']]
print(type(subset)) # <class 'pandas.core.frame.DataFrame'>
Series to DataFrame¶
s = pd.Series([1, 2, 3], name='values')
# to_frame() method
df = s.to_frame()
print(type(df)) # <class 'pandas.core.frame.DataFrame'>
# reset_index() also creates DataFrame
df = s.reset_index()
print(df.columns) # Index(['index', 'values'])
Shape Differences¶
url = "https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv"
df = pd.read_csv(url)
# Full DataFrame
print(df.shape) # (891, 12)
# Multiple columns -> DataFrame
print(df[["Survived", "Sex"]].shape) # (891, 2)
# Single column with double brackets -> DataFrame
print(df[["Survived"]].shape) # (891, 1)
# Single column with single brackets -> Series
print(df["Survived"].shape) # (891,) - Note: 1D tuple
Access Patterns¶
Equivalent Operations¶
| Operation | DataFrame Syntax | Series Syntax |
|---|---|---|
| Get element | df.loc[row, col] |
s[label] or s.loc[label] |
| Get by position | df.iloc[i, j] |
s.iloc[i] |
| Boolean filter | df[df['A'] > 0] |
s[s > 0] |
| Get values | df.values |
s.values |
Column-wise vs Element-wise¶
df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
# DataFrame aggregation is column-wise by default
print(df.sum())
# A 6
# B 15
# dtype: int64
s = pd.Series([1, 2, 3])
# Series aggregation is element-wise
print(s.sum()) # 6
Method Behavior Differences¶
Aggregations¶
df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
s = pd.Series([1, 2, 3])
# DataFrame.mean() returns Series (one value per column)
print(df.mean())
# A 2.0
# B 5.0
# dtype: float64
# Series.mean() returns scalar
print(s.mean()) # 2.0
Apply Behavior¶
# DataFrame apply works on columns (axis=0) or rows (axis=1)
df.apply(sum, axis=0) # Sum each column
df.apply(sum, axis=1) # Sum each row
# Series apply works element-wise
s.apply(lambda x: x ** 2) # Square each element
Common Conversion Patterns¶
Aggregation Results¶
# groupby returns Series by default
result = df.groupby('category')['value'].sum()
print(type(result)) # Series
# Convert to DataFrame with reset_index
result_df = df.groupby('category')['value'].sum().reset_index()
print(type(result_df)) # DataFrame
# Or use to_frame with custom column name
result_df = df.groupby('category')['value'].sum().to_frame(name='total')
Value Counts¶
s = pd.Series(['a', 'b', 'a', 'c', 'a', 'b'])
# value_counts returns Series
counts = s.value_counts()
print(type(counts)) # Series
# Convert to DataFrame
counts_df = s.value_counts().reset_index()
counts_df.columns = ['value', 'count']
Practical Guidelines¶
When to Use Series¶
- Working with a single variable
- Time series of one measurement
- Result of column extraction
- Input to plotting functions expecting 1D data
# Time series analysis
prices = df['Close'] # Series
returns = prices.pct_change()
rolling_mean = prices.rolling(20).mean()
When to Use DataFrame¶
- Multiple variables that should stay aligned
- Tabular data with different column types
- Data requiring row-wise operations
- Input/output for file operations
# Multi-asset analysis
portfolio = df[['AAPL', 'MSFT', 'GOOGL']] # DataFrame
correlations = portfolio.corr()
portfolio_returns = portfolio.pct_change()
Avoiding Common Mistakes¶
# WRONG: Expecting DataFrame, getting Series
col = df['price'] # This is a Series!
col.columns # AttributeError: 'Series' object has no attribute 'columns'
# RIGHT: Keep as DataFrame if needed
col = df[['price']] # This is a DataFrame
col.columns # Index(['price'], dtype='object')
# WRONG: Chained assignment warning
df[df['A'] > 0]['B'] = 1 # May not work as expected
# RIGHT: Use loc for assignment
df.loc[df['A'] > 0, 'B'] = 1
Performance Considerations¶
| Operation | Series | DataFrame |
|---|---|---|
| Memory | Lower (single dtype) | Higher (metadata per column) |
| Iteration | Faster | Slower |
| Vectorized ops | Optimal | Optimal |
| Type consistency | Guaranteed | Per-column |
For large-scale numerical operations, extracting to NumPy arrays may provide additional performance benefits:
# Extract for numerical operations
arr = df['price'].values # NumPy array
result = np.sqrt(arr) # Fast NumPy operation
# Put back into pandas if needed
df['price_sqrt'] = result
Runnable Example: python_numpy_pandas_comparison.py¶
"""
Three Ways: Pure Python vs NumPy vs Pandas for Data Analysis
This tutorial solves the same sales data analysis problem three ways,
showing why NumPy and Pandas exist and how they simplify data work.
Problem: Analyze monthly sales data across 5 regions.
Tasks:
1. Monthly total sales
2. Month-over-month growth rate
3. Annual sales by region (sorted)
4. Find peak sales (month + region)
5. Find most volatile region (variance)
Based on Python-100-Days Day66-80 day01.ipynb cells.
"""
import numpy as np
import pandas as pd
# =============================================================================
# Setup: Sales Data (12 months x 5 regions, in millions)
# =============================================================================
months = [f'{i:>2d}' for i in range(1, 13)]
regions = ['East', 'West', 'North', 'South', 'Central']
sales_data = [
[32, 17, 12, 20, 28],
[41, 30, 17, 15, 35],
[35, 18, 13, 11, 24],
[12, 42, 44, 21, 34],
[29, 11, 42, 32, 50],
[10, 15, 11, 12, 26],
[16, 28, 48, 22, 28],
[31, 40, 45, 30, 39],
[25, 41, 47, 42, 47],
[47, 21, 13, 49, 48],
[41, 36, 17, 36, 22],
[22, 25, 15, 20, 37],
]
# =============================================================================
# Way 1: Pure Python (loops and comprehensions)
# =============================================================================
def pure_python_analysis():
"""Analyze sales data using only Python builtins."""
print("=" * 50)
print("WAY 1: Pure Python")
print("=" * 50)
# Task 1: Monthly totals
monthly_totals = [sum(row) for row in sales_data]
print("\n--- Monthly Totals ---")
for m, total in zip(months, monthly_totals):
print(f" Month {m}: {total}M")
# Task 2: Month-over-month growth
print("\n--- Month-over-Month Growth ---")
for i in range(1, len(monthly_totals)):
growth = (monthly_totals[i] - monthly_totals[i-1]) / monthly_totals[i-1]
print(f" Month {months[i]}: {growth:>+.2%}")
# Task 3: Annual sales by region (sorted)
region_totals = {}
for j, region in enumerate(regions):
region_totals[region] = sum(sales_data[i][j] for i in range(12))
sorted_regions = sorted(region_totals, key=lambda r: region_totals[r], reverse=True)
print("\n--- Annual Sales by Region (sorted) ---")
for r in sorted_regions:
print(f" {r}: {region_totals[r]}M")
# Task 4: Peak sales
max_val, max_month, max_region = 0, 0, 0
for i in range(len(months)):
for j in range(len(regions)):
if sales_data[i][j] > max_val:
max_val = sales_data[i][j]
max_month, max_region = i, j
print(f"\n--- Peak Sales ---")
print(f" Month {months[max_month]}, {regions[max_region]}: {max_val}M")
# Task 5: Most volatile region (population variance)
print("\n--- Most Volatile Region ---")
max_var, most_volatile = 0, ""
for j, region in enumerate(regions):
values = [sales_data[i][j] for i in range(12)]
avg = sum(values) / len(values)
var = sum((x - avg) ** 2 for x in values) / len(values)
if var > max_var:
max_var, most_volatile = var, region
print(f" {most_volatile} (variance: {max_var:.1f})")
print()
# =============================================================================
# Way 2: NumPy (vectorized operations with axis)
# =============================================================================
def numpy_analysis():
"""Same analysis using NumPy - vectorized, no loops."""
print("=" * 50)
print("WAY 2: NumPy")
print("=" * 50)
data = np.array(sales_data)
print(f"\nArray shape: {data.shape} (12 months x 5 regions)")
# Task 1: Monthly totals - sum along axis=1 (columns)
monthly_totals = data.sum(axis=1)
print(f"\n--- Monthly Totals (axis=1) ---")
print(f" {monthly_totals}")
# Task 2: Month-over-month growth
mom = np.diff(monthly_totals) / monthly_totals[:-1]
print(f"\n--- MoM Growth ---")
print(f" {np.round(mom * 100, 1)}%")
# Task 3: Annual by region - sum along axis=0 (rows)
region_totals = data.sum(axis=0)
sorted_idx = np.argsort(region_totals)[::-1]
print(f"\n--- Annual Sales by Region (sorted) ---")
for idx in sorted_idx:
print(f" {regions[idx]}: {region_totals[idx]}M")
# Task 4: Peak sales - argmax on flattened then unravel
flat_idx = data.argmax()
peak_month, peak_region = np.unravel_index(flat_idx, data.shape)
print(f"\n--- Peak Sales ---")
print(f" Month {months[peak_month]}, {regions[peak_region]}: "
f"{data[peak_month, peak_region]}M")
# Task 5: Most volatile - variance along axis=0
variances = data.var(axis=0)
most_volatile = np.argmax(variances)
print(f"\n--- Most Volatile Region ---")
print(f" {regions[most_volatile]} (variance: {variances[most_volatile]:.1f})")
print(f" All variances: {np.round(variances, 1)}")
print()
# =============================================================================
# Way 3: Pandas (labeled data, built-in methods)
# =============================================================================
def pandas_analysis():
"""Same analysis using Pandas - labeled, expressive, chainable."""
print("=" * 50)
print("WAY 3: Pandas")
print("=" * 50)
df = pd.DataFrame(sales_data, columns=regions,
index=[f'Month {m}' for m in months])
print(f"\n{df}\n")
# Task 1: Monthly totals
print("--- Monthly Totals (df.sum(axis=1)) ---")
print(df.sum(axis=1))
print()
# Task 2: Month-over-month with pct_change()
print("--- MoM Growth (pct_change()) ---")
print(df.sum(axis=1).pct_change().dropna().map('{:.2%}'.format))
print()
# Task 3: Annual by region (sorted)
print("--- Annual Sales by Region (sorted) ---")
print(df.sum().sort_values(ascending=False))
print()
# Task 4: Peak sales with idxmax on stacked DataFrame
stacked = df.stack()
peak_idx = stacked.idxmax()
print(f"--- Peak Sales ---")
print(f" {peak_idx[0]}, {peak_idx[1]}: {stacked[peak_idx]}M")
print()
# Task 5: Most volatile
print("--- Most Volatile Region (var()) ---")
variances = df.var(ddof=0)
print(f" {variances.idxmax()} (variance: {variances.max():.1f})")
print(f" All variances:\n{variances.round(1)}")
print()
# =============================================================================
# Comparison Summary
# =============================================================================
def comparison_summary():
"""Compare the three approaches."""
print("=" * 50)
print("COMPARISON SUMMARY")
print("=" * 50)
print("""
Pure Python:
+ No dependencies
+ Easy to understand
- Verbose (many loops)
- Slow on large data
NumPy:
+ Fast (vectorized C operations)
+ Concise (axis-based operations)
- Integer indexing only (no labels)
- Homogeneous dtype
Pandas:
+ Labeled data (named rows/columns)
+ Rich methods (pct_change, describe, groupby)
+ Handles mixed types and missing values
+ Great for tabular data
- More memory overhead
- Learning curve for API
""")
# =============================================================================
# Main
# =============================================================================
if __name__ == '__main__':
pure_python_analysis()
numpy_analysis()
pandas_analysis()
comparison_summary()