GroupBy Aggregations¶
GroupBy objects support various aggregation methods to summarize grouped data.
Mental Model
A GroupBy aggregation is SQL's GROUP BY + aggregate function. The data is split into groups, a reduction function collapses each group to one row, and the results are combined into a smaller DataFrame. The output always has one row per unique group key.
Basic Aggregations¶
Apply single aggregation functions.
1. Mean per Group¶
```python import pandas as pd
df = pd.DataFrame({ 'asset': ['A', 'A', 'B', 'B'], 'return': [0.01, -0.02, 0.03, 0.01], })
df.groupby('asset')['return'].mean() ```
asset
A -0.005
B 0.020
Name: return, dtype: float64
2. Common Aggregations¶
python
df.groupby('asset')['return'].sum()
df.groupby('asset')['return'].count()
df.groupby('asset')['return'].std()
df.groupby('asset')['return'].min()
df.groupby('asset')['return'].max()
3. Multiple Methods¶
python
df.groupby('asset')['return'].agg(['mean', 'std', 'count'])
LeetCode Example: Duplicate Emails¶
Count email occurrences.
1. Sample Data¶
python
person = pd.DataFrame({
'id': [1, 2, 3, 4, 5],
'email': ['a@ex.com', 'b@ex.com', 'a@ex.com', 'b@ex.com', 'c@ex.com']
})
2. GroupBy Count¶
python
email_counts = person.groupby('email')['id'].count().reset_index(name='count')
print(email_counts)
email count
0 a@ex.com 2
1 b@ex.com 2
2 c@ex.com 1
3. Find Duplicates¶
python
duplicates = email_counts[email_counts['count'] > 1]['email']
LeetCode Example: Customer Orders¶
Find customer with most orders.
1. Sample Data¶
python
orders = pd.DataFrame({
'order_number': [101, 102, 103, 104, 105],
'customer_number': [1, 1, 2, 3, 2]
})
2. Count per Customer¶
python
order_counts = orders.groupby('customer_number')['order_number'].count().reset_index()
print(order_counts)
3. Find Maximum¶
python
max_orders = order_counts.loc[order_counts['order_number'].idxmax()]
Named Aggregations¶
Create descriptive column names.
1. Named Syntax¶
python
df.groupby('asset').agg(
mean_return=('return', 'mean'),
std_return=('return', 'std'),
count=('return', 'count')
)
2. Dictionary Syntax¶
python
df.groupby('asset').agg({
'return': ['mean', 'std', 'count']
})
3. Custom Functions¶
python
df.groupby('asset').agg(
range=('return', lambda x: x.max() - x.min())
)
reset_index¶
Convert index to columns.
1. Default Result¶
```python result = df.groupby('asset')['return'].mean()
asset is index¶
```
2. With reset_index¶
```python result = df.groupby('asset')['return'].mean().reset_index()
asset is column¶
```
3. as_index=False¶
```python result = df.groupby('asset', as_index=False)['return'].mean()
Equivalent to reset_index()¶
```
Runnable Example: business_analytics_example.py¶
```python """ Business Analytics: GroupBy, Pivot Tables, and Visualization
A practical business analytics workflow using Pandas groupby, pivot_table, and crosstab for multi-dimensional analysis.
Topics covered: - GroupBy with multiple aggregation functions - Pivot tables for cross-tabulation - Month-over-month (MoM) calculations - Percent-of-total calculations - Summary statistics by category
Based on Python-100-Days Day66-80 day05.ipynb business analytics examples. """
import numpy as np import pandas as pd
=============================================================================¶
Setup: Generate Sample Business Data¶
=============================================================================¶
def create_sales_data() -> pd.DataFrame: """Create sample sales data for analytics.""" np.random.seed(42) n = 200
data = {
'date': pd.date_range('2023-01-01', periods=n, freq='B'),
'region': np.random.choice(['East', 'West', 'North', 'South'], n),
'channel': np.random.choice(['Online', 'Retail', 'Wholesale'], n,
p=[0.4, 0.35, 0.25]),
'product': np.random.choice(['Widget A', 'Widget B', 'Widget C'], n),
'units': np.random.randint(10, 200, n),
'unit_price': np.random.choice([25.0, 45.0, 80.0], n),
}
df = pd.DataFrame(data)
df['revenue'] = df['units'] * df['unit_price']
df['cost'] = df['revenue'] * np.random.uniform(0.4, 0.7, n)
df['profit'] = df['revenue'] - df['cost']
df['month'] = df['date'].dt.to_period('M')
return df
=============================================================================¶
Analysis 1: Monthly Revenue Summary¶
=============================================================================¶
def monthly_summary(df: pd.DataFrame) -> None: """Summarize revenue by month with MoM growth.""" print("=== Monthly Revenue Summary ===")
monthly = df.groupby('month').agg(
total_revenue=('revenue', 'sum'),
total_profit=('profit', 'sum'),
num_orders=('revenue', 'count'),
avg_order_value=('revenue', 'mean'),
).round(0)
# Month-over-month growth
monthly['revenue_mom'] = monthly['total_revenue'].pct_change()
monthly['profit_margin'] = monthly['total_profit'] / monthly['total_revenue']
print(monthly.to_string())
print()
=============================================================================¶
Analysis 2: Region x Channel Pivot Table¶
=============================================================================¶
def region_channel_analysis(df: pd.DataFrame) -> None: """Pivot table: revenue by region and channel.""" print("=== Revenue by Region x Channel ===")
pivot = pd.pivot_table(
df,
values='revenue',
index='region',
columns='channel',
aggfunc='sum',
margins=True, # Add row/column totals
margins_name='Total',
).round(0)
print(pivot.to_string())
print()
# Percent of total
print("--- Percent of Total ---")
total = pivot.loc['Total', 'Total']
pct = (pivot / total * 100).round(1)
print(pct.to_string())
print()
=============================================================================¶
Analysis 3: Product Performance¶
=============================================================================¶
def product_performance(df: pd.DataFrame) -> None: """Analyze performance by product.""" print("=== Product Performance ===")
product_stats = df.groupby('product').agg(
total_units=('units', 'sum'),
total_revenue=('revenue', 'sum'),
total_profit=('profit', 'sum'),
avg_price=('unit_price', 'mean'),
num_orders=('revenue', 'count'),
)
product_stats['profit_margin'] = (
product_stats['total_profit'] / product_stats['total_revenue']
)
product_stats['revenue_per_order'] = (
product_stats['total_revenue'] / product_stats['num_orders']
)
# Sort by total revenue
product_stats = product_stats.sort_values('total_revenue', ascending=False)
print(product_stats.round(1).to_string())
print()
=============================================================================¶
Analysis 4: Cross-Tabulation¶
=============================================================================¶
def channel_region_crosstab(df: pd.DataFrame) -> None: """Cross-tabulation of order counts by channel and region.""" print("=== Order Count: Channel x Region (crosstab) ===")
ct = pd.crosstab(
df['channel'],
df['region'],
margins=True,
margins_name='Total',
)
print(ct.to_string())
print()
# Normalize by row (channel distribution across regions)
print("--- Channel Distribution Across Regions (%) ---")
ct_pct = pd.crosstab(
df['channel'],
df['region'],
normalize='index',
).round(3) * 100
print(ct_pct.to_string())
print()
=============================================================================¶
Analysis 5: Multi-Level Grouping¶
=============================================================================¶
def multi_level_grouping(df: pd.DataFrame) -> None: """Group by multiple columns with multiple aggregations.""" print("=== Multi-Level Grouping: Region + Product ===")
result = df.groupby(['region', 'product']).agg({
'revenue': ['sum', 'mean', 'count'],
'profit': ['sum'],
'units': ['sum'],
}).round(0)
# Flatten column MultiIndex
result.columns = ['_'.join(col).strip() for col in result.columns]
# Top 5 by revenue
print("Top 5 by Total Revenue:")
print(result.nlargest(5, 'revenue_sum').to_string())
print()
=============================================================================¶
Analysis 6: Transform and Rank within Groups¶
=============================================================================¶
def group_transform_rank(df: pd.DataFrame) -> None: """Use transform and rank within groups.""" print("=== Transform: Percent of Regional Revenue ===")
# Each order's revenue as % of its region's total
df_copy = df[['region', 'channel', 'revenue']].copy()
df_copy['region_total'] = df_copy.groupby('region')['revenue'].transform('sum')
df_copy['pct_of_region'] = (df_copy['revenue'] / df_copy['region_total'] * 100).round(2)
# Rank within region
df_copy['rank_in_region'] = df_copy.groupby('region')['revenue'].rank(
ascending=False, method='dense'
).astype(int)
print(df_copy.head(10).to_string(index=False))
print()
=============================================================================¶
Main¶
=============================================================================¶
if name == 'main': df = create_sales_data() print(f"Dataset: {df.shape[0]} rows, {df.shape[1]} columns") print(f"Date range: {df['date'].min().date()} to {df['date'].max().date()}") print(f"Columns: {df.columns.tolist()}") print()
monthly_summary(df)
region_channel_analysis(df)
product_performance(df)
channel_region_crosstab(df)
multi_level_grouping(df)
group_transform_rank(df)
```
Exercises¶
Exercise 1.
Group a DataFrame by 'department' and apply named aggregations to compute total_sales=('sales', 'sum') and avg_sales=('sales', 'mean'). Reset the index to get a flat DataFrame.
Solution to Exercise 1
Use named aggregations for clean column names.
import pandas as pd
df = pd.DataFrame({
'department': ['Sales', 'Sales', 'IT', 'IT'],
'sales': [100, 200, 150, 250]
})
result = df.groupby('department').agg(
total_sales=('sales', 'sum'),
avg_sales=('sales', 'mean')
).reset_index()
print(result)
Exercise 2.
Use the dictionary syntax of .agg() to apply different functions to different columns after a groupby: sum the 'quantity' column and compute the mean of the 'price' column per group.
Solution to Exercise 2
Use a dictionary mapping columns to functions.
import pandas as pd
df = pd.DataFrame({
'group': ['A', 'A', 'B', 'B'],
'quantity': [10, 20, 30, 40],
'price': [5.0, 7.5, 3.0, 6.0]
})
result = df.groupby('group').agg({
'quantity': 'sum',
'price': 'mean'
})
print(result)
Exercise 3.
Group by 'customer_id' and count the number of orders per customer using .count(). Then filter to find customers with more than 3 orders using boolean indexing on the result.
Solution to Exercise 3
Count per group and filter using boolean indexing.
import pandas as pd
df = pd.DataFrame({
'customer_id': [1, 1, 1, 1, 2, 2, 3, 3, 3, 3, 3],
'order_id': range(11)
})
counts = df.groupby('customer_id')['order_id'].count()
frequent = counts[counts > 3]
print(frequent)