Basic Aggregations¶
Aggregation functions summarize data by computing statistics like sum, mean, and count. They reduce multiple values to a single result.
Mental Model
An aggregation collapses many values into one -- a column of 1000 numbers becomes a single mean, sum, or count. Picture a funnel: data flows in at the top, and a single summary number comes out at the bottom. Every built-in aggregation method (mean, sum, std, ...) is just a different funnel shape.
Column Aggregations¶
Apply aggregations to DataFrame columns.
1. Single Column Mean¶
```python import pandas as pd
df = pd.DataFrame({ 'Age': [25, 30, 35, 40], 'Salary': [50000, 60000, 70000, 80000] })
print(df['Age'].mean()) # 32.5 ```
2. Single Column Sum¶
python
print(df['Salary'].sum()) # 260000
3. Multiple Aggregations¶
python
print(df['Age'].min()) # 25
print(df['Age'].max()) # 40
print(df['Age'].std()) # 6.45
print(df['Age'].count()) # 4
DataFrame Aggregations¶
Apply aggregations across the entire DataFrame.
1. All Columns Mean¶
python
print(df.mean())
Age 32.5
Salary 65000.0
dtype: float64
2. All Columns Sum¶
python
print(df.sum())
3. Summary Statistics¶
python
print(df.describe())
Age Salary
count 4.000000 4.000000
mean 32.500000 65000.000000
std 6.454972 12909.944487
min 25.000000 50000.000000
25% 28.750000 57500.000000
50% 32.500000 65000.000000
75% 36.250000 72500.000000
max 40.000000 80000.000000
Common Aggregation Methods¶
Methods available on Series and DataFrame.
1. Central Tendency¶
python
df['Age'].mean() # Arithmetic mean
df['Age'].median() # Middle value
df['Age'].mode() # Most frequent value
2. Dispersion¶
python
df['Age'].std() # Standard deviation
df['Age'].var() # Variance
df['Age'].sem() # Standard error of mean
3. Quantiles¶
python
df['Age'].quantile(0.25) # First quartile
df['Age'].quantile([0.25, 0.75]) # Multiple quantiles
Numeric Only Aggregations¶
Handle mixed data types.
1. numeric_only Parameter¶
```python df = pd.DataFrame({ 'Name': ['Alice', 'Bob'], 'Age': [25, 30], 'Salary': [50000, 60000] })
df.mean(numeric_only=True) # Exclude 'Name' ```
2. Select Numeric Columns¶
python
df.select_dtypes(include='number').mean()
3. Specific Columns¶
python
df[['Age', 'Salary']].mean()
Axis Parameter¶
Aggregate along rows or columns.
1. axis=0 (Default)¶
```python df = pd.DataFrame({ 'A': [1, 2, 3], 'B': [4, 5, 6] })
df.sum(axis=0) # Sum each column ```
A 6
B 15
dtype: int64
2. axis=1¶
python
df.sum(axis=1) # Sum each row
0 5
1 7
2 9
dtype: int64
3. Row Mean¶
python
df['RowMean'] = df.mean(axis=1)
Handling Missing Values¶
Aggregation methods handle NaN by default.
1. skipna=True (Default)¶
```python import numpy as np
s = pd.Series([1, 2, np.nan, 4]) s.mean() # 2.333... (ignores NaN) ```
2. skipna=False¶
python
s.mean(skipna=False) # NaN (includes NaN)
3. Count Non-NaN¶
python
s.count() # 3 (only non-NaN values)
len(s) # 4 (all values including NaN)
Exercises¶
Exercise 1.
Create a DataFrame with columns 'math', 'science', and 'english' containing scores for five students. Use .mean() to compute the average score for each subject (column-wise) and for each student (row-wise with axis=1).
Solution to Exercise 1
Use .mean() with different axis values.
import pandas as pd
df = pd.DataFrame({
'math': [85, 90, 78, 92, 88],
'science': [80, 85, 82, 95, 90],
'english': [75, 88, 91, 84, 79]
})
print("Average per subject:\n", df.mean())
print("\nAverage per student:\n", df.mean(axis=1))
Exercise 2.
Given a DataFrame of daily stock prices for three tickers, compute the describe() statistics. Then extract only the 'mean' and 'std' rows from the result using .loc.
Solution to Exercise 2
Use describe() and extract specific rows with .loc.
import pandas as pd
import numpy as np
np.random.seed(42)
df = pd.DataFrame({
'AAPL': np.random.uniform(140, 160, 20),
'MSFT': np.random.uniform(340, 360, 20),
'GOOGL': np.random.uniform(130, 150, 20)
})
stats = df.describe()
print(stats.loc[['mean', 'std']])
Exercise 3.
Create a Series with some NaN values. Demonstrate the difference between .count() and len() by printing both. Then compute the .sum() and show that NaN values are skipped by default but can be included by setting min_count.
Solution to Exercise 3
Show the difference between count() and len().
import pandas as pd
import numpy as np
s = pd.Series([10, 20, np.nan, 40, np.nan])
print(f"count(): {s.count()}") # 3
print(f"len(): {len(s)}") # 5
print(f"sum(): {s.sum()}") # 70 (NaN skipped)