apply with axis¶
The axis parameter in apply() determines whether the function is applied along rows or columns. Understanding axis behavior is crucial for correct DataFrame operations.
Mental Model
axis=0 feeds each column as a Series to your function (the function "moves down" the rows). axis=1 feeds each row as a Series (the function "moves across" the columns). The axis number tells you which dimension is being collapsed or iterated over.
axis=0 Column-wise¶
Apply function to each column (default behavior).
1. Default Behavior¶
```python import pandas as pd import numpy as np
df = pd.DataFrame({ 'A': [1, 2, 3], 'B': [4, 5, 6], 'C': [7, 8, 9] })
result = df.apply(np.sum, axis=0) print(result) ```
A 6
B 15
C 24
dtype: int64
2. Custom Function¶
```python def column_stats(col): return pd.Series({ 'mean': col.mean(), 'std': col.std(), 'min': col.min(), 'max': col.max() })
df.apply(column_stats, axis=0) ```
3. Column Normalization¶
python
df.apply(lambda col: (col - col.mean()) / col.std(), axis=0)
axis=1 Row-wise¶
Apply function to each row.
1. Row Sum¶
python
result = df.apply(np.sum, axis=1)
print(result)
0 12
1 15
2 18
dtype: int64
2. Row Maximum¶
python
df.apply(lambda row: row.max(), axis=1)
3. Conditional Row Logic¶
python
df.apply(
lambda row: 'High' if row['A'] > 2 else 'Low',
axis=1
)
Multiple Column Operations¶
Common patterns using axis=1.
1. Weighted Calculation¶
```python df = pd.DataFrame({ 'quantity': [10, 20, 30], 'price': [100, 200, 150], 'discount': [0.1, 0.2, 0.15] })
df['total'] = df.apply( lambda row: row['quantity'] * row['price'] * (1 - row['discount']), axis=1 ) ```
2. String Concatenation¶
```python df = pd.DataFrame({ 'first_name': ['John', 'Jane'], 'last_name': ['Doe', 'Smith'] })
df['full_name'] = df.apply( lambda row: f"{row['first_name']} {row['last_name']}", axis=1 ) ```
3. Conditional Assignment¶
python
df['status'] = df.apply(
lambda row: 'Pass' if row['score'] >= 60 else 'Fail',
axis=1
)
result_type Parameter¶
Control the output format when applying row-wise.
1. result_type='expand'¶
```python def extract_parts(row): return [row['A'], row['B'], row['A'] + row['B']]
df.apply(extract_parts, axis=1, result_type='expand') ```
Returns a DataFrame with columns 0, 1, 2.
2. result_type='reduce'¶
```python
Return a Series (default)¶
df.apply(lambda row: row.sum(), axis=1, result_type='reduce') ```
3. result_type='broadcast'¶
```python
Same shape as original DataFrame¶
df.apply(lambda row: row - row.mean(), axis=1, result_type='broadcast') ```
LeetCode Example: Quality Metrics¶
Calculate metrics with groupby and apply.
1. Sample Data¶
```python queries = pd.DataFrame({ 'query_name': ['Query1', 'Query1', 'Query2', 'Query2'], 'rating': [5, 4, 3, 2], 'position': [2, 1, 3, 2] })
queries['quality'] = queries['rating'] / queries['position'] queries['poor_query'] = (queries['rating'] < 3).astype(int) * 100 ```
2. Round Function¶
```python round2 = lambda x: round(x, 2)
result = (queries .groupby('query_name')[['quality', 'poor_query']] .mean() .apply(round2) .reset_index()) print(result) ```
3. Result¶
query_name quality poor_query
0 Query1 3.25 0.00
1 Query2 1.00 50.00
Performance Comparison¶
Row-wise apply vs vectorized operations.
1. Slow Row-wise¶
python
%%timeit
df.apply(lambda row: row['A'] + row['B'], axis=1)
2. Fast Vectorized¶
python
%%timeit
df['A'] + df['B']
3. When to Use axis=1¶
- Complex conditional logic
- Operations requiring multiple columns
- Non-vectorizable custom functions
Exercises¶
Exercise 1.
Create a numeric DataFrame with 4 columns. Use apply(np.mean, axis=0) to compute the mean of each column, then apply(np.mean, axis=1) to compute the mean of each row. Verify the column means match df.mean().
Solution to Exercise 1
Apply mean along both axes.
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(5, 4), columns=['A', 'B', 'C', 'D'])
col_means = df.apply(np.mean, axis=0)
row_means = df.apply(np.mean, axis=1)
print("Column means:\n", col_means)
print("Row means:\n", row_means)
assert (col_means == df.mean()).all()
Exercise 2.
Create a DataFrame with columns 'math', 'science', and 'english'. Use apply() with axis=1 to add a new column 'highest_subject' that contains the name of the column with the highest score for each row (use idxmax()).
Solution to Exercise 2
Find the column name of the max value per row.
import pandas as pd
df = pd.DataFrame({
'math': [85, 92, 78],
'science': [90, 88, 95],
'english': [88, 85, 80]
})
df['highest_subject'] = df.apply(lambda row: row.idxmax(), axis=1)
print(df)
Exercise 3.
Create a DataFrame and use apply() with axis=0 to return a Series with the count of values above the column mean for each column. Compare using a custom function vs a vectorized approach.
Solution to Exercise 3
Count values above the column mean using apply.
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(100, 3), columns=['A', 'B', 'C'])
above_mean = df.apply(lambda col: (col > col.mean()).sum(), axis=0)
print(above_mean)