One-Hot Encoding with get_dummies¶
One-hot encoding converts categorical variables into binary columns, enabling their use in machine learning models that require numerical input.
Mental Model
One-hot encoding replaces a single column of labels with a row of binary flags -- one flag per unique label. If a column has values [red, blue, green], it becomes three columns where exactly one is 1 and the rest are 0. This trades a compact representation for one that linear models can consume directly.
pd.get_dummies Basics¶
Basic Usage¶
```python import pandas as pd
df = pd.DataFrame({ 'color': ['red', 'blue', 'green', 'red', 'blue'], 'size': ['S', 'M', 'L', 'M', 'S'], 'price': [10, 20, 30, 15, 25] })
Encode categorical columns¶
encoded = pd.get_dummies(df) print(encoded) ```
price color_blue color_green color_red size_L size_M size_S
0 10 False False True False False True
1 20 True False False False True False
2 30 False True False True False False
3 15 False False True False True False
4 25 True False False False False True
Encode Specific Columns¶
```python
Only encode 'color' column¶
encoded = pd.get_dummies(df, columns=['color']) print(encoded) ```
size price color_blue color_green color_red
0 S 10 False False True
1 M 20 True False False
2 L 30 False True False
3 M 15 False False True
4 S 25 True False False
Custom Prefix¶
```python
Add custom prefix to encoded columns¶
encoded = pd.get_dummies(df, columns=['color'], prefix='c') print(encoded.columns)
Index(['size', 'price', 'c_blue', 'c_green', 'c_red'], dtype='object')¶
Multiple columns with different prefixes¶
encoded = pd.get_dummies( df, columns=['color', 'size'], prefix={'color': 'col', 'size': 'sz'} ) print(encoded.columns) ```
Handling Data Types¶
dtype Parameter¶
```python
Default: bool dtype (memory efficient)¶
encoded = pd.get_dummies(df['color']) print(encoded.dtypes)
Integer dtype for ML compatibility¶
encoded = pd.get_dummies(df['color'], dtype=int) print(encoded.dtypes)
Float dtype¶
encoded = pd.get_dummies(df['color'], dtype=float) ```
Drop First (Dummy Variable Trap)¶
In regression models, including all dummy variables creates multicollinearity. Use drop_first=True to avoid this.
The Problem¶
```python
Full encoding: 3 columns for 3 colors¶
red + blue + green always sums to 1 (perfect collinearity)¶
full = pd.get_dummies(df['color']) print(full) ```
blue green red
0 False False True
1 True False False
2 False True False
3 False False True
4 True False False
The Solution¶
```python
Drop first category: 2 columns sufficient¶
If blue=0 and green=0, it's red¶
reduced = pd.get_dummies(df['color'], drop_first=True) print(reduced) ```
green red
0 False True
1 False False
2 True False
3 False True
4 False False
Handling Missing Values¶
Default Behavior¶
```python df_missing = pd.DataFrame({ 'color': ['red', 'blue', None, 'red', pd.NA] })
NaN is not encoded by default¶
encoded = pd.get_dummies(df_missing['color']) print(encoded) ```
blue red
0 False True
1 True False
2 False False # NaN row: all False
3 False True
4 False False # NA row: all False
Create NaN Indicator Column¶
```python
Create column for missing values¶
encoded = pd.get_dummies(df_missing['color'], dummy_na=True) print(encoded) ```
blue red NaN
0 False True False
1 True False False
2 False False True
3 False True False
4 False False True
Encoding Series vs DataFrame¶
Series Encoding¶
python
s = pd.Series(['A', 'B', 'A', 'C'])
encoded = pd.get_dummies(s)
print(encoded)
A B C
0 True False False
1 False True False
2 True False False
3 False False True
DataFrame Encoding¶
```python
Encodes ALL object/category columns automatically¶
df = pd.DataFrame({ 'cat1': ['A', 'B', 'A'], 'cat2': ['X', 'Y', 'X'], 'num': [1, 2, 3] })
encoded = pd.get_dummies(df) print(encoded) ```
Sparse Output¶
For datasets with many categories, use sparse matrices to save memory.
```python
Dense (default)¶
encoded_dense = pd.get_dummies(df['color']) print(f"Dense memory: {encoded_dense.memory_usage(deep=True).sum()} bytes")
Sparse¶
encoded_sparse = pd.get_dummies(df['color'], sparse=True) print(f"Sparse memory: {encoded_sparse.memory_usage(deep=True).sum()} bytes") ```
Practical Examples¶
1. Preparing Data for Machine Learning¶
```python import pandas as pd from sklearn.model_selection import train_test_split from sklearn.linear_model import LogisticRegression
Sample dataset¶
df = pd.DataFrame({ 'age': [25, 35, 45, 30, 50], 'education': ['High School', 'Bachelor', 'Master', 'Bachelor', 'PhD'], 'income_level': ['Low', 'Medium', 'High', 'Medium', 'High'], 'purchased': [0, 1, 1, 0, 1] })
Encode categorical variables¶
X = pd.get_dummies( df[['age', 'education', 'income_level']], drop_first=True, dtype=int ) y = df['purchased']
Train model¶
model = LogisticRegression() model.fit(X, y) print(f"Features: {list(X.columns)}") ```
2. Handling New Categories in Test Data¶
```python
Training data¶
train = pd.DataFrame({'color': ['red', 'blue', 'green']}) train_encoded = pd.get_dummies(train['color'])
Test data has new category 'yellow'¶
test = pd.DataFrame({'color': ['red', 'yellow', 'blue']}) test_encoded = pd.get_dummies(test['color'])
Problem: different columns!¶
print(f"Train columns: {list(train_encoded.columns)}") print(f"Test columns: {list(test_encoded.columns)}")
Solution: reindex to match training columns¶
test_aligned = test_encoded.reindex(columns=train_encoded.columns, fill_value=0) print(test_aligned) ```
3. Financial Sector Analysis¶
```python import yfinance as yf
Get S&P 500 sector data (simplified example)¶
portfolio = pd.DataFrame({ 'ticker': ['AAPL', 'JPM', 'JNJ', 'XOM', 'GOOGL'], 'sector': ['Technology', 'Financial', 'Healthcare', 'Energy', 'Technology'], 'weight': [0.25, 0.20, 0.20, 0.15, 0.20] })
Create sector exposure matrix¶
sector_exposure = pd.get_dummies(portfolio['sector'], dtype=float)
Weight by portfolio allocation¶
weighted_exposure = sector_exposure.multiply(portfolio['weight'], axis=0) print("Sector Exposure:") print(weighted_exposure) print("\nTotal Sector Allocation:") print(weighted_exposure.sum()) ```
4. Time-Based Categorical Features¶
```python
Create time-based features¶
dates = pd.date_range('2024-01-01', periods=10, freq='D') df = pd.DataFrame({'date': dates, 'value': range(10)})
df['day_of_week'] = df['date'].dt.day_name() df['is_weekend'] = df['date'].dt.dayofweek >= 5
Encode day of week¶
encoded = pd.get_dummies(df[['value', 'day_of_week', 'is_weekend']], columns=['day_of_week'], drop_first=True) print(encoded.head()) ```
Comparison: get_dummies vs sklearn¶
```python import pandas as pd from sklearn.preprocessing import OneHotEncoder
df = pd.DataFrame({'color': ['red', 'blue', 'green', 'red']})
pandas get_dummies¶
pd_encoded = pd.get_dummies(df['color'], dtype=int)
sklearn OneHotEncoder¶
encoder = OneHotEncoder(sparse_output=False) sk_encoded = encoder.fit_transform(df[['color']]) sk_df = pd.DataFrame(sk_encoded, columns=encoder.get_feature_names_out()) ```
| Feature | pd.get_dummies | sklearn OneHotEncoder |
|---|---|---|
| Returns | DataFrame | NumPy array (default) |
| Handles new categories | No (must reindex) | Yes (with handle_unknown) |
| Fit/transform pattern | No | Yes |
| Column names | Automatic | Via get_feature_names_out() |
| Best for | Quick exploration | Production ML pipelines |
Key Parameters¶
| Parameter | Description | Default |
|---|---|---|
data |
Input DataFrame or Series | Required |
columns |
Columns to encode (DataFrame only) | None (all object/category) |
prefix |
String prefix for column names | None |
prefix_sep |
Separator between prefix and value | '_' |
drop_first |
Drop first category | False |
dummy_na |
Add NaN indicator column | False |
dtype |
Data type for encoded columns | bool |
sparse |
Return sparse DataFrame | False |
Common Pitfalls¶
1. Not Handling Unseen Categories¶
```python
Training¶
train_encoded = pd.get_dummies(train_df['category'])
Test has new category - causes mismatch¶
test_encoded = pd.get_dummies(test_df['category'])
Fix: align columns¶
test_aligned = test_encoded.reindex(columns=train_encoded.columns, fill_value=0) ```
2. Forgetting drop_first for Regression¶
```python
For linear regression, ALWAYS use drop_first=True¶
X = pd.get_dummies(df[categorical_cols], drop_first=True) ```
3. Memory Issues with High Cardinality¶
```python
Column with 10,000 unique values creates 10,000 columns!¶
Consider: grouping rare categories, target encoding, or embeddings¶
df['category_grouped'] = df['category'].apply( lambda x: x if df['category'].value_counts()[x] > 100 else 'Other' ) ```
Exercises¶
Exercise 1. Write code that converts a categorical column to dummy variables using pd.get_dummies(). Show the resulting DataFrame.
Solution to Exercise 1
```python import pandas as pd
See page content for relevant API details¶
s = pd.Series(['a', 'b', 'c', 'a', 'b'], dtype='category') print(s) print(s.cat.categories) print(s.cat.codes) ```
Exercise 2. Explain the difference between one-hot encoding and label encoding. Give an example of when each is appropriate.
Solution to Exercise 2
See the explanation in the main content of this page. The key concept involves understanding the categorical data type and its internal representation in Pandas.
Exercise 3. Write code that performs label encoding by using .cat.codes on a categorical Series. Print the codes alongside the original values.
Solution to Exercise 3
```python import pandas as pd import numpy as np
np.random.seed(42) df = pd.DataFrame({'col': np.random.choice(['A', 'B', 'C'], 1000)}) df['col'] = df['col'].astype('category') print(df.dtypes) print(df['col'].value_counts()) ```
Exercise 4. Create a DataFrame with a categorical column and use pd.get_dummies(drop_first=True) to avoid the dummy variable trap. Explain why drop_first is useful.
Solution to Exercise 4
```python import pandas as pd
s = pd.Categorical(['low', 'medium', 'high', 'low'], categories=['low', 'medium', 'high'], ordered=True) print(s) print(s > 'low') ```