Skip to content

Creating Categoricals

There are several ways to create categorical data in pandas, from simple type conversion to explicit construction with custom categories and ordering.

Method 1: Using astype('category')

The simplest way to convert existing data to categorical.

import pandas as pd

# From a Series
s = pd.Series(['apple', 'banana', 'apple', 'cherry'])
s_cat = s.astype('category')
print(s_cat)
0     apple
1    banana
2     apple
3    cherry
dtype: category
Categories (3, object): ['apple', 'banana', 'cherry']
# From a DataFrame column
df = pd.DataFrame({
    'product': ['A', 'B', 'A', 'C', 'B'],
    'price': [100, 200, 100, 300, 200]
})

df['product'] = df['product'].astype('category')
print(df['product'].dtype)  # category

Categories are Automatically Inferred

When using astype('category'), pandas automatically: - Identifies unique values - Creates categories in sorted order (alphabetical for strings) - Assigns integer codes to each value

s = pd.Series(['zebra', 'apple', 'mango', 'apple'])
s_cat = s.astype('category')
print(s_cat.cat.categories)  # ['apple', 'mango', 'zebra'] (sorted)

Method 2: Using pd.Categorical()

For explicit control over categories and ordering.

Basic Construction

cat = pd.Categorical(['a', 'b', 'c', 'a', 'b'])
print(cat)
['a', 'b', 'c', 'a', 'b']
Categories (3, object): ['a', 'b', 'c']

Specifying Categories

Define the allowed categories explicitly:

# Only these categories are valid
cat = pd.Categorical(
    ['small', 'medium', 'large', 'small'],
    categories=['small', 'medium', 'large', 'extra-large']
)
print(cat)
['small', 'medium', 'large', 'small']
Categories (4, object): ['small', 'medium', 'large', 'extra-large']

Note: 'extra-large' is a valid category even though it doesn't appear in the data.

Ordered Categories

Create categories with logical ordering:

cat = pd.Categorical(
    ['medium', 'small', 'large', 'small'],
    categories=['small', 'medium', 'large'],
    ordered=True
)
print(cat)
['medium', 'small', 'large', 'small']
Categories (3, object): ['small' < 'medium' < 'large']

The < symbols indicate ordering.

Handling Invalid Values

Values not in categories become NaN:

cat = pd.Categorical(
    ['a', 'b', 'c', 'd'],  # 'd' not in categories
    categories=['a', 'b', 'c']
)
print(cat)
['a', 'b', 'c', NaN]
Categories (3, object): ['a', 'b', 'c']

Method 3: Using pd.CategoricalDtype

Define a categorical type for reuse across multiple columns.

# Define the dtype once
size_dtype = pd.CategoricalDtype(
    categories=['S', 'M', 'L', 'XL'],
    ordered=True
)

# Apply to multiple columns
df = pd.DataFrame({
    'shirt_size': ['M', 'L', 'S', 'XL'],
    'pants_size': ['L', 'M', 'M', 'L']
})

df['shirt_size'] = df['shirt_size'].astype(size_dtype)
df['pants_size'] = df['pants_size'].astype(size_dtype)

print(df.dtypes)
shirt_size    category
pants_size    category
dtype: object

CategoricalDtype with read_csv

# Define dtype before reading
rating_dtype = pd.CategoricalDtype(
    categories=['AAA', 'AA', 'A', 'BBB', 'BB', 'B'],
    ordered=True
)

# Apply during read
df = pd.read_csv('bonds.csv', dtype={'rating': rating_dtype})

Method 4: During DataFrame Creation

Specify categorical dtype when creating a DataFrame:

df = pd.DataFrame({
    'name': ['Alice', 'Bob', 'Charlie'],
    'grade': pd.Categorical(['A', 'B', 'A'], ordered=True)
})

print(df['grade'].dtype)  # category

Method 5: Using Series Constructor

s = pd.Series(
    pd.Categorical(['low', 'medium', 'high', 'low']),
    name='priority'
)
print(s)

Practical Examples

Stock Sectors

import numpy as np

# Define valid sectors
sector_dtype = pd.CategoricalDtype(categories=[
    'Technology', 'Healthcare', 'Finance', 'Energy',
    'Consumer Discretionary', 'Consumer Staples',
    'Industrials', 'Materials', 'Utilities',
    'Real Estate', 'Communication Services'
])

# Create stock data
stocks = pd.DataFrame({
    'ticker': ['AAPL', 'JNJ', 'JPM', 'XOM', 'AMZN'],
    'sector': ['Technology', 'Healthcare', 'Finance', 'Energy', 'Consumer Discretionary']
})

stocks['sector'] = stocks['sector'].astype(sector_dtype)
print(stocks['sector'].cat.categories)

Credit Ratings

# Ratings with natural order (AAA is best)
rating_dtype = pd.CategoricalDtype(
    categories=['D', 'C', 'CC', 'CCC', 'B', 'BB', 'BBB', 'A', 'AA', 'AAA'],
    ordered=True
)

bonds = pd.DataFrame({
    'issuer': ['Company A', 'Company B', 'Company C'],
    'rating': ['AA', 'BBB', 'A']
})

bonds['rating'] = bonds['rating'].astype(rating_dtype)

# Now we can compare ratings
print(bonds[bonds['rating'] >= 'A'])  # Investment grade

Survey Responses

# Likert scale with order
likert_dtype = pd.CategoricalDtype(
    categories=[
        'Strongly Disagree',
        'Disagree', 
        'Neutral',
        'Agree',
        'Strongly Agree'
    ],
    ordered=True
)

survey = pd.DataFrame({
    'respondent_id': [1, 2, 3, 4, 5],
    'satisfaction': ['Agree', 'Neutral', 'Strongly Agree', 'Disagree', 'Agree']
})

survey['satisfaction'] = survey['satisfaction'].astype(likert_dtype)

# Find positive responses
positive = survey[survey['satisfaction'] > 'Neutral']
print(positive)

Summary of Creation Methods

Method Use Case Example
astype('category') Quick conversion s.astype('category')
pd.Categorical() Custom categories/order pd.Categorical(data, categories=[...])
pd.CategoricalDtype Reusable type definition dtype = pd.CategoricalDtype(...)
In DataFrame creation Direct specification pd.DataFrame({'col': pd.Categorical(...)})

Best Practices

  1. Define categories explicitly when you know the valid values
  2. Use ordered=True when categories have logical order
  3. Use CategoricalDtype for consistency across multiple columns
  4. Specify dtype in read_csv to save memory on large files
  5. Handle invalid values - they become NaN silently