Creating Categoricals¶

There are several ways to create categorical data in pandas, from simple type conversion to explicit construction with custom categories and ordering.

Mental Model

Creating a categorical is like defining an enum: you declare the allowed values (categories) and then each element is just a pointer into that list. The simplest path is astype('category') which infers the categories from the data, but CategoricalDtype gives you full control over which values are allowed and their order.

Method 1: Using astype('category')¶

The simplest way to convert existing data to categorical.

```python import pandas as pd

From a Series¶

s = pd.Series(['apple', 'banana', 'apple', 'cherry']) s_cat = s.astype('category') print(s_cat) ```

0 apple 1 banana 2 apple 3 cherry dtype: category Categories (3, object): ['apple', 'banana', 'cherry']

```python

From a DataFrame column¶

df = pd.DataFrame({ 'product': ['A', 'B', 'A', 'C', 'B'], 'price': [100, 200, 100, 300, 200] })

df['product'] = df['product'].astype('category') print(df['product'].dtype) # category ```

Categories are Automatically Inferred¶

When using astype('category'), pandas automatically:

Identifies unique values
Creates categories in sorted order (alphabetical for strings)
Assigns integer codes to each value

python s = pd.Series(['zebra', 'apple', 'mango', 'apple']) s_cat = s.astype('category') print(s_cat.cat.categories) # ['apple', 'mango', 'zebra'] (sorted)

Method 2: Using pd.Categorical()¶

For explicit control over categories and ordering.

Basic Construction¶

python cat = pd.Categorical(['a', 'b', 'c', 'a', 'b']) print(cat)

['a', 'b', 'c', 'a', 'b'] Categories (3, object): ['a', 'b', 'c']

Specifying Categories¶

Define the allowed categories explicitly:

```python

Only these categories are valid¶

cat = pd.Categorical( ['small', 'medium', 'large', 'small'], categories=['small', 'medium', 'large', 'extra-large'] ) print(cat) ```

['small', 'medium', 'large', 'small'] Categories (4, object): ['small', 'medium', 'large', 'extra-large']

Note: 'extra-large' is a valid category even though it doesn't appear in the data.

Ordered Categories¶

Create categories with logical ordering:

python cat = pd.Categorical( ['medium', 'small', 'large', 'small'], categories=['small', 'medium', 'large'], ordered=True ) print(cat)

['medium', 'small', 'large', 'small'] Categories (3, object): ['small' < 'medium' < 'large']

The < symbols indicate ordering.

Handling Invalid Values¶

Values not in categories become NaN:

python cat = pd.Categorical( ['a', 'b', 'c', 'd'], # 'd' not in categories categories=['a', 'b', 'c'] ) print(cat)

['a', 'b', 'c', NaN] Categories (3, object): ['a', 'b', 'c']

Method 3: Using pd.CategoricalDtype¶

Define a categorical type for reuse across multiple columns.

```python

Define the dtype once¶

size_dtype = pd.CategoricalDtype( categories=['S', 'M', 'L', 'XL'], ordered=True )

Apply to multiple columns¶

df = pd.DataFrame({ 'shirt_size': ['M', 'L', 'S', 'XL'], 'pants_size': ['L', 'M', 'M', 'L'] })

df['shirt_size'] = df['shirt_size'].astype(size_dtype) df['pants_size'] = df['pants_size'].astype(size_dtype)

print(df.dtypes) ```

shirt_size category pants_size category dtype: object

CategoricalDtype with read_csv¶

```python

Define dtype before reading¶

rating_dtype = pd.CategoricalDtype( categories=['AAA', 'AA', 'A', 'BBB', 'BB', 'B'], ordered=True )

Apply during read¶

df = pd.read_csv('bonds.csv', dtype={'rating': rating_dtype}) ```

Method 4: During DataFrame Creation¶

Specify categorical dtype when creating a DataFrame:

```python df = pd.DataFrame({ 'name': ['Alice', 'Bob', 'Charlie'], 'grade': pd.Categorical(['A', 'B', 'A'], ordered=True) })

print(df['grade'].dtype) # category ```

Method 5: Using Series Constructor¶

python s = pd.Series( pd.Categorical(['low', 'medium', 'high', 'low']), name='priority' ) print(s)

Practical Examples¶

Stock Sectors¶

```python import numpy as np

Define valid sectors¶

sector_dtype = pd.CategoricalDtype(categories=[ 'Technology', 'Healthcare', 'Finance', 'Energy', 'Consumer Discretionary', 'Consumer Staples', 'Industrials', 'Materials', 'Utilities', 'Real Estate', 'Communication Services' ])

Create stock data¶

stocks = pd.DataFrame({ 'ticker': ['AAPL', 'JNJ', 'JPM', 'XOM', 'AMZN'], 'sector': ['Technology', 'Healthcare', 'Finance', 'Energy', 'Consumer Discretionary'] })

stocks['sector'] = stocks['sector'].astype(sector_dtype) print(stocks['sector'].cat.categories) ```

Credit Ratings¶

```python

Ratings with natural order (AAA is best)¶

rating_dtype = pd.CategoricalDtype( categories=['D', 'C', 'CC', 'CCC', 'B', 'BB', 'BBB', 'A', 'AA', 'AAA'], ordered=True )

bonds = pd.DataFrame({ 'issuer': ['Company A', 'Company B', 'Company C'], 'rating': ['AA', 'BBB', 'A'] })

bonds['rating'] = bonds['rating'].astype(rating_dtype)

Now we can compare ratings¶

print(bonds[bonds['rating'] >= 'A']) # Investment grade ```

Survey Responses¶

```python

Likert scale with order¶

likert_dtype = pd.CategoricalDtype( categories=[ 'Strongly Disagree', 'Disagree', 'Neutral', 'Agree', 'Strongly Agree' ], ordered=True )

survey = pd.DataFrame({ 'respondent_id': [1, 2, 3, 4, 5], 'satisfaction': ['Agree', 'Neutral', 'Strongly Agree', 'Disagree', 'Agree'] })

survey['satisfaction'] = survey['satisfaction'].astype(likert_dtype)

Find positive responses¶

positive = survey[survey['satisfaction'] > 'Neutral'] print(positive) ```

Summary of Creation Methods¶

Method	Use Case	Example
`astype('category')`	Quick conversion	`s.astype('category')`
`pd.Categorical()`	Custom categories/order	`pd.Categorical(data, categories=[...])`
`pd.CategoricalDtype`	Reusable type definition	`dtype = pd.CategoricalDtype(...)`
In DataFrame creation	Direct specification	`pd.DataFrame({'col': pd.Categorical(...)})`

Best Practices¶

Define categories explicitly when you know the valid values
Use ordered=True when categories have logical order
Use CategoricalDtype for consistency across multiple columns
Specify dtype in read_csv to save memory on large files
Handle invalid values - they become NaN silently

Exercises¶

Exercise 1. Write code that creates a Pandas Categorical from the list ['low', 'medium', 'high', 'medium', 'low'] with categories ['low', 'medium', 'high'] and ordered=True.

Solution to Exercise 1

```python import pandas as pd

cat = pd.Categorical( ['low', 'medium', 'high', 'medium', 'low'], categories=['low', 'medium', 'high'], ordered=True ) print(cat) print(f'Ordered: {cat.ordered}') ```

Exercise 2. Explain the difference between pd.Categorical() and astype('category'). When would you use each?

Solution to Exercise 2

pd.Categorical() creates a standalone categorical array where you can specify the categories and ordering explicitly at creation time. .astype('category') converts an existing Series to a categorical dtype, inferring the categories from the data. Use pd.Categorical() when you need to set custom categories or ordering upfront. Use .astype('category') for quick conversion of an existing column.

Exercise 3. Create a DataFrame with a string column and convert it to categorical using .astype('category'). Print the category codes and the memory usage before and after conversion.

Solution to Exercise 3

```python import pandas as pd

df = pd.DataFrame({'color': ['red', 'blue', 'green', 'red', 'blue'] * 100}) print(f'Before: {df["color"].memory_usage()} bytes') df['color'] = df['color'].astype('category') print(f'After: {df["color"].memory_usage()} bytes') print(df['color'].cat.codes[:5]) ```

Exercise 4. Write code that creates a categorical with custom categories that include a category not present in the data (e.g., 'very_high'). Show that this category appears in .cat.categories.

Solution to Exercise 4

```python import pandas as pd

cat = pd.Categorical( ['low', 'medium', 'high'], categories=['low', 'medium', 'high', 'very_high'] ) print(cat) print(f'Categories: {cat.categories.tolist()}') ```