Type Conversion with astype¶

The astype() method converts Series or DataFrame columns to a specified data type. This is essential for data cleaning, memory optimization, and ensuring correct operations.

Basic Usage¶

Series Conversion¶

import pandas as pd

s = pd.Series([1, 2, 3])
print(f"Original dtype: {s.dtype}")  # int64

# Convert to float
s_float = s.astype('float64')
print(f"Converted dtype: {s_float.dtype}")  # float64
print(s_float)

0    1.0
1    2.0
2    3.0
dtype: float64

DataFrame Column Conversion¶

df = pd.DataFrame({
    'A': [1, 2, 3],
    'B': ['4', '5', '6']
})

# Convert single column
df['B'] = df['B'].astype(int)
print(df.dtypes)

A    int64
B    int64
dtype: object

Multiple Columns at Once¶

df = pd.DataFrame({
    'A': ['1', '2', '3'],
    'B': ['4.0', '5.0', '6.0'],
    'C': ['True', 'False', 'True']
})

# Convert multiple columns using a dictionary
df = df.astype({
    'A': 'int64',
    'B': 'float64'
})
print(df.dtypes)

Common Type Conversions¶

String to Numeric¶

s = pd.Series(['1', '2', '3'])
s_int = s.astype(int)
s_float = s.astype(float)

Numeric to String¶

s = pd.Series([1, 2, 3])
s_str = s.astype(str)
print(s_str)

0    1
1    2
2    3
dtype: object

To Boolean¶

s = pd.Series([0, 1, 0, 1])
s_bool = s.astype(bool)
print(s_bool)

0    False
1     True
2    False
3     True
dtype: bool

To Category¶

s = pd.Series(['low', 'medium', 'high', 'low', 'medium'])
s_cat = s.astype('category')
print(s_cat)
print(f"Categories: {s_cat.cat.categories.tolist()}")

0       low
1    medium
2      high
3       low
4    medium
dtype: category
Categories (3, object): ['high', 'low', 'medium']

Practical Example: Trip Status Encoding¶

From LeetCode 262: Convert trip status to binary for cancellation rate calculation.

trips = pd.DataFrame({
    'id': [1, 2, 3, 4, 5],
    'status': ['completed', 'cancelled_by_driver', 'completed', 
               'cancelled_by_client', 'completed']
})

# Step 1: Replace status strings with integers
status_encoded = trips['status'].replace({
    'cancelled_by_driver': 1,
    'cancelled_by_client': 1,
    'completed': 0
})
print(status_encoded)

0    0
1    1
2    0
3    1
4    0
Name: status, dtype: int64

# Step 2: Ensure integer type (important after replace)
status_encoded = status_encoded.astype(int)

# Calculate cancellation rate
cancellation_rate = status_encoded.sum() / len(status_encoded)
print(f"Cancellation rate: {cancellation_rate:.2%}")  # 40.00%

Why astype(int) After Replace?¶

The replace() method may return mixed types if not all values are replaced. Using astype(int) ensures consistent integer type for calculations.

# Example where replace might leave mixed types
df = pd.DataFrame({
    'status': ['completed', 'cancelled_by_driver', 'in_progress']
})

# 'in_progress' is not in the mapping, remains as string
result = df['status'].replace({
    'cancelled_by_driver': 1,
    'completed': 0
})
print(result.dtype)  # object (mixed)

# Force to numeric (will error if truly invalid)
# result.astype(int)  # Would raise ValueError

Handling Conversion Errors¶

errors Parameter¶

s = pd.Series(['1', '2', 'three', '4'])

# Default: raises error
# s.astype(int)  # ValueError

# With pd.to_numeric for error handling
s_numeric = pd.to_numeric(s, errors='coerce')
print(s_numeric)

0    1.0
1    2.0
2    NaN
3    4.0
dtype: float64

Safe Conversion Pattern¶

def safe_convert(series, dtype):
    """Safely convert series to dtype, returning None on failure."""
    try:
        return series.astype(dtype)
    except (ValueError, TypeError) as e:
        print(f"Conversion failed: {e}")
        return None

s = pd.Series(['1', '2', 'invalid'])
result = safe_convert(s, int)  # Conversion failed

Memory Optimization¶

Downcasting Integers¶

# Default int64 uses 8 bytes per value
s = pd.Series([1, 2, 3, 4, 5])
print(f"int64 memory: {s.memory_usage()} bytes")

# Downcast to int8 (1 byte) for small integers
s_small = s.astype('int8')
print(f"int8 memory: {s_small.memory_usage()} bytes")

Integer Types and Their Ranges¶

Type	Bytes	Range
int8	1	-128 to 127
int16	2	-32,768 to 32,767
int32	4	-2B to 2B
int64	8	-9×10¹⁸ to 9×10¹⁸
uint8	1	0 to 255

Using pd.to_numeric with Downcast¶

s = pd.Series([1, 2, 3, 100, 200])

# Automatically choose smallest integer type
s_downcast = pd.to_numeric(s, downcast='integer')
print(f"Downcast dtype: {s_downcast.dtype}")  # int8

Nullable Integer Types¶

pandas supports nullable integers that can hold NaN values.

# Standard int64 cannot hold NaN
s = pd.Series([1, 2, None])
print(s.dtype)  # float64 (upcasted)

# Nullable integer preserves integer nature
s = pd.Series([1, 2, None], dtype='Int64')
print(s)

0       1
1       2
2    <NA>
dtype: Int64

Converting to Nullable¶

s = pd.Series([1.0, 2.0, float('nan')])
s_nullable = s.astype('Int64')
print(s_nullable)

0       1
1       2
2    <NA>
dtype: Int64

Datetime Conversions¶

String to Datetime¶

s = pd.Series(['2024-01-01', '2024-01-02', '2024-01-03'])

# Using astype
s_datetime = s.astype('datetime64[ns]')

# Better: using pd.to_datetime for more control
s_datetime = pd.to_datetime(s)
print(s_datetime)

Datetime to String¶

dates = pd.Series(pd.date_range('2024-01-01', periods=3))
dates_str = dates.astype(str)
print(dates_str)

0    2024-01-01
1    2024-01-02
2    2024-01-03
dtype: object

Practical Example: Data Cleaning Pipeline¶

# Raw data with mixed types
raw_data = pd.DataFrame({
    'user_id': ['1', '2', '3', '4'],
    'age': ['25', '30', '28', 'unknown'],
    'premium': ['1', '0', '1', '1'],
    'signup_date': ['2024-01-01', '2024-01-15', '2024-02-01', '2024-02-15']
})

# Clean and convert types
cleaned = raw_data.copy()

# Convert user_id to int
cleaned['user_id'] = cleaned['user_id'].astype(int)

# Convert age to numeric, coercing errors
cleaned['age'] = pd.to_numeric(cleaned['age'], errors='coerce')

# Convert premium to boolean
cleaned['premium'] = cleaned['premium'].astype(int).astype(bool)

# Convert signup_date to datetime
cleaned['signup_date'] = pd.to_datetime(cleaned['signup_date'])

print(cleaned.dtypes)

user_id                  int64
age                    float64
premium                   bool
signup_date    datetime64[ns]
dtype: object

Financial Example: Portfolio Data Types¶

portfolio = pd.DataFrame({
    'ticker': ['AAPL', 'MSFT', 'GOOGL'],
    'shares': ['100', '150', '50'],
    'price': ['150.25', '350.50', '140.75'],
    'sector': ['Tech', 'Tech', 'Tech']
})

# Convert to appropriate types
portfolio = portfolio.astype({
    'shares': 'int64',
    'price': 'float64',
    'sector': 'category'
})

# Calculate position values
portfolio['value'] = portfolio['shares'] * portfolio['price']
print(portfolio)
print(f"\nData types:\n{portfolio.dtypes}")

Summary of Type Conversions¶

From	To	Method
str → int	`s.astype(int)`	Direct conversion
str → float	`s.astype(float)`	Direct conversion
str → datetime	`pd.to_datetime(s)`	Preferred for dates
int → str	`s.astype(str)`	Direct conversion
float → int	`s.astype(int)`	Truncates decimals
object → category	`s.astype('category')`	Memory efficient
int64 → Int64	`s.astype('Int64')`	Nullable integer