Type Conversion with astype¶
The astype() method converts Series or DataFrame columns to a specified data type. This is essential for data cleaning, memory optimization, and ensuring correct operations.
Basic Usage¶
Series Conversion¶
import pandas as pd
s = pd.Series([1, 2, 3])
print(f"Original dtype: {s.dtype}") # int64
# Convert to float
s_float = s.astype('float64')
print(f"Converted dtype: {s_float.dtype}") # float64
print(s_float)
0 1.0
1 2.0
2 3.0
dtype: float64
DataFrame Column Conversion¶
df = pd.DataFrame({
'A': [1, 2, 3],
'B': ['4', '5', '6']
})
# Convert single column
df['B'] = df['B'].astype(int)
print(df.dtypes)
A int64
B int64
dtype: object
Multiple Columns at Once¶
df = pd.DataFrame({
'A': ['1', '2', '3'],
'B': ['4.0', '5.0', '6.0'],
'C': ['True', 'False', 'True']
})
# Convert multiple columns using a dictionary
df = df.astype({
'A': 'int64',
'B': 'float64'
})
print(df.dtypes)
Common Type Conversions¶
String to Numeric¶
s = pd.Series(['1', '2', '3'])
s_int = s.astype(int)
s_float = s.astype(float)
Numeric to String¶
s = pd.Series([1, 2, 3])
s_str = s.astype(str)
print(s_str)
0 1
1 2
2 3
dtype: object
To Boolean¶
s = pd.Series([0, 1, 0, 1])
s_bool = s.astype(bool)
print(s_bool)
0 False
1 True
2 False
3 True
dtype: bool
To Category¶
s = pd.Series(['low', 'medium', 'high', 'low', 'medium'])
s_cat = s.astype('category')
print(s_cat)
print(f"Categories: {s_cat.cat.categories.tolist()}")
0 low
1 medium
2 high
3 low
4 medium
dtype: category
Categories (3, object): ['high', 'low', 'medium']
Practical Example: Trip Status Encoding¶
From LeetCode 262: Convert trip status to binary for cancellation rate calculation.
trips = pd.DataFrame({
'id': [1, 2, 3, 4, 5],
'status': ['completed', 'cancelled_by_driver', 'completed',
'cancelled_by_client', 'completed']
})
# Step 1: Replace status strings with integers
status_encoded = trips['status'].replace({
'cancelled_by_driver': 1,
'cancelled_by_client': 1,
'completed': 0
})
print(status_encoded)
0 0
1 1
2 0
3 1
4 0
Name: status, dtype: int64
# Step 2: Ensure integer type (important after replace)
status_encoded = status_encoded.astype(int)
# Calculate cancellation rate
cancellation_rate = status_encoded.sum() / len(status_encoded)
print(f"Cancellation rate: {cancellation_rate:.2%}") # 40.00%
Why astype(int) After Replace?¶
The replace() method may return mixed types if not all values are replaced. Using astype(int) ensures consistent integer type for calculations.
# Example where replace might leave mixed types
df = pd.DataFrame({
'status': ['completed', 'cancelled_by_driver', 'in_progress']
})
# 'in_progress' is not in the mapping, remains as string
result = df['status'].replace({
'cancelled_by_driver': 1,
'completed': 0
})
print(result.dtype) # object (mixed)
# Force to numeric (will error if truly invalid)
# result.astype(int) # Would raise ValueError
Handling Conversion Errors¶
errors Parameter¶
s = pd.Series(['1', '2', 'three', '4'])
# Default: raises error
# s.astype(int) # ValueError
# With pd.to_numeric for error handling
s_numeric = pd.to_numeric(s, errors='coerce')
print(s_numeric)
0 1.0
1 2.0
2 NaN
3 4.0
dtype: float64
Safe Conversion Pattern¶
def safe_convert(series, dtype):
"""Safely convert series to dtype, returning None on failure."""
try:
return series.astype(dtype)
except (ValueError, TypeError) as e:
print(f"Conversion failed: {e}")
return None
s = pd.Series(['1', '2', 'invalid'])
result = safe_convert(s, int) # Conversion failed
Memory Optimization¶
Downcasting Integers¶
# Default int64 uses 8 bytes per value
s = pd.Series([1, 2, 3, 4, 5])
print(f"int64 memory: {s.memory_usage()} bytes")
# Downcast to int8 (1 byte) for small integers
s_small = s.astype('int8')
print(f"int8 memory: {s_small.memory_usage()} bytes")
Integer Types and Their Ranges¶
| Type | Bytes | Range |
|---|---|---|
| int8 | 1 | -128 to 127 |
| int16 | 2 | -32,768 to 32,767 |
| int32 | 4 | -2B to 2B |
| int64 | 8 | -9×10¹⁸ to 9×10¹⁸ |
| uint8 | 1 | 0 to 255 |
Using pd.to_numeric with Downcast¶
s = pd.Series([1, 2, 3, 100, 200])
# Automatically choose smallest integer type
s_downcast = pd.to_numeric(s, downcast='integer')
print(f"Downcast dtype: {s_downcast.dtype}") # int8
Nullable Integer Types¶
pandas supports nullable integers that can hold NaN values.
# Standard int64 cannot hold NaN
s = pd.Series([1, 2, None])
print(s.dtype) # float64 (upcasted)
# Nullable integer preserves integer nature
s = pd.Series([1, 2, None], dtype='Int64')
print(s)
0 1
1 2
2 <NA>
dtype: Int64
Converting to Nullable¶
s = pd.Series([1.0, 2.0, float('nan')])
s_nullable = s.astype('Int64')
print(s_nullable)
0 1
1 2
2 <NA>
dtype: Int64
Datetime Conversions¶
String to Datetime¶
s = pd.Series(['2024-01-01', '2024-01-02', '2024-01-03'])
# Using astype
s_datetime = s.astype('datetime64[ns]')
# Better: using pd.to_datetime for more control
s_datetime = pd.to_datetime(s)
print(s_datetime)
Datetime to String¶
dates = pd.Series(pd.date_range('2024-01-01', periods=3))
dates_str = dates.astype(str)
print(dates_str)
0 2024-01-01
1 2024-01-02
2 2024-01-03
dtype: object
Practical Example: Data Cleaning Pipeline¶
# Raw data with mixed types
raw_data = pd.DataFrame({
'user_id': ['1', '2', '3', '4'],
'age': ['25', '30', '28', 'unknown'],
'premium': ['1', '0', '1', '1'],
'signup_date': ['2024-01-01', '2024-01-15', '2024-02-01', '2024-02-15']
})
# Clean and convert types
cleaned = raw_data.copy()
# Convert user_id to int
cleaned['user_id'] = cleaned['user_id'].astype(int)
# Convert age to numeric, coercing errors
cleaned['age'] = pd.to_numeric(cleaned['age'], errors='coerce')
# Convert premium to boolean
cleaned['premium'] = cleaned['premium'].astype(int).astype(bool)
# Convert signup_date to datetime
cleaned['signup_date'] = pd.to_datetime(cleaned['signup_date'])
print(cleaned.dtypes)
user_id int64
age float64
premium bool
signup_date datetime64[ns]
dtype: object
Financial Example: Portfolio Data Types¶
portfolio = pd.DataFrame({
'ticker': ['AAPL', 'MSFT', 'GOOGL'],
'shares': ['100', '150', '50'],
'price': ['150.25', '350.50', '140.75'],
'sector': ['Tech', 'Tech', 'Tech']
})
# Convert to appropriate types
portfolio = portfolio.astype({
'shares': 'int64',
'price': 'float64',
'sector': 'category'
})
# Calculate position values
portfolio['value'] = portfolio['shares'] * portfolio['price']
print(portfolio)
print(f"\nData types:\n{portfolio.dtypes}")
Summary of Type Conversions¶
| From | To | Method |
|---|---|---|
| str → int | s.astype(int) |
Direct conversion |
| str → float | s.astype(float) |
Direct conversion |
| str → datetime | pd.to_datetime(s) |
Preferred for dates |
| int → str | s.astype(str) |
Direct conversion |
| float → int | s.astype(int) |
Truncates decimals |
| object → category | s.astype('category') |
Memory efficient |
| int64 → Int64 | s.astype('Int64') |
Nullable integer |