Type Conversion with astype¶
The astype() method converts Series or DataFrame columns to a specified data type. This is essential for data cleaning, memory optimization, and ensuring correct operations.
Mental Model
astype() is an explicit type cast for an entire column. It tells pandas "reinterpret every value under this new dtype." Common uses include converting strings to numbers after cleaning, shrinking int64 to int8 for memory savings, and switching to categorical. If a value cannot be converted, it raises an error -- use pd.to_numeric(errors='coerce') for graceful handling.
Basic Usage¶
Series Conversion¶
```python import pandas as pd
s = pd.Series([1, 2, 3]) print(f"Original dtype: {s.dtype}") # int64
Convert to float¶
s_float = s.astype('float64') print(f"Converted dtype: {s_float.dtype}") # float64 print(s_float) ```
0 1.0
1 2.0
2 3.0
dtype: float64
DataFrame Column Conversion¶
```python df = pd.DataFrame({ 'A': [1, 2, 3], 'B': ['4', '5', '6'] })
Convert single column¶
df['B'] = df['B'].astype(int) print(df.dtypes) ```
A int64
B int64
dtype: object
Multiple Columns at Once¶
```python df = pd.DataFrame({ 'A': ['1', '2', '3'], 'B': ['4.0', '5.0', '6.0'], 'C': ['True', 'False', 'True'] })
Convert multiple columns using a dictionary¶
df = df.astype({ 'A': 'int64', 'B': 'float64' }) print(df.dtypes) ```
Common Type Conversions¶
String to Numeric¶
python
s = pd.Series(['1', '2', '3'])
s_int = s.astype(int)
s_float = s.astype(float)
Numeric to String¶
python
s = pd.Series([1, 2, 3])
s_str = s.astype(str)
print(s_str)
0 1
1 2
2 3
dtype: object
To Boolean¶
python
s = pd.Series([0, 1, 0, 1])
s_bool = s.astype(bool)
print(s_bool)
0 False
1 True
2 False
3 True
dtype: bool
To Category¶
python
s = pd.Series(['low', 'medium', 'high', 'low', 'medium'])
s_cat = s.astype('category')
print(s_cat)
print(f"Categories: {s_cat.cat.categories.tolist()}")
0 low
1 medium
2 high
3 low
4 medium
dtype: category
Categories (3, object): ['high', 'low', 'medium']
Practical Example: Trip Status Encoding¶
From LeetCode 262: Convert trip status to binary for cancellation rate calculation.
```python trips = pd.DataFrame({ 'id': [1, 2, 3, 4, 5], 'status': ['completed', 'cancelled_by_driver', 'completed', 'cancelled_by_client', 'completed'] })
Step 1: Replace status strings with integers¶
status_encoded = trips['status'].replace({ 'cancelled_by_driver': 1, 'cancelled_by_client': 1, 'completed': 0 }) print(status_encoded) ```
0 0
1 1
2 0
3 1
4 0
Name: status, dtype: int64
```python
Step 2: Ensure integer type (important after replace)¶
status_encoded = status_encoded.astype(int)
Calculate cancellation rate¶
cancellation_rate = status_encoded.sum() / len(status_encoded) print(f"Cancellation rate: {cancellation_rate:.2%}") # 40.00% ```
Why astype(int) After Replace?¶
The replace() method may return mixed types if not all values are replaced. Using astype(int) ensures consistent integer type for calculations.
```python
Example where replace might leave mixed types¶
df = pd.DataFrame({ 'status': ['completed', 'cancelled_by_driver', 'in_progress'] })
'in_progress' is not in the mapping, remains as string¶
result = df['status'].replace({ 'cancelled_by_driver': 1, 'completed': 0 }) print(result.dtype) # object (mixed)
Force to numeric (will error if truly invalid)¶
result.astype(int) # Would raise ValueError¶
```
Handling Conversion Errors¶
errors Parameter¶
```python s = pd.Series(['1', '2', 'three', '4'])
Default: raises error¶
s.astype(int) # ValueError¶
With pd.to_numeric for error handling¶
s_numeric = pd.to_numeric(s, errors='coerce') print(s_numeric) ```
0 1.0
1 2.0
2 NaN
3 4.0
dtype: float64
Safe Conversion Pattern¶
```python def safe_convert(series, dtype): """Safely convert series to dtype, returning None on failure.""" try: return series.astype(dtype) except (ValueError, TypeError) as e: print(f"Conversion failed: {e}") return None
s = pd.Series(['1', '2', 'invalid']) result = safe_convert(s, int) # Conversion failed ```
Memory Optimization¶
Downcasting Integers¶
```python
Default int64 uses 8 bytes per value¶
s = pd.Series([1, 2, 3, 4, 5]) print(f"int64 memory: {s.memory_usage()} bytes")
Downcast to int8 (1 byte) for small integers¶
s_small = s.astype('int8') print(f"int8 memory: {s_small.memory_usage()} bytes") ```
Integer Types and Their Ranges¶
| Type | Bytes | Range |
|---|---|---|
| int8 | 1 | -128 to 127 |
| int16 | 2 | -32,768 to 32,767 |
| int32 | 4 | -2B to 2B |
| int64 | 8 | -9×10¹⁸ to 9×10¹⁸ |
| uint8 | 1 | 0 to 255 |
Using pd.to_numeric with Downcast¶
```python s = pd.Series([1, 2, 3, 100, 200])
Automatically choose smallest integer type¶
s_downcast = pd.to_numeric(s, downcast='integer') print(f"Downcast dtype: {s_downcast.dtype}") # int8 ```
Nullable Integer Types¶
pandas supports nullable integers that can hold NaN values.
```python
Standard int64 cannot hold NaN¶
s = pd.Series([1, 2, None]) print(s.dtype) # float64 (upcasted)
Nullable integer preserves integer nature¶
s = pd.Series([1, 2, None], dtype='Int64') print(s) ```
0 1
1 2
2 <NA>
dtype: Int64
Converting to Nullable¶
python
s = pd.Series([1.0, 2.0, float('nan')])
s_nullable = s.astype('Int64')
print(s_nullable)
0 1
1 2
2 <NA>
dtype: Int64
Datetime Conversions¶
String to Datetime¶
```python s = pd.Series(['2024-01-01', '2024-01-02', '2024-01-03'])
Using astype¶
s_datetime = s.astype('datetime64[ns]')
Better: using pd.to_datetime for more control¶
s_datetime = pd.to_datetime(s) print(s_datetime) ```
Datetime to String¶
python
dates = pd.Series(pd.date_range('2024-01-01', periods=3))
dates_str = dates.astype(str)
print(dates_str)
0 2024-01-01
1 2024-01-02
2 2024-01-03
dtype: object
Practical Example: Data Cleaning Pipeline¶
```python
Raw data with mixed types¶
raw_data = pd.DataFrame({ 'user_id': ['1', '2', '3', '4'], 'age': ['25', '30', '28', 'unknown'], 'premium': ['1', '0', '1', '1'], 'signup_date': ['2024-01-01', '2024-01-15', '2024-02-01', '2024-02-15'] })
Clean and convert types¶
cleaned = raw_data.copy()
Convert user_id to int¶
cleaned['user_id'] = cleaned['user_id'].astype(int)
Convert age to numeric, coercing errors¶
cleaned['age'] = pd.to_numeric(cleaned['age'], errors='coerce')
Convert premium to boolean¶
cleaned['premium'] = cleaned['premium'].astype(int).astype(bool)
Convert signup_date to datetime¶
cleaned['signup_date'] = pd.to_datetime(cleaned['signup_date'])
print(cleaned.dtypes) ```
user_id int64
age float64
premium bool
signup_date datetime64[ns]
dtype: object
Financial Example: Portfolio Data Types¶
```python portfolio = pd.DataFrame({ 'ticker': ['AAPL', 'MSFT', 'GOOGL'], 'shares': ['100', '150', '50'], 'price': ['150.25', '350.50', '140.75'], 'sector': ['Tech', 'Tech', 'Tech'] })
Convert to appropriate types¶
portfolio = portfolio.astype({ 'shares': 'int64', 'price': 'float64', 'sector': 'category' })
Calculate position values¶
portfolio['value'] = portfolio['shares'] * portfolio['price'] print(portfolio) print(f"\nData types:\n{portfolio.dtypes}") ```
Summary of Type Conversions¶
| From | To | Method |
|---|---|---|
| str → int | s.astype(int) |
Direct conversion |
| str → float | s.astype(float) |
Direct conversion |
| str → datetime | pd.to_datetime(s) |
Preferred for dates |
| int → str | s.astype(str) |
Direct conversion |
| float → int | s.astype(int) |
Truncates decimals |
| object → category | s.astype('category') |
Memory efficient |
| int64 → Int64 | s.astype('Int64') |
Nullable integer |
Exercises¶
Exercise 1.
Create a DataFrame with a column of string numbers (e.g., ['1', '2', '3']). Convert it to int using .astype(int). Then convert it to float64 and verify the dtype changed.
Solution to Exercise 1
Convert string numbers to int then float.
import pandas as pd
df = pd.DataFrame({'nums': ['1', '2', '3']})
print("Original dtype:", df['nums'].dtype)
df['nums'] = df['nums'].astype(int)
print("After int:", df['nums'].dtype)
df['nums'] = df['nums'].astype('float64')
print("After float64:", df['nums'].dtype)
Exercise 2.
Create a DataFrame with an int64 column where values range from 0 to 100. Convert it to int8 using .astype('int8') to save memory. Use .memory_usage() to compare memory before and after.
Solution to Exercise 2
Downcast int64 to int8 and compare memory.
import pandas as pd
import numpy as np
df = pd.DataFrame({'values': np.random.randint(0, 101, 10000).astype('int64')})
mem_before = df['values'].memory_usage(deep=True)
df['values'] = df['values'].astype('int8')
mem_after = df['values'].memory_usage(deep=True)
print(f"Before: {mem_before} bytes")
print(f"After: {mem_after} bytes")
Exercise 3.
Create a DataFrame with a column containing mixed values including some that cannot be converted to numeric (e.g., 'N/A'). Demonstrate that .astype(float) raises an error, then use pd.to_numeric() with errors='coerce' as an alternative to convert valid values and set invalid ones to NaN.
Solution to Exercise 3
Handle conversion errors with pd.to_numeric.
import pandas as pd
df = pd.DataFrame({'col': ['1.5', '2.0', 'N/A', '4.0']})
try:
df['col'].astype(float)
except ValueError as e:
print(f"astype error: {e}")
df['col'] = pd.to_numeric(df['col'], errors='coerce')
print(df)
print(df.dtypes)