Parsing Dates¶
Parsing dates involves converting string representations of dates and times into datetime objects. This is essential for time series analysis and date-based operations in pandas.
Understanding Date Parsing¶
String vs Datetime¶
import pandas as pd
# String - limited operations
date_str = "2024-01-15"
print(type(date_str)) # <class 'str'>
# Datetime - full date operations
date_dt = pd.to_datetime(date_str)
print(type(date_dt)) # <class 'pandas._libs.tslibs.timestamps.Timestamp'>
print(date_dt.year) # 2024
print(date_dt.month) # 1
print(date_dt.day) # 15
Why Parse Dates?¶
- Date arithmetic: Add/subtract time periods
- Component extraction: Get year, month, day, etc.
- Filtering: Select date ranges
- Resampling: Aggregate by time periods
- Sorting: Chronological ordering
pd.to_datetime()¶
The primary function for converting to datetime.
Basic Usage¶
# Single value
date = pd.to_datetime('2024-01-15')
print(date) # 2024-01-15 00:00:00
# Series
s = pd.Series(['2024-01-15', '2024-01-16', '2024-01-17'])
dates = pd.to_datetime(s)
print(dates)
0 2024-01-15
1 2024-01-16
2 2024-01-17
dtype: datetime64[ns]
Automatic Format Detection¶
pandas can often infer the date format automatically:
# Various formats - all parsed correctly
formats = [
'2024-01-15', # ISO format
'01/15/2024', # US format
'15-01-2024', # European format
'January 15, 2024', # Written format
'2024/01/15', # Slash format
]
for f in formats:
print(f"{f:20} -> {pd.to_datetime(f)}")
Specifying Format¶
For faster parsing or ambiguous formats, specify the format explicitly.
# Explicit format
s = pd.Series(['01/15/2024', '02/20/2024', '03/25/2024'])
# Without format (slower, may be ambiguous)
dates = pd.to_datetime(s)
# With format (faster, unambiguous)
dates = pd.to_datetime(s, format='%m/%d/%Y')
Common Format Codes¶
| Code | Meaning | Example |
|---|---|---|
%Y |
4-digit year | 2024 |
%y |
2-digit year | 24 |
%m |
Month (01-12) | 01 |
%d |
Day (01-31) | 15 |
%H |
Hour (00-23) | 14 |
%I |
Hour (01-12) | 02 |
%M |
Minute (00-59) | 30 |
%S |
Second (00-59) | 45 |
%f |
Microsecond | 000000 |
%p |
AM/PM | PM |
%B |
Full month name | January |
%b |
Abbreviated month | Jan |
%A |
Full weekday | Monday |
%a |
Abbreviated weekday | Mon |
Handling Errors¶
s = pd.Series(['2024-01-15', 'invalid', '2024-01-17'])
# Default: raises error
# pd.to_datetime(s) # ValueError
# errors='coerce': invalid dates become NaT (Not a Time)
dates = pd.to_datetime(s, errors='coerce')
print(dates)
0 2024-01-15
1 NaT
2 2024-01-17
dtype: datetime64[ns]
# errors='ignore': return original input unchanged
dates = pd.to_datetime(s, errors='ignore')
print(dates)
0 2024-01-15
1 invalid
2 2024-01-17
dtype: object
Unix Timestamps¶
# Unix timestamps (seconds since 1970-01-01)
timestamps = pd.Series([1705276800, 1705363200, 1705449600])
dates = pd.to_datetime(timestamps, unit='s')
print(dates)
0 2024-01-15
1 2024-01-16
2 2024-01-17
dtype: datetime64[ns]
| Unit | Description |
|---|---|
'D' |
Days |
's' |
Seconds |
'ms' |
Milliseconds |
'us' |
Microseconds |
'ns' |
Nanoseconds |
Parsing in pd.read_csv()¶
parse_dates Parameter¶
url = 'https://raw.githubusercontent.com/srivatsan88/YouTubeLI/master/dataset/electricity_consumption.csv'
# Without parsing - dates are strings
df = pd.read_csv(url)
print(df['Bill_Date'].dtype) # object
# With parse_dates=True - attempt to parse index
df = pd.read_csv(url, index_col='Bill_Date', parse_dates=True)
print(df.index.dtype) # datetime64[ns]
# With parse_dates=['column'] - parse specific columns
df = pd.read_csv(url, parse_dates=['Bill_Date'])
print(df['Bill_Date'].dtype) # datetime64[ns]
date_format Parameter¶
Specify the format for date columns:
# When date format is known
df = pd.read_csv(url, parse_dates=['Bill_Date'], date_format='%m/%d/%Y')
Multiple Date Columns¶
# Parse multiple columns
df = pd.read_csv(filename, parse_dates=['start_date', 'end_date'])
# Combine columns into single datetime
df = pd.read_csv(filename, parse_dates={'datetime': ['date', 'time']})
strptime vs strftime¶
Understanding the difference between parsing and formatting:
strptime (String Parse Time)¶
String → Datetime Object
from datetime import datetime
date_string = "2024-01-15 14:30:00"
format_string = "%Y-%m-%d %H:%M:%S"
# Parse string to datetime
dt = datetime.strptime(date_string, format_string)
print(dt) # 2024-01-15 14:30:00
print(type(dt)) # <class 'datetime.datetime'>
strftime (String From Time)¶
Datetime Object → String
from datetime import datetime
dt = datetime(2024, 1, 15, 14, 30, 0)
# Format datetime to string
date_string = dt.strftime("%B %d, %Y at %I:%M %p")
print(date_string) # January 15, 2024 at 02:30 PM
In pandas¶
# Parsing (strptime equivalent)
s = pd.Series(['2024-01-15', '2024-01-16'])
dates = pd.to_datetime(s, format='%Y-%m-%d')
# Formatting (strftime)
formatted = dates.dt.strftime('%B %d, %Y')
print(formatted)
# January 15, 2024
# January 16, 2024
Timezone-Aware Parsing¶
Parsing with Timezone¶
# String with timezone
s = pd.Series(['2024-01-15 12:00:00+00:00', '2024-01-15 07:00:00-05:00'])
dates = pd.to_datetime(s)
print(dates)
Adding Timezone After Parsing¶
# Parse first (naive)
dates = pd.to_datetime(['2024-01-15', '2024-01-16'])
# Then localize
dates_utc = dates.tz_localize('UTC')
print(dates_utc)
Performance Considerations¶
Use format Parameter¶
import time
s = pd.Series(['01/15/2024'] * 100000)
# Without format (slower)
start = time.time()
pd.to_datetime(s)
print(f"Without format: {time.time() - start:.3f}s")
# With format (faster)
start = time.time()
pd.to_datetime(s, format='%m/%d/%Y')
print(f"With format: {time.time() - start:.3f}s")
infer_datetime_format (deprecated)¶
In older pandas versions, infer_datetime_format=True could speed up parsing. This is now the default behavior and the parameter is deprecated.
Practical Examples¶
Financial Data¶
# Stock data with various date formats
df = pd.DataFrame({
'date_str': ['2024-01-15', '2024-01-16', '2024-01-17'],
'close': [150.0, 151.5, 149.8]
})
# Parse and set as index
df['date'] = pd.to_datetime(df['date_str'])
df = df.set_index('date').drop('date_str', axis=1)
print(df)
Mixed Date Formats¶
# Handle mixed formats with errors='coerce'
mixed_dates = pd.Series([
'2024-01-15',
'01/16/2024',
'January 17, 2024',
'invalid'
])
dates = pd.to_datetime(mixed_dates, errors='coerce')
print(dates)
Creating DatetimeIndex for Time Series¶
# Create proper time series
prices = pd.Series(
[100, 101, 102, 103, 104],
index=pd.to_datetime(['2024-01-15', '2024-01-16', '2024-01-17',
'2024-01-18', '2024-01-19']),
name='price'
)
# Now can use time series operations
print(prices['2024-01-16':'2024-01-18']) # Date slicing
print(prices.resample('2D').mean()) # Resampling
Summary¶
| Task | Method |
|---|---|
| Parse strings to datetime | pd.to_datetime() |
| Parse during CSV read | parse_dates parameter |
| Format datetime to string | dt.strftime() |
| Handle invalid dates | errors='coerce' |
| Parse Unix timestamps | unit='s' parameter |
| Specify date format | format parameter |