Parsing Dates¶

Parsing dates involves converting string representations of dates and times into datetime objects. This is essential for time series analysis and date-based operations in pandas.

Mental Model

A date string like "2024-01-15" is just text -- you cannot add days or extract the month from it. pd.to_datetime() parses the string into a Timestamp object that understands calendar logic. Always parse date columns at load time (parse_dates in read_csv) rather than after, so every downstream operation benefits from proper datetime dtype.

Understanding Date Parsing¶

String vs Datetime¶

```python import pandas as pd

String - limited operations¶

date_str = "2024-01-15" print(type(date_str)) #

Datetime - full date operations¶

date_dt = pd.to_datetime(date_str) print(type(date_dt)) # print(date_dt.year) # 2024 print(date_dt.month) # 1 print(date_dt.day) # 15 ```

Why Parse Dates?¶

Date arithmetic: Add/subtract time periods
Component extraction: Get year, month, day, etc.
Filtering: Select date ranges
Resampling: Aggregate by time periods
Sorting: Chronological ordering

pd.to_datetime()¶

The primary function for converting to datetime.

Basic Usage¶

```python

Single value¶

date = pd.to_datetime('2024-01-15') print(date) # 2024-01-15 00:00:00

Series¶

s = pd.Series(['2024-01-15', '2024-01-16', '2024-01-17']) dates = pd.to_datetime(s) print(dates) ```

0 2024-01-15 1 2024-01-16 2 2024-01-17 dtype: datetime64[ns]

Automatic Format Detection¶

pandas can often infer the date format automatically:

```python

Various formats - all parsed correctly¶

formats = [ '2024-01-15', # ISO format '01/15/2024', # US format '15-01-2024', # European format 'January 15, 2024', # Written format '2024/01/15', # Slash format ]

for f in formats: print(f"{f:20} -> {pd.to_datetime(f)}") ```

Specifying Format¶

For faster parsing or ambiguous formats, specify the format explicitly.

```python

Explicit format¶

s = pd.Series(['01/15/2024', '02/20/2024', '03/25/2024'])

Without format (slower, may be ambiguous)¶

dates = pd.to_datetime(s)

With format (faster, unambiguous)¶

dates = pd.to_datetime(s, format='%m/%d/%Y') ```

Common Format Codes¶

Code	Meaning	Example
`%Y`	4-digit year	2024
`%y`	2-digit year	24
`%m`	Month (01-12)	01
`%d`	Day (01-31)	15
`%H`	Hour (00-23)	14
`%I`	Hour (01-12)	02
`%M`	Minute (00-59)	30
`%S`	Second (00-59)	45
`%f`	Microsecond	000000
`%p`	AM/PM	PM
`%B`	Full month name	January
`%b`	Abbreviated month	Jan
`%A`	Full weekday	Monday
`%a`	Abbreviated weekday	Mon

Handling Errors¶

```python s = pd.Series(['2024-01-15', 'invalid', '2024-01-17'])

Default: raises error¶

pd.to_datetime(s) # ValueError¶

errors='coerce': invalid dates become NaT (Not a Time)¶

dates = pd.to_datetime(s, errors='coerce') print(dates) ```

0 2024-01-15 1 NaT 2 2024-01-17 dtype: datetime64[ns]

```python

errors='ignore': return original input unchanged¶

dates = pd.to_datetime(s, errors='ignore') print(dates) ```

0 2024-01-15 1 invalid 2 2024-01-17 dtype: object

Unix Timestamps¶

```python

Unix timestamps (seconds since 1970-01-01)¶

timestamps = pd.Series([1705276800, 1705363200, 1705449600]) dates = pd.to_datetime(timestamps, unit='s') print(dates) ```

0 2024-01-15 1 2024-01-16 2 2024-01-17 dtype: datetime64[ns]

Unit	Description
`'D'`	Days
`'s'`	Seconds
`'ms'`	Milliseconds
`'us'`	Microseconds
`'ns'`	Nanoseconds

Parsing in pd.read_csv()¶

parse_dates Parameter¶

```python url = 'https://raw.githubusercontent.com/srivatsan88/YouTubeLI/master/dataset/electricity_consumption.csv'

Without parsing - dates are strings¶

df = pd.read_csv(url) print(df['Bill_Date'].dtype) # object

With parse_dates=True - attempt to parse index¶

df = pd.read_csv(url, index_col='Bill_Date', parse_dates=True) print(df.index.dtype) # datetime64[ns]

With parse_dates=['column'] - parse specific columns¶

df = pd.read_csv(url, parse_dates=['Bill_Date']) print(df['Bill_Date'].dtype) # datetime64[ns] ```

date_format Parameter¶

Specify the format for date columns:

```python

When date format is known¶

df = pd.read_csv(url, parse_dates=['Bill_Date'], date_format='%m/%d/%Y') ```

Multiple Date Columns¶

```python

Parse multiple columns¶

df = pd.read_csv(filename, parse_dates=['start_date', 'end_date'])

Combine columns into single datetime¶

df = pd.read_csv(filename, parse_dates={'datetime': ['date', 'time']}) ```

strptime vs strftime¶

Understanding the difference between parsing and formatting:

strptime (String Parse Time)¶

String → Datetime Object

```python from datetime import datetime

date_string = "2024-01-15 14:30:00" format_string = "%Y-%m-%d %H:%M:%S"

Parse string to datetime¶

dt = datetime.strptime(date_string, format_string) print(dt) # 2024-01-15 14:30:00 print(type(dt)) # ```

strftime (String From Time)¶

Datetime Object → String

```python from datetime import datetime

dt = datetime(2024, 1, 15, 14, 30, 0)

Format datetime to string¶

date_string = dt.strftime("%B %d, %Y at %I:%M %p") print(date_string) # January 15, 2024 at 02:30 PM ```

In pandas¶

```python

Parsing (strptime equivalent)¶

s = pd.Series(['2024-01-15', '2024-01-16']) dates = pd.to_datetime(s, format='%Y-%m-%d')

Formatting (strftime)¶

formatted = dates.dt.strftime('%B %d, %Y') print(formatted)

January 15, 2024¶

January 16, 2024¶

```

Timezone-Aware Parsing¶

Parsing with Timezone¶

```python

String with timezone¶

s = pd.Series(['2024-01-15 12:00:00+00:00', '2024-01-15 07:00:00-05:00']) dates = pd.to_datetime(s) print(dates) ```

Adding Timezone After Parsing¶

```python

Parse first (naive)¶

dates = pd.to_datetime(['2024-01-15', '2024-01-16'])

Then localize¶

dates_utc = dates.tz_localize('UTC') print(dates_utc) ```

Performance Considerations¶

Use format Parameter¶

```python import time

s = pd.Series(['01/15/2024'] * 100000)

Without format (slower)¶

start = time.time() pd.to_datetime(s) print(f"Without format: {time.time() - start:.3f}s")

With format (faster)¶

start = time.time() pd.to_datetime(s, format='%m/%d/%Y') print(f"With format: {time.time() - start:.3f}s") ```

infer_datetime_format (deprecated)¶

In older pandas versions, infer_datetime_format=True could speed up parsing. This is now the default behavior and the parameter is deprecated.

Practical Examples¶

Financial Data¶

```python

Stock data with various date formats¶

df = pd.DataFrame({ 'date_str': ['2024-01-15', '2024-01-16', '2024-01-17'], 'close': [150.0, 151.5, 149.8] })

Parse and set as index¶

df['date'] = pd.to_datetime(df['date_str']) df = df.set_index('date').drop('date_str', axis=1) print(df) ```

Mixed Date Formats¶

```python

Handle mixed formats with errors='coerce'¶

mixed_dates = pd.Series([ '2024-01-15', '01/16/2024', 'January 17, 2024', 'invalid' ])

dates = pd.to_datetime(mixed_dates, errors='coerce') print(dates) ```

Creating DatetimeIndex for Time Series¶

```python

Create proper time series¶

prices = pd.Series( [100, 101, 102, 103, 104], index=pd.to_datetime(['2024-01-15', '2024-01-16', '2024-01-17', '2024-01-18', '2024-01-19']), name='price' )

Now can use time series operations¶

print(prices['2024-01-16':'2024-01-18']) # Date slicing print(prices.resample('2D').mean()) # Resampling ```

Summary¶

Task	Method
Parse strings to datetime	`pd.to_datetime()`
Parse during CSV read	`parse_dates` parameter
Format datetime to string	`dt.strftime()`
Handle invalid dates	`errors='coerce'`
Parse Unix timestamps	`unit='s'` parameter
Specify date format	`format` parameter

Exercises¶

Exercise 1. Write code that converts a string column to datetime using pd.to_datetime(). Handle a custom format like '15-Mar-2024'.

Solution to Exercise 1

```python import pandas as pd import numpy as np

Solution for the specific exercise¶

np.random.seed(42) df = pd.DataFrame({'A': np.random.randn(10), 'B': np.random.randn(10)}) print(df.head()) ```

Exercise 2. Explain the format parameter in pd.to_datetime(). Why is specifying the format faster than letting Pandas infer it?

Solution to Exercise 2

See the main content for the detailed explanation. The key concept involves understanding the Pandas API and its behavior for this specific operation.

Exercise 3. Write code that handles parsing errors using errors='coerce' in pd.to_datetime(). Show what happens to unparseable values.

Solution to Exercise 3

```python import pandas as pd import numpy as np

np.random.seed(42) df = pd.DataFrame({'A': np.random.randn(20), 'B': np.random.randn(20)}) result = df.describe() print(result) ```

Exercise 4. Create a DataFrame with a date string column and set it as the index after converting to datetime.

Solution to Exercise 4

```python import pandas as pd import numpy as np

np.random.seed(42) df = pd.DataFrame({'A': np.random.randn(50), 'group': np.random.choice(['X', 'Y'], 50)}) result = df.groupby('group').mean() print(result) ```