Data Handling with pandas¶

Overview¶

pandas is the primary library for structured data manipulation in Python. Built on top of NumPy, it provides two core data structures—Series (1-D) and DataFrame (2-D)—along with a rich set of tools for loading, cleaning, transforming, and summarizing data. Throughout this book, pandas is used to prepare datasets for statistical analysis.

import pandas as pd
import numpy as np

Core Data Structures¶

Series¶

A Series is a one-dimensional labeled array. It behaves like a NumPy array with an attached index.

s = pd.Series([10, 20, 30, 40], index=["a", "b", "c", "d"])
print(s)
# a    10
# b    20
# c    30
# d    40

print(s["b"])       # 20
print(s[s > 15])    # b, c, d entries
print(s.dtype)      # int64

DataFrame¶

A DataFrame is a two-dimensional labeled table where each column is a Series. Columns can have different data types.

data = {
    "Name":   ["Alice", "Bob", "Charlie"],
    "Age":    [25, 30, 35],
    "Salary": [70_000, 80_000, 90_000]
}
df = pd.DataFrame(data)
print(df)
#       Name  Age  Salary
# 0    Alice   25   70000
# 1      Bob   30   80000
# 2  Charlie   35   90000

Loading Data¶

# CSV
df = pd.read_csv("data.csv")

# Excel
df = pd.read_excel("data.xlsx", sheet_name="Sheet1")

# From a URL
url = "https://example.com/dataset.csv"
df = pd.read_csv(url)

# Tab-separated
df = pd.read_csv("data.tsv", sep="\t")

Inspecting a DataFrame¶

df.head()            # first 5 rows
df.tail(3)           # last 3 rows
df.shape             # (rows, columns)
df.dtypes            # data type of each column
df.info()            # concise summary (types, non-null counts, memory)
df.describe()        # descriptive statistics for numeric columns
df.columns           # column names
df.index             # row index

Selecting Data¶

Column Selection¶

# Single column → Series
df["Age"]

# Multiple columns → DataFrame
df[["Name", "Salary"]]

Row Selection¶

# By integer position
df.iloc[0]          # first row as Series
df.iloc[0:2]        # first two rows as DataFrame

# By label
df.loc[0]           # row with index label 0
df.loc[0:1, "Name":"Age"]   # label-based slicing (inclusive on both ends)

Conditional Filtering¶

# Rows where Salary > 75000
high_salary = df[df["Salary"] > 75_000]
print(high_salary)
#       Name  Age  Salary
# 1      Bob   30   80000
# 2  Charlie   35   90000

# Multiple conditions (use & for AND, | for OR; parentheses required)
subset = df[(df["Age"] >= 30) & (df["Salary"] > 75_000)]

Adding and Modifying Columns¶

# New column from arithmetic
df["Bonus"] = df["Salary"] * 0.10

# New column from a function
df["Senior"] = df["Age"].apply(lambda x: x >= 30)

# Rename columns
df = df.rename(columns={"Salary": "Annual_Salary"})

Handling Missing Data¶

# Detect missing values
df.isna().sum()              # count NaNs per column

# Drop rows with any NaN
df_clean = df.dropna()

# Drop rows where specific column is NaN
df_clean = df.dropna(subset=["Age"])

# Fill missing values
df["Age"] = df["Age"].fillna(df["Age"].median())

Sorting¶

# Sort by a single column
df.sort_values("Age")

# Sort descending
df.sort_values("Salary", ascending=False)

# Sort by multiple columns
df.sort_values(["Age", "Salary"], ascending=[True, False])

Grouping and Aggregation¶

groupby splits the data by one or more keys, applies an aggregation function, and combines the results—a cornerstone of exploratory data analysis.

data = {
    "Department": ["Sales", "Sales", "Engineering", "Engineering", "HR"],
    "Employee":   ["A", "B", "C", "D", "E"],
    "Salary":     [60_000, 70_000, 90_000, 95_000, 65_000]
}
df = pd.DataFrame(data)

# Mean salary by department
df.groupby("Department")["Salary"].mean()
# Department
# Engineering    92500
# HR             65000
# Sales          65000

# Multiple aggregations
df.groupby("Department")["Salary"].agg(["mean", "std", "count"])

Merging and Joining¶

# Two DataFrames sharing a key column
orders = pd.DataFrame({
    "OrderID":    [1, 2, 3],
    "CustomerID": [101, 102, 103],
    "Amount":     [250, 450, 300]
})

customers = pd.DataFrame({
    "CustomerID": [101, 102, 104],
    "Name":       ["Alice", "Bob", "Diana"]
})

# Inner join (only matching keys)
merged = pd.merge(orders, customers, on="CustomerID", how="inner")

# Left join (keep all orders)
merged_left = pd.merge(orders, customers, on="CustomerID", how="left")

Pivot Tables¶

sales = pd.DataFrame({
    "Region":  ["East", "East", "West", "West"],
    "Quarter": ["Q1", "Q2", "Q1", "Q2"],
    "Revenue": [100, 150, 200, 180]
})

pivot = sales.pivot_table(values="Revenue", index="Region", columns="Quarter", aggfunc="sum")
print(pivot)
# Quarter   Q1   Q2
# Region
# East     100  150
# West     200  180

Descriptive Statistics with pandas¶

pandas provides convenient methods that align with the statistical concepts covered in this book.

rng = np.random.default_rng(42)
df = pd.DataFrame({
    "X": rng.normal(50, 10, 200),
    "Y": rng.normal(100, 25, 200)
})

# Central tendency
df.mean()
df.median()

# Spread
df.std()          # sample standard deviation (ddof=1 by default)
df.var()          # sample variance
df.quantile([0.25, 0.5, 0.75])

# Shape
df.skew()
df.kurt()         # excess kurtosis

# Correlation
df.corr()         # Pearson correlation matrix
df.corr(method="spearman")

Working with Dates and Times¶

Statistical data frequently includes timestamps. pandas has first-class support for datetime operations.

# Parse dates during CSV read
df = pd.read_csv("timeseries.csv", parse_dates=["date"])

# Manual conversion
df["date"] = pd.to_datetime(df["date"])

# Set as index for time-series operations
df = df.set_index("date")

# Resample to monthly frequency
monthly = df.resample("M").mean()

# Rolling statistics
df["rolling_mean_7d"] = df["value"].rolling(window=7).mean()

Applying Functions¶

# Element-wise via apply
df["log_Salary"] = df["Salary"].apply(np.log)

# Row-wise via apply with axis=1
df["Total"] = df.apply(lambda row: row["Base"] + row["Bonus"], axis=1)

# Vectorized string methods
df["Name_upper"] = df["Name"].str.upper()

Exporting Data¶

df.to_csv("output.csv", index=False)
df.to_excel("output.xlsx", index=False)

Summary¶

Concept	Key Takeaway
`Series` / `DataFrame`	Core 1-D and 2-D labeled data structures
Selection	`loc` for labels, `iloc` for positions, boolean masks for filtering
Missing data	`isna`, `dropna`, `fillna`
Grouping	`groupby` + aggregation for split-apply-combine analysis
Merging	`pd.merge` for SQL-style joins on key columns
Descriptive stats	`mean`, `std`, `var`, `quantile`, `corr` map directly to statistical concepts
Time series	`pd.to_datetime`, `resample`, `rolling` for temporal data