DataFrame¶

A DataFrame is a two-dimensional labeled data structure with columns of potentially different types. It is the primary pandas data structure for tabular data, analogous to a spreadsheet or SQL table.

Conceptual Overview¶

A DataFrame can be thought of as: - A dictionary of Series objects sharing the same index - A 2D NumPy array with row and column labels - An Excel spreadsheet with named columns

import pandas as pd

df = pd.DataFrame({
    "price": [100, 101, 102],
    "volume": [10, 12, 9],
})
print(df)

   price  volume
0    100      10
1    101      12
2    102       9

Creating a DataFrame¶

From a Dictionary of Lists¶

The most common creation method. Keys become column names, values become column data.

df = pd.DataFrame({
    "city": ["mumbai", "delhi"],
    "temperature": [32, 45],
    "humidity": [70, 60],
})
print(df)

     city  temperature  humidity
0  mumbai           32        70
1   delhi           45        60

From a List of Dictionaries¶

Each dictionary represents a row. Keys become column names.

df = pd.DataFrame([
    {'name': 'Alice', 'age': 25, 'city': 'NYC'},
    {'name': 'Bob', 'age': 30, 'city': 'LA'},
    {'name': 'Charlie', 'age': 35}  # Missing 'city' becomes NaN
])
print(df)

      name  age city
0    Alice   25  NYC
1      Bob   30   LA
2  Charlie   35  NaN

With Custom Index¶

df = pd.DataFrame(
    {"price": [100, 101, 102]},
    index=["day1", "day2", "day3"]
)
print(df)

      price
day1    100
day2    101
day3    102

From NumPy Array¶

import numpy as np

data = np.array([[1, 2], [3, 4], [5, 6]])
df = pd.DataFrame(data, columns=['A', 'B'], index=['x', 'y', 'z'])
print(df)

DataFrame Attributes¶

shape¶

Returns the dimensionality as (rows, columns).

url = "https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv"
df = pd.read_csv(url)
print(df.shape)  # (891, 12)

columns¶

The column labels as an Index object.

print(df.columns)
# Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', ...])

index¶

The row labels.

print(df.index)
# RangeIndex(start=0, stop=891, step=1)

dtypes¶

Data types of each column.

print(df.dtypes)

PassengerId      int64
Survived         int64
Pclass           int64
Name            object
Sex             object
Age            float64
...

values¶

The underlying data as a NumPy array.

print(type(df.values))  # <class 'numpy.ndarray'>
print(df.values.shape)  # (891, 12)

Column Access¶

Bracket Notation (Recommended)¶

# Single column -> returns Series
price_series = df["price"]
print(type(price_series))  # <class 'pandas.core.series.Series'>

# Multiple columns -> returns DataFrame
subset = df[["price", "volume"]]
print(type(subset))  # <class 'pandas.core.frame.DataFrame'>

Attribute Access (Use Cautiously)¶

# Works for simple column names
df.price

# Fails if column name:
# - Contains spaces
# - Starts with a number
# - Conflicts with DataFrame methods (e.g., 'count', 'mean')

Adding New Columns¶

# Direct assignment
df["return"] = df["price"].pct_change()

# Using assign() for method chaining
df = df.assign(
    return_pct=lambda x: x["price"].pct_change(),
    volume_ma=lambda x: x["volume"].rolling(3).mean()
)

Row Access¶

Using loc (Label-based)¶

df = pd.DataFrame(
    {"price": [100, 101, 102]},
    index=["day1", "day2", "day3"]
)

df.loc["day1"]           # Single row as Series
df.loc[["day1", "day3"]] # Multiple rows as DataFrame
df.loc["day1", "price"]  # Specific cell

Using iloc (Position-based)¶

df.iloc[0]        # First row as Series
df.iloc[0:2]      # First two rows as DataFrame
df.iloc[0, 0]     # First cell
df.iloc[-1]       # Last row

Slicing Rows¶

df[0:2]           # First two rows (position-based slicing)
df["day1":"day2"] # Label-based slicing (inclusive)

DataFrame Operations¶

Arithmetic Operations¶

Operations align by index and column labels.

df1 = pd.DataFrame({'A': [1, 2], 'B': [3, 4]})
df2 = pd.DataFrame({'A': [10, 20], 'B': [30, 40]})

print(df1 + df2)

    A   B
0  11  33
1  22  44

Broadcasting with Series¶

df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
s = pd.Series([10, 100], index=['A', 'B'])

print(df * s)  # Multiplies each column by corresponding Series value

Financial Context¶

DataFrames are the standard structure for financial data analysis.

Price Histories¶

import pandas as pd

prices = pd.DataFrame({
    "AAPL": [150, 151, 152, 153, 154],
    "MSFT": [300, 301, 302, 303, 304],
    "GOOGL": [140, 141, 142, 143, 144]
}, index=pd.date_range("2024-01-01", periods=5))

print(prices)

            AAPL  MSFT  GOOGL
2024-01-01   150   300    140
2024-01-02   151   301    141
2024-01-03   152   302    142
2024-01-04   153   303    143
2024-01-05   154   304    144

Returns Calculation¶

# Daily returns
returns = prices.pct_change()

# Cumulative returns
cumulative_returns = (1 + returns).cumprod() - 1

Risk Metrics¶

import numpy as np

# Annualized volatility
volatility = returns.std() * np.sqrt(252)

# Correlation matrix
correlation = returns.corr()

# Covariance matrix
covariance = returns.cov() * 252  # Annualized

Portfolio Analysis¶

# Portfolio weights
weights = pd.Series({'AAPL': 0.4, 'MSFT': 0.35, 'GOOGL': 0.25})

# Portfolio return
portfolio_return = (returns * weights).sum(axis=1)

# Portfolio volatility
portfolio_vol = np.sqrt(weights @ covariance @ weights)

DataFrame vs Series¶

Aspect	Series	DataFrame
Dimensions	1D	2D
Access single item	`s['label']`	`df.loc[row, col]`
Selecting subset	Returns Series	Returns Series or DataFrame
Typical use	Single variable	Multiple variables
Column relationship	N/A	Each column is a Series

Runnable Example: `dataframe_tutorial.py`¶

"""
Pandas Tutorial 02: Introduction to DataFrames
===============================================

This tutorial introduces DataFrames, the most important data structure in pandas.
We'll cover:
1. What is a DataFrame
2. Creating DataFrames from various sources
3. DataFrame attributes and methods
4. Selecting columns and rows
5. Adding and removing columns/rows
6. Basic DataFrame operations

Prerequisites: Tutorial 01 - Series
Difficulty: Beginner
"""

import pandas as pd
import numpy as np

# ============================================================================
# SECTION 1: WHAT IS A DATAFRAME?
# ============================================================================

if __name__ == "__main__":

    """
    What is a DataFrame?
    --------------------
    A DataFrame is a 2-dimensional labeled data structure with columns that can be 
    of different types. You can think of it as:
    - A spreadsheet or SQL table
    - A dictionary of Series objects (each column is a Series)
    - A 2D NumPy array with row and column labels

    Key characteristics:
    - Rows and columns have labels (index and columns)
    - Columns can have different data types
    - Size-mutable: can add/delete columns and rows
    - Automatic data alignment
    - Powerful data manipulation capabilities
    """

    # ============================================================================
    # SECTION 2: CREATING DATAFRAMES
    # ============================================================================

    print("=" * 70)
    print("CREATING DATAFRAMES")
    print("=" * 70)

    # Method 1: From a dictionary of lists
    # Keys become column names, lists become column data
    print("\n1. DataFrame from dictionary of lists:")
    data_dict = {
        'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
        'Age': [25, 30, 35, 28, 32],
        'City': ['New York', 'Los Angeles', 'Chicago', 'Houston', 'Phoenix'],
        'Salary': [50000, 60000, 75000, 55000, 65000]
    }
    df1 = pd.DataFrame(data_dict)
    print(df1)
    print(f"\nType: {type(df1)}")

    # Method 2: From a list of dictionaries
    # Each dictionary is a row
    print("\n2. DataFrame from list of dictionaries:")
    data_list = [
        {'Name': 'Alice', 'Age': 25, 'Score': 85},
        {'Name': 'Bob', 'Age': 30, 'Score': 92},
        {'Name': 'Charlie', 'Age': 35, 'Score': 78},
    ]
    df2 = pd.DataFrame(data_list)
    print(df2)

    # Method 3: From a 2D NumPy array
    # Need to specify column names and index
    print("\n3. DataFrame from NumPy array:")
    numpy_data = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
    df3 = pd.DataFrame(numpy_data,
                       columns=['Column A', 'Column B', 'Column C'],
                       index=['Row 1', 'Row 2', 'Row 3'])
    print(df3)

    # Method 4: From a dictionary of Series
    # Useful when you already have Series objects
    print("\n4. DataFrame from dictionary of Series:")
    series_x = pd.Series([10, 20, 30], index=['a', 'b', 'c'])
    series_y = pd.Series([100, 200, 300], index=['a', 'b', 'c'])
    series_z = pd.Series([1000, 2000, 3000], index=['a', 'b', 'c'])
    df4 = pd.DataFrame({'X': series_x, 'Y': series_y, 'Z': series_z})
    print(df4)

    # Method 5: From a list of lists with column names
    print("\n5. DataFrame from list of lists:")
    data_lists = [
        ['Product A', 100, 10.5],
        ['Product B', 150, 12.0],
        ['Product C', 200, 9.5]
    ]
    df5 = pd.DataFrame(data_lists, columns=['Product', 'Quantity', 'Price'])
    print(df5)

    # Method 6: Empty DataFrame (then fill it later)
    print("\n6. Empty DataFrame:")
    df6 = pd.DataFrame()
    print(f"Empty DataFrame:\n{df6}")
    print(f"Shape: {df6.shape}")

    # ============================================================================
    # SECTION 3: DATAFRAME ATTRIBUTES
    # ============================================================================

    print("\n" + "=" * 70)
    print("DATAFRAME ATTRIBUTES")
    print("=" * 70)

    # Let's work with df1 (employee data)
    print("\nWorking with this DataFrame:")
    print(df1)

    # Shape: (rows, columns)
    print(f"\nShape (rows, columns): {df1.shape}")

    # Dimensions
    print(f"Number of dimensions: {df1.ndim}")

    # Size: total number of elements (rows × columns)
    print(f"Size (total elements): {df1.size}")

    # Column names
    print(f"\nColumn names: {df1.columns.tolist()}")

    # Index (row labels)
    print(f"Index (row labels): {df1.index.tolist()}")

    # Data types of each column
    print("\nData types of columns:")
    print(df1.dtypes)

    # Values as a NumPy array
    print("\nValues (as 2D NumPy array):")
    print(df1.values)

    # Information about the DataFrame
    print("\nDataFrame info:")
    df1.info()

    # Memory usage
    print(f"\nMemory usage:\n{df1.memory_usage()}")

    # ============================================================================
    # SECTION 4: VIEWING DATA
    # ============================================================================

    print("\n" + "=" * 70)
    print("VIEWING DATA")
    print("=" * 70)

    # Create a larger DataFrame for demonstration
    np.random.seed(42)  # For reproducibility
    large_df = pd.DataFrame({
        'A': np.random.randint(1, 100, 20),
        'B': np.random.randint(1, 100, 20),
        'C': np.random.randint(1, 100, 20),
        'D': np.random.randint(1, 100, 20)
    })

    print("Full DataFrame (20 rows):")
    print(large_df)

    # View first n rows (default: 5)
    print("\nFirst 5 rows (head):")
    print(large_df.head())

    print("\nFirst 3 rows:")
    print(large_df.head(3))

    # View last n rows (default: 5)
    print("\nLast 5 rows (tail):")
    print(large_df.tail())

    print("\nLast 2 rows:")
    print(large_df.tail(2))

    # Random sample of rows
    print("\nRandom sample of 3 rows:")
    print(large_df.sample(3))

    # ============================================================================
    # SECTION 5: SELECTING COLUMNS
    # ============================================================================

    print("\n" + "=" * 70)
    print("SELECTING COLUMNS")
    print("=" * 70)

    print("Original DataFrame:")
    print(df1)

    # Method 1: Select a single column (returns a Series)
    print("\n1. Select 'Name' column (returns Series):")
    name_column = df1['Name']
    print(name_column)
    print(f"Type: {type(name_column)}")

    # Alternative: using dot notation (only if column name is valid Python identifier)
    print("\n2. Select 'Age' using dot notation:")
    age_column = df1.Age
    print(age_column)

    # Method 2: Select multiple columns (returns a DataFrame)
    print("\n3. Select multiple columns ['Name', 'Salary']:")
    subset = df1[['Name', 'Salary']]
    print(subset)
    print(f"Type: {type(subset)}")

    # Change column order while selecting
    print("\n4. Select columns in different order:")
    reordered = df1[['Salary', 'Name', 'City']]
    print(reordered)

    # ============================================================================
    # SECTION 6: SELECTING ROWS
    # ============================================================================

    print("\n" + "=" * 70)
    print("SELECTING ROWS")
    print("=" * 70)

    print("Original DataFrame:")
    print(df1)

    # Method 1: Select by integer position using iloc
    # iloc[row, column] - both are integer positions
    # Remember: Python uses 0-based indexing

    print("\n1. Select first row using iloc[0]:")
    first_row = df1.iloc[0]
    print(first_row)
    print(f"Type: {type(first_row)}")  # Returns a Series

    print("\n2. Select first 3 rows using iloc[0:3]:")
    first_three = df1.iloc[0:3]
    print(first_three)

    print("\n3. Select specific rows using iloc[[0, 2, 4]]:")
    specific_rows = df1.iloc[[0, 2, 4]]
    print(specific_rows)

    # Method 2: Select by label using loc
    # loc[row_label, column_label]
    # First, let's set custom index
    df1_custom = df1.copy()
    df1_custom.index = ['emp1', 'emp2', 'emp3', 'emp4', 'emp5']
    print("\n4. DataFrame with custom index:")
    print(df1_custom)

    print("\n5. Select row by label using loc['emp2']:")
    row_emp2 = df1_custom.loc['emp2']
    print(row_emp2)

    print("\n6. Select multiple rows using loc[['emp1', 'emp3', 'emp5']]:")
    selected_emps = df1_custom.loc[['emp1', 'emp3', 'emp5']]
    print(selected_emps)

    # Method 3: Boolean indexing (very powerful!)
    print("\n7. Boolean indexing - employees with Age > 28:")
    condition = df1['Age'] > 28
    print(f"Condition: {condition.tolist()}")
    filtered = df1[condition]
    print(filtered)

    # Multiple conditions using & (and) or | (or)
    # Important: Each condition must be in parentheses!
    print("\n8. Multiple conditions - Age > 28 AND Salary > 60000:")
    filtered2 = df1[(df1['Age'] > 28) & (df1['Salary'] > 60000)]
    print(filtered2)

    print("\n9. Multiple conditions - Age < 30 OR Salary > 65000:")
    filtered3 = df1[(df1['Age'] < 30) | (df1['Salary'] > 65000)]
    print(filtered3)

    # String conditions
    print("\n10. Filter by string - City contains 'New':")
    filtered4 = df1[df1['City'].str.contains('New')]
    print(filtered4)

    # ============================================================================
    # SECTION 7: SELECTING BOTH ROWS AND COLUMNS
    # ============================================================================

    print("\n" + "=" * 70)
    print("SELECTING ROWS AND COLUMNS TOGETHER")
    print("=" * 70)

    print("Original DataFrame:")
    print(df1)

    # Using iloc[rows, columns] - integer positions
    print("\n1. iloc[0:3, 0:2] - first 3 rows, first 2 columns:")
    subset1 = df1.iloc[0:3, 0:2]
    print(subset1)

    print("\n2. iloc[[0, 2], [0, 3]] - specific rows and columns:")
    subset2 = df1.iloc[[0, 2], [0, 3]]
    print(subset2)

    # Using loc[rows, columns] - labels
    print("\n3. loc[0:2, ['Name', 'Salary']] - rows 0-2, specific columns:")
    subset3 = df1.loc[0:2, ['Name', 'Salary']]
    print(subset3)

    # Select all rows, specific columns
    print("\n4. All rows, columns 'Name' and 'Age':")
    subset4 = df1.loc[:, ['Name', 'Age']]
    print(subset4)

    # Combine boolean indexing with column selection
    print("\n5. Age > 28, show Name and Salary only:")
    subset5 = df1.loc[df1['Age'] > 28, ['Name', 'Salary']]
    print(subset5)

    # ============================================================================
    # SECTION 8: ADDING AND MODIFYING COLUMNS
    # ============================================================================

    print("\n" + "=" * 70)
    print("ADDING AND MODIFYING COLUMNS")
    print("=" * 70)

    # Create a working copy
    df_work = df1.copy()
    print("Original DataFrame:")
    print(df_work)

    # Add a new column with a constant value
    df_work['Country'] = 'USA'
    print("\n1. Added 'Country' column with constant value:")
    print(df_work)

    # Add a new column based on calculation
    df_work['Annual Bonus'] = df_work['Salary'] * 0.1
    print("\n2. Added 'Annual Bonus' (10% of Salary):")
    print(df_work)

    # Add a column based on conditions
    df_work['Category'] = df_work['Age'].apply(
        lambda x: 'Young' if x < 30 else 'Experienced'
    )
    print("\n3. Added 'Category' based on Age:")
    print(df_work)

    # Modify existing column
    df_work['Salary'] = df_work['Salary'] * 1.05  # 5% raise
    print("\n4. Modified 'Salary' (5% increase):")
    print(df_work)

    # Add column using assign() method (doesn't modify original)
    df_assigned = df_work.assign(Tax=lambda x: x['Salary'] * 0.2)
    print("\n5. Added 'Tax' column using assign():")
    print(df_assigned)

    # ============================================================================
    # SECTION 9: ADDING AND REMOVING ROWS
    # ============================================================================

    print("\n" + "=" * 70)
    print("ADDING AND REMOVING ROWS")
    print("=" * 70)

    # Create a working copy
    df_rows = df1.copy()
    print("Original DataFrame:")
    print(df_rows)

    # Add a new row using loc (with specific index)
    df_rows.loc[len(df_rows)] = ['Frank', 40, 'Seattle', 70000]
    print("\n1. Added new row:")
    print(df_rows)

    # Add row using dictionary
    new_row_dict = {'Name': 'Grace', 'Age': 27, 'City': 'Boston', 'Salary': 58000}
    df_rows = pd.concat([df_rows, pd.DataFrame([new_row_dict])], ignore_index=True)
    print("\n2. Added row from dictionary:")
    print(df_rows)

    # Remove row by index position
    df_dropped = df_rows.drop(0)  # Drops row with index 0
    print("\n3. Removed first row (index 0):")
    print(df_dropped)

    # Remove multiple rows
    df_dropped_multiple = df_rows.drop([0, 2, 4])
    print("\n4. Removed rows at indices 0, 2, 4:")
    print(df_dropped_multiple)

    # Remove rows based on condition
    df_filtered = df_rows[df_rows['Age'] < 35]
    print("\n5. Kept only rows where Age < 35:")
    print(df_filtered)

    # ============================================================================
    # SECTION 10: BASIC DATAFRAME OPERATIONS
    # ============================================================================

    print("\n" + "=" * 70)
    print("BASIC DATAFRAME OPERATIONS")
    print("=" * 70)

    print("Original DataFrame:")
    print(df1)

    # Statistical summary
    print("\n1. Statistical summary (describe):")
    print(df1.describe())

    # Summary including non-numeric columns
    print("\n2. Summary including all columns:")
    print(df1.describe(include='all'))

    # Mean of numeric columns
    print("\n3. Mean of numeric columns:")
    print(df1.mean(numeric_only=True))

    # Sum of numeric columns
    print("\n4. Sum of numeric columns:")
    print(df1.sum(numeric_only=True))

    # Count non-null values in each column
    print("\n5. Count of non-null values:")
    print(df1.count())

    # Unique values in a column
    print("\n6. Unique cities:")
    print(df1['City'].unique())
    print(f"Number of unique cities: {df1['City'].nunique()}")

    # Value counts
    print("\n7. Value counts for 'City':")
    print(df1['City'].value_counts())

    # ============================================================================
    # SECTION 11: SORTING
    # ============================================================================

    print("\n" + "=" * 70)
    print("SORTING")
    print("=" * 70)

    print("Original DataFrame:")
    print(df1)

    # Sort by a single column
    print("\n1. Sort by Age (ascending):")
    sorted_age = df1.sort_values('Age')
    print(sorted_age)

    print("\n2. Sort by Age (descending):")
    sorted_age_desc = df1.sort_values('Age', ascending=False)
    print(sorted_age_desc)

    # Sort by multiple columns
    print("\n3. Sort by Salary (desc), then by Name (asc):")
    sorted_multi = df1.sort_values(['Salary', 'Name'], ascending=[False, True])
    print(sorted_multi)

    # Sort by index
    print("\n4. Sort by index:")
    df_index_sorted = df1.sort_index()
    print(df_index_sorted)

    # ============================================================================
    # SECTION 12: RENAMING COLUMNS AND INDEX
    # ============================================================================

    print("\n" + "=" * 70)
    print("RENAMING")
    print("=" * 70)

    print("Original DataFrame:")
    print(df1)

    # Rename columns using a dictionary
    df_renamed = df1.rename(columns={'Name': 'Employee Name', 'Salary': 'Annual Salary'})
    print("\n1. Renamed columns:")
    print(df_renamed)

    # Rename all columns at once
    df_renamed_all = df1.copy()
    df_renamed_all.columns = ['emp_name', 'emp_age', 'emp_city', 'emp_salary']
    print("\n2. Renamed all columns:")
    print(df_renamed_all)

    # Rename index
    df_renamed_index = df1.rename(index={0: 'first', 1: 'second', 2: 'third'})
    print("\n3. Renamed index:")
    print(df_renamed_index)

    # ============================================================================
    # SECTION 13: HANDLING MISSING DATA
    # ============================================================================

    print("\n" + "=" * 70)
    print("HANDLING MISSING DATA")
    print("=" * 70)

    # Create DataFrame with missing values
    data_missing = {
        'A': [1, 2, np.nan, 4, 5],
        'B': [np.nan, 2, 3, np.nan, 5],
        'C': [1, 2, 3, 4, 5],
        'D': [np.nan, np.nan, np.nan, np.nan, np.nan]
    }
    df_missing = pd.DataFrame(data_missing)
    print("DataFrame with missing values:")
    print(df_missing)

    # Check for missing values
    print("\n1. Is null? (True for NaN values):")
    print(df_missing.isnull())

    # Count missing values per column
    print("\n2. Count of missing values per column:")
    print(df_missing.isnull().sum())

    # Total missing values
    print(f"\n3. Total missing values: {df_missing.isnull().sum().sum()}")

    # Drop rows with any missing values
    print("\n4. Drop rows with ANY missing values:")
    df_dropped_rows = df_missing.dropna()
    print(df_dropped_rows)

    # Drop columns with any missing values
    print("\n5. Drop columns with ANY missing values:")
    df_dropped_cols = df_missing.dropna(axis=1)
    print(df_dropped_cols)

    # Drop rows where ALL values are missing
    print("\n6. Drop rows where ALL values are missing:")
    df_dropped_all = df_missing.dropna(how='all')
    print(df_dropped_all)

    # Drop columns where ALL values are missing
    print("\n7. Drop columns where ALL values are missing:")
    df_dropped_all_cols = df_missing.dropna(axis=1, how='all')
    print(df_dropped_all_cols)

    # Fill missing values with 0
    print("\n8. Fill missing values with 0:")
    df_filled_zero = df_missing.fillna(0)
    print(df_filled_zero)

    # Fill missing values with mean of each column
    print("\n9. Fill missing values with column mean:")
    df_filled_mean = df_missing.fillna(df_missing.mean())
    print(df_filled_mean)

    # Fill missing values with different values for each column
    print("\n10. Fill with different values per column:")
    df_filled_dict = df_missing.fillna({'A': 0, 'B': 999, 'D': -1})
    print(df_filled_dict)

    # Forward fill
    print("\n11. Forward fill (use previous row's value):")
    df_ffill = df_missing.fillna(method='ffill')
    print(df_ffill)

    # Backward fill
    print("\n12. Backward fill (use next row's value):")
    df_bfill = df_missing.fillna(method='bfill')
    print(df_bfill)

    # ============================================================================
    # SECTION 14: SUMMARY AND KEY TAKEAWAYS
    # ============================================================================

    print("\n" + "=" * 70)
    print("KEY TAKEAWAYS")
    print("=" * 70)

    summary = """
    1. DataFrame is a 2D labeled data structure (like a table)
    2. Create DataFrames from dicts, lists, arrays, or Series
    3. Use head(), tail(), and sample() to view data
    4. Select columns with [] or dot notation
    5. Select rows with iloc (position) or loc (label)
    6. Boolean indexing is powerful for filtering
    7. Add columns by assignment or assign() method
    8. Add rows with loc or concat()
    9. Remove rows/columns with drop()
    10. Sort with sort_values() or sort_index()
    11. Rename with rename() or direct assignment to .columns
    12. Handle missing data with dropna() or fillna()

    Next Steps:
    -----------
    - Practice DataFrame operations
    - Try the exercise file: exercises/02_dataframe_exercises.py
    - Move on to Tutorial 03: Reading and Writing Data
    """

    print(summary)