Reading SQL and Parquet Files¶

pandas supports reading data from SQL databases and Parquet files, which are common in enterprise and big data environments.

Mental Model

read_sql treats a database query result as a DataFrame -- the SQL engine does the heavy filtering, and pandas receives the final table. read_parquet loads columnar binary files that preserve dtypes exactly, making round-trips lossless and fast. Both skip the parsing overhead of text formats like CSV.

pd.read_sql - Database Connectivity¶

Basic Setup¶

```python import pandas as pd import sqlite3

Create a SQLite connection (example)¶

conn = sqlite3.connect('database.db')

For other databases, use appropriate drivers:¶

PostgreSQL: psycopg2¶

MySQL: pymysql or mysql-connector-python¶

SQL Server: pyodbc¶

Oracle: cx_Oracle¶

```

Reading with SQL Query¶

```python

Execute a SQL query and load results into DataFrame¶

query = "SELECT * FROM employees WHERE department = 'Sales'" df = pd.read_sql(query, conn) print(df) ```

Reading Entire Table¶

```python

Load an entire table¶

df = pd.read_sql_table('employees', conn)

Or using read_sql with table name¶

df = pd.read_sql('employees', conn) ```

Connection Strings¶

```python

SQLAlchemy connection string format (recommended)¶

from sqlalchemy import create_engine

SQLite¶

engine = create_engine('sqlite:///database.db')

PostgreSQL¶

engine = create_engine('postgresql://user:password@localhost:5432/dbname')

MySQL¶

engine = create_engine('mysql+pymysql://user:password@localhost:3306/dbname')

SQL Server¶

engine = create_engine('mssql+pyodbc://user:password@server/dbname?driver=ODBC+Driver+17+for+SQL+Server')

Read using engine¶

df = pd.read_sql("SELECT * FROM table_name", engine) ```

Parameterized Queries¶

```python

Safe parameter passing (prevents SQL injection)¶

query = "SELECT * FROM employees WHERE salary > :min_salary" df = pd.read_sql(query, conn, params={'min_salary': 50000})

With SQLite positional parameters¶

query = "SELECT * FROM employees WHERE department = ?" df = pd.read_sql(query, conn, params=['Sales']) ```

Chunked Reading for Large Tables¶

```python

Read in chunks for memory efficiency¶

chunks = pd.read_sql( "SELECT * FROM large_table", conn, chunksize=10000 )

Process chunks¶

for chunk in chunks: process(chunk)

Or concatenate all chunks¶

df = pd.concat(chunks, ignore_index=True) ```

Setting Index from Column¶

```python

Use a column as the DataFrame index¶

df = pd.read_sql( "SELECT id, name, value FROM table", conn, index_col='id' ) ```

Parsing Dates¶

```python

Auto-parse date columns¶

df = pd.read_sql( "SELECT * FROM orders", conn, parse_dates=['order_date', 'ship_date'] )

With custom format¶

df = pd.read_sql( "SELECT * FROM orders", conn, parse_dates={'order_date': '%Y-%m-%d'} ) ```

Key Parameters for read_sql¶

Parameter	Description	Default
`sql`	SQL query or table name	Required
`con`	Database connection	Required
`index_col`	Column to use as index	None
`coerce_float`	Convert decimal to float	True
`params`	Query parameters	None
`parse_dates`	Columns to parse as dates	None
`chunksize`	Rows per chunk	None

pd.read_parquet - Columnar File Format¶

Parquet is a columnar storage format optimized for analytical queries.

Why Parquet?¶

Feature	CSV	Parquet
Compression	None/External	Built-in (snappy, gzip, zstd)
Schema	None	Embedded
Read speed	Row-by-row	Column-oriented
Partial read	No	Column selection
File size	Large	2-10x smaller
Type preservation	No	Yes

Basic Reading¶

```python

Read entire parquet file¶

df = pd.read_parquet('data.parquet')

Read from URL¶

df = pd.read_parquet('https://example.com/data.parquet')

Read from S3 (requires s3fs)¶

df = pd.read_parquet('s3://bucket/data.parquet') ```

Column Selection¶

One of Parquet's key advantages is reading only needed columns.

```python

Read only specific columns (much faster for wide tables)¶

df = pd.read_parquet('data.parquet', columns=['id', 'name', 'value'])

Compare performance¶

import time

start = time.time() df_all = pd.read_parquet('wide_table.parquet') # 100 columns print(f"All columns: {time.time() - start:.2f}s")

start = time.time() df_subset = pd.read_parquet('wide_table.parquet', columns=['col1', 'col2']) print(f"2 columns: {time.time() - start:.2f}s") ```

Reading with Filters (Row Groups)¶

```python

PyArrow filters (if using pyarrow engine)¶

This filters at read time, reducing memory usage¶

df = pd.read_parquet( 'data.parquet', filters=[ ('year', '>=', 2020), ('category', '==', 'A') ] ) ```

Partitioned Parquet Files¶

Parquet files can be partitioned by column values (like Hive-style partitioning).

data/ ├── year=2023/ │ ├── month=01/ │ │ └── data.parquet │ └── month=02/ │ └── data.parquet └── year=2024/ └── month=01/ └── data.parquet

```python

Read entire partitioned dataset¶

df = pd.read_parquet('data/')

Partition columns are added automatically¶

print(df.columns) # [...original columns..., 'year', 'month']

With filters on partition columns (very efficient)¶

df = pd.read_parquet( 'data/', filters=[('year', '==', 2024)] ) ```

Choosing Parquet Engine¶

```python

PyArrow (default, recommended)¶

df = pd.read_parquet('data.parquet', engine='pyarrow')

fastparquet (alternative)¶

df = pd.read_parquet('data.parquet', engine='fastparquet') ```

Writing Parquet Files¶

```python

Basic write¶

df.to_parquet('output.parquet')

With compression¶

df.to_parquet('output.parquet', compression='snappy') # Fast, moderate compression df.to_parquet('output.parquet', compression='gzip') # Slower, better compression df.to_parquet('output.parquet', compression='zstd') # Best balance

Partitioned write¶

df.to_parquet('output/', partition_cols=['year', 'month']) ```

Key Parameters for read_parquet¶

Parameter	Description	Default
`path`	File path or URL	Required
`engine`	'pyarrow' or 'fastparquet'	'auto'
`columns`	Columns to read	None (all)
`filters`	Row group filters	None
`use_nullable_dtypes`	Use nullable dtypes	False

Practical Examples¶

1. Database to DataFrame Workflow¶

```python from sqlalchemy import create_engine import pandas as pd

Connect to PostgreSQL¶

engine = create_engine('postgresql://user:pass@localhost/sales_db')

Load recent orders¶

query = """ SELECT o.order_id, o.order_date, c.customer_name, p.product_name, o.quantity, o.total_price FROM orders o JOIN customers c ON o.customer_id = c.id JOIN products p ON o.product_id = p.id WHERE o.order_date >= '2024-01-01' """

df = pd.read_sql(query, engine, parse_dates=['order_date'])

Analysis¶

daily_sales = df.groupby(df['order_date'].dt.date)['total_price'].sum() print(daily_sales) ```

2. Large Dataset with Parquet¶

```python

Read large dataset efficiently¶

df = pd.read_parquet( 'transactions.parquet', columns=['date', 'amount', 'category'], filters=[('date', '>=', '2024-01-01')] )

Aggregate¶

summary = df.groupby('category')['amount'].agg(['sum', 'mean', 'count']) print(summary) ```

3. ETL Pipeline: SQL to Parquet¶

```python

Extract from database¶

df = pd.read_sql( "SELECT * FROM large_table WHERE updated_at > :last_run", engine, params={'last_run': '2024-01-01'}, chunksize=100000 )

Transform and load to Parquet¶

for i, chunk in enumerate(df): # Transform chunk = chunk.dropna() chunk['processed_at'] = pd.Timestamp.now()

# Load to partitioned Parquet
chunk.to_parquet(
    f'output/batch_{i}.parquet',
    compression='snappy'
)

```

4. Financial Data Storage¶

```python import yfinance as yf

Get historical data¶

tickers = ['AAPL', 'GOOGL', 'MSFT', 'AMZN'] data = {}

for ticker in tickers: data[ticker] = yf.Ticker(ticker).history(period='5y')

Combine into panel format¶

df = pd.concat(data, names=['ticker', 'date']) df = df.reset_index()

Save as partitioned Parquet (efficient for queries by ticker)¶

df.to_parquet('stock_data/', partition_cols=['ticker'])

Read specific ticker efficiently¶

aapl = pd.read_parquet('stock_data/', filters=[('ticker', '==', 'AAPL')]) ```

Performance Comparison¶

```python import time import pandas as pd import numpy as np

Create sample data¶

n_rows = 1_000_000 df = pd.DataFrame({ 'id': range(n_rows), 'value': np.random.randn(n_rows), 'category': np.random.choice(['A', 'B', 'C'], n_rows), 'date': pd.date_range('2020-01-01', periods=n_rows, freq='s') })

Save in different formats¶

df.to_csv('data.csv', index=False) df.to_parquet('data.parquet')

Compare read times¶

start = time.time() df_csv = pd.read_csv('data.csv') csv_time = time.time() - start

start = time.time() df_parquet = pd.read_parquet('data.parquet') parquet_time = time.time() - start

print(f"CSV read time: {csv_time:.2f}s") print(f"Parquet read time: {parquet_time:.2f}s") print(f"Speedup: {csv_time/parquet_time:.1f}x")

File sizes¶

import os print(f"\nCSV size: {os.path.getsize('data.csv') / 1e6:.1f} MB") print(f"Parquet size: {os.path.getsize('data.parquet') / 1e6:.1f} MB") ```

Typical results:

Parquet is 5-10x faster to read than CSV
Parquet files are 2-10x smaller than CSV
Column selection in Parquet provides additional speedup

Common Pitfalls (SQL)¶

1. Database Connection Not Closed¶

```python

Bad: Connection stays open¶

df = pd.read_sql("SELECT * FROM table", conn)

Good: Use context manager¶

with engine.connect() as conn: df = pd.read_sql("SELECT * FROM table", conn)

Connection automatically closed¶

```

2. SQL Injection Vulnerability¶

```python

DANGEROUS: String formatting¶

user_input = "Sales'; DROP TABLE employees;--" query = f"SELECT * FROM emp WHERE dept = '{user_input}'"

SAFE: Use parameterized queries¶

query = "SELECT * FROM emp WHERE dept = :dept" df = pd.read_sql(query, conn, params={'dept': user_input}) ```

3. Parquet Type Preservation¶

```python

Categorical and nullable types may need explicit handling¶

df = pd.read_parquet('data.parquet')

Convert categorical back if needed¶

df['category'] = df['category'].astype('category') ```

Exercises¶

Exercise 1. Using SQLite and pd.read_sql, create an in-memory database with a table named products that has columns id, name, and price. Insert three rows, then read the table into a DataFrame using a parameterized query that selects only products with price > 10.

Solution to Exercise 1

Create an in-memory SQLite database, insert rows, and query with parameters.

import pandas as pd
import sqlite3

conn = sqlite3.connect(':memory:')
conn.execute('''
    CREATE TABLE products (id INTEGER, name TEXT, price REAL)
''')
conn.executemany(
    'INSERT INTO products VALUES (?, ?, ?)',
    [(1, 'Widget', 9.99), (2, 'Gadget', 19.99), (3, 'Gizmo', 14.99)]
)
conn.commit()

query = "SELECT * FROM products WHERE price > ?"
df = pd.read_sql(query, conn, params=[10])
print(df)
conn.close()

Exercise 2. Create a DataFrame with columns 'date', 'ticker', and 'close'. Save it to a Parquet file. Then read back only the 'ticker' and 'close' columns (without loading 'date') and verify the result has exactly two columns.

Solution to Exercise 2

Use the columns parameter of read_parquet to select specific columns.

import pandas as pd

df = pd.DataFrame({
    'date': pd.date_range('2024-01-01', periods=5),
    'ticker': ['AAPL'] * 5,
    'close': [150.0, 151.0, 149.0, 152.0, 153.0]
})
df.to_parquet('stock_data.parquet', index=False)

df_subset = pd.read_parquet('stock_data.parquet', columns=['ticker', 'close'])
print(df_subset)
print(f"Number of columns: {len(df_subset.columns)}")  # 2

Exercise 3. Create a DataFrame with 10,000 rows. Save it as both CSV and Parquet. Read both files back and compare the read times using time.time(). Print the speedup factor.

Solution to Exercise 3

Time the read operations and compute the speedup ratio.

import pandas as pd
import numpy as np
import time

np.random.seed(42)
df = pd.DataFrame({
    'id': range(10000),
    'value': np.random.randn(10000),
    'category': np.random.choice(['A', 'B', 'C', 'D'], 10000)
})

df.to_csv('bench.csv', index=False)
df.to_parquet('bench.parquet')

start = time.time()
_ = pd.read_csv('bench.csv')
csv_time = time.time() - start

start = time.time()
_ = pd.read_parquet('bench.parquet')
parquet_time = time.time() - start

print(f"CSV read: {csv_time:.4f}s")
print(f"Parquet read: {parquet_time:.4f}s")
print(f"Speedup: {csv_time / parquet_time:.1f}x")