Skip to content

Reading SQL and Parquet Files

pandas supports reading data from SQL databases and Parquet files, which are common in enterprise and big data environments.

Mental Model

read_sql treats a database query result as a DataFrame -- the SQL engine does the heavy filtering, and pandas receives the final table. read_parquet loads columnar binary files that preserve dtypes exactly, making round-trips lossless and fast. Both skip the parsing overhead of text formats like CSV.

pd.read_sql - Database Connectivity

Basic Setup

```python import pandas as pd import sqlite3

Create a SQLite connection (example)

conn = sqlite3.connect('database.db')

For other databases, use appropriate drivers:

PostgreSQL: psycopg2

MySQL: pymysql or mysql-connector-python

SQL Server: pyodbc

Oracle: cx_Oracle

```

Reading with SQL Query

```python

Execute a SQL query and load results into DataFrame

query = "SELECT * FROM employees WHERE department = 'Sales'" df = pd.read_sql(query, conn) print(df) ```

Reading Entire Table

```python

Load an entire table

df = pd.read_sql_table('employees', conn)

Or using read_sql with table name

df = pd.read_sql('employees', conn) ```

Connection Strings

```python

SQLAlchemy connection string format (recommended)

from sqlalchemy import create_engine

SQLite

engine = create_engine('sqlite:///database.db')

PostgreSQL

engine = create_engine('postgresql://user:password@localhost:5432/dbname')

MySQL

engine = create_engine('mysql+pymysql://user:password@localhost:3306/dbname')

SQL Server

engine = create_engine('mssql+pyodbc://user:password@server/dbname?driver=ODBC+Driver+17+for+SQL+Server')

Read using engine

df = pd.read_sql("SELECT * FROM table_name", engine) ```

Parameterized Queries

```python

Safe parameter passing (prevents SQL injection)

query = "SELECT * FROM employees WHERE salary > :min_salary" df = pd.read_sql(query, conn, params={'min_salary': 50000})

With SQLite positional parameters

query = "SELECT * FROM employees WHERE department = ?" df = pd.read_sql(query, conn, params=['Sales']) ```

Chunked Reading for Large Tables

```python

Read in chunks for memory efficiency

chunks = pd.read_sql( "SELECT * FROM large_table", conn, chunksize=10000 )

Process chunks

for chunk in chunks: process(chunk)

Or concatenate all chunks

df = pd.concat(chunks, ignore_index=True) ```

Setting Index from Column

```python

Use a column as the DataFrame index

df = pd.read_sql( "SELECT id, name, value FROM table", conn, index_col='id' ) ```

Parsing Dates

```python

Auto-parse date columns

df = pd.read_sql( "SELECT * FROM orders", conn, parse_dates=['order_date', 'ship_date'] )

With custom format

df = pd.read_sql( "SELECT * FROM orders", conn, parse_dates={'order_date': '%Y-%m-%d'} ) ```

Key Parameters for read_sql

Parameter Description Default
sql SQL query or table name Required
con Database connection Required
index_col Column to use as index None
coerce_float Convert decimal to float True
params Query parameters None
parse_dates Columns to parse as dates None
chunksize Rows per chunk None

pd.read_parquet - Columnar File Format

Parquet is a columnar storage format optimized for analytical queries.

Why Parquet?

Feature CSV Parquet
Compression None/External Built-in (snappy, gzip, zstd)
Schema None Embedded
Read speed Row-by-row Column-oriented
Partial read No Column selection
File size Large 2-10x smaller
Type preservation No Yes

Basic Reading

```python

Read entire parquet file

df = pd.read_parquet('data.parquet')

Read from URL

df = pd.read_parquet('https://example.com/data.parquet')

Read from S3 (requires s3fs)

df = pd.read_parquet('s3://bucket/data.parquet') ```

Column Selection

One of Parquet's key advantages is reading only needed columns.

```python

Read only specific columns (much faster for wide tables)

df = pd.read_parquet('data.parquet', columns=['id', 'name', 'value'])

Compare performance

import time

start = time.time() df_all = pd.read_parquet('wide_table.parquet') # 100 columns print(f"All columns: {time.time() - start:.2f}s")

start = time.time() df_subset = pd.read_parquet('wide_table.parquet', columns=['col1', 'col2']) print(f"2 columns: {time.time() - start:.2f}s") ```

Reading with Filters (Row Groups)

```python

PyArrow filters (if using pyarrow engine)

This filters at read time, reducing memory usage

df = pd.read_parquet( 'data.parquet', filters=[ ('year', '>=', 2020), ('category', '==', 'A') ] ) ```

Partitioned Parquet Files

Parquet files can be partitioned by column values (like Hive-style partitioning).

data/ ├── year=2023/ │ ├── month=01/ │ │ └── data.parquet │ └── month=02/ │ └── data.parquet └── year=2024/ └── month=01/ └── data.parquet

```python

Read entire partitioned dataset

df = pd.read_parquet('data/')

Partition columns are added automatically

print(df.columns) # [...original columns..., 'year', 'month']

With filters on partition columns (very efficient)

df = pd.read_parquet( 'data/', filters=[('year', '==', 2024)] ) ```

Choosing Parquet Engine

```python

PyArrow (default, recommended)

df = pd.read_parquet('data.parquet', engine='pyarrow')

fastparquet (alternative)

df = pd.read_parquet('data.parquet', engine='fastparquet') ```

Writing Parquet Files

```python

Basic write

df.to_parquet('output.parquet')

With compression

df.to_parquet('output.parquet', compression='snappy') # Fast, moderate compression df.to_parquet('output.parquet', compression='gzip') # Slower, better compression df.to_parquet('output.parquet', compression='zstd') # Best balance

Partitioned write

df.to_parquet('output/', partition_cols=['year', 'month']) ```

Key Parameters for read_parquet

Parameter Description Default
path File path or URL Required
engine 'pyarrow' or 'fastparquet' 'auto'
columns Columns to read None (all)
filters Row group filters None
use_nullable_dtypes Use nullable dtypes False

Practical Examples

1. Database to DataFrame Workflow

```python from sqlalchemy import create_engine import pandas as pd

Connect to PostgreSQL

engine = create_engine('postgresql://user:pass@localhost/sales_db')

Load recent orders

query = """ SELECT o.order_id, o.order_date, c.customer_name, p.product_name, o.quantity, o.total_price FROM orders o JOIN customers c ON o.customer_id = c.id JOIN products p ON o.product_id = p.id WHERE o.order_date >= '2024-01-01' """

df = pd.read_sql(query, engine, parse_dates=['order_date'])

Analysis

daily_sales = df.groupby(df['order_date'].dt.date)['total_price'].sum() print(daily_sales) ```

2. Large Dataset with Parquet

```python

Read large dataset efficiently

df = pd.read_parquet( 'transactions.parquet', columns=['date', 'amount', 'category'], filters=[('date', '>=', '2024-01-01')] )

Aggregate

summary = df.groupby('category')['amount'].agg(['sum', 'mean', 'count']) print(summary) ```

3. ETL Pipeline: SQL to Parquet

```python

Extract from database

df = pd.read_sql( "SELECT * FROM large_table WHERE updated_at > :last_run", engine, params={'last_run': '2024-01-01'}, chunksize=100000 )

Transform and load to Parquet

for i, chunk in enumerate(df): # Transform chunk = chunk.dropna() chunk['processed_at'] = pd.Timestamp.now()

# Load to partitioned Parquet
chunk.to_parquet(
    f'output/batch_{i}.parquet',
    compression='snappy'
)

```

4. Financial Data Storage

```python import yfinance as yf

Get historical data

tickers = ['AAPL', 'GOOGL', 'MSFT', 'AMZN'] data = {}

for ticker in tickers: data[ticker] = yf.Ticker(ticker).history(period='5y')

Combine into panel format

df = pd.concat(data, names=['ticker', 'date']) df = df.reset_index()

Save as partitioned Parquet (efficient for queries by ticker)

df.to_parquet('stock_data/', partition_cols=['ticker'])

Read specific ticker efficiently

aapl = pd.read_parquet('stock_data/', filters=[('ticker', '==', 'AAPL')]) ```

Performance Comparison

```python import time import pandas as pd import numpy as np

Create sample data

n_rows = 1_000_000 df = pd.DataFrame({ 'id': range(n_rows), 'value': np.random.randn(n_rows), 'category': np.random.choice(['A', 'B', 'C'], n_rows), 'date': pd.date_range('2020-01-01', periods=n_rows, freq='s') })

Save in different formats

df.to_csv('data.csv', index=False) df.to_parquet('data.parquet')

Compare read times

start = time.time() df_csv = pd.read_csv('data.csv') csv_time = time.time() - start

start = time.time() df_parquet = pd.read_parquet('data.parquet') parquet_time = time.time() - start

print(f"CSV read time: {csv_time:.2f}s") print(f"Parquet read time: {parquet_time:.2f}s") print(f"Speedup: {csv_time/parquet_time:.1f}x")

File sizes

import os print(f"\nCSV size: {os.path.getsize('data.csv') / 1e6:.1f} MB") print(f"Parquet size: {os.path.getsize('data.parquet') / 1e6:.1f} MB") ```

Typical results:

  • Parquet is 5-10x faster to read than CSV
  • Parquet files are 2-10x smaller than CSV
  • Column selection in Parquet provides additional speedup

Common Pitfalls (SQL)

1. Database Connection Not Closed

```python

Bad: Connection stays open

df = pd.read_sql("SELECT * FROM table", conn)

Good: Use context manager

with engine.connect() as conn: df = pd.read_sql("SELECT * FROM table", conn)

Connection automatically closed

```

2. SQL Injection Vulnerability

```python

DANGEROUS: String formatting

user_input = "Sales'; DROP TABLE employees;--" query = f"SELECT * FROM emp WHERE dept = '{user_input}'"

SAFE: Use parameterized queries

query = "SELECT * FROM emp WHERE dept = :dept" df = pd.read_sql(query, conn, params={'dept': user_input}) ```

3. Parquet Type Preservation

```python

Categorical and nullable types may need explicit handling

df = pd.read_parquet('data.parquet')

Convert categorical back if needed

df['category'] = df['category'].astype('category') ```


Exercises

Exercise 1. Using SQLite and pd.read_sql, create an in-memory database with a table named products that has columns id, name, and price. Insert three rows, then read the table into a DataFrame using a parameterized query that selects only products with price > 10.

Solution to Exercise 1

Create an in-memory SQLite database, insert rows, and query with parameters.

import pandas as pd
import sqlite3

conn = sqlite3.connect(':memory:')
conn.execute('''
    CREATE TABLE products (id INTEGER, name TEXT, price REAL)
''')
conn.executemany(
    'INSERT INTO products VALUES (?, ?, ?)',
    [(1, 'Widget', 9.99), (2, 'Gadget', 19.99), (3, 'Gizmo', 14.99)]
)
conn.commit()

query = "SELECT * FROM products WHERE price > ?"
df = pd.read_sql(query, conn, params=[10])
print(df)
conn.close()

Exercise 2. Create a DataFrame with columns 'date', 'ticker', and 'close'. Save it to a Parquet file. Then read back only the 'ticker' and 'close' columns (without loading 'date') and verify the result has exactly two columns.

Solution to Exercise 2

Use the columns parameter of read_parquet to select specific columns.

import pandas as pd

df = pd.DataFrame({
    'date': pd.date_range('2024-01-01', periods=5),
    'ticker': ['AAPL'] * 5,
    'close': [150.0, 151.0, 149.0, 152.0, 153.0]
})
df.to_parquet('stock_data.parquet', index=False)

df_subset = pd.read_parquet('stock_data.parquet', columns=['ticker', 'close'])
print(df_subset)
print(f"Number of columns: {len(df_subset.columns)}")  # 2

Exercise 3. Create a DataFrame with 10,000 rows. Save it as both CSV and Parquet. Read both files back and compare the read times using time.time(). Print the speedup factor.

Solution to Exercise 3

Time the read operations and compute the speedup ratio.

import pandas as pd
import numpy as np
import time

np.random.seed(42)
df = pd.DataFrame({
    'id': range(10000),
    'value': np.random.randn(10000),
    'category': np.random.choice(['A', 'B', 'C', 'D'], 10000)
})

df.to_csv('bench.csv', index=False)
df.to_parquet('bench.parquet')

start = time.time()
_ = pd.read_csv('bench.csv')
csv_time = time.time() - start

start = time.time()
_ = pd.read_parquet('bench.parquet')
parquet_time = time.time() - start

print(f"CSV read: {csv_time:.4f}s")
print(f"Parquet read: {parquet_time:.4f}s")
print(f"Speedup: {csv_time / parquet_time:.1f}x")