Python and Jupyter Basics¶
Why Python for Statistics?¶
Python is a general-purpose programming language that has become the dominant tool for data analysis, scientific computing, and machine learning. Its readability, extensive library ecosystem, and active community make it an ideal choice for statistical work ranging from exploratory analysis to production-grade modeling.
Choosing a Python Distribution¶
For statistical and data-science work the Anaconda distribution is the recommended starting point.
| Feature | Anaconda | Standard Python |
|---|---|---|
| Pre-installed packages | 1,500+ (NumPy, Pandas, Matplotlib, SciPy, …) | Standard library only |
| Package manager | conda + pip |
pip only |
| Environment management | Built-in (conda env) |
Requires venv or virtualenv |
| Jupyter Notebook | Included | Manual install |
| Platforms | Windows, macOS, Linux | Windows, macOS, Linux |
Installing Anaconda¶
- Download the installer from anaconda.com.
- Run the installer and (optionally) add Anaconda to your system
PATH. - Verify the installation:
conda --version
Package Management¶
conda¶
conda is Anaconda's native package and environment manager.
# Install a package
conda install numpy
# Install a specific version
conda install pandas=2.1.0
# Update a package
conda update matplotlib
# List installed packages
conda list
pip¶
pip is the standard Python package installer and is useful for packages not available through conda.
# Install a package
pip install seaborn
# Install from a requirements file
pip install -r requirements.txt
Best Practice
Use conda for packages available in the Anaconda repository and fall back to pip for everything else. Mixing the two carelessly can cause dependency conflicts.
Virtual Environments¶
A virtual environment isolates a project's dependencies from the system-wide Python installation, preventing version conflicts across projects.
Creating and Managing Environments¶
# Create a new environment with a specific Python version
conda create --name stats_env python=3.11
# Activate the environment
conda activate stats_env
# Install packages inside the environment
conda install numpy pandas matplotlib scipy
# Deactivate when finished
conda deactivate
# List all environments
conda env list
# Remove an environment
conda env remove --name stats_env
Exporting and Reproducing Environments¶
# Export environment specification
conda env export > environment.yml
# Recreate environment from file
conda env create -f environment.yml
Integrated Development Environments¶
Jupyter Notebook¶
Jupyter Notebook is the primary tool used throughout this book. It provides an interactive, cell-based interface where code, output, and narrative text coexist in a single document.
Launching Jupyter:
jupyter notebook
This opens the Jupyter interface in your default web browser. From there, create a new notebook and select the Python kernel.
Key features:
- Execute code cells independently and see results inline.
- Mix Markdown cells for documentation with code cells for analysis.
- Render \(\LaTeX\) equations directly in Markdown cells.
- Export notebooks to HTML, PDF, or slides.
Useful keyboard shortcuts:
| Shortcut | Action |
|---|---|
Shift + Enter |
Run cell and move to next |
Ctrl + Enter |
Run cell in place |
Esc + A |
Insert cell above |
Esc + B |
Insert cell below |
Esc + M |
Convert cell to Markdown |
Esc + Y |
Convert cell to Code |
Esc + D D |
Delete cell |
Other IDEs¶
- Spyder — Ships with Anaconda; MATLAB-like layout with variable explorer, editor, and console panes. Well-suited for interactive scientific computing.
- PyCharm — Full-featured IDE with intelligent code completion, debugging, and project management. The Professional edition includes Jupyter support.
- VS Code — Lightweight editor with excellent Python and Jupyter extensions; a popular all-purpose choice.
Essential Python Refresher¶
The subsections below review core Python constructs that appear throughout the book.
Data Types and Variables¶
# Numeric types
x_int = 42 # int
x_float = 3.14 # float
x_bool = True # bool (subclass of int)
# Strings
name = "statistics"
# Type checking
print(type(x_float)) # <class 'float'>
Collections¶
# List — ordered, mutable
values = [1, 2, 3, 4, 5]
# Tuple — ordered, immutable
point = (3.0, 4.0)
# Dictionary — key-value pairs
params = {"mu": 0.0, "sigma": 1.0}
# Set — unordered, unique elements
unique = {1, 2, 3, 3, 2} # {1, 2, 3}
Control Flow¶
# Conditional
if x > 0:
print("positive")
elif x == 0:
print("zero")
else:
print("negative")
# For loop
total = 0
for v in values:
total += v
# List comprehension
squares = [v ** 2 for v in values]
# While loop
n = 10
while n > 0:
n -= 1
Functions¶
def sample_mean(data):
"""Return the arithmetic mean of a list of numbers."""
return sum(data) / len(data)
# Lambda (anonymous) function
square = lambda x: x ** 2
Importing Libraries¶
# Standard import
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
# Import specific objects
from scipy.stats import norm
Installing the Libraries Used in This Book¶
Run the following once inside your environment to install all core dependencies:
# From a notebook cell
!pip install numpy pandas matplotlib seaborn scipy scikit-learn statsmodels
Or from the terminal:
conda install numpy pandas matplotlib seaborn scipy scikit-learn statsmodels
Recommended Directory Structure¶
Keeping your projects organized makes reproducibility straightforward:
project/
├── data/
│ ├── raw/
│ └── processed/
├── notebooks/
│ ├── 01_exploration.ipynb
│ └── 02_modeling.ipynb
├── src/
│ └── utils.py
├── environment.yml
└── README.md
Summary¶
| Concept | Key Takeaway |
|---|---|
| Distribution | Use Anaconda for a batteries-included setup |
| Package manager | Prefer conda; use pip as fallback |
| Environments | Always isolate projects with virtual environments |
| IDE | Jupyter Notebook for interactive analysis; Spyder / VS Code for scripts |
| Organization | Maintain a clean directory structure and export environment.yml |