Protocols (TCP/IP, HTTP)¶

What is a Protocol?¶

A protocol is a set of rules for communication. Like languages for computers—both sides must follow the same rules to understand each other.

Without Protocol:          With Protocol (HTTP):

"gimme page"              GET /page HTTP/1.1
"here stuff"              Host: example.com

   ???                    HTTP/1.1 200 OK
                          Content-Type: text/html
                          <html>...

TCP/IP Protocol Suite¶

The foundation of internet communication:

TCP/IP Layers

┌─────────────────────────────────────────────────┐
│ Application Layer                               │
│   HTTP, FTP, SMTP, DNS, SSH                     │
├─────────────────────────────────────────────────┤
│ Transport Layer                                 │
│   TCP (reliable) / UDP (fast)                   │
├─────────────────────────────────────────────────┤
│ Internet Layer                                  │
│   IP (addressing and routing)                   │
├─────────────────────────────────────────────────┤
│ Network Access Layer                            │
│   Ethernet, WiFi, physical transmission         │
└─────────────────────────────────────────────────┘

IP: Internet Protocol¶

IP handles addressing and routing packets across networks.

IP Packet Structure¶

IP Packet:
┌──────────────────────────────────────────────────┐
│                   IP Header                      │
│  ┌──────────┬──────────┬────────────────────┐   │
│  │ Version  │  Length  │  Type of Service   │   │
│  ├──────────┴──────────┼────────────────────┤   │
│  │   Source IP Address │  Dest IP Address   │   │
│  └─────────────────────┴────────────────────┘   │
├──────────────────────────────────────────────────┤
│                     Data                         │
│              (TCP/UDP segment)                   │
└──────────────────────────────────────────────────┘

IP Characteristics¶

Connectionless: Each packet independent
Best-effort: No delivery guarantee
Unreliable: Packets can be lost, duplicated, reordered

TCP: Transmission Control Protocol¶

TCP provides reliable, ordered delivery over unreliable IP.

TCP Features¶

Feature	Description
Connection-oriented	Establishes connection before data
Reliable	Guarantees delivery (retransmits lost)
Ordered	Data arrives in sequence
Flow control	Prevents overwhelming receiver
Error checking	Checksums detect corruption

TCP Three-Way Handshake¶

Connection Establishment:

Client                           Server
   │                                │
   │ ──────── SYN ────────────────▶ │  1. Client: "Want to connect"
   │                                │
   │ ◀─────── SYN-ACK ───────────── │  2. Server: "OK, I'm ready"
   │                                │
   │ ──────── ACK ────────────────▶ │  3. Client: "Let's go!"
   │                                │
   │ ════ Connection Established ═══│

TCP Data Transfer¶

Reliable Delivery:

Client                           Server
   │ ──── Data [Seq=1] ──────────▶ │
   │ ◀─── ACK [Ack=2] ───────────  │  "Got it"
   │                                │
   │ ──── Data [Seq=2] ──────────▶ │
   │         (lost!)                │
   │                                │
   │ ...timeout...                  │
   │                                │
   │ ──── Data [Seq=2] ──────────▶ │  Retransmit
   │ ◀─── ACK [Ack=3] ───────────  │

Python TCP Socket¶

import socket

# TCP Client
def tcp_client(host, port, message):
    sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)  # SOCK_STREAM = TCP
    sock.connect((host, port))
    sock.send(message.encode())
    response = sock.recv(4096)
    sock.close()
    return response.decode()

# TCP Server
def tcp_server(host, port):
    sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
    sock.bind((host, port))
    sock.listen(5)

    while True:
        client, addr = sock.accept()  # Blocks until connection
        data = client.recv(4096)
        client.send(b"Received: " + data)
        client.close()

UDP: User Datagram Protocol¶

UDP provides fast, connectionless communication without guarantees.

UDP vs TCP¶

Aspect	TCP	UDP
Connection	Required	None
Reliability	Guaranteed	Best-effort
Order	Preserved	Not guaranteed
Speed	Slower (overhead)	Faster
Use case	Web, email, files	Streaming, gaming, DNS

Python UDP Socket¶

import socket

# UDP Client
def udp_client(host, port, message):
    sock = socket.socket(socket.AF_INET, socket.SOCK_DGRAM)  # SOCK_DGRAM = UDP
    sock.sendto(message.encode(), (host, port))
    response, addr = sock.recvfrom(4096)
    return response.decode()

# UDP Server
def udp_server(host, port):
    sock = socket.socket(socket.AF_INET, socket.SOCK_DGRAM)
    sock.bind((host, port))

    while True:
        data, addr = sock.recvfrom(4096)  # No accept needed
        sock.sendto(b"Received: " + data, addr)

HTTP: HyperText Transfer Protocol¶

HTTP is the application protocol for the web.

HTTP Request¶

GET /api/data?id=123 HTTP/1.1
Host: api.example.com
User-Agent: Python/3.10
Accept: application/json
Authorization: Bearer token123

[optional body for POST/PUT]

HTTP Response¶

HTTP/1.1 200 OK
Content-Type: application/json
Content-Length: 42
Date: Mon, 01 Jan 2024 12:00:00 GMT

{"id": 123, "name": "Example", "value": 42}

HTTP Methods¶

Method	Purpose	Body
GET	Retrieve resource	No
POST	Create resource	Yes
PUT	Update resource	Yes
DELETE	Remove resource	Optional
PATCH	Partial update	Yes
HEAD	Get headers only	No

HTTP Status Codes¶

Code	Meaning	Example
2xx	Success	200 OK, 201 Created
3xx	Redirect	301 Moved, 304 Not Modified
4xx	Client Error	400 Bad Request, 404 Not Found
5xx	Server Error	500 Internal, 503 Unavailable

Python HTTP Client¶

import requests

# GET request
response = requests.get('https://api.example.com/data')
print(response.status_code)  # 200
print(response.json())       # {'key': 'value'}

# POST request
response = requests.post(
    'https://api.example.com/create',
    json={'name': 'test'},
    headers={'Authorization': 'Bearer token123'}
)

# Error handling
response = requests.get('https://api.example.com/data')
response.raise_for_status()  # Raises exception for 4xx/5xx

Python HTTP Server¶

from http.server import HTTPServer, BaseHTTPRequestHandler
import json

class SimpleHandler(BaseHTTPRequestHandler):
    def do_GET(self):
        self.send_response(200)
        self.send_header('Content-Type', 'application/json')
        self.end_headers()
        self.wfile.write(json.dumps({'status': 'ok'}).encode())

    def do_POST(self):
        content_length = int(self.headers['Content-Length'])
        body = self.rfile.read(content_length)
        data = json.loads(body)

        self.send_response(201)
        self.send_header('Content-Type', 'application/json')
        self.end_headers()
        self.wfile.write(json.dumps({'received': data}).encode())

# server = HTTPServer(('localhost', 8080), SimpleHandler)
# server.serve_forever()

HTTPS: Secure HTTP¶

HTTPS = HTTP + TLS encryption:

HTTPS Connection:

Client                           Server
   │                                │
   │ ──── ClientHello ────────────▶ │  Supported ciphers
   │ ◀─── ServerHello ───────────── │  Chosen cipher + certificate
   │                                │
   │   [Certificate validation]     │
   │                                │
   │ ──── Key Exchange ───────────▶ │  Establish shared secret
   │ ◀───────────────────────────── │
   │                                │
   │ ═══ Encrypted HTTP Traffic ═══ │

Protocol Comparison¶

┌─────────────┬─────────────────────────────────────────┐
│  Protocol   │  Characteristics                        │
├─────────────┼─────────────────────────────────────────┤
│  IP         │  Addressing, routing, unreliable        │
│  TCP        │  Reliable, ordered, connection-oriented │
│  UDP        │  Fast, unreliable, connectionless       │
│  HTTP       │  Web requests, text-based, stateless    │
│  HTTPS      │  HTTP + encryption                      │
│  WebSocket  │  Full-duplex, persistent connection     │
└─────────────┴─────────────────────────────────────────┘

Summary¶

Protocol	Layer	Purpose
IP	Internet	Addressing and routing
TCP	Transport	Reliable delivery
UDP	Transport	Fast, unreliable delivery
HTTP	Application	Web communication
HTTPS	Application	Secure web communication

Key points for Python:

Use socket for TCP/UDP low-level communication
Use requests for HTTP client operations
Use flask/fastapi for HTTP servers
TCP for reliability, UDP for speed
HTTPS for any sensitive data
Understand status codes for proper error handling

Runnable Example: `web_scraping_api_tutorial.py`¶

"""
Topic 15.2 - Web Scraping and API Data Collection Tutorial

Practical examples of making HTTP requests, parsing JSON responses,
handling pagination, writing results to CSV, and implementing retry logic.

These patterns appear constantly in real-world Python: collecting data from
REST APIs, scraping public web pages, and building data pipelines.

Inspired by common social media and mapping API scraping patterns,
modernized for Python 3 with current best practices.

Learning Objectives:
- Making HTTP requests with urllib.request (standard library)
- Parsing JSON API responses
- Writing structured data to CSV files
- Implementing pagination to collect all results
- Retry logic with exponential backoff
- Rate limiting and polite scraping
- Environment variable-based credential management

Prerequisites:
- ch01/io (File I/O, CSV, JSON basics)
- ch07/json (json module)
- ch07/regex (re module for text extraction)
- ch14/concepts (I/O-bound concurrency)

Author: Python Educator
Date: 2024
"""

import urllib.request
import urllib.error
import json
import csv
import time
import datetime
import re
import os
from typing import Any


# ============================================================================
# PART 1: BEGINNER - Making HTTP Requests
# ============================================================================

def demonstrate_basic_http_request():
    """
    Show how to make a simple HTTP GET request using urllib (standard library).

    urllib.request is built into Python — no pip install needed.
    For production use, the third-party 'requests' library is more ergonomic,
    but understanding urllib teaches you what happens under the hood.
    """
    print("=" * 70)
    print("BEGINNER: Basic HTTP Requests with urllib.request")
    print("=" * 70)

    # --- Simple GET request ---
    # httpbin.org is a free HTTP testing service
    url = "https://httpbin.org/get"

    print(f"\n1. Simple GET request to: {url}")
    print("-" * 50)

    try:
        # urllib.request.urlopen sends a GET request and returns a response
        with urllib.request.urlopen(url, timeout=10) as response:
            # response.status gives the HTTP status code
            print(f"   Status code : {response.status}")

            # response.read() returns bytes; decode to get a string
            raw_bytes = response.read()
            body = raw_bytes.decode("utf-8")

            # The response is JSON, so parse it
            data = json.loads(body)
            print(f"   Content type: {response.headers['Content-Type']}")
            print(f"   Origin IP   : {data.get('origin', 'N/A')}")

    except urllib.error.URLError as e:
        # Network errors: DNS failure, connection refused, timeout, etc.
        print(f"   Network error: {e.reason}")
    except urllib.error.HTTPError as e:
        # HTTP errors: 404, 500, 403, etc.
        print(f"   HTTP error {e.code}: {e.reason}")

    # --- GET request with query parameters ---
    print(f"\n2. GET request with query parameters")
    print("-" * 50)

    # Build URL with query parameters properly encoded
    params = urllib.parse.urlencode({"name": "Python", "version": "3.12"})
    url_with_params = f"https://httpbin.org/get?{params}"

    print(f"   URL: {url_with_params}")

    try:
        with urllib.request.urlopen(url_with_params, timeout=10) as response:
            data = json.loads(response.read().decode("utf-8"))
            print(f"   Server received args: {data.get('args', {})}")
    except (urllib.error.URLError, urllib.error.HTTPError) as e:
        print(f"   Request failed: {e}")

    # --- Adding headers to a request ---
    print(f"\n3. Custom headers (User-Agent, Accept)")
    print("-" * 50)

    # Some APIs require specific headers
    req = urllib.request.Request(
        "https://httpbin.org/headers",
        headers={
            "User-Agent": "PythonCourseScraper/1.0",
            "Accept": "application/json",
        },
    )

    try:
        with urllib.request.urlopen(req, timeout=10) as response:
            data = json.loads(response.read().decode("utf-8"))
            headers_sent = data.get("headers", {})
            print(f"   User-Agent sent: {headers_sent.get('User-Agent')}")
            print(f"   Accept sent    : {headers_sent.get('Accept')}")
    except (urllib.error.URLError, urllib.error.HTTPError) as e:
        print(f"   Request failed: {e}")

    print("\n" + "=" * 70 + "\n")


# ============================================================================
# PART 2: BEGINNER - Parsing JSON API Responses
# ============================================================================

def demonstrate_json_api_parsing():
    """
    Show how to work with JSON data returned from a REST API.

    Most modern web APIs return JSON. The pattern is always:
    1. Make HTTP request
    2. Read response body (bytes)
    3. Decode to string
    4. Parse with json.loads()
    5. Navigate the resulting dict/list
    """
    print("=" * 70)
    print("BEGINNER: Parsing JSON API Responses")
    print("=" * 70)

    # --- Fetching and parsing JSON ---
    print("\n1. Fetch and parse JSON from a public API")
    print("-" * 50)

    # JSONPlaceholder is a free fake REST API for testing
    url = "https://jsonplaceholder.typicode.com/posts/1"

    try:
        with urllib.request.urlopen(url, timeout=10) as response:
            raw = response.read().decode("utf-8")
            post = json.loads(raw)

            # post is now a regular Python dict
            print(f"   Type of parsed data: {type(post).__name__}")
            print(f"   Post ID   : {post['id']}")
            print(f"   User ID   : {post['userId']}")
            print(f"   Title     : {post['title'][:50]}...")
            print(f"   Body (len): {len(post['body'])} chars")

    except (urllib.error.URLError, urllib.error.HTTPError) as e:
        print(f"   Request failed: {e}")

    # --- Working with a list of JSON objects ---
    print(f"\n2. Fetch a list of JSON objects")
    print("-" * 50)

    url = "https://jsonplaceholder.typicode.com/posts?userId=1"

    try:
        with urllib.request.urlopen(url, timeout=10) as response:
            posts = json.loads(response.read().decode("utf-8"))

            print(f"   Number of posts: {len(posts)}")
            print(f"   First post title: {posts[0]['title'][:40]}...")
            print(f"   Last post title : {posts[-1]['title'][:40]}...")

            # Extract just titles using a list comprehension
            titles = [p["title"] for p in posts]
            print(f"   All titles extracted: {len(titles)} items")

    except (urllib.error.URLError, urllib.error.HTTPError) as e:
        print(f"   Request failed: {e}")

    # --- Nested JSON structures ---
    print(f"\n3. Navigating nested JSON")
    print("-" * 50)

    url = "https://jsonplaceholder.typicode.com/posts/1/comments"

    try:
        with urllib.request.urlopen(url, timeout=10) as response:
            comments = json.loads(response.read().decode("utf-8"))

            print(f"   Comments on post 1: {len(comments)}")
            for comment in comments[:3]:
                # Safely access nested fields with .get()
                name = comment.get("name", "Unknown")
                email = comment.get("email", "N/A")
                body_preview = comment.get("body", "")[:40]
                print(f"   - {name} ({email})")
                print(f"     {body_preview}...")

    except (urllib.error.URLError, urllib.error.HTTPError) as e:
        print(f"   Request failed: {e}")

    print("\n" + "=" * 70 + "\n")


# ============================================================================
# PART 3: INTERMEDIATE - Writing API Data to CSV
# ============================================================================

def demonstrate_api_to_csv():
    """
    Complete pipeline: fetch data from an API and save to CSV.

    This is one of the most common real-world Python tasks.
    The pattern: API → JSON → process → CSV
    """
    print("=" * 70)
    print("INTERMEDIATE: API Data → CSV Pipeline")
    print("=" * 70)

    csv_path = "/tmp/api_posts.csv"

    print(f"\n1. Fetching posts from JSONPlaceholder API...")

    try:
        url = "https://jsonplaceholder.typicode.com/posts"
        with urllib.request.urlopen(url, timeout=10) as response:
            posts = json.loads(response.read().decode("utf-8"))
        print(f"   Fetched {len(posts)} posts")
    except (urllib.error.URLError, urllib.error.HTTPError) as e:
        print(f"   Request failed: {e}")
        return

    # --- Process and clean the data ---
    print("\n2. Processing data...")

    processed = []
    for post in posts:
        processed.append({
            "post_id": post["id"],
            "user_id": post["userId"],
            "title": post["title"].strip(),
            "body_length": len(post["body"]),
            "word_count": len(post["body"].split()),
            "fetched_at": datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S"),
        })

    print(f"   Processed {len(processed)} records")
    print(f"   Fields: {list(processed[0].keys())}")

    # --- Write to CSV ---
    print(f"\n3. Writing to CSV: {csv_path}")

    fieldnames = ["post_id", "user_id", "title", "body_length",
                   "word_count", "fetched_at"]

    with open(csv_path, "w", newline="", encoding="utf-8") as f:
        writer = csv.DictWriter(f, fieldnames=fieldnames)
        writer.writeheader()
        writer.writerows(processed)

    print(f"   Wrote {len(processed)} rows")

    # --- Verify the CSV ---
    print(f"\n4. Verifying CSV (first 3 rows):")

    with open(csv_path, "r", encoding="utf-8") as f:
        reader = csv.DictReader(f)
        for i, row in enumerate(reader):
            if i >= 3:
                break
            print(f"   Row {i}: id={row['post_id']}, "
                  f"words={row['word_count']}, "
                  f"title={row['title'][:30]}...")

    print("\n" + "=" * 70 + "\n")


# ============================================================================
# PART 4: INTERMEDIATE - Retry Logic with Exponential Backoff
# ============================================================================

def request_with_retry(
    url: str,
    max_retries: int = 3,
    initial_delay: float = 1.0,
    backoff_factor: float = 2.0,
    timeout: int = 10,
) -> dict[str, Any]:
    """
    Make an HTTP GET request with retry logic and exponential backoff.

    When scraping APIs, requests sometimes fail due to:
    - Network glitches
    - Server overload (HTTP 429 Too Many Requests)
    - Temporary server errors (HTTP 500, 502, 503)

    Exponential backoff: wait 1s, then 2s, then 4s, etc.
    This avoids hammering a struggling server.

    Args:
        url: The URL to request
        max_retries: Maximum number of retry attempts
        initial_delay: Seconds to wait before first retry
        backoff_factor: Multiply delay by this factor each retry
        timeout: Request timeout in seconds

    Returns:
        Parsed JSON response as a dict

    Raises:
        Exception: If all retries are exhausted
    """
    delay = initial_delay

    for attempt in range(max_retries + 1):
        try:
            req = urllib.request.Request(
                url,
                headers={"User-Agent": "PythonCourseScraper/1.0"},
            )
            with urllib.request.urlopen(req, timeout=timeout) as response:
                if response.status == 200:
                    return json.loads(response.read().decode("utf-8"))

        except urllib.error.HTTPError as e:
            # Retry on server errors and rate limits
            if e.code in (429, 500, 502, 503) and attempt < max_retries:
                print(f"   HTTP {e.code} on attempt {attempt + 1}, "
                      f"retrying in {delay:.1f}s...")
                time.sleep(delay)
                delay *= backoff_factor
                continue
            raise

        except urllib.error.URLError as e:
            if attempt < max_retries:
                print(f"   Network error on attempt {attempt + 1}: {e.reason}, "
                      f"retrying in {delay:.1f}s...")
                time.sleep(delay)
                delay *= backoff_factor
                continue
            raise

    raise Exception(f"All {max_retries} retries exhausted for {url}")


def demonstrate_retry_logic():
    """
    Show the retry pattern in action.
    """
    print("=" * 70)
    print("INTERMEDIATE: Retry Logic with Exponential Backoff")
    print("=" * 70)

    print("\n1. Successful request (no retries needed):")
    print("-" * 50)

    try:
        data = request_with_retry("https://jsonplaceholder.typicode.com/posts/1")
        print(f"   Got post: '{data['title'][:40]}...'")
    except Exception as e:
        print(f"   Failed: {e}")

    print("\n2. Request to non-existent endpoint (will fail fast):")
    print("-" * 50)

    try:
        data = request_with_retry(
            "https://httpbin.org/status/404",
            max_retries=1,
            initial_delay=0.5,
        )
    except urllib.error.HTTPError as e:
        print(f"   Correctly failed with HTTP {e.code} (not retryable)")
    except Exception as e:
        print(f"   Failed: {e}")

    print("\n3. The exponential backoff pattern:")
    print("-" * 50)
    print("   Attempt 1: immediate")
    print("   Attempt 2: wait 1.0s  (initial_delay)")
    print("   Attempt 3: wait 2.0s  (1.0 × backoff_factor)")
    print("   Attempt 4: wait 4.0s  (2.0 × backoff_factor)")
    print("   ...")
    print("   This prevents overwhelming struggling servers.")

    print("\n" + "=" * 70 + "\n")


# ============================================================================
# PART 5: INTERMEDIATE - Pagination
# ============================================================================

def demonstrate_pagination():
    """
    Show how to handle paginated API responses.

    Most APIs don't return all results at once. Instead, they paginate:
    - Page-based: ?page=1&per_page=10
    - Offset-based: ?offset=0&limit=10
    - Cursor-based: ?cursor=abc123 (next page token)

    You must loop through pages to collect all data.
    """
    print("=" * 70)
    print("INTERMEDIATE: Handling Paginated API Responses")
    print("=" * 70)

    # --- Page-based pagination ---
    print("\n1. Page-based pagination (most common)")
    print("-" * 50)

    all_posts = []
    page = 1
    per_page = 10

    print(f"   Fetching posts, {per_page} per page...")

    while True:
        url = (f"https://jsonplaceholder.typicode.com/posts"
               f"?_page={page}&_limit={per_page}")

        try:
            with urllib.request.urlopen(url, timeout=10) as response:
                posts = json.loads(response.read().decode("utf-8"))

                if not posts:
                    # Empty response means no more pages
                    break

                all_posts.extend(posts)
                print(f"   Page {page}: got {len(posts)} posts "
                      f"(total so far: {len(all_posts)})")

                if len(posts) < per_page:
                    # Partial page means this is the last one
                    break

                page += 1

                # Be polite: don't hammer the server
                time.sleep(0.1)

        except (urllib.error.URLError, urllib.error.HTTPError) as e:
            print(f"   Error on page {page}: {e}")
            break

    print(f"\n   Total posts collected: {len(all_posts)}")

    # --- Simulated cursor-based pagination ---
    print(f"\n2. Cursor-based pagination (conceptual)")
    print("-" * 50)

    print("   Many APIs (Facebook, Twitter, Slack) use cursor pagination:")
    print()
    print("   response = fetch('/api/items?cursor=START')")
    print("   while response['next_cursor']:")
    print("       process(response['data'])")
    print("       response = fetch(f'/api/items?cursor={response[\"next_cursor\"]}')")
    print()
    print("   Advantages over page-based:")
    print("   - Consistent results even if data changes between requests")
    print("   - More efficient for the server (no offset scanning)")
    print("   - Server controls traversal order")

    print("\n" + "=" * 70 + "\n")


# ============================================================================
# PART 6: ADVANCED - Rate Limiting and Polite Scraping
# ============================================================================

def demonstrate_rate_limiting():
    """
    Show how to respect API rate limits and scrape politely.

    Being a good API citizen means:
    1. Respecting rate limits (usually in response headers)
    2. Adding delays between requests
    3. Using proper User-Agent strings
    4. Caching responses when possible
    5. Handling 429 (Too Many Requests) gracefully
    """
    print("=" * 70)
    print("ADVANCED: Rate Limiting and Polite Scraping")
    print("=" * 70)

    print("\n1. Reading rate limit headers:")
    print("-" * 50)

    try:
        url = "https://api.github.com/rate_limit"
        req = urllib.request.Request(
            url,
            headers={"User-Agent": "PythonCourseScraper/1.0"},
        )
        with urllib.request.urlopen(req, timeout=10) as response:
            # Many APIs include rate limit info in headers
            remaining = response.headers.get("X-RateLimit-Remaining", "N/A")
            limit = response.headers.get("X-RateLimit-Limit", "N/A")
            reset = response.headers.get("X-RateLimit-Reset", "N/A")

            print(f"   X-RateLimit-Limit     : {limit} requests/hour")
            print(f"   X-RateLimit-Remaining : {remaining} left")

            if reset != "N/A":
                reset_time = datetime.datetime.fromtimestamp(int(reset))
                print(f"   X-RateLimit-Reset     : {reset_time}")

    except (urllib.error.URLError, urllib.error.HTTPError) as e:
        print(f"   Request failed: {e}")

    # --- Rate limiter class ---
    print(f"\n2. Simple rate limiter implementation:")
    print("-" * 50)

    class RateLimiter:
        """
        Enforce a minimum delay between requests.

        Usage:
            limiter = RateLimiter(requests_per_second=2)
            for url in urls:
                limiter.wait()
                fetch(url)
        """

        def __init__(self, requests_per_second: float = 1.0):
            self.min_interval = 1.0 / requests_per_second
            self.last_request_time = 0.0

        def wait(self):
            """Block until enough time has passed since the last request."""
            elapsed = time.time() - self.last_request_time
            if elapsed < self.min_interval:
                sleep_time = self.min_interval - elapsed
                time.sleep(sleep_time)
            self.last_request_time = time.time()

    # Demo the rate limiter
    limiter = RateLimiter(requests_per_second=5)  # Max 5 req/sec
    print(f"   RateLimiter(requests_per_second=5)")
    print(f"   Making 5 rapid requests...")

    start = time.time()
    for i in range(5):
        limiter.wait()
        # In real code: fetch(url) here
    elapsed = time.time() - start

    print(f"   5 requests took {elapsed:.2f}s (limited to ~1.0s minimum)")

    # --- Best practices summary ---
    print(f"\n3. Polite scraping best practices:")
    print("-" * 50)
    print("   - Set a descriptive User-Agent header")
    print("   - Respect robots.txt (check before scraping)")
    print("   - Add 0.5-2s delay between requests")
    print("   - Honor Retry-After headers on 429 responses")
    print("   - Cache responses to avoid redundant requests")
    print("   - Use API keys when available (higher rate limits)")
    print("   - Never hardcode credentials — use environment variables:")
    print()
    print('     api_key = os.environ.get("MY_API_KEY")')
    print('     if not api_key:')
    print('         raise ValueError("Set MY_API_KEY environment variable")')

    print("\n" + "=" * 70 + "\n")


# ============================================================================
# PART 7: ADVANCED - Complete Scraping Pipeline
# ============================================================================

def demonstrate_complete_pipeline():
    """
    A complete, production-style data collection pipeline.

    Combines all patterns: requests, JSON parsing, pagination,
    retry logic, rate limiting, CSV output, and error handling.
    """
    print("=" * 70)
    print("ADVANCED: Complete Data Collection Pipeline")
    print("=" * 70)

    csv_path = "/tmp/users_and_posts.csv"

    # Step 1: Fetch users
    print("\n--- Step 1: Fetch all users ---")
    try:
        with urllib.request.urlopen(
            "https://jsonplaceholder.typicode.com/users", timeout=10
        ) as response:
            users = json.loads(response.read().decode("utf-8"))
        print(f"   Fetched {len(users)} users")
    except Exception as e:
        print(f"   Failed to fetch users: {e}")
        return

    # Build a user lookup dict for efficient joins
    user_lookup = {u["id"]: u for u in users}

    # Step 2: Fetch posts with pagination
    print("\n--- Step 2: Fetch all posts (paginated) ---")
    all_posts = []
    page = 1
    per_page = 20

    while True:
        url = (f"https://jsonplaceholder.typicode.com/posts"
               f"?_page={page}&_limit={per_page}")
        try:
            data = request_with_retry(url, max_retries=2, initial_delay=0.5)
            if not data:
                break
            all_posts.extend(data)
            print(f"   Page {page}: {len(data)} posts (total: {len(all_posts)})")
            if len(data) < per_page:
                break
            page += 1
            time.sleep(0.1)  # Rate limiting
        except Exception as e:
            print(f"   Error on page {page}: {e}")
            break

    # Step 3: Enrich posts with user data (join)
    print(f"\n--- Step 3: Enrich posts with user info ---")
    enriched = []
    for post in all_posts:
        user = user_lookup.get(post["userId"], {})
        enriched.append({
            "post_id": post["id"],
            "title": post["title"].strip(),
            "body_word_count": len(post["body"].split()),
            "user_name": user.get("name", "Unknown"),
            "user_email": user.get("email", "N/A"),
            "user_company": user.get("company", {}).get("name", "N/A"),
            "collected_at": datetime.datetime.now().isoformat(),
        })
    print(f"   Enriched {len(enriched)} records")

    # Step 4: Write to CSV
    print(f"\n--- Step 4: Write to CSV ---")
    fieldnames = ["post_id", "title", "body_word_count", "user_name",
                   "user_email", "user_company", "collected_at"]

    with open(csv_path, "w", newline="", encoding="utf-8") as f:
        writer = csv.DictWriter(f, fieldnames=fieldnames)
        writer.writeheader()
        writer.writerows(enriched)

    print(f"   Wrote {len(enriched)} rows to {csv_path}")

    # Step 5: Summary statistics
    print(f"\n--- Step 5: Summary ---")
    total_words = sum(r["body_word_count"] for r in enriched)
    users_with_posts = len({r["user_name"] for r in enriched})
    avg_words = total_words / len(enriched) if enriched else 0

    print(f"   Total posts     : {len(enriched)}")
    print(f"   Unique authors  : {users_with_posts}")
    print(f"   Total words     : {total_words}")
    print(f"   Avg words/post  : {avg_words:.1f}")

    print("\n" + "=" * 70 + "\n")


# ============================================================================
# PART 8: ADVANCED - Text Extraction with Regex
# ============================================================================

def demonstrate_regex_extraction():
    """
    Show how regex is used in web scraping to extract structured data
    from semi-structured text.

    When APIs return text (comments, posts, bios), you often need
    to extract specific patterns: hashtags, URLs, emails, mentions.
    """
    print("=" * 70)
    print("ADVANCED: Regex for Data Extraction in Scraping")
    print("=" * 70)

    # Simulate scraped social media text
    sample_texts = [
        "Loving the new #Python3 features! Check https://python.org @guido",
        "Data science with #pandas and #numpy is amazing! 📊 contact@ds.org",
        "Meeting at 2:30 PM EST. Join via https://meet.example.com/abc123",
        "#MachineLearning model got 95.2% accuracy on the test set!",
    ]

    # --- Extract hashtags ---
    print("\n1. Extracting hashtags from text")
    print("-" * 50)

    hashtag_pattern = re.compile(r"#(\w+)")

    for text in sample_texts:
        hashtags = hashtag_pattern.findall(text)
        if hashtags:
            print(f"   Text: {text[:50]}...")
            print(f"   Tags: {hashtags}")
            print()

    # --- Extract URLs ---
    print("2. Extracting URLs from text")
    print("-" * 50)

    url_pattern = re.compile(r"https?://[^\s]+")

    for text in sample_texts:
        urls = url_pattern.findall(text)
        if urls:
            print(f"   Text: {text[:50]}...")
            print(f"   URLs: {urls}")
            print()

    # --- Extract email addresses ---
    print("3. Extracting email addresses")
    print("-" * 50)

    email_pattern = re.compile(r"[\w.+-]+@[\w-]+\.[\w.]+")

    for text in sample_texts:
        emails = email_pattern.findall(text)
        if emails:
            print(f"   Text: {text[:50]}...")
            print(f"   Emails: {emails}")
            print()

    # --- Extract numbers and percentages ---
    print("4. Extracting numbers and percentages")
    print("-" * 50)

    pct_pattern = re.compile(r"(\d+\.?\d*)%")

    for text in sample_texts:
        percentages = pct_pattern.findall(text)
        if percentages:
            print(f"   Text: {text[:50]}...")
            print(f"   Percentages: {[float(p) for p in percentages]}")
            print()

    print("=" * 70 + "\n")


# ============================================================================
# MAIN EXECUTION
# ============================================================================

def main():
    """Run all demonstrations."""
    print("\n" + "=" * 70)
    print(" " * 10 + "WEB SCRAPING AND API DATA COLLECTION")
    print(" " * 15 + "Complete Tutorial")
    print("=" * 70 + "\n")

    # Beginner
    demonstrate_basic_http_request()
    demonstrate_json_api_parsing()

    # Intermediate
    demonstrate_api_to_csv()
    demonstrate_retry_logic()
    demonstrate_pagination()

    # Advanced
    demonstrate_rate_limiting()
    demonstrate_complete_pipeline()
    demonstrate_regex_extraction()

    print("\n" + "=" * 70)
    print("Tutorial Complete!")
    print("=" * 70)
    print("\nKey Takeaways:")
    print("1. urllib.request for HTTP — no install needed")
    print("2. json.loads() to parse API responses")
    print("3. csv.DictWriter for structured output")
    print("4. Always implement retry logic for production scrapers")
    print("5. Respect rate limits — be a polite scraper")
    print("6. Never hardcode API keys — use os.environ")
    print("7. Pagination: loop until empty response or partial page")
    print("8. Regex for extracting patterns from scraped text")
    print("=" * 70 + "\n")


if __name__ == "__main__":
    main()