Group Comparisons¶

Overview¶

Effective data visualization often requires comparing distributions, frequencies, or relationships across groups. This section covers the major visualization tools for group comparisons: scatter plots, line plots, bar plots, pie charts, pair plots, stem-and-leaf plots, dot plots, frequency tables, and mosaic plots.

1. Line Plots¶

Line plots connect data points in sequence, making them ideal for time series and trends.

Stock Price Example¶

import pandas as pd
import yfinance as yf
import matplotlib.pyplot as plt

def download_stock_prices(ticker, start='2023-01-01', end='2023-12-31'):
    return yf.download(ticker, start=start, end=end)

def display_stock_prices(data, ticker, ticker_name):
    data.index = data.index.tz_localize(None)
    fig, ax = plt.subplots(figsize=(12, 3))
    ax.plot(data['Close'], label=ticker_name, color='blue')

    date_to_mark = pd.to_datetime('2023-10-19').tz_localize(None)
    if date_to_mark in data.index:
        ax.plot([date_to_mark], [data.loc[date_to_mark, 'Close']],
                'or', label=f'{ticker_name} on {date_to_mark.date()}')

    ax.set_xlabel('Date')
    ax.set_ylabel(f'{ticker_name} Price (KRW)')
    ax.set_title(f'{ticker_name} Prices in 2023')
    ax.legend()
    plt.show()

ticker = '019170.KS'
data = download_stock_prices(ticker)
display_stock_prices(data, ticker, "Shinpoong")

2. Scatter Plots¶

Scatter plots display the relationship between two continuous variables. Matplotlib offers two methods with different capabilities.

`ax.plot` vs. `ax.scatter`¶

import matplotlib.pyplot as plt
import numpy as np
from scipy import stats

np.random.seed(0)
num_samples = 10
x = stats.norm().rvs(size=num_samples)
noise = 0.7 * stats.norm().rvs(size=num_samples)
y = 1 + 2 * x + noise

fig, (ax_plot, ax_scatter) = plt.subplots(1, 2, figsize=(12, 3))

point_sizes = 100 * stats.norm().rvs(size=num_samples) ** 2
color_values = stats.uniform().rvs(size=num_samples)

# ax.plot: Fixed marker properties
ax_plot.plot(x, y, 'o', markersize=10, mec="red", mfc="blue", mew=3)
ax_plot.set_title("Standard Plot\nFixed Marker Size")

# ax.scatter: Variable marker properties
ax_scatter.scatter(x, y, s=point_sizes, c=color_values)
ax_scatter.set_title("Scatter Plot\nVariable Marker Size")

for ax in (ax_plot, ax_scatter):
    ax.set_xticks([])
    ax.set_yticks([])
    for spine in ['left', 'right', 'top', 'bottom']:
        ax.spines[spine].set_visible(False)

plt.show()

Key differences: ax.plot uses uniform marker size and color—ideal for simple point displays. ax.scatter allows each point to have individual size and color, enabling visualization of additional data dimensions.

3. Bar Plots¶

Single Group Bar Plot¶

import matplotlib.pyplot as plt
import pandas as pd

data = {
    'Courses': ('Language', 'History', 'Geometry', 'Chemistry', 'Physics'),
    'Number of Teachers': (7, 3, 9, 1, 2)
}
df = pd.DataFrame(data).set_index('Courses')

fig, ax = plt.subplots(figsize=(12, 3))
ax.bar(x=range(len(df)), height=df["Number of Teachers"],
       tick_label=df.index, width=0.5)
ax.set_xlabel('Courses')
ax.set_ylabel('Number of Teachers')
ax.set_title("Favorite Courses of Teachers")
ax.spines['right'].set_visible(False)
ax.spines['top'].set_visible(False)
plt.show()

Grouped Bar Plot¶

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

data = {
    'Student': ['Brandon', 'Vanessa', 'Daniel', 'Kevin', 'William'],
    'Midterm': [85, 60, 60, 65, 100],
    'Final': [90, 90, 65, 80, 95]
}
df = pd.DataFrame(data).set_index('Student')

positions = np.arange(len(df))
width = 0.3

fig, ax = plt.subplots(figsize=(12, 3))
ax.bar(positions - width / 2, df['Midterm'], width=width, label="Midterm")
ax.bar(positions + width / 2, df['Final'], width=width, label="Final")
ax.set_xticks(positions)
ax.set_xticklabels(df.index)
ax.set_xlabel("Student")
ax.set_ylabel("Scores")
ax.set_title("Midterm and Final Scores")
ax.legend(title="Exam Type")
ax.spines['right'].set_visible(False)
ax.spines['top'].set_visible(False)
plt.show()

Segmented (Stacked) Bar Plot¶

Segmented bar plots show the composition of each category.

import matplotlib.pyplot as plt
import numpy as np

labels = ("Yes", "No")
counts = (np.array([95, 90, 40]), np.array([5, 10, 60]))
age_groups = ("Adults", "Children", "Infants")

fig, ax = plt.subplots(figsize=(6, 3))
bottom = np.zeros(3)

for label, count in zip(labels, counts):
    ax.bar(np.arange(3), count, width=0.5, bottom=bottom,
           tick_label=age_groups, label=label)
    bottom += count

ax.set_title("Has Antibodies?")
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
ax.legend(title="Response", loc="center left", bbox_to_anchor=(1.0, 0.5))
plt.tight_layout()
plt.show()

4. Pie Charts¶

Pie charts show proportions of a whole. They work best with a small number of categories.

import matplotlib.pyplot as plt

labels = 'Apples', 'Bananas', 'Cherries', 'Dates'
sizes = [215, 130, 245, 210]
colors = ['gold', 'yellowgreen', 'lightcoral', 'lightskyblue']
explode = (0.1, 0, 0, 0)

fig, ax = plt.subplots()
ax.pie(sizes, explode=explode, labels=labels, colors=colors,
       autopct='%1.1f%%', shadow=True, startangle=140,
       radius=1.5, counterclock=True)
ax.axis('equal')
ax.set_title('Fruit Distribution in Basket')
plt.show()

The autopct='%1.1f%%' format string displays percentages to one decimal place on each slice.

5. Pair Plots¶

Pair plots create a matrix of scatter plots for every pair of variables, with histograms on the diagonal. They are invaluable for multivariate exploration.

import seaborn as sns
import pandas as pd

url = "https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv"
df = pd.read_csv(url, index_col='PassengerId')
df['Sex_int'] = df['Sex'].apply(lambda x: 1 if x == 'male' else 0)

sns.pairplot(df[["Survived", "Age", "Sex_int"]])

6. Stem-and-Leaf Plots¶

Stem-and-leaf plots preserve individual data values while showing the distribution shape.

import stemgraphic

data = [65, 93, 45, 73, 99, 70, 88, 46, 75, 34, 83, 100, 88, 72, 70]
fig, ax = stemgraphic.stem_graphic(data, scale=10,
                                    title="Stem-and-Leaf Plot of Student Scores")

7. Dot Plots and Frequency Tables¶

Dot Plot¶

import matplotlib.pyplot as plt

data = [5, 7, 5, 9, 7, 7, 6, 9, 9, 9, 10, 12, 12, 7]

age_freq = {}
for age in data:
    age_freq[age] = age_freq.get(age, 0) + 1

fig, ax = plt.subplots(figsize=(12, 3))
for age, freq in age_freq.items():
    ax.plot([age] * freq, range(1, freq + 1), 'ok')

ax.set_xlabel('Ages')
ax.set_ylabel('Number of Students')
ax.set_title("Ages of Students in Class")
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
ax.set_yticks([0, 1, 2, 3, 4])
ax.spines["bottom"].set_position("zero")
plt.show()

8. Two-Way Frequency Tables¶

Frequency Table¶

import pandas as pd

data = {'SUV': 28*['yes'] + 35*['no'] + 97*['yes'] + 104*['no'],
        'Accident': 28*['yes'] + 35*['yes'] + 97*['no'] + 104*['no']}
df = pd.DataFrame(data)

dg = pd.crosstab(df.SUV, df.Accident, rownames=['SUV'], colnames=['Accident'])
dg.loc['TOTAL', :] = dg.sum()
dg.loc[:, 'TOTAL'] = dg.sum(axis=1)
dg = dg.astype(int)
print(dg)

Relative Frequency Table¶

dh = dg / dg.loc['TOTAL', 'TOTAL']
print(dh)

Connection to Probability¶

Two-way frequency tables directly relate to joint, marginal, and conditional distributions:

\[ \begin{array}{ll} \text{Chain rule:} & p(x, y) = p(x) \, p(y|x) \\ \text{Marginalization:} & p(x) = \sum_y p(x, y) \\ \text{Conditioning:} & p(y|x) = \frac{p(x, y)}{p(x)} \end{array} \]

9. Data Types and Appropriate Visualizations¶

\[ \text{Data} \begin{cases} \text{Categorical Data: Pie Chart, Bar Chart, Mosaic Plot, \ldots} \\ \text{Quantitative Data: Histogram, Box Plot, Stem Plot, Time Plot, \ldots} \end{cases} \]

Summary¶

Different group comparison tasks call for different visualization tools. Bar plots and pie charts work for categorical data; histograms, box plots, and violin plots reveal the shape of continuous distributions; scatter plots and pair plots expose bivariate relationships; and frequency tables bridge visualization with probability. Choosing the right tool depends on the data type, the number of groups, and the specific aspect of the comparison you want to emphasize.