Value Counts and Unique¶
Methods for counting and identifying unique values are essential for data exploration and analysis. This document covers value_counts(), nunique(), and related methods.
Mental Model
value_counts() is a frequency table: it counts how many times each unique value appears and sorts by frequency. nunique() returns just the count of distinct values. Together they answer the two most basic questions about a categorical column: "what values exist?" and "how often does each appear?"
value_counts()¶
Returns a Series containing counts of unique values, sorted by frequency.
Basic Usage¶
```python import pandas as pd
s = pd.Series(['a', 'b', 'a', 'c', 'a', 'b']) print(s.value_counts()) ```
a 3
b 2
c 1
Name: count, dtype: int64
On DataFrame Columns¶
```python url = "https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv" df = pd.read_csv(url)
Count passengers by class¶
print(df['Pclass'].value_counts()) ```
3 491
1 216
2 184
Name: count, dtype: int64
Parameters¶
normalize¶
Return proportions instead of counts.
python
print(df['Pclass'].value_counts(normalize=True))
3 0.551066
1 0.242424
2 0.206510
Name: proportion, dtype: float64
sort¶
Control sorting behavior.
```python
Sort by index instead of count¶
print(df['Pclass'].value_counts(sort=False)) ```
1 216
2 184
3 491
Name: count, dtype: int64
ascending¶
```python
Least frequent first¶
print(df['Pclass'].value_counts(ascending=True)) ```
2 184
1 216
3 491
Name: count, dtype: int64
bins¶
Group numeric data into bins.
```python
Bin ages into groups¶
print(df['Age'].value_counts(bins=5)) ```
(16.336, 32.252] 346
(32.252, 48.168] 188
(0.339, 16.336] 100
(48.168, 64.084] 69
(64.084, 80.0] 11
Name: count, dtype: int64
dropna¶
Control whether to count NaN values.
```python s = pd.Series([1, 2, 2, None, None, None])
Default: exclude NaN¶
print(s.value_counts()) ```
2.0 2
1.0 1
Name: count, dtype: int64
```python
Include NaN in counts¶
print(s.value_counts(dropna=False)) ```
NaN 3
2.0 2
1.0 1
Name: count, dtype: int64
Practical Example: Manager Direct Reports¶
From LeetCode 570: Find managers with at least 5 direct reports.
```python employee = pd.DataFrame({ 'id': [1, 2, 3, 4, 5, 6, 7, 8, 9], 'name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve', 'Frank', 'Grace', 'Helen', 'Ian'], 'managerId': [2, None, 2, 2, 3, 2, 2, 2, 2] })
Count direct reports per manager¶
manager_counts = employee['managerId'].value_counts() print(manager_counts) ```
2.0 7
3.0 1
Name: count, dtype: int64
```python
Find managers with >= 5 direct reports¶
managers_with_5_plus = manager_counts[manager_counts >= 5].index print(managers_with_5_plus) # Index([2.0], dtype='float64') ```
nunique()¶
Returns the number of unique values, excluding NaN by default.
Basic Usage¶
python
s = pd.Series([1, 2, 2, 3, 3, 3])
print(s.nunique()) # 3
On DataFrame¶
```python
Unique values per column¶
print(df.nunique()) ```
PassengerId 891
Survived 2
Pclass 3
Name 891
Sex 2
Age 88
...
Including NaN¶
```python s = pd.Series([1, 2, None, None])
print(s.nunique()) # 2 (excludes NaN) print(s.nunique(dropna=False)) # 3 (includes NaN as unique) ```
Practical Example: Counting Unique Players¶
From LeetCode 550: Count unique players in activity data.
```python activity = pd.DataFrame({ 'player_id': [1, 2, 1, 3, 2], 'event_date': ['2024-01-01', '2024-01-02', '2024-01-03', '2024-01-04', '2024-01-05'] })
Count unique players¶
total_players = activity['player_id'].nunique() print(f"Total unique players: {total_players}") # 3 ```
unique()¶
Returns an array of unique values (not counts).
python
s = pd.Series([3, 1, 2, 3, 1])
print(s.unique()) # array([3, 1, 2])
Note: Unlike value_counts(), unique() returns values in order of first appearance, not sorted.
Sorted Unique Values¶
```python
For sorted unique values¶
print(sorted(s.unique())) # [1, 2, 3]
Or use numpy¶
import numpy as np print(np.sort(s.unique())) # array([1, 2, 3]) ```
duplicated()¶
Returns a boolean Series indicating duplicate rows.
python
s = pd.Series([1, 2, 2, 3, 3, 3])
print(s.duplicated())
0 False
1 False
2 True
3 False
4 True
5 True
dtype: bool
Parameters¶
keep¶
'first'(default): Mark duplicates except for the first occurrence'last': Mark duplicates except for the last occurrenceFalse: Mark all duplicates as True
```python s = pd.Series([1, 2, 2, 3])
print(s.duplicated(keep='first')) # [False, False, True, False] print(s.duplicated(keep='last')) # [False, True, False, False] print(s.duplicated(keep=False)) # [False, True, True, False] ```
DataFrame value_counts()¶
Count unique combinations of values across multiple columns.
```python df = pd.DataFrame({ 'A': ['foo', 'foo', 'foo', 'bar', 'bar'], 'B': ['one', 'one', 'two', 'two', 'one'] })
print(df.value_counts()) ```
A B
foo one 2
two 1
bar two 1
one 1
Name: count, dtype: int64
Reset Index for DataFrame Output¶
```python
Convert to DataFrame¶
counts_df = df.value_counts().reset_index() print(counts_df) ```
A B count
0 foo one 2
1 foo two 1
2 bar two 1
3 bar one 1
Practical Example: Finding Duplicate Emails¶
From LeetCode 182: Find duplicate emails.
```python person = pd.DataFrame({ 'id': [1, 2, 3, 4, 5], 'email': ['a@example.com', 'b@example.com', 'a@example.com', 'b@example.com', 'c@example.com'] })
Method 1: Using value_counts¶
email_counts = person['email'].value_counts() duplicates = email_counts[email_counts > 1].index.tolist() print(duplicates) # ['a@example.com', 'b@example.com']
Method 2: Using groupby and count¶
email_counts = person.groupby('email')['id'].count().reset_index(name='count') duplicates = email_counts[email_counts['count'] > 1]['email'] print(duplicates.tolist()) # ['a@example.com', 'b@example.com'] ```
Comparison Summary¶
| Method | Returns | Purpose |
|---|---|---|
value_counts() |
Series | Count occurrences of each unique value |
nunique() |
int | Count of unique values |
unique() |
ndarray | Array of unique values |
duplicated() |
bool Series | Mark duplicate values |
Financial Example: Sector Distribution¶
```python
Portfolio holdings by sector¶
holdings = pd.DataFrame({ 'ticker': ['AAPL', 'MSFT', 'GOOGL', 'JPM', 'BAC', 'XOM', 'CVX'], 'sector': ['Tech', 'Tech', 'Tech', 'Finance', 'Finance', 'Energy', 'Energy'], 'weight': [0.20, 0.18, 0.15, 0.12, 0.10, 0.13, 0.12] })
Count holdings per sector¶
print(holdings['sector'].value_counts()) ```
Tech 3
Finance 2
Energy 2
Name: count, dtype: int64
```python
Sector concentration¶
print(holdings['sector'].value_counts(normalize=True)) ```
Tech 0.428571
Finance 0.285714
Energy 0.285714
Name: proportion, dtype: float64
```python
Number of sectors¶
print(f"Number of sectors: {holdings['sector'].nunique()}") # 3 ```
Exercises¶
Exercise 1.
Create a Series of 50 random choices from ['red', 'blue', 'green']. Use .value_counts() to count occurrences and .value_counts(normalize=True) to get the proportion of each color.
Solution to Exercise 1
Use value_counts with and without normalize.
import pandas as pd
import numpy as np
np.random.seed(42)
colors = pd.Series(np.random.choice(['red', 'blue', 'green'], 50))
print("Counts:")
print(colors.value_counts())
print("\nProportions:")
print(colors.value_counts(normalize=True))
Exercise 2.
Given a DataFrame with a numeric 'score' column, use .value_counts(bins=5) to group scores into 5 bins. Then use .nunique() to count the number of distinct scores.
Solution to Exercise 2
Use bins parameter and nunique.
import pandas as pd
import numpy as np
np.random.seed(42)
df = pd.DataFrame({'score': np.random.randint(0, 101, 100)})
print("Binned counts:")
print(df['score'].value_counts(bins=5))
print(f"\nUnique scores: {df['score'].nunique()}")
Exercise 3.
Create a DataFrame with duplicate rows. Use .duplicated() to identify them, .duplicated(keep=False) to mark all duplicates, and then use value_counts() on a column to find which values appear more than once.
Solution to Exercise 3
Identify duplicates using duplicated and value_counts.
import pandas as pd
df = pd.DataFrame({
'name': ['Alice', 'Bob', 'Alice', 'Carol', 'Bob'],
'dept': ['HR', 'IT', 'HR', 'IT', 'IT']
})
print("Duplicated (keep first):")
print(df.duplicated())
print("\nAll duplicates:")
print(df.duplicated(keep=False))
print("\nValues appearing more than once:")
counts = df['name'].value_counts()
print(counts[counts > 1])