set Internals¶
Sets are implemented similarly to dictionaries—as hash tables—enabling O(1) average membership testing and set operations. Understanding set internals explains their performance characteristics and implementation constraints.
Hash-Based Implementation¶
Membership Testing is Fast¶
s = {i for i in range(100000)}
# O(1) average case
print(50000 in s)
print(99999 in s)
print(100000 in s)
Output:
True
True
False
Elements Must Be Hashable¶
s = {1, 2.5, "three", (4, 5)}
print(s)
# Unhashable types fail
try:
s.add([6, 7])
except TypeError as e:
print(f"Error: {e}")
Output:
{1, 2.5, 'three', (4, 5)}
Error: unhashable type: 'list'
Set Operations as Hash Operations¶
Union, Intersection, Difference¶
a = {1, 2, 3, 4}
b = {3, 4, 5, 6}
print(f"Union: {a | b}")
print(f"Intersection: {a & b}")
print(f"Difference: {a - b}")
print(f"Symmetric difference: {a ^ b}")
Output:
Union: {1, 2, 3, 4, 5, 6}
Intersection: {3, 4}
Difference: {1, 2}
Symmetric difference: {1, 2, 5, 6}
Set vs List Performance¶
Membership Testing¶
import time
# Using list
lst = list(range(100000))
set_time = time.time()
for _ in range(1000):
50000 in lst
list_duration = time.time() - set_time
# Using set
s = set(range(100000))
start = time.time()
for _ in range(1000):
50000 in s
set_duration = time.time() - start
print(f"List: {list_duration:.4f}s")
print(f"Set: {set_duration:.6f}s")
Output:
List: 0.0234s
Set: 0.000024s
Practical Uses¶
Deduplication¶
data = [1, 2, 2, 3, 3, 3, 4, 4, 4, 4]
unique = set(data)
print(unique)
Output:
{1, 2, 3, 4}
Finding Common Elements¶
students_morning = {'Alice', 'Bob', 'Charlie'}
students_afternoon = {'Bob', 'Charlie', 'David'}
both_shifts = students_morning & students_afternoon
print(f"Both shifts: {both_shifts}")
Output:
Both shifts: {'Charlie', 'Bob'}
Runnable Example: set_vs_list_performance.py¶
"""
TUTORIAL: Sets vs Lists for Membership Testing - O(1) vs O(n)
Why this matters:
When you need to check "is this item in my collection?", the data structure
you choose determines speed. Lists require scanning every item (O(n) time).
Sets use hashing to find items instantly (O(1) time).
For small lists, the difference is negligible. But with hundreds or thousands
of items, using a list instead of a set can cost you seconds per second of work.
Core lesson:
If you're doing membership testing on a large collection, USE A SET.
If you need order or duplicates, use a list and accept the O(n) cost.
But whenever you can, choose the faster structure.
"""
import random
import string
import timeit
# ============ Example 1: List-based uniqueness ============
# This function finds unique first names by using a nested loop.
# For each name, it searches through the unique list (O(n) per name).
# Total time: O(n^2) - quadratic, gets slow fast!
def list_unique_names(phonebook):
"""Find unique first names using a list. O(n^2) due to nested search."""
unique_names = []
for name, phonenumber in phonebook:
first_name, last_name = name.split(" ", 1)
# Check if first_name already in unique_names list
# This is O(n) - must scan list each time
for unique in unique_names:
if unique == first_name:
break # Found it, don't add
else:
# Only reached if loop completed without break
unique_names.append(first_name)
return len(unique_names)
# ============ Example 2: Set-based uniqueness ============
# This function uses a set. Membership test is O(1) using hash lookup.
# For each name, we just add to set (duplicates automatically ignored).
# Total time: O(n) - linear, much faster!
def set_unique_names(phonebook):
"""Find unique first names using a set. O(n) due to O(1) lookups."""
unique_names = set()
for name, phonenumber in phonebook:
first_name, last_name = name.split(" ", 1)
# Add to set. Duplicates automatically ignored (O(1) time).
unique_names.add(first_name)
return len(unique_names)
# ============ Example 3: Helper function ============
def random_name():
"""Generate a random first and last name."""
first_name = "".join(random.sample(string.ascii_letters, 8))
last_name = "".join(random.sample(string.ascii_letters, 8))
return f"{first_name} {last_name}"
def demo_small_data():
"""Show both methods on small data where difference is small."""
print("\n" + "=" * 70)
print("Example 1: Small Phonebook (few entries)")
print("=" * 70)
small_phonebook = [
("John Doe", "555-555-5555"),
("Jane Smith", "555-555-5556"),
("John Davis", "555-555-5557"),
("Albert Einstein", "212-555-5555"),
]
print(f"Phonebook size: {len(small_phonebook)}")
print()
unique_from_list = list_unique_names(small_phonebook)
unique_from_set = set_unique_names(small_phonebook)
print(f"Unique first names (list method): {unique_from_list}")
print(f"Unique first names (set method): {unique_from_set}")
print()
print("Both give correct answers. On small data, speed difference is tiny.")
def demo_large_data():
"""Show performance difference on large data."""
print("\n" + "=" * 70)
print("Example 2: Large Phonebook (1000 entries)")
print("=" * 70)
random.seed(42) # Reproducible results
large_phonebook = [(random_name(), "555-5555") for i in range(1000)]
print(f"Phonebook size: {len(large_phonebook)}")
print(f"Running 50 iterations to measure time accurately...")
print()
# Time the list method
setup_list = """
from __main__ import large_phonebook, list_unique_names
"""
t_list = timeit.timeit(
stmt="list_unique_names(large_phonebook)",
setup=setup_list,
number=50
)
# Time the set method
setup_set = """
from __main__ import large_phonebook, set_unique_names
"""
t_set = timeit.timeit(
stmt="set_unique_names(large_phonebook)",
setup=setup_set,
number=50
)
time_per_call_list = t_list / 50
time_per_call_set = t_set / 50
speedup = time_per_call_list / time_per_call_set
print(f"List method time: {time_per_call_list:2e} seconds per call")
print(f"Set method time: {time_per_call_set:2e} seconds per call")
print()
print(f"SET is {speedup:.1f}x FASTER than list!")
print()
print("""
WHY such a big difference:
List approach (O(n^2)):
- Check item 1 against 0 items: 1 comparison
- Check item 2 against 1 item: 1 comparison
- Check item 3 against 2 items: 2 comparisons
- ...
- Check item 1000 against ~500 items: 500 comparisons
- Total: 1 + 1 + 2 + 3 + ... + 500 ≈ 125,000 comparisons
Set approach (O(n)):
- Add item 1: 1 operation
- Add item 2: 1 operation
- Add item 3: 1 operation
- ...
- Add item 1000: 1 operation
- Total: 1000 operations
- 125x fewer operations!
""")
def demo_complexity_comparison():
"""Show how complexity grows with data size."""
print("\n" + "=" * 70)
print("Example 3: How Complexity Grows with Data Size")
print("=" * 70)
print("""
For checking membership in N items:
Data Size List Method Set Method Speedup
--------- ----------- ---------- -------
10 items ~50 comparisons 10 ops 5x
100 items ~5,000 comps 100 ops 50x
1,000 items ~500,000 comps 1,000 ops 500x
10,000 items ~50M comps 10,000 ops 5,000x
Notice: As data grows, list method gets exponentially slower!
Set method grows linearly (much more reasonable).
For real-world phonebooks:
- 1,000 entries: Set is 500x faster
- 10,000 entries: Set is 5,000x faster
- 1,000,000 entries: Set is 500,000x faster
That's not theoretical - that's real wall-clock time!
""")
def demo_when_to_use_each():
"""Explain when to use list vs set."""
print("\n" + "=" * 70)
print("Example 4: When to Use List vs Set")
print("=" * 70)
print("""
USE A LIST when:
- You need to preserve order
- You need to access items by position (index)
- You need to store duplicates
- Data is small (< 100 items typically)
- You rarely check membership
- You need custom ordering
USE A SET when:
- You frequently check membership ("is X in collection?")
- You need fast add/remove operations
- Duplicates don't make sense for your data
- You have large collections (> 100 items)
- You need O(1) membership testing
HYBRID APPROACH:
- Store data as list (ordered, duplicates preserved)
- Create set for membership testing
- Cost: Extra memory, but gains speed on lookups
EXAMPLE:
names = [...] # List for iteration and order
name_set = set(names) # Set for fast checking
if "John" in name_set: ... # O(1) instead of O(n)
""")
if __name__ == "__main__":
print("=" * 70)
print("TUTORIAL: Sets vs Lists - Membership Testing Performance")
print("=" * 70)
demo_small_data()
demo_large_data()
demo_complexity_comparison()
demo_when_to_use_each()
# -------- KEY INSIGHTS --------
print("\n" + "=" * 70)
print("KEY INSIGHTS")
print("=" * 70)
print("""
1. LIST membership is O(n):
- Must scan every item to check membership
- Slow for large collections
- Quadratic complexity for checking all items
2. SET membership is O(1):
- Hash lookup finds item instantly
- Constant time regardless of collection size
- Linear complexity for checking all items
3. COMPLEXITY MATTERS at SCALE:
- Small differences (< 100 items): negligible
- Medium differences (100-1,000): noticeable
- Large differences (1,000+ items): critical
4. REAL-WORLD IMPACT:
- List on 1,000 items: seconds
- Set on 1,000 items: milliseconds
- Difference is user-visible
5. MEMORY TRADEOFF:
- Set uses ~1.5x more memory than list (for hashing)
- But gains 500x+ speed on membership testing
- Usually worthwhile for frequently-tested collections
6. BEST PRACTICE:
- Default to set for membership testing
- Only use list if you need order or duplicates
- If uncertain, profile and measure
- The overhead of set is minimal if you need speed
""")