Keyword - join and verify_integrity¶
When concatenating DataFrames with pd.concat, mismatched columns can silently introduce NaN values, and duplicate index labels can go undetected. The join parameter controls how columns (or indices) that do not appear in all DataFrames are handled, while verify_integrity provides a safety check against duplicate index values in the result.
import pandas as pd
join Parameter¶
The join parameter determines what happens to columns that exist in some DataFrames but not others. It accepts two values: "outer" (the default) and "inner".
Outer Join (Default)¶
With join="outer", the result includes all columns from every DataFrame. Missing values are filled with NaN.
df1 = pd.DataFrame({"A": [1, 2], "B": [3, 4]})
df2 = pd.DataFrame({"B": [5, 6], "C": [7, 8]})
result = pd.concat([df1, df2], join="outer", ignore_index=True)
print(result)
A B C
0 1.0 3 NaN
1 2.0 4 NaN
2 NaN 5 7.0
3 NaN 6 8.0
Columns A and C appear in only one DataFrame, so the other gets NaN in those positions.
Inner Join¶
With join="inner", the result keeps only columns that appear in every DataFrame. No NaN values are introduced from column mismatch.
result = pd.concat([df1, df2], join="inner", ignore_index=True)
print(result)
B
0 3
1 4
2 5
3 6
Only column B is common to both DataFrames, so A and C are dropped.
When to Use Each¶
| Situation | Recommended join |
|---|---|
| DataFrames share all columns | Either (same result) |
| Some columns differ, keep all data | "outer" (default) |
| Some columns differ, keep only shared | "inner" |
Silent NaN Introduction
The default join="outer" can introduce NaN values without warning when DataFrames have different columns. If your downstream code does not handle NaN, consider using join="inner" or explicitly checking columns before concatenation.
verify_integrity Parameter¶
Setting verify_integrity=True causes pd.concat to raise a ValueError if the resulting index contains duplicate values. This is useful as a sanity check when duplicate indices would indicate a data problem.
Default Behavior (No Checking)¶
By default, pd.concat allows duplicate index values.
df1 = pd.DataFrame({"A": [1, 2]}, index=[0, 1])
df2 = pd.DataFrame({"A": [3, 4]}, index=[0, 1])
result = pd.concat([df1, df2])
print(result)
A
0 1
1 2
0 3
1 4
The result has duplicate index values (0 and 1 each appear twice), which is allowed by default.
Enabling Integrity Check¶
try:
result = pd.concat([df1, df2], verify_integrity=True)
except ValueError as e:
print(e)
# Indexes have overlapping values: Int64Index([0, 1], dtype='int64')
The call raises a ValueError because indices 0 and 1 appear in both DataFrames.
Fixing Duplicate Indices¶
Two common approaches to resolve duplicate indices before or during concatenation:
# Option 1: Reset indices with ignore_index
result = pd.concat([df1, df2], ignore_index=True)
print(result)
A
0 1
1 2
2 3
3 4
# Option 2: Use keys to create a hierarchical index
result = pd.concat([df1, df2], keys=["first", "second"])
print(result)
A
first 0 1
1 2
second 0 3
1 4
Both approaches produce a result with unique index values.
Combining join and verify_integrity¶
The two parameters work independently and can be used together.
df1 = pd.DataFrame({"A": [1], "B": [2]}, index=["x"])
df2 = pd.DataFrame({"B": [3], "C": [4]}, index=["y"])
# Inner join + integrity check
result = pd.concat(
[df1, df2],
join="inner",
verify_integrity=True
)
print(result)
B
x 2
y 3
Here join="inner" keeps only column B, and verify_integrity=True confirms no duplicate indices exist.
Summary¶
| Parameter | Values | Purpose |
|---|---|---|
join |
"outer" (default), "inner" |
Controls column handling for mismatched DataFrames |
verify_integrity |
False (default), True |
Raises error on duplicate index values |
Key Takeaways:
join="outer"keeps all columns but may introduceNaN;join="inner"keeps only shared columnsverify_integrity=Trueis a safety net that raisesValueErroron duplicate indices- Use
ignore_index=Trueorkeysto resolve duplicate indices before they cause problems - These parameters apply to
pd.concatonly, not topd.merge(which has its own join logic)