Keyword - join and verify_integrity¶
When concatenating DataFrames with pd.concat, mismatched columns can silently introduce NaN values, and duplicate index labels can go undetected. The join parameter controls how columns (or indices) that do not appear in all DataFrames are handled, while verify_integrity provides a safety check against duplicate index values in the result.
Mental Model
join="outer" keeps every column from every DataFrame, filling gaps with NaN. join="inner" keeps only the columns that appear in all DataFrames. verify_integrity=True is a safety net that raises an error if the resulting index has duplicates -- useful for catching accidental double-stacking.
python
import pandas as pd
join Parameter¶
The join parameter determines what happens to columns that exist in some DataFrames but not others. It accepts two values: "outer" (the default) and "inner".
Outer Join (Default)¶
With join="outer", the result includes all columns from every DataFrame. Missing values are filled with NaN.
```python df1 = pd.DataFrame({"A": [1, 2], "B": [3, 4]}) df2 = pd.DataFrame({"B": [5, 6], "C": [7, 8]})
result = pd.concat([df1, df2], join="outer", ignore_index=True) print(result) ```
A B C
0 1.0 3 NaN
1 2.0 4 NaN
2 NaN 5 7.0
3 NaN 6 8.0
Columns A and C appear in only one DataFrame, so the other gets NaN in those positions.
Inner Join¶
With join="inner", the result keeps only columns that appear in every DataFrame. No NaN values are introduced from column mismatch.
python
result = pd.concat([df1, df2], join="inner", ignore_index=True)
print(result)
B
0 3
1 4
2 5
3 6
Only column B is common to both DataFrames, so A and C are dropped.
When to Use Each¶
| Situation | Recommended join |
|---|---|
| DataFrames share all columns | Either (same result) |
| Some columns differ, keep all data | "outer" (default) |
| Some columns differ, keep only shared | "inner" |
Silent NaN Introduction
The default join="outer" can introduce NaN values without warning when DataFrames have different columns. If your downstream code does not handle NaN, consider using join="inner" or explicitly checking columns before concatenation.
verify_integrity Parameter¶
Setting verify_integrity=True causes pd.concat to raise a ValueError if the resulting index contains duplicate values. This is useful as a sanity check when duplicate indices would indicate a data problem.
Default Behavior (No Checking)¶
By default, pd.concat allows duplicate index values.
```python df1 = pd.DataFrame({"A": [1, 2]}, index=[0, 1]) df2 = pd.DataFrame({"A": [3, 4]}, index=[0, 1])
result = pd.concat([df1, df2]) print(result) ```
A
0 1
1 2
0 3
1 4
The result has duplicate index values (0 and 1 each appear twice), which is allowed by default.
Enabling Integrity Check¶
```python try: result = pd.concat([df1, df2], verify_integrity=True) except ValueError as e: print(e)
Indexes have overlapping values: Int64Index([0, 1], dtype='int64')¶
```
The call raises a ValueError because indices 0 and 1 appear in both DataFrames.
Fixing Duplicate Indices¶
Two common approaches to resolve duplicate indices before or during concatenation:
```python
Option 1: Reset indices with ignore_index¶
result = pd.concat([df1, df2], ignore_index=True) print(result) ```
A
0 1
1 2
2 3
3 4
```python
Option 2: Use keys to create a hierarchical index¶
result = pd.concat([df1, df2], keys=["first", "second"]) print(result) ```
A
first 0 1
1 2
second 0 3
1 4
Both approaches produce a result with unique index values.
Combining join and verify_integrity¶
The two parameters work independently and can be used together.
```python df1 = pd.DataFrame({"A": [1], "B": [2]}, index=["x"]) df2 = pd.DataFrame({"B": [3], "C": [4]}, index=["y"])
Inner join + integrity check¶
result = pd.concat( [df1, df2], join="inner", verify_integrity=True ) print(result) ```
B
x 2
y 3
Here join="inner" keeps only column B, and verify_integrity=True confirms no duplicate indices exist.
Summary¶
| Parameter | Values | Purpose |
|---|---|---|
join |
"outer" (default), "inner" |
Controls column handling for mismatched DataFrames |
verify_integrity |
False (default), True |
Raises error on duplicate index values |
Key Takeaways:
join="outer"keeps all columns but may introduceNaN;join="inner"keeps only shared columnsverify_integrity=Trueis a safety net that raisesValueErroron duplicate indices- Use
ignore_index=Trueorkeysto resolve duplicate indices before they cause problems - These parameters apply to
pd.concatonly, not topd.merge(which has its own join logic)
Exercises¶
Exercise 1. Write code that concatenates two DataFrames with different columns using join='inner' and join='outer'. Compare the results.
Solution to Exercise 1
```python import pandas as pd
df1 = pd.DataFrame({'A': [1, 2], 'B': [3, 4]}) df2 = pd.DataFrame({'A': [5, 6], 'B': [7, 8]}) result = pd.concat([df1, df2], ignore_index=True) print(result) ```
Exercise 2. Explain what the verify_integrity parameter does in pd.concat(). What happens if there are duplicate index values?
Solution to Exercise 2
See the explanation in the main content. The key concept involves understanding how pd.concat() aligns data along the specified axis and handles mismatched indices or columns.
Exercise 3. Create two DataFrames with partially overlapping columns. Concatenate with join='inner' and show that only common columns are kept.
Solution to Exercise 3
```python import pandas as pd
df1 = pd.DataFrame({'A': [1, 2], 'B': [3, 4]}) df2 = pd.DataFrame({'A': [5, 6], 'C': [7, 8]}) result = pd.concat([df1, df2], axis=0) print(result) ```
Exercise 4. Write code demonstrating that pd.concat([df1, df2], verify_integrity=True) raises an error when index values are duplicated.
Solution to Exercise 4
```python import pandas as pd
df1 = pd.DataFrame({'A': [1, 2]}, index=[0, 1]) df2 = pd.DataFrame({'A': [3, 4]}, index=[2, 3]) result = pd.concat([df1, df2]) print(result) ```