Keyword - join and verify_integrity¶

When concatenating DataFrames with pd.concat, mismatched columns can silently introduce NaN values, and duplicate index labels can go undetected. The join parameter controls how columns (or indices) that do not appear in all DataFrames are handled, while verify_integrity provides a safety check against duplicate index values in the result.

import pandas as pd

join Parameter¶

The join parameter determines what happens to columns that exist in some DataFrames but not others. It accepts two values: "outer" (the default) and "inner".

Outer Join (Default)¶

With join="outer", the result includes all columns from every DataFrame. Missing values are filled with NaN.

df1 = pd.DataFrame({"A": [1, 2], "B": [3, 4]})
df2 = pd.DataFrame({"B": [5, 6], "C": [7, 8]})

result = pd.concat([df1, df2], join="outer", ignore_index=True)
print(result)

     A  B    C
0  1.0  3  NaN
1  2.0  4  NaN
2  NaN  5  7.0
3  NaN  6  8.0

Columns A and C appear in only one DataFrame, so the other gets NaN in those positions.

Inner Join¶

With join="inner", the result keeps only columns that appear in every DataFrame. No NaN values are introduced from column mismatch.

result = pd.concat([df1, df2], join="inner", ignore_index=True)
print(result)

Only column B is common to both DataFrames, so A and C are dropped.

When to Use Each¶

Situation	Recommended `join`
DataFrames share all columns	Either (same result)
Some columns differ, keep all data	`"outer"` (default)
Some columns differ, keep only shared	`"inner"`

Silent NaN Introduction

The default join="outer" can introduce NaN values without warning when DataFrames have different columns. If your downstream code does not handle NaN, consider using join="inner" or explicitly checking columns before concatenation.

verify_integrity Parameter¶

Setting verify_integrity=True causes pd.concat to raise a ValueError if the resulting index contains duplicate values. This is useful as a sanity check when duplicate indices would indicate a data problem.

Default Behavior (No Checking)¶

By default, pd.concat allows duplicate index values.

df1 = pd.DataFrame({"A": [1, 2]}, index=[0, 1])
df2 = pd.DataFrame({"A": [3, 4]}, index=[0, 1])

result = pd.concat([df1, df2])
print(result)

The result has duplicate index values (0 and 1 each appear twice), which is allowed by default.

Enabling Integrity Check¶

try:
    result = pd.concat([df1, df2], verify_integrity=True)
except ValueError as e:
    print(e)
# Indexes have overlapping values: Int64Index([0, 1], dtype='int64')

The call raises a ValueError because indices 0 and 1 appear in both DataFrames.

Fixing Duplicate Indices¶

Two common approaches to resolve duplicate indices before or during concatenation:

# Option 1: Reset indices with ignore_index
result = pd.concat([df1, df2], ignore_index=True)
print(result)

# Option 2: Use keys to create a hierarchical index
result = pd.concat([df1, df2], keys=["first", "second"])
print(result)

            A
first  0    1
       1    2
second 0    3
       1    4

Both approaches produce a result with unique index values.

Combining join and verify_integrity¶

The two parameters work independently and can be used together.

df1 = pd.DataFrame({"A": [1], "B": [2]}, index=["x"])
df2 = pd.DataFrame({"B": [3], "C": [4]}, index=["y"])

# Inner join + integrity check
result = pd.concat(
    [df1, df2],
    join="inner",
    verify_integrity=True
)
print(result)

   B
x  2
y  3

Here join="inner" keeps only column B, and verify_integrity=True confirms no duplicate indices exist.

Summary¶

Parameter	Values	Purpose
`join`	`"outer"` (default), `"inner"`	Controls column handling for mismatched DataFrames
`verify_integrity`	`False` (default), `True`	Raises error on duplicate index values

Key Takeaways:

join="outer" keeps all columns but may introduce NaN; join="inner" keeps only shared columns
verify_integrity=True is a safety net that raises ValueError on duplicate indices
Use ignore_index=True or keys to resolve duplicate indices before they cause problems
These parameters apply to pd.concat only, not to pd.merge (which has its own join logic)