Skip to content

DataFrame Creation

DataFrames can be created from various data structures including dictionaries, lists, NumPy arrays, and other DataFrames.

Mental Model

Every DataFrame constructor answers the same question: "How should I interpret this raw data as rows and columns?" A dict of lists reads keys as column names. A list of dicts reads keys as column names per row. A 2D array needs explicit column labels. Choose the constructor that matches how your source data is already organized.

From Dictionary of Lists

Column-oriented data with lists as values.

1. Basic Dictionary

```python import pandas as pd

data = { 'temperature': [32, 35, 28], 'windspeed': [6, 7, 2], 'event': ['Rain', 'Sunny', 'Snow'] }

df = pd.DataFrame(data) print(df) ```

temperature windspeed event 0 32 6 Rain 1 35 7 Sunny 2 28 2 Snow

2. With Custom Index

python day = ['1/1/2017', '1/2/2017', '1/3/2017'] df = pd.DataFrame(data, index=day) print(df)

temperature windspeed event 1/1/2017 32 6 Rain 1/2/2017 35 7 Sunny 1/3/2017 28 2 Snow

3. Access Attributes

python print(df.index) # Index(['1/1/2017', '1/2/2017', '1/3/2017']) print(df.columns) # Index(['temperature', 'windspeed', 'event'])

From Dictionary of Dictionaries

Row keys become the index automatically.

1. Nested Dictionaries

```python temp = {'1/1/2017': 32, '1/2/2017': 35, '1/3/2017': 28} wind = {'1/1/2017': 6, '1/2/2017': 7, '1/3/2017': 2} event = {'1/1/2017': 'Rain', '1/2/2017': 'Sunny', '1/3/2017': 'Snow'}

data = {'temperature': temp, 'windspeed': wind, 'event': event} df = pd.DataFrame(data) print(df) ```

2. Automatic Index

Keys from inner dictionaries become the DataFrame index.

3. Handling Missing Keys

```python

If inner dicts have different keys, NaN fills missing values

```

From List of Lists

Row-oriented data with each inner list as a row.

1. Basic List of Lists

```python data = [ ['1/1/2017', 32, 6, 'Rain'], ['1/2/2017', 35, 7, 'Sunny'], ['1/3/2017', 28, 2, 'Snow'] ]

columns = ['day', 'temperature', 'windspeed', 'event'] df = pd.DataFrame(data, columns=columns) print(df) ```

2. Set Column as Index

python df = df.set_index('day')

3. Direct Index Assignment

python df = pd.DataFrame(data, columns=columns).set_index('day')

From List of Dictionaries

Each dictionary represents a row.

1. Row Dictionaries

```python data = [ {'day': '1/1/2017', 'temperature': 32, 'windspeed': 6, 'event': 'Rain'}, {'day': '1/2/2017', 'temperature': 35, 'windspeed': 7, 'event': 'Sunny'}, {'day': '1/3/2017', 'temperature': 28, 'windspeed': 2, 'event': 'Snow'} ]

df = pd.DataFrame(data).set_index('day') ```

2. Automatic Column Detection

Column names are inferred from dictionary keys.

3. Missing Keys

```python

Missing keys in some dicts result in NaN values

```

From NumPy Array

Create DataFrame from 2D array.

1. Random Data

```python import numpy as np

np.random.seed(0) data = np.random.normal(size=(3, 4))

index = ['Jenny', 'Frank', 'Wenfei'] columns = list('ABCD')

df = pd.DataFrame(data, index=index, columns=columns) print(df) ```

A B C D Jenny 1.764052 0.400157 0.978738 2.240893 Frank 1.867558 -0.977278 0.950088 -0.151357 Wenfei -0.103219 0.410599 0.144044 1.454274

2. Specify dtype

python df = pd.DataFrame(data, dtype=float)

3. Shape Preservation

DataFrame shape matches array shape.

LeetCode Example

Create DataFrame from list of student data.

1. Sample Data

python student_data = [ [101, 20], [102, 22], [103, 21] ]

2. Create DataFrame

python df = pd.DataFrame(student_data, columns=['student_id', 'age']) print(df)

student_id age 0 101 20 1 102 22 2 103 21

3. Type Annotation

```python from typing import List

def createDataframe(student_data: List[List[int]]) -> pd.DataFrame: return pd.DataFrame(student_data, columns=['student_id', 'age']) ```


Exercises

Exercise 1. Create a DataFrame from a dictionary where keys are 'ticker', 'sector', and 'market_cap' with at least four rows of sample stock data. Set the 'ticker' column as the index after creation using set_index.

Solution to Exercise 1

Create from a dictionary and set the index.

import pandas as pd

df = pd.DataFrame({
    'ticker': ['AAPL', 'MSFT', 'GOOGL', 'AMZN'],
    'sector': ['Tech', 'Tech', 'Tech', 'Consumer'],
    'market_cap': [2800, 2400, 1800, 1500]
})
df = df.set_index('ticker')
print(df)

Exercise 2. Create a DataFrame from a list of dictionaries where each dictionary represents a student with keys 'name', 'grade', and 'score'. One of the dictionaries should be missing the 'score' key. Print the DataFrame and observe how pandas handles the missing value.

Solution to Exercise 2

Missing keys in dictionaries produce NaN in the DataFrame.

import pandas as pd

students = [
    {'name': 'Alice', 'grade': 'A', 'score': 95},
    {'name': 'Bob', 'grade': 'B', 'score': 85},
    {'name': 'Carol', 'grade': 'A'},  # Missing 'score'
]
df = pd.DataFrame(students)
print(df)
# Carol's score will be NaN

Exercise 3. Create a DataFrame from a 3x4 NumPy random array. Assign custom column names ['Q1', 'Q2', 'Q3', 'Q4'] and custom index labels ['2022', '2023', '2024']. Then verify the shape is (3, 4).

Solution to Exercise 3

Create from a NumPy array with custom labels.

import pandas as pd
import numpy as np

np.random.seed(42)
data = np.random.rand(3, 4)
df = pd.DataFrame(
    data,
    columns=['Q1', 'Q2', 'Q3', 'Q4'],
    index=['2022', '2023', '2024']
)
print(df)
print(f"Shape: {df.shape}")  # (3, 4)