Skip to content

Structured Arrays

Structured arrays (also called record arrays) allow you to store heterogeneous data types in a single array, similar to a database table or spreadsheet row.

Mental Model

A structured array is NumPy's version of a database table: each element is a "row" containing named fields of potentially different types (e.g., a string name, an int age, a float score). Access a column by field name (arr['age']), a row by index (arr[0]). For most tabular data work, Pandas is more convenient, but structured arrays shine when you need tight memory control or C-level interop.

At the memory level, a structured array stores different fields in a single contiguous memory block with byte offsets — the dtype acts as a schema that tells NumPy where each field starts within each record.

Array of Structs vs Struct of Arrays

There are two ways to store tabular data in memory, and the choice has real performance consequences:

Array of Structs (AoS) — what structured arrays use: text [row1: name, age, score] [row2: name, age, score] ... Each record's fields are contiguous. Good for record-at-a-time access and C struct interop, but column operations (e.g., arr['score'].mean()) must skip over interleaved fields, reducing cache efficiency.

Struct of Arrays (SoA) — what separate NumPy arrays / Pandas use: text names: [name1, name2, ...] ages: [age1, age2, ...] scores: [score1, score2, ...] Each column is contiguous. Vectorized column operations are cache-friendly and fast, but accessing a full record requires gathering from multiple arrays.

Concern AoS (structured) SoA (separate arrays)
Record access Fast (contiguous) Slow (scattered)
Column vectorization Slower (interleaved) Fast (contiguous)
C interop Natural (memcpy a struct) Requires packing
Cache locality Per-record Per-column

Rule of thumb: use structured arrays (AoS) for binary I/O and C interop; use separate arrays or Pandas (SoA) for analytical computation.

Structured Data Model

```text A structured array is: - a contiguous block of memory - interpreted as records with named fields - defined by a dtype schema (field names + types + offsets)

It allows: - heterogeneous data in a single array - vectorized operations across fields - memory-efficient tabular storage - direct interop with C structs and binary formats ```

Decision Guide

Use case Tool
Memory-critical tabular data Structured arrays
Binary file formats / C interop Structured arrays
Data analysis (groupby, joins, missing values) Pandas
Heavy analytics / exploratory work Pandas

Avoid structured arrays for operations that Pandas handles natively (groupby, pivot, merge) — the ergonomic cost is not worth the memory savings.

python import numpy as np


What are Structured Arrays?

Regular NumPy arrays hold homogeneous data (all same type). Structured arrays hold records with multiple named fields of different types:

```python

Regular array: all floats

regular = np.array([1.0, 2.0, 3.0])

Structured array: mixed types

dt = np.dtype([('name', 'U10'), ('age', 'i4'), ('score', 'f8')]) structured = np.array([ ('Alice', 25, 95.5), ('Bob', 30, 87.3), ('Charlie', 22, 91.0) ], dtype=dt) ```


Creating Structured Arrays

The dtype is the schema definition for your structured array — it specifies field names, types, and (implicitly) byte offsets, just as a CREATE TABLE statement defines column names and types in SQL.

Method 1: dtype with List of Tuples

```python

Define dtype: (field_name, data_type)

dt = np.dtype([ ('name', 'U20'), # Unicode string, max 20 chars ('age', 'i4'), # 32-bit integer ('salary', 'f8'), # 64-bit float ('active', '?') # Boolean ])

Create array

employees = np.array([ ('Alice', 30, 75000.0, True), ('Bob', 25, 65000.0, True), ('Charlie', 35, 85000.0, False) ], dtype=dt) ```

Method 2: Dictionary Format

```python dt = np.dtype({ 'names': ['x', 'y', 'z'], 'formats': ['f8', 'f8', 'f8'] })

points = np.array([(1.0, 2.0, 3.0), (4.0, 5.0, 6.0)], dtype=dt) ```

Method 3: String Format

```python

Comma-separated type strings

dt = np.dtype('U10, i4, f8') # Unnamed fields: f0, f1, f2

data = np.array([('Alice', 25, 95.5)], dtype=dt) print(data['f0']) # 'Alice' ```


Data Type Codes

Code Type Example
'i4' 32-bit int np.int32
'i8' 64-bit int np.int64
'f4' 32-bit float np.float32
'f8' 64-bit float np.float64
'U10' Unicode string (10 chars)
'S10' Byte string (10 bytes)
'?' Boolean np.bool_
'c16' Complex 128 np.complex128

Accessing Data

By Field Name

```python dt = np.dtype([('name', 'U10'), ('age', 'i4'), ('score', 'f8')]) students = np.array([ ('Alice', 20, 95.5), ('Bob', 22, 87.3), ('Charlie', 21, 91.0) ], dtype=dt)

Access entire column

print(students['name']) # ['Alice' 'Bob' 'Charlie'] print(students['age']) # [20 22 21] print(students['score']) # [95.5 87.3 91. ]

Access single record

print(students[0]) # ('Alice', 20, 95.5) print(students[0]['name']) # 'Alice' ```

By Index

```python

First record

print(students[0]) # ('Alice', 20, 95.5)

Slice

print(students[:2]) # First two records

Boolean indexing

adults = students[students['age'] >= 21] print(adults['name']) # ['Bob' 'Charlie'] ```

Multiple Fields

```python

Select multiple fields (returns structured array)

subset = students[['name', 'score']] print(subset.dtype) # [('name', '<U10'), ('score', '<f8')] ```


Modifying Data

```python

Modify single field

students['score'][0] = 98.0

Modify entire record

students[1] = ('Robert', 23, 90.0)

Modify column

students['age'] = students['age'] + 1 # Everyone ages by 1 ```


Nested Structures

```python

Nested dtype

address_dt = np.dtype([('city', 'U20'), ('zip', 'U10')]) person_dt = np.dtype([ ('name', 'U20'), ('address', address_dt) ])

people = np.array([ ('Alice', ('New York', '10001')), ('Bob', ('Los Angeles', '90001')) ], dtype=person_dt)

Access nested fields

print(people['address']['city']) # ['New York' 'Los Angeles'] ```


Array Fields

```python

Field that is itself an array

dt = np.dtype([ ('name', 'U10'), ('grades', 'f8', (3,)) # Array of 3 floats ])

students = np.array([ ('Alice', [95, 87, 92]), ('Bob', [88, 91, 85]) ], dtype=dt)

print(students['grades'])

[[95. 87. 92.]

[88. 91. 85.]]

print(students[0]['grades'].mean()) # 91.33 ```


Record Arrays (recarray)

A recarray is syntactic sugar over a structured array — the underlying data and dtype are identical, but you can write rec.name instead of rec['name']. The convenience comes at a small performance cost (attribute access is slower than item access), so use recarrays for interactive exploration and structured arrays for production code.

Record arrays allow attribute-style access:

```python

Convert structured array to recarray

rec = students.view(np.recarray)

Attribute access (instead of indexing)

print(rec.name) # ['Alice' 'Bob'] print(rec.age) # [20 22] print(rec[0].name) # 'Alice'

Create recarray directly

rec = np.rec.array([ ('Alice', 25, 95.5), ('Bob', 30, 87.3) ], dtype=[('name', 'U10'), ('age', 'i4'), ('score', 'f8')]) ```


Practical Examples

CSV-like Data

```python

Load CSV-like data

dt = np.dtype([ ('id', 'i4'), ('product', 'U30'), ('price', 'f8'), ('quantity', 'i4') ])

inventory = np.array([ (1, 'Widget', 9.99, 100), (2, 'Gadget', 24.99, 50), (3, 'Gizmo', 14.99, 75) ], dtype=dt)

Calculate total value

total_value = (inventory['price'] * inventory['quantity']).sum() print(f"Total inventory value: ${total_value:.2f}") ```

Sorting Structured Arrays

```python

Sort by single field

sorted_by_score = np.sort(students, order='score')

Sort by multiple fields

sorted_multi = np.sort(students, order=['age', 'score']) ```

Filtering

```python

Boolean filtering

high_scorers = students[students['score'] > 90] young_students = students[students['age'] < 22]

Combined conditions

filtered = students[(students['age'] >= 20) & (students['score'] > 85)] ```


When to Use Structured Arrays

Use Structured Arrays When:

  • Working with large datasets in pure NumPy
  • Need memory-efficient storage
  • Interfacing with C/binary data formats
  • Simple tabular operations without pandas overhead

Use Pandas Instead When:

  • Need advanced data manipulation
  • Working with time series
  • Need missing value handling (NaN)
  • Complex groupby/merge operations

Summary

Task Code
Create dtype np.dtype([('name', 'U10'), ('age', 'i4')])
Create array np.array([('Alice', 25)], dtype=dt)
Access field arr['name']
Access record arr[0]
Access nested arr['address']['city']
Sort by field np.sort(arr, order='age')
Filter arr[arr['age'] > 20]
To recarray arr.view(np.recarray)

Key Takeaways:

  • Structured arrays store heterogeneous data with named fields
  • Access fields by name: arr['fieldname']
  • Use dtype codes: 'i4' (int32), 'f8' (float64), 'U10' (string)
  • Record arrays add attribute-style access: arr.name
  • Can nest structures and include array fields
  • Good for memory-efficient tabular data without pandas
  • Use order parameter for sorting by fields

Exercises

Exercise 1. Write a short code example that demonstrates the main concept covered on this page. Include comments explaining each step.

Solution to Exercise 1

Refer to the code examples in the page content above. A complete solution would recreate the key pattern with clear comments explaining the NumPy operations involved.


Exercise 2. Predict the output of a code snippet that uses the features described on this page. Explain why the output is what it is.

Solution to Exercise 2

The output depends on how NumPy handles the specific operation. Key factors include array shapes, dtypes, and broadcasting rules. Trace through the computation step by step.


Exercise 3. Write a practical function that applies the concepts from this page to solve a real data processing task. Test it with sample data.

Solution to Exercise 3

```python import numpy as np

Example: apply the page's concept to process sample data

data = np.random.default_rng(42).random((5, 3))

Apply the relevant operation

result = data # replace with actual operation print(result) ```


Exercise 4. Identify a common mistake when using the features described on this page. Write code that demonstrates the mistake and then show the corrected version.

Solution to Exercise 4

A common mistake is misunderstanding array shapes or dtypes. Always check .shape and .dtype when debugging unexpected results.