Structured Arrays¶
Structured arrays (also called record arrays) allow you to store heterogeneous data types in a single array, similar to a database table or spreadsheet row.
Mental Model
A structured array is NumPy's version of a database table: each element is a "row" containing named fields of potentially different types (e.g., a string name, an int age, a float score). Access a column by field name (arr['age']), a row by index (arr[0]). For most tabular data work, Pandas is more convenient, but structured arrays shine when you need tight memory control or C-level interop.
At the memory level, a structured array stores different fields in a single contiguous memory block with byte offsets — the dtype acts as a schema that tells NumPy where each field starts within each record.
Array of Structs vs Struct of Arrays
There are two ways to store tabular data in memory, and the choice has real performance consequences:
Array of Structs (AoS) — what structured arrays use:
text
[row1: name, age, score] [row2: name, age, score] ...
Each record's fields are contiguous. Good for record-at-a-time access and
C struct interop, but column operations (e.g., arr['score'].mean()) must
skip over interleaved fields, reducing cache efficiency.
Struct of Arrays (SoA) — what separate NumPy arrays / Pandas use:
text
names: [name1, name2, ...]
ages: [age1, age2, ...]
scores: [score1, score2, ...]
Each column is contiguous. Vectorized column operations are cache-friendly
and fast, but accessing a full record requires gathering from multiple arrays.
| Concern | AoS (structured) | SoA (separate arrays) |
|---|---|---|
| Record access | Fast (contiguous) | Slow (scattered) |
| Column vectorization | Slower (interleaved) | Fast (contiguous) |
| C interop | Natural (memcpy a struct) |
Requires packing |
| Cache locality | Per-record | Per-column |
Rule of thumb: use structured arrays (AoS) for binary I/O and C interop; use separate arrays or Pandas (SoA) for analytical computation.
Structured Data Model
```text A structured array is: - a contiguous block of memory - interpreted as records with named fields - defined by a dtype schema (field names + types + offsets)
It allows: - heterogeneous data in a single array - vectorized operations across fields - memory-efficient tabular storage - direct interop with C structs and binary formats ```
Decision Guide
| Use case | Tool |
|---|---|
| Memory-critical tabular data | Structured arrays |
| Binary file formats / C interop | Structured arrays |
| Data analysis (groupby, joins, missing values) | Pandas |
| Heavy analytics / exploratory work | Pandas |
Avoid structured arrays for operations that Pandas handles natively (groupby, pivot, merge) — the ergonomic cost is not worth the memory savings.
python
import numpy as np
What are Structured Arrays?¶
Regular NumPy arrays hold homogeneous data (all same type). Structured arrays hold records with multiple named fields of different types:
```python
Regular array: all floats¶
regular = np.array([1.0, 2.0, 3.0])
Structured array: mixed types¶
dt = np.dtype([('name', 'U10'), ('age', 'i4'), ('score', 'f8')]) structured = np.array([ ('Alice', 25, 95.5), ('Bob', 30, 87.3), ('Charlie', 22, 91.0) ], dtype=dt) ```
Creating Structured Arrays¶
The dtype is the schema definition for your structured array — it specifies
field names, types, and (implicitly) byte offsets, just as a CREATE TABLE
statement defines column names and types in SQL.
Method 1: dtype with List of Tuples¶
```python
Define dtype: (field_name, data_type)¶
dt = np.dtype([ ('name', 'U20'), # Unicode string, max 20 chars ('age', 'i4'), # 32-bit integer ('salary', 'f8'), # 64-bit float ('active', '?') # Boolean ])
Create array¶
employees = np.array([ ('Alice', 30, 75000.0, True), ('Bob', 25, 65000.0, True), ('Charlie', 35, 85000.0, False) ], dtype=dt) ```
Method 2: Dictionary Format¶
```python dt = np.dtype({ 'names': ['x', 'y', 'z'], 'formats': ['f8', 'f8', 'f8'] })
points = np.array([(1.0, 2.0, 3.0), (4.0, 5.0, 6.0)], dtype=dt) ```
Method 3: String Format¶
```python
Comma-separated type strings¶
dt = np.dtype('U10, i4, f8') # Unnamed fields: f0, f1, f2
data = np.array([('Alice', 25, 95.5)], dtype=dt) print(data['f0']) # 'Alice' ```
Data Type Codes¶
| Code | Type | Example |
|---|---|---|
'i4' |
32-bit int | np.int32 |
'i8' |
64-bit int | np.int64 |
'f4' |
32-bit float | np.float32 |
'f8' |
64-bit float | np.float64 |
'U10' |
Unicode string (10 chars) | |
'S10' |
Byte string (10 bytes) | |
'?' |
Boolean | np.bool_ |
'c16' |
Complex 128 | np.complex128 |
Accessing Data¶
By Field Name¶
```python dt = np.dtype([('name', 'U10'), ('age', 'i4'), ('score', 'f8')]) students = np.array([ ('Alice', 20, 95.5), ('Bob', 22, 87.3), ('Charlie', 21, 91.0) ], dtype=dt)
Access entire column¶
print(students['name']) # ['Alice' 'Bob' 'Charlie'] print(students['age']) # [20 22 21] print(students['score']) # [95.5 87.3 91. ]
Access single record¶
print(students[0]) # ('Alice', 20, 95.5) print(students[0]['name']) # 'Alice' ```
By Index¶
```python
First record¶
print(students[0]) # ('Alice', 20, 95.5)
Slice¶
print(students[:2]) # First two records
Boolean indexing¶
adults = students[students['age'] >= 21] print(adults['name']) # ['Bob' 'Charlie'] ```
Multiple Fields¶
```python
Select multiple fields (returns structured array)¶
subset = students[['name', 'score']] print(subset.dtype) # [('name', '<U10'), ('score', '<f8')] ```
Modifying Data¶
```python
Modify single field¶
students['score'][0] = 98.0
Modify entire record¶
students[1] = ('Robert', 23, 90.0)
Modify column¶
students['age'] = students['age'] + 1 # Everyone ages by 1 ```
Nested Structures¶
```python
Nested dtype¶
address_dt = np.dtype([('city', 'U20'), ('zip', 'U10')]) person_dt = np.dtype([ ('name', 'U20'), ('address', address_dt) ])
people = np.array([ ('Alice', ('New York', '10001')), ('Bob', ('Los Angeles', '90001')) ], dtype=person_dt)
Access nested fields¶
print(people['address']['city']) # ['New York' 'Los Angeles'] ```
Array Fields¶
```python
Field that is itself an array¶
dt = np.dtype([ ('name', 'U10'), ('grades', 'f8', (3,)) # Array of 3 floats ])
students = np.array([ ('Alice', [95, 87, 92]), ('Bob', [88, 91, 85]) ], dtype=dt)
print(students['grades'])
[[95. 87. 92.]¶
[88. 91. 85.]]¶
print(students[0]['grades'].mean()) # 91.33 ```
Record Arrays (recarray)¶
A recarray is syntactic sugar over a structured array — the underlying data
and dtype are identical, but you can write rec.name instead of rec['name'].
The convenience comes at a small performance cost (attribute access is slower than
item access), so use recarrays for interactive exploration and structured arrays
for production code.
Record arrays allow attribute-style access:
```python
Convert structured array to recarray¶
rec = students.view(np.recarray)
Attribute access (instead of indexing)¶
print(rec.name) # ['Alice' 'Bob'] print(rec.age) # [20 22] print(rec[0].name) # 'Alice'
Create recarray directly¶
rec = np.rec.array([ ('Alice', 25, 95.5), ('Bob', 30, 87.3) ], dtype=[('name', 'U10'), ('age', 'i4'), ('score', 'f8')]) ```
Practical Examples¶
CSV-like Data¶
```python
Load CSV-like data¶
dt = np.dtype([ ('id', 'i4'), ('product', 'U30'), ('price', 'f8'), ('quantity', 'i4') ])
inventory = np.array([ (1, 'Widget', 9.99, 100), (2, 'Gadget', 24.99, 50), (3, 'Gizmo', 14.99, 75) ], dtype=dt)
Calculate total value¶
total_value = (inventory['price'] * inventory['quantity']).sum() print(f"Total inventory value: ${total_value:.2f}") ```
Sorting Structured Arrays¶
```python
Sort by single field¶
sorted_by_score = np.sort(students, order='score')
Sort by multiple fields¶
sorted_multi = np.sort(students, order=['age', 'score']) ```
Filtering¶
```python
Boolean filtering¶
high_scorers = students[students['score'] > 90] young_students = students[students['age'] < 22]
Combined conditions¶
filtered = students[(students['age'] >= 20) & (students['score'] > 85)] ```
When to Use Structured Arrays¶
Use Structured Arrays When:¶
- Working with large datasets in pure NumPy
- Need memory-efficient storage
- Interfacing with C/binary data formats
- Simple tabular operations without pandas overhead
Use Pandas Instead When:¶
- Need advanced data manipulation
- Working with time series
- Need missing value handling (NaN)
- Complex groupby/merge operations
Summary¶
| Task | Code |
|---|---|
| Create dtype | np.dtype([('name', 'U10'), ('age', 'i4')]) |
| Create array | np.array([('Alice', 25)], dtype=dt) |
| Access field | arr['name'] |
| Access record | arr[0] |
| Access nested | arr['address']['city'] |
| Sort by field | np.sort(arr, order='age') |
| Filter | arr[arr['age'] > 20] |
| To recarray | arr.view(np.recarray) |
Key Takeaways:
- Structured arrays store heterogeneous data with named fields
- Access fields by name:
arr['fieldname'] - Use dtype codes:
'i4'(int32),'f8'(float64),'U10'(string) - Record arrays add attribute-style access:
arr.name - Can nest structures and include array fields
- Good for memory-efficient tabular data without pandas
- Use
orderparameter for sorting by fields
Exercises¶
Exercise 1. Write a short code example that demonstrates the main concept covered on this page. Include comments explaining each step.
Solution to Exercise 1
Refer to the code examples in the page content above. A complete solution would recreate the key pattern with clear comments explaining the NumPy operations involved.
Exercise 2. Predict the output of a code snippet that uses the features described on this page. Explain why the output is what it is.
Solution to Exercise 2
The output depends on how NumPy handles the specific operation. Key factors include array shapes, dtypes, and broadcasting rules. Trace through the computation step by step.
Exercise 3. Write a practical function that applies the concepts from this page to solve a real data processing task. Test it with sample data.
Solution to Exercise 3
```python import numpy as np
Example: apply the page's concept to process sample data¶
data = np.random.default_rng(42).random((5, 3))
Apply the relevant operation¶
result = data # replace with actual operation print(result) ```
Exercise 4. Identify a common mistake when using the features described on this page. Write code that demonstrates the mistake and then show the corrected version.
Solution to Exercise 4
A common mistake is misunderstanding array shapes or dtypes. Always check .shape and .dtype when debugging unexpected results.