mirror of
https://github.com/aljazceru/Tutorial-Codebase-Knowledge.git
synced 2025-12-19 07:24:20 +01:00
init push
This commit is contained in:
242
docs/NumPy Core/01_ndarray__n_dimensional_array_.md
Normal file
242
docs/NumPy Core/01_ndarray__n_dimensional_array_.md
Normal file
@@ -0,0 +1,242 @@
|
||||
# Chapter 1: ndarray (N-dimensional array)
|
||||
|
||||
Welcome to the NumPy Core tutorial! If you're interested in how NumPy works under the hood, you're in the right place. NumPy is the foundation for scientific computing in Python, and its core strength comes from a special object called the `ndarray`.
|
||||
|
||||
Imagine you have a huge list of numbers, maybe temperatures recorded every second for a year, or the pixel values of a large image. Doing math with standard Python lists can be quite slow for these large datasets. This is the problem NumPy, and specifically the `ndarray`, is designed to solve.
|
||||
|
||||
## What is an ndarray?
|
||||
|
||||
Think of an `ndarray` (which stands for N-dimensional array) as a powerful grid or table designed to hold items **of the same type**, usually numbers (like integers or decimals). It's the fundamental building block of NumPy.
|
||||
|
||||
* **Grid:** It can be a simple list (1-dimension), a table with rows and columns (2-dimensions), or even have more dimensions (3D, 4D, ... N-D).
|
||||
* **Same Type:** This is key! Unlike Python lists that can hold anything (numbers, strings, objects), NumPy arrays require all elements to be of the *same data type* (e.g., all 32-bit integers or all 64-bit floating-point numbers). This restriction allows NumPy to store and operate on the data extremely efficiently. We'll explore data types more in [Chapter 2: dtype (Data Type Object)](02_dtype__data_type_object_.md).
|
||||
|
||||
Analogy: Think of a Python list as a drawer where you can throw anything in – socks, books, tools. An `ndarray` is like a specialized toolbox or an egg carton – designed to hold only specific things (only tools, only eggs) in an organized way. This organization makes it much faster to work with.
|
||||
|
||||
Here's a quick peek at what different dimensional arrays look like conceptually:
|
||||
|
||||
```mermaid
|
||||
flowchart LR
|
||||
A[0] --> B[1] --> C[2] --> D[3]
|
||||
```
|
||||
|
||||
```mermaid
|
||||
flowchart LR
|
||||
subgraph Row 1
|
||||
R1C1[ R1C1 ] --> R1C2[ R1C2 ] --> R1C3[ R1C3 ]
|
||||
end
|
||||
|
||||
subgraph Row 2
|
||||
R2C1[ R2C1 ] --> R2C2[ R2C2 ] --> R2C3[ R2C3 ]
|
||||
end
|
||||
|
||||
R1C1 -.-> R2C1
|
||||
R1C2 -.-> R2C2
|
||||
R1C3 -.-> R2C3
|
||||
```
|
||||
|
||||
```mermaid
|
||||
flowchart LR
|
||||
subgraph Layer 1
|
||||
L1R1C1[ L1R1C1 ] --> L1R1C2[ L1R1C2 ]
|
||||
L1R2C1[ L1R2C1 ] --> L1R2C2[ L1R2C2 ]
|
||||
L1R1C1 -.-> L1R2C1
|
||||
L1R1C2 -.-> L1R2C2
|
||||
end
|
||||
|
||||
subgraph Layer 2
|
||||
L2R1C1[ L2R1C1 ] --> L2R1C2[ L2R1C2 ]
|
||||
L2R2C1[ L2R2C1 ] --> L2R2C2[ L2R2C2 ]
|
||||
L2R1C1 -.-> L2R2C1
|
||||
L2R1C2 -.-> L2R2C2
|
||||
end
|
||||
|
||||
L1R1C1 --- L2R1C1
|
||||
L1R1C2 --- L2R1C2
|
||||
L1R2C1 --- L2R2C1
|
||||
L1R2C2 --- L2R2C2
|
||||
```
|
||||
|
||||
|
||||
|
||||
## Why ndarrays? The Magic of Vectorization
|
||||
|
||||
Let's say you have two lists of numbers and you want to add them element by element. In standard Python, you'd use a loop:
|
||||
|
||||
```python
|
||||
# Using standard Python lists
|
||||
list1 = [1, 2, 3, 4]
|
||||
list2 = [5, 6, 7, 8]
|
||||
result = []
|
||||
for i in range(len(list1)):
|
||||
result.append(list1[i] + list2[i])
|
||||
|
||||
print(result)
|
||||
# Output: [6, 8, 10, 12]
|
||||
```
|
||||
This works, but for millions of numbers, this Python loop becomes slow.
|
||||
|
||||
Now, see how you do it with NumPy ndarrays:
|
||||
|
||||
```python
|
||||
import numpy as np # Standard way to import NumPy
|
||||
|
||||
array1 = np.array([1, 2, 3, 4])
|
||||
array2 = np.array([5, 6, 7, 8])
|
||||
|
||||
# Add the arrays directly!
|
||||
result_array = array1 + array2
|
||||
|
||||
print(result_array)
|
||||
# Output: [ 6 8 10 12]
|
||||
```
|
||||
Notice how we just used `+` directly on the arrays? This is called **vectorization**. You write the operation as if you're working on single values, but NumPy applies it to *all* elements automatically.
|
||||
|
||||
**Why is this better?**
|
||||
|
||||
1. **Speed:** The looping happens behind the scenes in highly optimized C code, which is *much* faster than a Python loop.
|
||||
2. **Readability:** The code is cleaner and looks more like standard mathematical notation.
|
||||
|
||||
This ability to perform operations on entire arrays at once is a core reason why NumPy is so powerful and widely used.
|
||||
|
||||
## Creating Your First ndarrays
|
||||
|
||||
Let's create some arrays. First, we always import NumPy, usually as `np`:
|
||||
|
||||
```python
|
||||
import numpy as np
|
||||
```
|
||||
|
||||
**1. From Python Lists:** The most common way is using `np.array()`:
|
||||
|
||||
```python
|
||||
# Create a 1-dimensional array (vector)
|
||||
my_list = [10, 20, 30]
|
||||
arr1d = np.array(my_list)
|
||||
print(arr1d)
|
||||
# Output: [10 20 30]
|
||||
|
||||
# Create a 2-dimensional array (matrix/table)
|
||||
my_nested_list = [[1, 2, 3], [4, 5, 6]]
|
||||
arr2d = np.array(my_nested_list)
|
||||
print(arr2d)
|
||||
# Output:
|
||||
# [[1 2 3]
|
||||
# [4 5 6]]
|
||||
```
|
||||
`np.array()` takes your list (or list of lists) and converts it into an ndarray. NumPy tries to figure out the best data type automatically.
|
||||
|
||||
**2. Arrays of Zeros or Ones:** Often useful as placeholders.
|
||||
|
||||
```python
|
||||
# Create an array of shape (2, 3) filled with zeros
|
||||
zeros_arr = np.zeros((2, 3))
|
||||
print(zeros_arr)
|
||||
# Output:
|
||||
# [[0. 0. 0.]
|
||||
# [0. 0. 0.]]
|
||||
|
||||
# Create an array of shape (3,) filled with ones
|
||||
ones_arr = np.ones(3)
|
||||
print(ones_arr)
|
||||
# Output: [1. 1. 1.]
|
||||
```
|
||||
Notice we pass a tuple like `(2, 3)` to specify the desired shape. By default, these are filled with floating-point numbers.
|
||||
|
||||
**3. Using `np.arange`:** Similar to Python's `range`.
|
||||
|
||||
```python
|
||||
# Create an array with numbers from 0 up to (but not including) 5
|
||||
range_arr = np.arange(5)
|
||||
print(range_arr)
|
||||
# Output: [0 1 2 3 4]
|
||||
```
|
||||
|
||||
There are many other ways to create arrays, but these are fundamental.
|
||||
|
||||
## Exploring Your ndarray: Basic Attributes
|
||||
|
||||
Once you have an array, you can easily check its properties:
|
||||
|
||||
```python
|
||||
arr = np.array([[1.0, 2.0, 3.0], [4.0, 5.0, 6.0]])
|
||||
|
||||
# 1. Shape: The size of each dimension
|
||||
print(f"Shape: {arr.shape}")
|
||||
# Output: Shape: (2, 3) (2 rows, 3 columns)
|
||||
|
||||
# 2. Number of Dimensions (ndim): How many axes it has
|
||||
print(f"Dimensions: {arr.ndim}")
|
||||
# Output: Dimensions: 2
|
||||
|
||||
# 3. Size: Total number of elements
|
||||
print(f"Size: {arr.size}")
|
||||
# Output: Size: 6
|
||||
|
||||
# 4. Data Type (dtype): The type of elements in the array
|
||||
print(f"Data Type: {arr.dtype}")
|
||||
# Output: Data Type: float64
|
||||
```
|
||||
These attributes are crucial for understanding the structure of your data. The `dtype` tells you what kind of data is stored (e.g., `int32`, `float64`, `bool`). We'll dive much deeper into this in [Chapter 2: dtype (Data Type Object)](02_dtype__data_type_object_.md).
|
||||
|
||||
## A Glimpse Under the Hood
|
||||
|
||||
So, how does NumPy achieve its speed? The `ndarray` you manipulate in Python is actually a clever wrapper around a highly efficient data structure implemented in the **C programming language**.
|
||||
|
||||
When you perform an operation like `array1 + array2`, Python doesn't slowly loop through the elements. Instead, NumPy:
|
||||
|
||||
1. Checks if the operation is valid (e.g., arrays are compatible).
|
||||
2. Hands off the arrays and the operation (`+` in this case) to its underlying C code.
|
||||
3. The C code, which is pre-compiled and highly optimized for your processor, performs the addition very rapidly across the entire block of memory holding the array data.
|
||||
4. The result (another block of memory) is then wrapped back into a new Python `ndarray` object for you to use.
|
||||
|
||||
Here's a simplified view of what happens when you call `np.array()`:
|
||||
|
||||
```mermaid
|
||||
sequenceDiagram
|
||||
participant P as Python Code (Your script)
|
||||
participant NPF as NumPy Python Function (e.g., np.array)
|
||||
participant CF as C Function (in _multiarray_umath)
|
||||
participant M as Memory
|
||||
|
||||
P->>NPF: np.array([1, 2, 3])
|
||||
NPF->>CF: Call C implementation with list data
|
||||
CF->>M: Allocate contiguous memory block
|
||||
CF->>M: Copy data [1, 2, 3] into block
|
||||
CF-->>NPF: Return C-level ndarray structure pointing to memory
|
||||
NPF-->>P: Return Python ndarray object wrapping the C structure
|
||||
```
|
||||
|
||||
The core implementation lives within compiled C extension modules, primarily `_multiarray_umath`. Python files like `numpy/core/multiarray.py` and `numpy/core/numeric.py` provide the convenient Python functions (`np.array`, `np.zeros`, etc.) that eventually call this fast C code. You can see how `numeric.py` imports functions from `multiarray`:
|
||||
|
||||
```python
|
||||
# From numpy/core/numeric.py - Simplified
|
||||
from . import multiarray
|
||||
from .multiarray import (
|
||||
arange, array, asarray, asanyarray, # <-- Python functions defined here
|
||||
empty, empty_like, zeros # <-- More functions
|
||||
# ... many others ...
|
||||
)
|
||||
|
||||
# The `array` function seen in multiarray.py is often a wrapper
|
||||
# that calls the actual C implementation.
|
||||
```
|
||||
This setup gives you the ease of Python with the speed of C. The `ndarray` object itself stores metadata (like shape, dtype, strides) and a pointer to the actual raw data block in memory. We will see more details about the Python modules involved in [Chapter 6: multiarray Module](06_multiarray_module.md) and [Chapter 7: umath Module](07_umath_module.md).
|
||||
|
||||
## Conclusion
|
||||
|
||||
You've met the `ndarray`, the heart of NumPy! You learned:
|
||||
|
||||
* It's a powerful, efficient grid for storing elements of the **same type**.
|
||||
* It enables **vectorization**, allowing fast operations on entire arrays without explicit Python loops.
|
||||
* How to create basic arrays using `np.array`, `np.zeros`, `np.ones`, and `np.arange`.
|
||||
* How to check key properties like `shape`, `ndim`, `size`, and `dtype`.
|
||||
* That the speed comes from an underlying **C implementation**.
|
||||
|
||||
The `ndarray` is the container. Now, let's look more closely at *what* it contains – the different types of data it can hold.
|
||||
|
||||
Ready to learn about data types? Let's move on to [Chapter 2: dtype (Data Type Object)](02_dtype__data_type_object_.md).
|
||||
|
||||
---
|
||||
|
||||
Generated by [AI Codebase Knowledge Builder](https://github.com/The-Pocket/Tutorial-Codebase-Knowledge)
|
||||
213
docs/NumPy Core/02_dtype__data_type_object_.md
Normal file
213
docs/NumPy Core/02_dtype__data_type_object_.md
Normal file
@@ -0,0 +1,213 @@
|
||||
# Chapter 2: dtype (Data Type Object)
|
||||
|
||||
In [Chapter 1: ndarray (N-dimensional array)](01_ndarray__n_dimensional_array_.md), we learned that NumPy's `ndarray` is a powerful grid designed to hold items **of the same type**. This "same type" requirement is fundamental to NumPy's speed and efficiency. But how does NumPy know *what kind* of data it's storing? That's where the `dtype` comes in!
|
||||
|
||||
## What Problem Does `dtype` Solve?
|
||||
|
||||
Imagine you have a list of numbers in Python: `[1, 2, 3]`. Are these small integers? Big integers? Numbers with decimal points? Python figures this out on the fly, which is flexible but can be slow for large datasets.
|
||||
|
||||
NumPy needs to be much faster. To achieve speed, it needs to know *exactly* what kind of data is in an array *before* doing any calculations. Is it a tiny integer that fits in 1 byte? A standard integer using 4 bytes? A decimal number needing 8 bytes?
|
||||
|
||||
Knowing the exact type and size allows NumPy to:
|
||||
1. **Allocate Memory Efficiently:** If you have a million small integers, NumPy can reserve exactly the right amount of memory, not wasting space.
|
||||
2. **Perform Fast Math:** NumPy can use highly optimized, low-level C or Fortran code that works directly with specific number types (like 32-bit integers or 64-bit floats). These low-level operations are much faster than Python's flexible number handling.
|
||||
|
||||
Think of it like packing boxes. If you know you're only packing small screws (like `int8`), you can use small, efficiently packed boxes. If you're packing large bolts (`int64`), you need bigger boxes. If you just have a mixed bag (like a Python list), you need a much larger, less efficient container to hold everything. The `dtype` is the label on the box telling NumPy exactly what's inside.
|
||||
|
||||
## What is a `dtype` (Data Type Object)?
|
||||
|
||||
A `dtype` is a special **object** in NumPy that describes the **type** and **size** of data stored in an `ndarray`. Every `ndarray` has a `dtype` associated with it.
|
||||
|
||||
It's like specifying the "column type" in a database or spreadsheet. If you set a column to "Integer", you expect only whole numbers in that column. If you set it to "Decimal", you expect numbers with potential decimal points. Similarly, the `dtype` ensures all elements in a NumPy array are consistent.
|
||||
|
||||
Let's see it in action. Remember from Chapter 1 how we could check the attributes of an array?
|
||||
|
||||
```python
|
||||
import numpy as np
|
||||
|
||||
# Create an array of integers
|
||||
int_array = np.array([1, 2, 3])
|
||||
print(f"Integer array: {int_array}")
|
||||
print(f"Data type: {int_array.dtype}")
|
||||
|
||||
# Create an array of floating-point numbers (decimals)
|
||||
float_array = np.array([1.0, 2.5, 3.14])
|
||||
print(f"\nFloat array: {float_array}")
|
||||
print(f"Data type: {float_array.dtype}")
|
||||
|
||||
# Create an array of booleans (True/False)
|
||||
bool_array = np.array([True, False, True])
|
||||
print(f"\nBoolean array: {bool_array}")
|
||||
print(f"Data type: {bool_array.dtype}")
|
||||
```
|
||||
|
||||
**Output:**
|
||||
|
||||
```
|
||||
Integer array: [1 2 3]
|
||||
Data type: int64
|
||||
|
||||
Float array: [1. 2.5 3.14]
|
||||
Data type: float64
|
||||
|
||||
Boolean array: [ True False True]
|
||||
Data type: bool
|
||||
```
|
||||
|
||||
Look at the `Data type:` lines.
|
||||
* For `int_array`, NumPy chose `int64`. This means each element is a 64-bit signed integer (a whole number that can be positive or negative, stored using 64 bits or 8 bytes). The `64` tells us the size.
|
||||
* For `float_array`, NumPy chose `float64`. Each element is a 64-bit floating-point number (a number with a potential decimal point, following the standard IEEE 754 format, stored using 64 bits or 8 bytes).
|
||||
* For `bool_array`, NumPy chose `bool`. Each element is a boolean value (True or False), typically stored using 1 byte.
|
||||
|
||||
The `dtype` object holds this crucial information.
|
||||
|
||||
## Specifying the `dtype`
|
||||
|
||||
NumPy usually makes a good guess about the `dtype` when you create an array from a list. But sometimes you need to be explicit, especially if you want to save memory or ensure a specific precision.
|
||||
|
||||
You can specify the `dtype` when creating an array using the `dtype` argument:
|
||||
|
||||
```python
|
||||
import numpy as np
|
||||
|
||||
# Create an array, specifying 32-bit integers
|
||||
arr_i32 = np.array([1, 2, 3], dtype=np.int32)
|
||||
print(f"Array: {arr_i32}")
|
||||
print(f"Data type: {arr_i32.dtype}")
|
||||
print(f"Bytes per element: {arr_i32.itemsize}") # itemsize shows bytes
|
||||
|
||||
# Create an array, specifying 32-bit floats
|
||||
arr_f32 = np.array([1, 2, 3], dtype=np.float32)
|
||||
print(f"\nArray: {arr_f32}") # Notice the decimal points now!
|
||||
print(f"Data type: {arr_f32.dtype}")
|
||||
print(f"Bytes per element: {arr_f32.itemsize}")
|
||||
|
||||
# Create an array using string codes for dtype
|
||||
arr_f64_str = np.array([4, 5, 6], dtype='float64') # Equivalent to np.float64
|
||||
print(f"\nArray: {arr_f64_str}")
|
||||
print(f"Data type: {arr_f64_str.dtype}")
|
||||
print(f"Bytes per element: {arr_f64_str.itemsize}")
|
||||
```
|
||||
|
||||
**Output:**
|
||||
|
||||
```
|
||||
Array: [1 2 3]
|
||||
Data type: int32
|
||||
Bytes per element: 4
|
||||
|
||||
Array: [1. 2. 3.]
|
||||
Data type: float32
|
||||
Bytes per element: 4
|
||||
|
||||
Array: [4. 5. 6.]
|
||||
Data type: float64
|
||||
Bytes per element: 8
|
||||
```
|
||||
|
||||
Notice a few things:
|
||||
1. We used `np.int32` and `np.float32` to explicitly ask for 32-bit types.
|
||||
2. The `.itemsize` attribute shows how many *bytes* each element takes. `int32` and `float32` use 4 bytes, while `float64` uses 8 bytes. Choosing `int32` instead of the default `int64` uses half the memory!
|
||||
3. You can use string codes like `'float64'` (or `'f8'`) instead of the type object `np.float64`.
|
||||
|
||||
### Common Data Type Codes
|
||||
|
||||
NumPy offers various ways to specify dtypes. Here are the most common:
|
||||
|
||||
| Type Category | NumPy Type Objects | String Codes (Common) | Description |
|
||||
| :----------------- | :------------------------- | :-------------------- | :-------------------------------- |
|
||||
| **Boolean** | `np.bool_` | `'?'` or `'bool'` | True / False |
|
||||
| **Signed Integer** | `np.int8`, `np.int16`, `np.int32`, `np.int64` | `'i1'`, `'i2'`, `'i4'`, `'i8'` | Whole numbers (positive/negative) |
|
||||
| **Unsigned Int** | `np.uint8`, `np.uint16`, `np.uint32`, `np.uint64` | `'u1'`, `'u2'`, `'u4'`, `'u8'` | Whole numbers (non-negative) |
|
||||
| **Floating Point** | `np.float16`, `np.float32`, `np.float64` | `'f2'`, `'f4'`, `'f8'` | Decimal numbers |
|
||||
| **Complex Float** | `np.complex64`, `np.complex128` | `'c8'`, `'c16'` | Complex numbers (real+imaginary) |
|
||||
| **String (Fixed)** | `np.bytes_` | `'S'` + number | Fixed-length byte strings |
|
||||
| **Unicode (Fixed)**| `np.str_` | `'U'` + number | Fixed-length unicode strings |
|
||||
| **Object** | `np.object_` | `'O'` | Python objects |
|
||||
| **Datetime** | `np.datetime64` | `'M8'` + unit | Date and time values |
|
||||
| **Timedelta** | `np.timedelta64` | `'m8'` + unit | Time durations |
|
||||
|
||||
* The numbers in the string codes (`i4`, `f8`, `u2`) usually represent the number of **bytes**. So `i4` = 4-byte integer (`int32`), `f8` = 8-byte float (`float64`).
|
||||
* `'S'` and `'U'` often need a number after them (e.g., `'S10'`, `'U25'`) to specify the maximum length of the string.
|
||||
* `'M8'` and `'m8'` usually have a unit like `[D]` for day or `[s]` for second (e.g., `'M8[D]'`). We'll explore numeric types more in [Chapter 4: Numeric Types (`numerictypes`)](04_numeric_types___numerictypes__.md).
|
||||
|
||||
Using explicit dtypes is important when:
|
||||
* You need to control memory usage (e.g., using `int8` if your numbers are always small).
|
||||
* You are reading data from a file that has a specific binary format.
|
||||
* You need a specific precision for calculations.
|
||||
|
||||
## A Glimpse Under the Hood
|
||||
|
||||
How does NumPy manage this `dtype` information internally?
|
||||
|
||||
The Python `dtype` object you interact with (like `arr.dtype`) is essentially a wrapper around more detailed information stored in a C structure within NumPy's core. This C structure (often referred to as `PyArray_Descr`) contains everything NumPy needs to know to interpret the raw bytes in the `ndarray`'s memory block:
|
||||
|
||||
1. **Type Kind:** Is it an integer, float, boolean, string, etc.? (Represented by a character like `'i'`, `'f'`, `'b'`, `'S'`).
|
||||
2. **Item Size:** How many bytes does one element occupy? (e.g., 1, 2, 4, 8).
|
||||
3. **Byte Order:** How are multi-byte numbers stored? (Little-endian `<` or Big-endian `>`. Important for reading files created on different types of computers).
|
||||
4. **Element Type:** A pointer to the specific C-level functions that know how to operate on this data type.
|
||||
5. **Fields (for Structured Types):** If it's a structured dtype (like a C struct or a database row), information about the names, dtypes, and offsets of each field.
|
||||
6. **Subarray (for Nested Types):** Information if the dtype itself represents an array.
|
||||
|
||||
When you create an array or perform an operation:
|
||||
|
||||
```mermaid
|
||||
sequenceDiagram
|
||||
participant P as Python Code (Your script)
|
||||
participant NPF as NumPy Python Func (e.g., np.array)
|
||||
participant C_API as NumPy C API
|
||||
participant DTypeC as C Struct (PyArray_Descr)
|
||||
participant Mem as Memory
|
||||
|
||||
P->>NPF: np.array([1, 2], dtype='int32')
|
||||
NPF->>C_API: Parse dtype string 'int32'
|
||||
C_API->>DTypeC: Create/Find PyArray_Descr for int32 (kind='i', itemsize=4, etc.)
|
||||
C_API->>Mem: Allocate memory (2 items * 4 bytes/item = 8 bytes)
|
||||
C_API->>Mem: Copy data [1, 2] into memory as 32-bit ints
|
||||
C_API-->>NPF: Return C ndarray struct (pointing to Mem and DTypeC)
|
||||
NPF-->>P: Return Python ndarray object wrapping the C struct
|
||||
```
|
||||
|
||||
The `dtype` is created or retrieved *once* and then referenced by potentially many arrays. This C-level description allows NumPy's core functions, especially the [ufunc (Universal Function)](03_ufunc__universal_function_.md)s we'll see next, to work directly on the raw memory with maximum efficiency.
|
||||
|
||||
The Python code in `numpy/core/_dtype.py` helps manage the creation and representation (like the nice string output you see when you `print(arr.dtype)`) of these `dtype` objects in Python. For instance, functions like `_kind_name`, `__str__`, and `__repr__` in `_dtype.py` are used to generate the user-friendly names and representations based on the underlying C structure's information. The `_dtype_ctypes.py` file helps bridge the gap between NumPy dtypes and Python's built-in `ctypes` module, allowing interoperability.
|
||||
|
||||
## Beyond Simple Numbers: Structured Data and Byte Order
|
||||
|
||||
`dtype`s can do more than just describe simple numbers:
|
||||
|
||||
* **Structured Arrays:** You can define a `dtype` that represents a mix of types, like a row in a table or a C struct. This is useful for representing structured data efficiently.
|
||||
```python
|
||||
# Define a structured dtype: a name (up to 10 chars) and an age (4-byte int)
|
||||
person_dtype = np.dtype([('name', 'S10'), ('age', 'i4')])
|
||||
people = np.array([('Alice', 30), ('Bob', 25)], dtype=person_dtype)
|
||||
|
||||
print(people)
|
||||
print(people.dtype)
|
||||
print(people[0]['name']) # Access fields by name
|
||||
```
|
||||
**Output:**
|
||||
```
|
||||
[(b'Alice', 30) (b'Bob', 25)]
|
||||
[('name', 'S10'), ('age', '<i4')]
|
||||
b'Alice'
|
||||
```
|
||||
* **Byte Order:** Computers can store multi-byte numbers in different ways ("endianness"). `dtype`s can specify byte order (`<` for little-endian, `>` for big-endian) which is crucial for reading binary data correctly across different systems. Notice the `'<i4'` in the output above – the `<` indicates little-endian, which is common on x86 processors.
|
||||
|
||||
## Conclusion
|
||||
|
||||
You've now learned about the `dtype` object, the crucial piece of metadata that tells NumPy *what kind* of data is stored in an `ndarray`. You saw:
|
||||
|
||||
* `dtype` describes the **type** and **size** of array elements.
|
||||
* It's essential for NumPy's **memory efficiency** and **computational speed**.
|
||||
* How to **inspect** (`arr.dtype`) and **specify** (`dtype=...`) data types using type objects (`np.int32`) or string codes (`'i4'`).
|
||||
* That the Python `dtype` object represents lower-level C information (`PyArray_Descr`) used for efficient operations.
|
||||
* `dtype`s can also handle more complex scenarios like **structured data** and **byte order**.
|
||||
|
||||
Understanding `dtype`s is key to understanding how NumPy manages data efficiently. With the container (`ndarray`) and its contents (`dtype`) defined, we can now explore how NumPy performs fast calculations on these arrays.
|
||||
|
||||
Next up, we'll dive into the workhorses of NumPy's element-wise computations: [Chapter 3: ufunc (Universal Function)](03_ufunc__universal_function_.md).
|
||||
|
||||
---
|
||||
|
||||
Generated by [AI Codebase Knowledge Builder](https://github.com/The-Pocket/Tutorial-Codebase-Knowledge)
|
||||
185
docs/NumPy Core/03_ufunc__universal_function_.md
Normal file
185
docs/NumPy Core/03_ufunc__universal_function_.md
Normal file
@@ -0,0 +1,185 @@
|
||||
# Chapter 3: ufunc (Universal Function)
|
||||
|
||||
Welcome back! In [Chapter 1: ndarray (N-dimensional array)](01_ndarray__n_dimensional_array_.md), we met the `ndarray`, NumPy's powerful container for numerical data. In [Chapter 2: dtype (Data Type Object)](02_dtype__data_type_object_.md), we learned how `dtype`s specify the exact *kind* of data stored within those arrays.
|
||||
|
||||
Now, let's tackle a fundamental question: How does NumPy actually *perform calculations* on these arrays so quickly? If you have two large arrays, `a` and `b`, why is `a + b` massively faster than using a Python `for` loop? The answer lies in a special type of function: the **ufunc**.
|
||||
|
||||
## What Problem Do ufuncs Solve? Speeding Up Element-wise Math
|
||||
|
||||
Imagine you have temperature readings from a sensor stored in a NumPy array, and you need to convert them from Celsius to Fahrenheit. The formula is `F = C * 9/5 + 32`.
|
||||
|
||||
With standard Python lists, you'd loop through each temperature:
|
||||
|
||||
```python
|
||||
# Celsius temperatures in a Python list
|
||||
celsius_list = [0.0, 10.0, 20.0, 30.0, 100.0]
|
||||
fahrenheit_list = []
|
||||
|
||||
# Python loop for conversion
|
||||
for temp_c in celsius_list:
|
||||
temp_f = temp_c * (9/5) + 32
|
||||
fahrenheit_list.append(temp_f)
|
||||
|
||||
print(fahrenheit_list)
|
||||
# Output: [32.0, 50.0, 68.0, 86.0, 212.0]
|
||||
```
|
||||
This works, but as we saw in Chapter 1, Python loops are relatively slow, especially for millions of data points.
|
||||
|
||||
NumPy offers a much faster way using its `ndarray` and vectorized operations:
|
||||
|
||||
```python
|
||||
import numpy as np
|
||||
|
||||
# Celsius temperatures in a NumPy array
|
||||
celsius_array = np.array([0.0, 10.0, 20.0, 30.0, 100.0])
|
||||
|
||||
# NumPy vectorized conversion - NO explicit Python loop!
|
||||
fahrenheit_array = celsius_array * (9/5) + 32
|
||||
|
||||
print(fahrenheit_array)
|
||||
# Output: [ 32. 50. 68. 86. 212.]
|
||||
```
|
||||
Look how clean that is! We just wrote the math formula directly using the array. But *how* does NumPy execute `*`, `/`, and `+` so efficiently on *every element* without a visible loop? This magic is powered by ufuncs.
|
||||
|
||||
## What is a ufunc (Universal Function)?
|
||||
|
||||
A **ufunc** (Universal Function) is a special type of function in NumPy designed to operate on `ndarray`s **element by element**. Think of them as super-powered mathematical functions specifically built for NumPy arrays.
|
||||
|
||||
Examples include `np.add`, `np.subtract`, `np.multiply`, `np.sin`, `np.cos`, `np.exp`, `np.sqrt`, `np.maximum`, `np.equal`, and many more.
|
||||
|
||||
**Key Features:**
|
||||
|
||||
1. **Element-wise Operation:** A ufunc applies the same operation independently to each element of the input array(s). When you do `np.add(a, b)`, it conceptually does `result[0] = a[0] + b[0]`, `result[1] = a[1] + b[1]`, and so on for all elements.
|
||||
2. **Speed (Optimized C Loops):** This is the secret sauce! Ufuncs don't actually perform the element-wise operation using slow Python loops. Instead, they execute highly optimized, pre-compiled **C loops** under the hood. This C code can work directly with the raw data buffers of the arrays (remember, ndarrays store data contiguously), making the computations extremely fast.
|
||||
* **Analogy:** Imagine you need to staple 1000 documents. A Python loop is like picking up the stapler, stapling one document, putting the stapler down, picking it up again, stapling the next... A ufunc is like using an industrial stapling machine that processes the entire stack almost instantly.
|
||||
3. **Broadcasting Support:** Ufuncs automatically handle operations between arrays of different, but compatible, shapes. For example, you can add a single number (a scalar) to every element of an array, or add a 1D array to each row of a 2D array. The ufunc "stretches" or "broadcasts" the smaller array to match the shape of the larger one during the calculation. (We won't dive deep into broadcasting rules here, just know that ufuncs enable it).
|
||||
4. **Type Casting:** Ufuncs can intelligently handle inputs with different [Chapter 2: dtype (Data Type Object)](02_dtype__data_type_object_.md)s. For instance, if you add an `int32` array and a `float64` array, the ufunc might decide to convert the integers to `float64` before performing the addition to avoid losing precision, returning a `float64` array. This happens according to well-defined casting rules.
|
||||
5. **Optional Output Arrays (`out` argument):** You can tell a ufunc to place its result into an *existing* array instead of creating a new one. This can save memory, especially when working with very large arrays or inside loops.
|
||||
|
||||
## Using ufuncs
|
||||
|
||||
You use ufuncs just like regular Python functions, but you pass NumPy arrays as arguments. Many common mathematical operators (`+`, `-`, `*`, `/`, `**`, `==`, `<`, etc.) also call ufuncs behind the scenes when used with NumPy arrays.
|
||||
|
||||
```python
|
||||
import numpy as np
|
||||
|
||||
a = np.array([1, 2, 3, 4])
|
||||
b = np.array([5, 0, 7, 2])
|
||||
|
||||
# Using the ufunc directly
|
||||
c = np.add(a, b)
|
||||
print(f"np.add(a, b) = {c}")
|
||||
# Output: np.add(a, b) = [ 6 2 10 6]
|
||||
|
||||
# Using the corresponding operator (which calls np.add internally)
|
||||
d = a + b
|
||||
print(f"a + b = {d}")
|
||||
# Output: a + b = [ 6 2 10 6]
|
||||
|
||||
# Other examples
|
||||
print(f"np.maximum(a, b) = {np.maximum(a, b)}") # Element-wise maximum
|
||||
# Output: np.maximum(a, b) = [5 2 7 4]
|
||||
|
||||
print(f"np.sin(a) = {np.sin(a)}") # Element-wise sine
|
||||
# Output: np.sin(a) = [ 0.84147098 0.90929743 0.14112001 -0.7568025 ]
|
||||
```
|
||||
|
||||
**Using the `out` Argument:**
|
||||
|
||||
Let's pre-allocate an array and tell the ufunc to use it for the result.
|
||||
|
||||
```python
|
||||
import numpy as np
|
||||
|
||||
a = np.arange(5) # [0 1 2 3 4]
|
||||
b = np.arange(5, 10) # [5 6 7 8 9]
|
||||
|
||||
# Create an empty array with the same shape and type
|
||||
result = np.empty_like(a)
|
||||
|
||||
# Perform addition, storing the result in the 'result' array
|
||||
np.add(a, b, out=result)
|
||||
|
||||
print(f"a = {a}")
|
||||
print(f"b = {b}")
|
||||
print(f"result (after np.add(a, b, out=result)) = {result}")
|
||||
# Output:
|
||||
# a = [0 1 2 3 4]
|
||||
# b = [5 6 7 8 9]
|
||||
# result (after np.add(a, b, out=result)) = [ 5 7 9 11 13]
|
||||
```
|
||||
Instead of creating a *new* array for the sum, `np.add` placed the values directly into `result`.
|
||||
|
||||
## A Glimpse Under the Hood
|
||||
|
||||
So, what happens internally when you call, say, `np.add(array1, array2)`?
|
||||
|
||||
1. **Identify Ufunc:** NumPy recognizes `np.add` as a specific ufunc object. This object holds metadata about the operation (like its name, number of inputs/outputs, identity element if any, etc.).
|
||||
2. **Check Dtypes:** NumPy inspects the `dtype` of `array1` and `array2` (e.g., `int32`, `float64`). This uses the `dtype` information we learned about in [Chapter 2: dtype (Data Type Object)](02_dtype__data_type_object_.md).
|
||||
3. **Find the Loop:** The ufunc object contains an internal table (a list of "loops"). Each loop is a specific, pre-compiled C function designed to handle a particular combination of input/output `dtype`s (e.g., `int32 + int32 -> int32`, `float32 + float32 -> float32`, `int32 + float64 -> float64`). NumPy searches this table to find the most appropriate C function based on the input dtypes and casting rules. It might need to select a loop that involves converting one or both inputs to a common, safer type (type casting).
|
||||
4. **Check Broadcasting:** NumPy checks if the shapes of `array1` and `array2` are compatible according to broadcasting rules. If they are compatible but different, it calculates how to "stretch" the smaller array's dimensions virtually.
|
||||
5. **Allocate Output:** If the `out` argument wasn't provided, NumPy allocates a new block of memory for the result array, determining its shape (based on broadcasting) and `dtype` (based on the chosen loop).
|
||||
6. **Execute C Loop:** NumPy calls the selected C function. This function iterates through the elements of the input arrays (using pointers to their raw memory locations, respecting broadcasting rules) and performs the addition, storing the result in the output array's memory. This loop is *very* fast because it's simple, compiled C code operating on primitive types.
|
||||
7. **Return ndarray:** NumPy wraps the output memory block (either the newly allocated one or the one provided via `out`) into a new Python `ndarray` object ([Chapter 1: ndarray (N-dimensional array)](01_ndarray__n_dimensional_array_.md)) with the correct `shape`, `dtype`, etc., and returns it to your Python code.
|
||||
|
||||
Here's a simplified sequence diagram:
|
||||
|
||||
```mermaid
|
||||
sequenceDiagram
|
||||
participant P as Python Code
|
||||
participant UFunc as np.add (Ufunc Object)
|
||||
participant C_API as NumPy C Core (Ufunc Machinery)
|
||||
participant C_Loop as Specific C Loop (e.g., int32_add)
|
||||
participant Mem as Memory
|
||||
|
||||
P->>UFunc: np.add(arr1, arr2)
|
||||
UFunc->>C_API: Request execution
|
||||
C_API->>C_API: Check dtypes (arr1.dtype, arr2.dtype)
|
||||
C_API->>UFunc: Find appropriate C loop (e.g., int32_add)
|
||||
C_API->>C_API: Check broadcasting rules
|
||||
C_API->>Mem: Allocate memory for result (if no 'out')
|
||||
C_API->>C_Loop: Execute C loop(arr1_data, arr2_data, result_data)
|
||||
C_Loop->>Mem: Read inputs, Compute, Write output
|
||||
C_Loop-->>C_API: Signal completion
|
||||
C_API->>Mem: Wrap result memory in ndarray object
|
||||
C_API-->>P: Return result ndarray
|
||||
```
|
||||
|
||||
**Where is the Code?**
|
||||
|
||||
* The ufunc objects themselves are typically defined in C, often generated by helper scripts like `numpy/core/code_generators/generate_umath.py`. This script reads definitions (like those in the `defdict` variable within the script) specifying the ufunc's name, inputs, outputs, and the C functions to use for different type combinations.
|
||||
```python
|
||||
# Snippet from generate_umath.py's defdict for 'add'
|
||||
'add':
|
||||
Ufunc(2, 1, Zero, # nin=2, nout=1, identity=0
|
||||
docstrings.get('numpy._core.umath.add'),
|
||||
'PyUFunc_AdditionTypeResolver', # Function for type resolution
|
||||
TD('?', cfunc_alias='logical_or', ...), # Loop for bools
|
||||
TD(no_bool_times_obj, dispatch=[...]), # Loops for numeric types
|
||||
# ... loops for datetime, object ...
|
||||
indexed=intfltcmplx # Types supporting indexed access
|
||||
),
|
||||
```
|
||||
* The Python functions you call (like `numpy.add`) are often thin wrappers defined in places like `numpy/core/umath.py` or `numpy/core/numeric.py`. These Python functions essentially just retrieve the corresponding C ufunc object and trigger its execution mechanism.
|
||||
* The core C machinery for handling ufunc dispatch (finding the right loop), broadcasting, and executing the loops resides within the compiled `_multiarray_umath` C extension module. We'll touch upon these modules in [Chapter 6: multiarray Module](06_multiarray_module.md) and [Chapter 7: umath Module](07_umath_module.md).
|
||||
* Helper Python modules like `numpy/core/_methods.py` provide Python implementations for array methods (like `.sum()`, `.mean()`, `.max()`) which often leverage the underlying ufunc's reduction capabilities.
|
||||
* Error handling during ufunc execution (e.g., division by zero, invalid operations) can be configured using functions like `seterr` defined in `numpy/core/_ufunc_config.py`, and specific exception types like `UFuncTypeError` from `numpy/core/_exceptions.py` might be raised if things go wrong (e.g., no suitable loop found for the input types).
|
||||
|
||||
## Conclusion
|
||||
|
||||
Ufuncs are the powerhouses behind NumPy's speed for element-wise operations. You've learned:
|
||||
|
||||
* They perform operations **element by element** on arrays.
|
||||
* Their speed comes from executing optimized **C loops**, avoiding slow Python loops.
|
||||
* They support **broadcasting** (handling compatible shapes) and **type casting** (handling different dtypes).
|
||||
* You can use them directly (`np.add(a, b)`) or often via operators (`a + b`).
|
||||
* The `out` argument allows reusing existing arrays, saving memory.
|
||||
* Internally, NumPy finds the right C loop based on input dtypes, handles broadcasting, executes the loop, and returns a new ndarray.
|
||||
|
||||
Now that we understand how basic element-wise operations work, let's delve deeper into the different kinds of numbers NumPy works with.
|
||||
|
||||
Next up: [Chapter 4: Numeric Types (`numerictypes`)](04_numeric_types___numerictypes__.md).
|
||||
|
||||
---
|
||||
|
||||
Generated by [AI Codebase Knowledge Builder](https://github.com/The-Pocket/Tutorial-Codebase-Knowledge)
|
||||
245
docs/NumPy Core/04_numeric_types___numerictypes__.md
Normal file
245
docs/NumPy Core/04_numeric_types___numerictypes__.md
Normal file
@@ -0,0 +1,245 @@
|
||||
# Chapter 4: Numeric Types (`numerictypes`)
|
||||
|
||||
Hello again! In [Chapter 3: ufunc (Universal Function)](03_ufunc__universal_function_.md), we saw how NumPy uses universal functions (`ufuncs`) to perform fast calculations on arrays. We learned that these `ufuncs` operate element by element and can handle different data types using optimized C loops.
|
||||
|
||||
But what exactly *are* all the different data types that NumPy knows about? We touched on `dtype` objects in [Chapter 2: dtype (Data Type Object)](02_dtype__data_type_object_.md), which *describe* the type of data in an array (like '64-bit integer' or '32-bit float'). Now, we'll look at the actual **types themselves** – the specific building blocks like `numpy.int32`, `numpy.float64`, etc., and how they relate to each other. This collection and classification system is handled within the `numerictypes` concept in NumPy's core.
|
||||
|
||||
## What Problem Do `numerictypes` Solve? Organizing the Data Ingredients
|
||||
|
||||
Imagine you're organizing a huge pantry. You have different kinds of items: grains, spices, canned goods, etc. Within grains, you have rice, oats, quinoa. Within rice, you might have basmati, jasmine, brown rice.
|
||||
|
||||
NumPy's data types are similar. It has many specific types of numbers (`int8`, `int16`, `int32`, `int64`, `float16`, `float32`, `float64`, etc.) and other kinds of data (`bool`, `complex`, `datetime`). Just having a list of all these types isn't very organized.
|
||||
|
||||
We need a system to:
|
||||
1. **Define** each specific type precisely (e.g., what exactly is `np.int32`?).
|
||||
2. **Group** similar types together (e.g., all integers, all floating-point numbers).
|
||||
3. **Establish relationships** between types (e.g., know that an `int32` *is a kind of* `integer`, which *is a kind of* `number`).
|
||||
4. Provide convenient **shortcuts or aliases** (e.g., maybe `np.double` is just another name for `np.float64`).
|
||||
|
||||
The `numerictypes` concept in NumPy provides this structured catalog or "family tree" for all its scalar data types. It helps NumPy (and you!) understand how different data types are related, which is crucial for operations like choosing the right `ufunc` loop or deciding the output type of a calculation (type promotion).
|
||||
|
||||
## What are Numeric Types (`numerictypes`)?
|
||||
|
||||
In NumPy, `numerictypes` refers to the collection of **scalar type objects** themselves (like the Python classes `numpy.int32`, `numpy.float64`, `numpy.bool_`) and the **hierarchy** that organizes them.
|
||||
|
||||
Think back to the `dtype` object from Chapter 2. The `dtype` object *describes* the data type of an array. The actual type it's describing *is* one of these numeric types (or more accurately, a scalar type, since it includes non-numbers like `bool_` and `str_`).
|
||||
|
||||
```python
|
||||
import numpy as np
|
||||
|
||||
# Create an array of 32-bit integers
|
||||
arr = np.array([10, 20, 30], dtype=np.int32)
|
||||
|
||||
# The dtype object describes the type
|
||||
print(f"Array's dtype object: {arr.dtype}")
|
||||
# Output: Array's dtype object: int32
|
||||
|
||||
# The actual Python type of elements (if accessed individually)
|
||||
# and the type referred to by the dtype object's `.type` attribute
|
||||
print(f"The element type class: {arr.dtype.type}")
|
||||
# Output: The element type class: <class 'numpy.int32'>
|
||||
|
||||
# This <class 'numpy.int32'> is one of NumPy's scalar types
|
||||
# managed under the numerictypes concept.
|
||||
```
|
||||
|
||||
So, `numerictypes` defines the actual classes like `np.int32`, `np.float64`, `np.integer`, `np.floating`, etc., that form the basis of NumPy's type system.
|
||||
|
||||
## The Type Hierarchy: A Family Tree
|
||||
|
||||
NumPy organizes its scalar types into a hierarchy, much like biological classification (Kingdom > Phylum > Class > Order...). This helps group related types.
|
||||
|
||||
At the top is `np.generic`, the base class for all NumPy scalars. Below that, major branches include `np.number`, `np.flexible`, `np.bool_`, etc.
|
||||
|
||||
Here's a simplified view of the *numeric* part of the hierarchy:
|
||||
|
||||
```mermaid
|
||||
graph TD
|
||||
N[np.number] --> I[np.integer]
|
||||
N --> IX[np.inexact]
|
||||
|
||||
I --> SI[np.signedinteger]
|
||||
I --> UI[np.unsignedinteger]
|
||||
|
||||
IX --> F[np.floating]
|
||||
IX --> C[np.complexfloating]
|
||||
|
||||
SI --> i8[np.int8]
|
||||
SI --> i16[np.int16]
|
||||
SI --> i32[np.int32]
|
||||
SI --> i64[np.int64]
|
||||
SI --> ip[np.intp]
|
||||
SI --> dots_i[...]
|
||||
|
||||
UI --> u8[np.uint8]
|
||||
UI --> u16[np.uint16]
|
||||
UI --> u32[np.uint32]
|
||||
UI --> u64[np.uint64]
|
||||
UI --> up[np.uintp]
|
||||
UI --> dots_u[...]
|
||||
|
||||
F --> f16[np.float16]
|
||||
F --> f32[np.float32]
|
||||
F --> f64[np.float64]
|
||||
F --> fld[np.longdouble]
|
||||
F --> dots_f[...]
|
||||
|
||||
C --> c64[np.complex64]
|
||||
C --> c128[np.complex128]
|
||||
C --> cld[np.clongdouble]
|
||||
C --> dots_c[...]
|
||||
|
||||
%% Styling for clarity
|
||||
classDef abstract fill:#f9f,stroke:#333,stroke-width:2px;
|
||||
class N,I,IX,SI,UI,F,C abstract;
|
||||
```
|
||||
|
||||
* **Abstract Types:** Boxes like `np.number`, `np.integer`, `np.floating` represent *categories* or abstract base classes. You usually don't create arrays directly of type `np.integer`, but you can use these categories to check if a specific type belongs to that group.
|
||||
* **Concrete Types:** Boxes like `np.int32`, `np.float64`, `np.complex128` are the specific, concrete types that you typically use to create arrays. They inherit from the abstract types. For example, `np.int32` is a subclass of `np.signedinteger`, which is a subclass of `np.integer`, which is a subclass of `np.number`.
|
||||
|
||||
You can check these relationships using `np.issubdtype` or Python's built-in `issubclass`:
|
||||
|
||||
```python
|
||||
import numpy as np
|
||||
|
||||
# Is np.int32 a kind of integer?
|
||||
print(f"issubdtype(np.int32, np.integer): {np.issubdtype(np.int32, np.integer)}")
|
||||
# Output: issubdtype(np.int32, np.integer): True
|
||||
|
||||
# Is np.float64 a kind of integer?
|
||||
print(f"issubdtype(np.float64, np.integer): {np.issubdtype(np.float64, np.integer)}")
|
||||
# Output: issubdtype(np.float64, np.integer): False
|
||||
|
||||
# Is np.float64 a kind of number?
|
||||
print(f"issubdtype(np.float64, np.number): {np.issubdtype(np.float64, np.number)}")
|
||||
# Output: issubdtype(np.float64, np.number): True
|
||||
|
||||
# Using issubclass directly on the types also works
|
||||
print(f"issubclass(np.int32, np.integer): {issubclass(np.int32, np.integer)}")
|
||||
# Output: issubclass(np.int32, np.integer): True
|
||||
```
|
||||
This hierarchy is useful for understanding how NumPy treats different types, especially during calculations where types might need to be promoted (e.g., adding an `int32` and a `float64` usually results in a `float64`).
|
||||
|
||||
## Common Types and Aliases
|
||||
|
||||
While NumPy defines many specific types (like `np.int8`, `np.uint16`, `np.float16`), you'll most often encounter these:
|
||||
|
||||
* **Integers:** `np.int32`, `np.int64` (default on 64-bit systems is usually `np.int64`)
|
||||
* **Unsigned Integers:** `np.uint8` (common for images), `np.uint32`, `np.uint64`
|
||||
* **Floats:** `np.float32` (single precision), `np.float64` (double precision, usually the default)
|
||||
* **Complex:** `np.complex64`, `np.complex128`
|
||||
* **Boolean:** `np.bool_` (True/False)
|
||||
|
||||
NumPy also provides several **aliases** or alternative names for convenience or historical reasons. Some common ones:
|
||||
|
||||
* `np.byte` is an alias for `np.int8`
|
||||
* `np.short` is an alias for `np.int16`
|
||||
* `np.intc` often corresponds to the C `int` type (usually `np.int32` or `np.int64`)
|
||||
* `np.int_` is the default integer type (often `np.int64` on 64-bit systems, `np.int32` on 32-bit systems). Platform dependent!
|
||||
* `np.single` is an alias for `np.float32`
|
||||
* `np.double` or `np.float_` is an alias for `np.float64` (matches Python's `float`)
|
||||
* `np.longdouble` corresponds to the C `long double` (size varies by platform)
|
||||
* `np.csingle` is an alias for `np.complex64`
|
||||
* `np.cdouble` or `np.complex_` is an alias for `np.complex128` (matches Python's `complex`)
|
||||
|
||||
You can usually use the specific name (like `np.float64`) or an alias (like `np.double`) interchangeably when specifying a `dtype`.
|
||||
|
||||
```python
|
||||
import numpy as np
|
||||
|
||||
# Using the specific name
|
||||
arr_f64 = np.array([1.0, 2.0], dtype=np.float64)
|
||||
print(f"Type using np.float64: {arr_f64.dtype}")
|
||||
# Output: Type using np.float64: float64
|
||||
|
||||
# Using an alias
|
||||
arr_double = np.array([1.0, 2.0], dtype=np.double)
|
||||
print(f"Type using np.double: {arr_double.dtype}")
|
||||
# Output: Type using np.double: float64
|
||||
|
||||
# They refer to the same underlying type
|
||||
print(f"Is np.float64 the same as np.double? {np.float64 is np.double}")
|
||||
# Output: Is np.float64 the same as np.double? True
|
||||
```
|
||||
|
||||
## A Glimpse Under the Hood
|
||||
|
||||
How does NumPy define all these types and their relationships? It's mostly done in Python code within the `numpy.core` submodule.
|
||||
|
||||
1. **Base C Types:** The fundamental types (like a 32-bit integer, a 64-bit float) are ultimately implemented in C as part of the [multiarray Module](06_multiarray_module.md).
|
||||
2. **Python Class Definitions:** Python classes are defined for each scalar type (e.g., `class int32(signedinteger): ...`) in modules like `numpy/core/numerictypes.py`. These classes inherit from each other to create the hierarchy (e.g., `int32` inherits from `signedinteger`, which inherits from `integer`, etc.).
|
||||
3. **Type Aliases:** Files like `numpy/core/_type_aliases.py` set up dictionaries (`sctypeDict`, `allTypes`) that map various names (including aliases like "double" or "int_") to the actual type objects (like `np.float64` or `np.intp`). This allows you to use different names when creating `dtype` objects.
|
||||
4. **Registration:** The Python number types are also registered with Python's abstract base classes (`numbers.Integral`, `numbers.Real`, etc.) in `numerictypes.py` to improve interoperability with standard Python type checking.
|
||||
5. **Documentation Generation:** Helper scripts like `numpy/core/_add_newdocs_scalars.py` use the type information and aliases to automatically generate parts of the documentation strings you see when you type `help(np.int32)`, making sure the aliases and platform specifics are correctly listed.
|
||||
|
||||
When you use a function like `np.issubdtype(np.int32, np.integer)`:
|
||||
|
||||
```mermaid
|
||||
sequenceDiagram
|
||||
participant P as Your Python Code
|
||||
participant NPFunc as np.issubdtype
|
||||
participant PyTypes as Python Type System
|
||||
participant TypeHier as NumPy Type Hierarchy (in numerictypes.py)
|
||||
|
||||
P->>NPFunc: np.issubdtype(np.int32, np.integer)
|
||||
NPFunc->>TypeHier: Get type object for np.int32
|
||||
NPFunc->>TypeHier: Get type object for np.integer
|
||||
NPFunc->>PyTypes: Ask: issubclass(np.int32_obj, np.integer_obj)?
|
||||
PyTypes-->>NPFunc: Return True (based on class inheritance)
|
||||
NPFunc-->>P: Return True
|
||||
```
|
||||
|
||||
Essentially, `np.issubdtype` leverages Python's standard `issubclass` mechanism, applied to the hierarchy of type classes defined within `numerictypes`. The `_type_aliases.py` file plays a crucial role in making sure that string names or alias names used in `dtype` specifications resolve to the correct underlying type object before such checks happen.
|
||||
|
||||
```python
|
||||
# Simplified view from numpy/core/_type_aliases.py
|
||||
|
||||
# ... (definitions of actual types like np.int8, np.float64) ...
|
||||
|
||||
allTypes = {
|
||||
'int8': np.int8,
|
||||
'int16': np.int16,
|
||||
# ...
|
||||
'float64': np.float64,
|
||||
# ...
|
||||
'signedinteger': np.signedinteger, # Abstract type
|
||||
'integer': np.integer, # Abstract type
|
||||
'number': np.number, # Abstract type
|
||||
# ... etc
|
||||
}
|
||||
|
||||
_aliases = {
|
||||
'double': 'float64', # "double" maps to the key "float64"
|
||||
'int_': 'intp', # "int_" maps to the key "intp" (platform dependent type)
|
||||
# ... etc
|
||||
}
|
||||
|
||||
sctypeDict = {} # Dictionary mapping names/aliases to types
|
||||
# Populate sctypeDict using allTypes and _aliases
|
||||
# ... (code to merge these dictionaries) ...
|
||||
|
||||
# When you do np.dtype('double'), NumPy uses sctypeDict (or similar logic)
|
||||
# to find that 'double' means np.float64.
|
||||
```
|
||||
|
||||
This setup provides a flexible and organized way to manage NumPy's rich set of data types.
|
||||
|
||||
## Conclusion
|
||||
|
||||
You've now explored the world of NumPy's `numerictypes`! You learned:
|
||||
|
||||
* `numerictypes` define the actual scalar **type objects** (like `np.int32`) and their **relationships**.
|
||||
* They form a **hierarchy** (like a family tree) with abstract categories (e.g., `np.integer`) and concrete types (e.g., `np.int32`).
|
||||
* This hierarchy helps NumPy understand how types relate, useful for calculations and type checking (`np.issubdtype`).
|
||||
* NumPy provides many convenient **aliases** (e.g., `np.double` for `np.float64`).
|
||||
* The types, hierarchy, and aliases are managed within Python code in `numpy.core`, primarily `numerictypes.py` and `_type_aliases.py`.
|
||||
|
||||
Understanding this catalog of types helps clarify why NumPy behaves the way it does when mixing different kinds of numbers.
|
||||
|
||||
Now that we know about the arrays, their data types, the functions that operate on them, and the specific numeric types available, how does NumPy *show* us the results?
|
||||
|
||||
Let's move on to how NumPy displays arrays: [Chapter 5: Array Printing (`arrayprint`)](05_array_printing___arrayprint__.md).
|
||||
|
||||
---
|
||||
|
||||
Generated by [AI Codebase Knowledge Builder](https://github.com/The-Pocket/Tutorial-Codebase-Knowledge)
|
||||
336
docs/NumPy Core/05_array_printing___arrayprint__.md
Normal file
336
docs/NumPy Core/05_array_printing___arrayprint__.md
Normal file
@@ -0,0 +1,336 @@
|
||||
# Chapter 5: Array Printing (`arrayprint`)
|
||||
|
||||
In the previous chapter, [Chapter 4: Numeric Types (`numerictypes`)](04_numeric_types___numerictypes__.md), we explored the different kinds of data NumPy can store in its arrays, like `int32`, `float64`, and more. Now that we know about the arrays ([`ndarray`](01_ndarray__n_dimensional_array_.md)), their data types ([`dtype`](02_dtype__data_type_object_.md)), the functions that operate on them ([`ufunc`](03_ufunc__universal_function_.md)), and the specific number types (`numerictypes`), a practical question arises: How do we actually *look* at these arrays, especially if they are very large?
|
||||
|
||||
## What Problem Does `arrayprint` Solve? Making Arrays Readable
|
||||
|
||||
Imagine you have a NumPy array representing a large image, maybe with millions of pixel values. Or perhaps you have simulation data with thousands of temperature readings.
|
||||
|
||||
```python
|
||||
import numpy as np
|
||||
|
||||
# Imagine this is a huge array, maybe thousands of numbers
|
||||
large_array = np.arange(2000)
|
||||
|
||||
# If Python just tried to print every single number...
|
||||
# it would flood your screen and be impossible to read!
|
||||
# print(list(large_array)) # <-- Don't run this! It would be too long.
|
||||
```
|
||||
|
||||
If NumPy just dumped *all* the numbers onto your screen whenever you tried to display a large array, it would be overwhelming and useless. We need a way to show the array's contents in a concise, human-friendly format. How can we get a *sense* of the array's data without printing every single element?
|
||||
|
||||
This is the job of NumPy's **array printing** mechanism, often referred to internally by the name of its main Python module, `arrayprint`.
|
||||
|
||||
## What is Array Printing (`arrayprint`)?
|
||||
|
||||
`arrayprint` is NumPy's **"pretty printer"** for `ndarray` objects. It's responsible for converting a NumPy array into a nicely formatted string representation that's easy to read and understand when you display it (e.g., in your Python console, Jupyter notebook, or using the `print()` function).
|
||||
|
||||
Think of it like getting a summary report instead of the raw database dump. `arrayprint` intelligently decides how to show the array, considering things like:
|
||||
|
||||
* **Summarization:** For large arrays, it shows only the beginning and end elements, using ellipsis (`...`) to indicate the omitted parts.
|
||||
* **Precision:** It controls how many decimal places are shown for floating-point numbers.
|
||||
* **Line Wrapping:** It breaks long rows of data into multiple lines to fit within a certain width.
|
||||
* **Special Values:** It uses consistent strings for "Not a Number" (`nan`) and infinity (`inf`).
|
||||
* **Customization:** It allows you to change these settings to suit your needs.
|
||||
|
||||
Let's see it in action with our `large_array`:
|
||||
|
||||
```python
|
||||
import numpy as np
|
||||
|
||||
large_array = np.arange(2000)
|
||||
|
||||
# Let NumPy's array printing handle it
|
||||
print(large_array)
|
||||
```
|
||||
|
||||
**Output:**
|
||||
|
||||
```
|
||||
[ 0 1 2 ... 1997 1998 1999]
|
||||
```
|
||||
|
||||
Instead of 2000 numbers flooding the screen, NumPy smartly printed only the first three and the last three, with `...` in between. This gives us a good idea of the array's contents (a sequence starting from 0) without being overwhelming.
|
||||
|
||||
## Key Features and Options
|
||||
|
||||
`arrayprint` has several options you can control to change how arrays are displayed.
|
||||
|
||||
### 1. Summarization (`threshold` and `edgeitems`)
|
||||
|
||||
* `threshold`: The total number of array elements that triggers summarization. If the array's `size` is greater than `threshold`, the array gets summarized. (Default: 1000)
|
||||
* `edgeitems`: When summarizing, this is the number of items shown at the beginning and end of each dimension. (Default: 3)
|
||||
|
||||
Let's try printing a smaller array and then changing the threshold:
|
||||
|
||||
```python
|
||||
import numpy as np
|
||||
|
||||
# An array with 10 elements
|
||||
arr = np.arange(10)
|
||||
print("Original:")
|
||||
print(arr)
|
||||
|
||||
# Temporarily set the threshold lower (e.g., 5)
|
||||
# We use np.printoptions as a context manager for temporary settings
|
||||
with np.printoptions(threshold=5):
|
||||
print("\nWith threshold=5:")
|
||||
print(arr)
|
||||
|
||||
# Change edgeitems too
|
||||
with np.printoptions(threshold=5, edgeitems=2):
|
||||
print("\nWith threshold=5, edgeitems=2:")
|
||||
print(arr)
|
||||
```
|
||||
|
||||
**Output:**
|
||||
|
||||
```
|
||||
Original:
|
||||
[0 1 2 3 4 5 6 7 8 9]
|
||||
|
||||
With threshold=5:
|
||||
[0 1 2 ... 7 8 9]
|
||||
|
||||
With threshold=5, edgeitems=2:
|
||||
[0 1 ... 8 9]
|
||||
```
|
||||
You can see how lowering the `threshold` caused the array (size 10) to be summarized, and `edgeitems` controlled how many elements were shown at the ends.
|
||||
|
||||
### 2. Floating-Point Precision (`precision` and `suppress`)
|
||||
|
||||
* `precision`: Controls the number of digits displayed after the decimal point for floats. (Default: 8)
|
||||
* `suppress`: If `True`, prevents NumPy from using scientific notation for very small numbers and prints them as zero if they are smaller than the current precision. (Default: False)
|
||||
|
||||
```python
|
||||
import numpy as np
|
||||
|
||||
# An array with floating-point numbers
|
||||
float_arr = np.array([0.123456789, 1.5e-10, 2.987])
|
||||
print("Default precision:")
|
||||
print(float_arr)
|
||||
|
||||
# Set precision to 3
|
||||
with np.printoptions(precision=3):
|
||||
print("\nWith precision=3:")
|
||||
print(float_arr)
|
||||
|
||||
# Set precision to 3 and suppress small numbers
|
||||
with np.printoptions(precision=3, suppress=True):
|
||||
print("\nWith precision=3, suppress=True:")
|
||||
print(float_arr)
|
||||
```
|
||||
|
||||
**Output:**
|
||||
|
||||
```
|
||||
Default precision:
|
||||
[1.23456789e-01 1.50000000e-10 2.98700000e+00]
|
||||
|
||||
With precision=3:
|
||||
[1.235e-01 1.500e-10 2.987e+00]
|
||||
|
||||
With precision=3, suppress=True:
|
||||
[0.123 0. 2.987]
|
||||
```
|
||||
Notice how `precision` changed the rounding, and `suppress=True` made the very small number (`1.5e-10`) display as `0.` and switched from scientific notation to fixed-point for the others. There's also a `floatmode` option for more fine-grained control over float formatting (e.g., 'fixed', 'unique').
|
||||
|
||||
### 3. Line Width (`linewidth`)
|
||||
|
||||
* `linewidth`: The maximum number of characters allowed per line before wrapping. (Default: 75)
|
||||
|
||||
```python
|
||||
import numpy as np
|
||||
|
||||
# A 2D array
|
||||
arr2d = np.arange(12).reshape(3, 4) * 0.1
|
||||
print("Default linewidth:")
|
||||
print(arr2d)
|
||||
|
||||
# Set a narrow linewidth
|
||||
with np.printoptions(linewidth=30):
|
||||
print("\nWith linewidth=30:")
|
||||
print(arr2d)
|
||||
```
|
||||
|
||||
**Output:**
|
||||
|
||||
```
|
||||
Default linewidth:
|
||||
[[0. 0.1 0.2 0.3]
|
||||
[0.4 0.5 0.6 0.7]
|
||||
[0.8 0.9 1. 1.1]]
|
||||
|
||||
With linewidth=30:
|
||||
[[0. 0.1 0.2 0.3]
|
||||
[0.4 0.5 0.6 0.7]
|
||||
[0.8 0.9 1. 1.1]]
|
||||
```
|
||||
*(Note: The output might not actually wrap here because the lines are short. If the array was wider, you'd see the rows break across multiple lines with the narrower `linewidth` setting.)*
|
||||
|
||||
### 4. Other Options
|
||||
|
||||
* `nanstr`: String representation for Not a Number. (Default: 'nan')
|
||||
* `infstr`: String representation for Infinity. (Default: 'inf')
|
||||
* `sign`: Control sign display for floats ('-', '+', or ' ').
|
||||
* `formatter`: A dictionary to provide completely custom formatting functions for specific data types (like bool, int, float, datetime, etc.). This is more advanced.
|
||||
|
||||
## Using and Customizing Array Printing
|
||||
|
||||
You usually interact with array printing implicitly just by displaying an array:
|
||||
|
||||
```python
|
||||
import numpy as np
|
||||
arr = np.linspace(0, 1, 5)
|
||||
|
||||
# These both use NumPy's array printing behind the scenes
|
||||
print(arr) # Calls __str__ -> array_str -> array2string
|
||||
arr # In interactive sessions, calls __repr__ -> array_repr -> array2string
|
||||
```
|
||||
|
||||
To customize the output, you can use:
|
||||
|
||||
1. **`np.set_printoptions(...)`:** Sets options globally (for your entire Python session).
|
||||
2. **`np.get_printoptions()`:** Returns a dictionary of the current settings.
|
||||
3. **`np.printoptions(...)`:** A context manager to set options *temporarily* within a `with` block (as used in the examples above). This is often the preferred way to avoid changing settings permanently.
|
||||
4. **`np.array2string(...)`:** A function to get the string representation directly, allowing you to override options just for that one call.
|
||||
|
||||
```python
|
||||
import numpy as np
|
||||
import sys # Needed for sys.maxsize
|
||||
|
||||
arr = np.random.rand(10, 10) * 1000
|
||||
|
||||
# --- Global Setting ---
|
||||
print("--- Setting threshold globally ---")
|
||||
original_options = np.get_printoptions() # Store original settings
|
||||
np.set_printoptions(threshold=50)
|
||||
print(arr)
|
||||
np.set_printoptions(**original_options) # Restore original settings
|
||||
|
||||
# --- Temporary Setting (Context Manager) ---
|
||||
print("\n--- Setting precision temporarily ---")
|
||||
with np.printoptions(precision=2, suppress=True):
|
||||
print(arr)
|
||||
print("\n--- Back to default precision ---")
|
||||
print(arr) # Options are automatically restored outside the 'with' block
|
||||
|
||||
# --- Direct Call with Overrides ---
|
||||
print("\n--- Using array2string with summarization off ---")
|
||||
# Use sys.maxsize to effectively disable summarization
|
||||
arr_string = np.array2string(arr, threshold=sys.maxsize, precision=1)
|
||||
# print(arr_string) # This might still be very long! Let's just print the first few lines
|
||||
print('\n'.join(arr_string.splitlines()[:5]) + '\n...')
|
||||
```
|
||||
|
||||
**Output (will vary due to random numbers):**
|
||||
|
||||
```
|
||||
--- Setting threshold globally ---
|
||||
[[992.84337197 931.73648142 119.68616987 ... 305.61919366 516.97897205
|
||||
707.69140878]
|
||||
[507.45895986 253.00740626 739.97091378 ... 755.69943511 813.11931119
|
||||
19.84654589]
|
||||
[941.25264871 689.43209981 820.11954711 ... 709.83933545 192.49837505
|
||||
609.30358618]
|
||||
...
|
||||
[498.86686503 872.79555956 401.19333028 ... 552.97492858 303.59379464
|
||||
308.61881807]
|
||||
[797.51920685 427.86020151 783.2019203 ... 511.63382762 322.52764881
|
||||
778.22766019]
|
||||
[ 54.84391309 938.24403397 796.7431406 ... 495.90873227 267.16620292
|
||||
409.51491904]]
|
||||
|
||||
--- Setting precision temporarily ---
|
||||
[[992.84 931.74 119.69 ... 305.62 516.98 707.69]
|
||||
[507.46 253.01 739.97 ... 755.7 813.12 19.85]
|
||||
[941.25 689.43 820.12 ... 709.84 192.5 609.3 ]
|
||||
...
|
||||
[498.87 872.8 401.19 ... 552.97 303.59 308.62]
|
||||
[797.52 427.86 783.2 ... 511.63 322.53 778.23]
|
||||
[ 54.84 938.24 796.74 ... 495.91 267.17 409.51]]
|
||||
|
||||
--- Back to default precision ---
|
||||
[[992.84337197 931.73648142 119.68616987 ... 305.61919366 516.97897205
|
||||
707.69140878]
|
||||
[507.45895986 253.00740626 739.97091378 ... 755.69943511 813.11931119
|
||||
19.84654589]
|
||||
[941.25264871 689.43209981 820.11954711 ... 709.83933545 192.49837505
|
||||
609.30358618]
|
||||
...
|
||||
[498.86686503 872.79555956 401.19333028 ... 552.97492858 303.59379464
|
||||
308.61881807]
|
||||
[797.51920685 427.86020151 783.2019203 ... 511.63382762 322.52764881
|
||||
778.22766019]
|
||||
[ 54.84391309 938.24403397 796.7431406 ... 495.90873227 267.16620292
|
||||
409.51491904]]
|
||||
|
||||
--- Using array2string with summarization off ---
|
||||
[[992.8 931.7 119.7 922. 912.2 156.5 459.4 305.6 517. 707.7]
|
||||
[507.5 253. 740. 640.3 420.3 652.1 197. 755.7 813.1 19.8]
|
||||
[941.3 689.4 820.1 125.8 598.2 219.3 466.7 709.8 192.5 609.3]
|
||||
[ 32. 855.2 362.1 434.9 133.5 148.1 522.6 725.1 395.5 377.9]
|
||||
[332.7 782.2 587.3 320.3 905.5 412.8 378. 911.9 972.1 400.2]
|
||||
...
|
||||
```
|
||||
|
||||
## A Glimpse Under the Hood
|
||||
|
||||
What happens when you call `print(my_array)`?
|
||||
|
||||
1. Python calls the `__str__` method of the `ndarray` object.
|
||||
2. NumPy's `ndarray.__str__` method typically calls the internal function `_array_str_implementation`.
|
||||
3. `_array_str_implementation` checks for simple cases (like 0-dimensional arrays) and then calls the main workhorse: `array2string`.
|
||||
4. **`array2string`** (defined in `numpy/core/arrayprint.py`) takes the array and any specified options (like `precision`, `threshold`, etc.). It also reads the current default print options (managed by `numpy/core/printoptions.py` using context variables).
|
||||
5. It determines if the array needs **summarization** based on its `size` and the `threshold` option.
|
||||
6. It figures out the **correct formatting function** for the array's `dtype` (e.g., `IntegerFormat`, `FloatingFormat`, `DatetimeFormat`). These formatters handle details like precision, sign, and scientific notation for individual elements. `FloatingFormat`, for example, might use the efficient `dragon4` algorithm (implemented in C) to convert floats to strings accurately.
|
||||
7. It recursively processes the array's dimensions:
|
||||
* For each element (or summarized chunk), it calls the chosen formatting function to get its string representation.
|
||||
* It arranges these strings, adding separators (like spaces or commas) and brackets (`[` `]`).
|
||||
* It checks the `linewidth` and inserts line breaks and indentation as needed.
|
||||
* If summarizing, it inserts the ellipsis (`...`) string (`summary_insert`).
|
||||
8. Finally, `array2string` returns the complete, formatted string representation of the array.
|
||||
|
||||
```mermaid
|
||||
sequenceDiagram
|
||||
participant User
|
||||
participant Python as print() / REPL
|
||||
participant NDArray as my_array object
|
||||
participant ArrayPrint as numpy.core.arrayprint module
|
||||
participant PrintOpts as numpy.core.printoptions module
|
||||
|
||||
User->>Python: print(my_array) or my_array
|
||||
Python->>NDArray: call __str__ or __repr__
|
||||
NDArray->>ArrayPrint: call array_str or array_repr
|
||||
ArrayPrint->>ArrayPrint: call array2string(my_array, ...)
|
||||
ArrayPrint->>PrintOpts: Get current print options (threshold, precision, etc.)
|
||||
ArrayPrint->>ArrayPrint: Check size vs threshold -> Summarize?
|
||||
ArrayPrint->>ArrayPrint: Select Formatter based on my_array.dtype
|
||||
loop For each element/chunk
|
||||
ArrayPrint->>ArrayPrint: Format element using Formatter
|
||||
end
|
||||
ArrayPrint->>ArrayPrint: Arrange strings, add brackets, wrap lines
|
||||
ArrayPrint-->>NDArray: Return formatted string
|
||||
NDArray-->>Python: Return formatted string
|
||||
Python-->>User: Display formatted string
|
||||
```
|
||||
|
||||
The core logic resides in `numpy/core/arrayprint.py`. This file contains `array2string`, `array_repr`, `array_str`, and various formatter classes (`FloatingFormat`, `IntegerFormat`, `BoolFormat`, `ComplexFloatingFormat`, `DatetimeFormat`, `TimedeltaFormat`, `StructuredVoidFormat`, etc.). The global print options themselves are managed using Python's `contextvars` in `numpy/core/printoptions.py`, allowing settings to be changed globally or temporarily within a context.
|
||||
|
||||
## Conclusion
|
||||
|
||||
You've now learned how NumPy takes potentially huge and complex arrays and turns them into readable string representations using its `arrayprint` mechanism. Key takeaways:
|
||||
|
||||
* `arrayprint` is NumPy's "pretty printer" for arrays.
|
||||
* It uses **summarization** (`threshold`, `edgeitems`) for large arrays.
|
||||
* It controls **formatting** (like `precision`, `suppress` for floats) and **layout** (`linewidth`).
|
||||
* You can customize printing **globally** (`set_printoptions`), **temporarily** (`printoptions` context manager), or for **single calls** (`array2string`).
|
||||
* The core logic resides in `numpy/core/arrayprint.py`, using formatters tailored to different dtypes and reading options from `numpy/core/printoptions.py`.
|
||||
|
||||
Understanding array printing helps you effectively inspect and share your NumPy data.
|
||||
|
||||
Next, we'll start looking at the specific C and Python modules that form the core of NumPy's implementation, beginning with the central [Chapter 6: multiarray Module](06_multiarray_module.md).
|
||||
|
||||
---
|
||||
|
||||
Generated by [AI Codebase Knowledge Builder](https://github.com/The-Pocket/Tutorial-Codebase-Knowledge)
|
||||
154
docs/NumPy Core/06_multiarray_module.md
Normal file
154
docs/NumPy Core/06_multiarray_module.md
Normal file
@@ -0,0 +1,154 @@
|
||||
# Chapter 6: multiarray Module
|
||||
|
||||
Welcome back! In [Chapter 5: Array Printing (`arrayprint`)](05_array_printing___arrayprint__.md), we saw how NumPy takes complex arrays and presents them in a readable format. We've now covered the array container ([`ndarray`](01_ndarray__n_dimensional_array_.md)), its data types ([`dtype`](02_dtype__data_type_object_.md)), the functions that compute on them ([`ufunc`](03_ufunc__universal_function_.md)), the catalog of types ([`numerictypes`](04_numeric_types___numerictypes__.md)), and how arrays are displayed ([`arrayprint`](05_array_printing___arrayprint__.md)).
|
||||
|
||||
Now, let's peek deeper into the engine room. Where does the fundamental `ndarray` object *actually* come from? How are core operations like creating arrays or accessing elements implemented so efficiently? The answer lies largely within the C code associated with the concept of the `multiarray` module.
|
||||
|
||||
## What Problem Does `multiarray` Solve? Providing the Engine
|
||||
|
||||
Think about the very first step in using NumPy: creating an array.
|
||||
|
||||
```python
|
||||
import numpy as np
|
||||
|
||||
# How does this seemingly simple line actually work?
|
||||
my_array = np.array([1, 2, 3, 4, 5])
|
||||
|
||||
# How does NumPy know its shape? How is the data stored?
|
||||
print(my_array)
|
||||
print(my_array.shape)
|
||||
```
|
||||
|
||||
When you execute `np.array()`, you're using a convenient Python function. But NumPy's speed doesn't come from Python itself. It comes from highly optimized code written in the C programming language. How do these Python functions connect to that fast C code? And where is that C code defined?
|
||||
|
||||
The `multiarray` concept represents this core C engine. It's the part of NumPy responsible for:
|
||||
|
||||
1. **Defining the `ndarray` object:** The very structure that holds your data, its shape, its data type ([`dtype`](02_dtype__data_type_object_.md)), and how it's laid out in memory.
|
||||
2. **Implementing Fundamental Operations:** Providing the low-level C functions for creating arrays (like allocating memory), accessing elements (indexing), changing the view (slicing, reshaping), and basic mathematical operations.
|
||||
|
||||
Think of the Python functions like `np.array`, `np.zeros`, or accessing `arr.shape` as the dashboard and controls of a car. The `multiarray` C code is the powerful engine under the hood that actually makes the car move efficiently.
|
||||
|
||||
## What is the `multiarray` Module (Concept)?
|
||||
|
||||
Historically, `multiarray` was a distinct C extension module in NumPy. An "extension module" is a module written in C (or C++) that Python can import and use just like a regular Python module. This allows Python code to leverage the speed of C for performance-critical tasks.
|
||||
|
||||
More recently (since NumPy 1.16), the C code for `multiarray` was merged with the C code for the [ufunc (Universal Function)](03_ufunc__universal_function_.md) system (which we'll discuss more in [Chapter 7: umath Module](07_umath_module.md)) into a single, larger C extension module typically called `_multiarray_umath.cpython-*.so` (on Linux/Mac) or `_multiarray_umath.pyd` (on Windows).
|
||||
|
||||
Even though the C code is merged, the *concept* of `multiarray` remains important. It represents the C implementation layer that provides:
|
||||
|
||||
* The **`ndarray` object type** itself (`PyArrayObject` in C).
|
||||
* The **C-API (Application Programming Interface)**: A set of C functions that can be called by other C extensions (and internally by NumPy's Python code) to work with `ndarray` objects. Examples include functions to create arrays from data, get the shape, get the data pointer, perform indexing, etc.
|
||||
* Implementations of **core array functionalities**: array creation, data type handling ([`dtype`](02_dtype__data_type_object_.md)), memory layout management (strides), indexing, slicing, reshaping, transposing, and some basic operations.
|
||||
|
||||
The Python files you might see in the NumPy source code, like `numpy/core/multiarray.py` and `numpy/core/numeric.py`, often serve as Python wrappers. They provide the user-friendly Python functions (like `np.array`, `np.empty`, `np.dot`) that eventually call the fast C functions implemented within the `_multiarray_umath` extension module.
|
||||
|
||||
```python
|
||||
# numpy/core/multiarray.py - Simplified Example
|
||||
# This Python file imports directly from the C extension module
|
||||
|
||||
from . import _multiarray_umath # Import the compiled C module
|
||||
from ._multiarray_umath import * # Make C functions available
|
||||
|
||||
# Functions like 'array', 'empty', 'dot' that you use via `np.`
|
||||
# might be defined or re-exported here, ultimately calling C code.
|
||||
# For example, the `array` function here might parse the Python input
|
||||
# and then call a C function like `PyArray_NewFromDescr` from _multiarray_umath.
|
||||
```
|
||||
|
||||
This structure gives you the flexibility and ease of Python on the surface, powered by the speed and efficiency of C underneath.
|
||||
|
||||
## A Glimpse Under the Hood: Creating an Array
|
||||
|
||||
Let's trace what happens when you call `my_array = np.array([1, 2, 3])`:
|
||||
|
||||
1. **Python Call:** You call the Python function `np.array`. This function likely lives in `numpy/core/numeric.py` or is exposed through `numpy/core/multiarray.py`.
|
||||
2. **Argument Parsing:** The Python function examines the input `[1, 2, 3]`. It figures out the data type (likely `int64` by default on many systems) and the shape (which is `(3,)`).
|
||||
3. **Call C-API Function:** The Python function calls a specific function within the compiled `_multiarray_umath` C extension module. This C function is designed to create a new array. A common one is `PyArray_NewFromDescr` or a related helper.
|
||||
4. **Memory Allocation (C):** The C function asks the operating system for a block of memory large enough to hold 3 integers of the chosen type (e.g., 3 * 8 bytes = 24 bytes for `int64`).
|
||||
5. **Data Copying (C):** The C function copies the values `1`, `2`, and `3` from the Python list into the newly allocated memory block.
|
||||
6. **Create C `ndarray` Struct:** The C function creates an internal C structure (called `PyArrayObject`). This structure stores:
|
||||
* A pointer to the actual data block in memory.
|
||||
* Information about the data type ([`dtype`](02_dtype__data_type_object_.md)).
|
||||
* The shape of the array (`(3,)`).
|
||||
* The strides (how many bytes to jump to get to the next element in each dimension).
|
||||
* Other metadata (like flags indicating if it owns the data, if it's writeable, etc.).
|
||||
7. **Wrap in Python Object:** The C function wraps this internal `PyArrayObject` structure into a Python object that Python can understand – the `ndarray` object you interact with.
|
||||
8. **Return to Python:** The C function returns this new Python `ndarray` object back to your Python code, which assigns it to the variable `my_array`.
|
||||
|
||||
Here's a simplified view of that flow:
|
||||
|
||||
```mermaid
|
||||
sequenceDiagram
|
||||
participant User as Your Python Script
|
||||
participant PyFunc as NumPy Python Func (np.array)
|
||||
participant C_API as C Code (_multiarray_umath)
|
||||
participant Memory
|
||||
|
||||
User->>PyFunc: my_array = np.array([1, 2, 3])
|
||||
PyFunc->>C_API: Call C function (e.g., PyArray_NewFromDescr) with list data, inferred dtype, shape
|
||||
C_API->>Memory: Allocate memory block (e.g., 24 bytes for 3x int64)
|
||||
C_API->>Memory: Copy data [1, 2, 3] into block
|
||||
C_API->>C_API: Create internal C ndarray struct (PyArrayObject) pointing to data, storing shape=(3,), dtype=int64, etc.
|
||||
C_API->>PyFunc: Return Python ndarray object wrapping the C struct
|
||||
PyFunc-->>User: Assign returned ndarray object to `my_array`
|
||||
```
|
||||
|
||||
**Where is the Code?**
|
||||
|
||||
* **C Implementation:** The core logic is in C files compiled into the `_multiarray_umath` extension module (e.g., parts of `numpy/core/src/multiarray/`). Files like `alloc.c`, `ctors.c` (constructors), `getset.c` (for getting/setting attributes like shape), `item_selection.c` (indexing) contain relevant C code.
|
||||
* **Python Wrappers:** `numpy/core/numeric.py` and `numpy/core/multiarray.py` provide many of the familiar Python functions. They import directly from `_multiarray_umath`.
|
||||
```python
|
||||
# From numpy/core/numeric.py - Simplified
|
||||
from . import multiarray # Imports numpy/core/multiarray.py
|
||||
# multiarray.py itself imports from _multiarray_umath
|
||||
from .multiarray import (
|
||||
array, asarray, zeros, empty, # Functions defined/re-exported
|
||||
# ... many others ...
|
||||
)
|
||||
```
|
||||
* **Initialization:** `numpy/core/__init__.py` helps set up the `numpy.core` namespace, importing from `multiarray` and `umath`.
|
||||
```python
|
||||
# From numpy/core/__init__.py - Simplified
|
||||
from . import multiarray
|
||||
from . import umath
|
||||
# ... other imports ...
|
||||
from . import numeric
|
||||
from .numeric import * # Pulls in functions like np.array, np.zeros
|
||||
# ... more setup ...
|
||||
```
|
||||
* **C API Definition:** Files like `numpy/core/include/numpy/multiarray.h` define the C structures (`PyArrayObject`) and function prototypes (`PyArray_NewFromDescr`, etc.) that make up the NumPy C-API. Code generators like `numpy/core/code_generators/generate_numpy_api.py` help create tables (`__multiarray_api.h`, `__multiarray_api.c`) that allow other C extensions to easily access these core NumPy C functions.
|
||||
```python
|
||||
# Snippet from numpy/core/code_generators/generate_numpy_api.py
|
||||
# This script generates C code that defines an array of function pointers
|
||||
# making up the C-API.
|
||||
|
||||
# Describes API functions, their index in the API table, return type, args...
|
||||
multiarray_funcs = {
|
||||
# ... many functions ...
|
||||
'NewLikeArray': (10, None, 'PyObject *', (('PyArrayObject *', 'prototype'), ...)),
|
||||
'NewFromDescr': (9, None, 'PyObject *', ...),
|
||||
'Empty': (8, None, 'PyObject *', ...),
|
||||
# ...
|
||||
}
|
||||
|
||||
# ... code to generate C header (.h) and implementation (.c) files ...
|
||||
# These generated files help expose the C functions consistently.
|
||||
```
|
||||
|
||||
## Conclusion
|
||||
|
||||
You've now learned about the conceptual `multiarray` module, the C engine at the heart of NumPy.
|
||||
|
||||
* It's implemented in **C** (as part of the `_multiarray_umath` extension module) for maximum **speed and efficiency**.
|
||||
* It provides the fundamental **`ndarray` object** structure.
|
||||
* It implements **core array operations** like creation, memory management, indexing, and reshaping at a low level.
|
||||
* Python modules like `numpy.core.numeric` and `numpy.core.multiarray` provide user-friendly interfaces that call this underlying C code.
|
||||
* Understanding this separation helps explain *why* NumPy is so fast compared to standard Python lists for numerical tasks.
|
||||
|
||||
While `multiarray` provides the array structure and basic manipulation, the element-wise mathematical operations often rely on another closely related C implementation layer.
|
||||
|
||||
Let's explore that next in [Chapter 7: umath Module](07_umath_module.md).
|
||||
|
||||
---
|
||||
|
||||
Generated by [AI Codebase Knowledge Builder](https://github.com/The-Pocket/Tutorial-Codebase-Knowledge)
|
||||
159
docs/NumPy Core/07_umath_module.md
Normal file
159
docs/NumPy Core/07_umath_module.md
Normal file
@@ -0,0 +1,159 @@
|
||||
# Chapter 7: umath Module
|
||||
|
||||
Welcome to Chapter 7! In [Chapter 6: multiarray Module](06_multiarray_module.md), we explored the core C engine that defines the `ndarray` object and handles fundamental operations like creating arrays and accessing elements. We saw that the actual power comes from C code.
|
||||
|
||||
But what about the mathematical operations themselves? When you perform `np.sin(my_array)` or `array1 + array2`, which part of the C engine handles the actual sine calculation or the addition for *every single element*? This is where the concept of the `umath` module comes in.
|
||||
|
||||
## What Problem Does `umath` Solve? Implementing Fast Array Math
|
||||
|
||||
Remember the [ufunc (Universal Function)](03_ufunc__universal_function_.md) from Chapter 3? Ufuncs are NumPy's special functions designed to operate element-wise on arrays with incredible speed (like `np.add`, `np.sin`, `np.log`).
|
||||
|
||||
Let's take a simple example:
|
||||
|
||||
```python
|
||||
import numpy as np
|
||||
|
||||
angles = np.array([0, np.pi/2, np.pi])
|
||||
sines = np.sin(angles) # How is this sine calculated so fast?
|
||||
|
||||
print(angles)
|
||||
print(sines)
|
||||
```
|
||||
|
||||
**Output:**
|
||||
|
||||
```
|
||||
[0. 1.57079633 3.14159265]
|
||||
[0.0000000e+00 1.0000000e+00 1.2246468e-16] # Note: pi value is approximate
|
||||
```
|
||||
|
||||
The Python function `np.sin` acts as a dispatcher. It needs to hand off the actual, heavy-duty work of calculating the sine for each element in the `angles` array to highly optimized code. Where does this optimized code live?
|
||||
|
||||
Historically, the C code responsible for implementing the *loops and logic* of these mathematical ufuncs (like addition, subtraction, sine, cosine, logarithm, etc.) was contained within a dedicated C extension module called `umath`. It provided the fast, element-by-element computational kernels.
|
||||
|
||||
## What is the `umath` Module (Concept)?
|
||||
|
||||
The `umath` module represents the part of NumPy's C core dedicated to implementing **universal functions (ufuncs)**. Think of it as NumPy's built-in, highly optimized math library specifically designed for element-wise operations on arrays.
|
||||
|
||||
**Key Points:**
|
||||
|
||||
1. **Houses ufunc Implementations:** It contains the low-level C code that performs the actual calculations for functions like `np.add`, `np.sin`, `np.exp`, `np.sqrt`, etc.
|
||||
2. **Optimized Loops:** This C code includes specialized loops that iterate over the array elements very efficiently, often tailored for specific [dtype (Data Type Object)](02_dtype__data_type_object_.md)s (like a fast loop for adding 32-bit integers, another for 64-bit floats, etc.).
|
||||
3. **Historical C Module:** Originally, `umath` was a separate compiled C extension module (`umath.so` or `umath.pyd`).
|
||||
4. **Merged with `multiarray`:** Since NumPy 1.16, the C code for `umath` has been merged with the C code for `multiarray` into a single, larger C extension module named `_multiarray_umath`. While they are now in the same compiled file, the *functions and purpose* associated with `umath` (implementing ufunc math) are distinct from those associated with `multiarray` (array object structure and basic manipulation).
|
||||
5. **Python Access (`numpy/core/umath.py`):** You don't usually interact with the C code directly. Instead, NumPy provides Python functions (like `np.add`, `np.sin`) in the Python file `numpy/core/umath.py`. These Python functions are wrappers that know how to find and trigger the correct C implementation within the `_multiarray_umath` extension module.
|
||||
|
||||
**Analogy:** Imagine `multiarray` builds the car chassis and engine block (`ndarray` structure). `umath` provides specialized, high-performance engine components like the fuel injectors for addition (`np.add`'s C code), the turbocharger for exponentiation (`np.exp`'s C code), and the precise valve timing for trigonometry (`np.sin`'s C code). The Python functions (`np.add`, `np.sin`) are the pedals and buttons you use to activate these components.
|
||||
|
||||
## How it Works (Usage Perspective)
|
||||
|
||||
As a NumPy user, you typically trigger the `umath` C code indirectly by calling a ufunc:
|
||||
|
||||
```python
|
||||
import numpy as np
|
||||
|
||||
a = np.array([1, 2, 3])
|
||||
b = np.array([10, 20, 30])
|
||||
|
||||
# Calling the ufunc np.add
|
||||
result1 = np.add(a, b) # Triggers the C implementation for addition
|
||||
|
||||
# Using the operator '+' which also calls np.add for arrays
|
||||
result2 = a + b # Also triggers the C implementation
|
||||
|
||||
print(f"Using np.add: {result1}")
|
||||
print(f"Using + operator: {result2}")
|
||||
```
|
||||
|
||||
**Output:**
|
||||
|
||||
```
|
||||
Using np.add: [11 22 33]
|
||||
Using + operator: [11 22 33]
|
||||
```
|
||||
|
||||
Both `np.add(a, b)` and `a + b` ultimately lead to NumPy executing the highly optimized C code associated with the addition ufunc, which conceptually belongs to the `umath` part of the core.
|
||||
|
||||
## A Glimpse Under the Hood
|
||||
|
||||
When you call a ufunc like `np.add(a, b)`:
|
||||
|
||||
1. **Python Call:** You invoke the Python function `np.add` (found in `numpy/core/umath.py` or exposed through `numpy/core/__init__.py`).
|
||||
2. **Identify Ufunc Object:** This Python function accesses the corresponding ufunc object (`np.add` itself is a ufunc object). This object holds metadata about the operation.
|
||||
3. **Dispatch to C:** The ufunc object mechanism (part of the `_multiarray_umath` C core) takes over.
|
||||
4. **Type Resolution & Loop Selection:** The C code inspects the `dtype`s of the input arrays (`a` and `b`). Based on the input types, it looks up an internal table associated with the `add` ufunc to find the *best* matching, pre-compiled C loop. For example, if `a` and `b` are both `int64`, it selects the C function specifically designed for `int64 + int64 -> int64`. This selection process might involve type casting rules (e.g., adding `int32` and `float64` might choose a loop that operates on `float64`).
|
||||
5. **Execute C Loop:** The selected C function (the core `umath` implementation for this specific type combination) is executed. This function iterates efficiently over the input array(s) memory, performs the addition element by element, and stores the results in the output array's memory.
|
||||
6. **Return Result:** The C machinery wraps the output memory into a new `ndarray` object and returns it back to your Python code.
|
||||
|
||||
Here's a simplified sequence diagram:
|
||||
|
||||
```mermaid
|
||||
sequenceDiagram
|
||||
participant User as Your Python Script
|
||||
participant PyUfunc as np.add (Python Wrapper)
|
||||
participant UfuncObj as Ufunc Object (Metadata)
|
||||
participant C_Core as C Code (_multiarray_umath)
|
||||
participant C_Loop as Specific Add Loop (e.g., int64_add)
|
||||
participant Memory
|
||||
|
||||
User->>PyUfunc: result = np.add(a, b)
|
||||
PyUfunc->>UfuncObj: Access the 'add' ufunc object
|
||||
UfuncObj->>C_Core: Initiate ufunc execution (pass inputs a, b)
|
||||
C_Core->>C_Core: Inspect a.dtype, b.dtype
|
||||
C_Core->>UfuncObj: Find best C loop (e.g., int64_add loop)
|
||||
C_Core->>Memory: Allocate memory for result (if needed)
|
||||
C_Core->>C_Loop: Execute int64_add(a_data, b_data, result_data)
|
||||
C_Loop->>Memory: Read a, b, compute sum, write result
|
||||
C_Loop-->>C_Core: Signal loop completion
|
||||
C_Core->>Memory: Wrap result memory in ndarray object
|
||||
C_Core-->>PyUfunc: Return result ndarray
|
||||
PyUfunc-->>User: Assign result ndarray to 'result'
|
||||
|
||||
```
|
||||
|
||||
**Where is the Code?**
|
||||
|
||||
* **C Extension Module:** The compiled code lives in `_multiarray_umath.so` / `.pyd`.
|
||||
* **Ufunc Definition & Generation:** The script `numpy/core/code_generators/generate_umath.py` is crucial. It contains definitions (like the `defdict` dictionary) that describe each ufunc: its name, number of inputs/outputs, identity element, the C functions to use for different type combinations (`TD` entries), and associated docstrings. This script generates C code (`__umath_generated.c`, which is then compiled) that sets up the ufunc objects and their internal loop tables.
|
||||
```python
|
||||
# Simplified snippet from generate_umath.py's defdict for 'add'
|
||||
'add':
|
||||
Ufunc(2, 1, Zero, # nin=2, nout=1, identity=0
|
||||
docstrings.get('numpy._core.umath.add'), # Docstring reference
|
||||
'PyUFunc_AdditionTypeResolver', # Type resolution logic
|
||||
TD('?', ...), # Loop for booleans
|
||||
TD(no_bool_times_obj, dispatch=[...]), # Loops for numeric types
|
||||
# ... loops for datetime, object ...
|
||||
),
|
||||
```
|
||||
This definition tells the generator how to build the `np.add` ufunc, including which C functions (often defined in other C files or generated from templates) handle addition for different data types.
|
||||
* **C Loop Implementations:** The actual C code performing the math often comes from template files (like `numpy/core/src/umath/loops.c.src`) or CPU-dispatch-specific files (like `numpy/core/src/umath/loops_arithm_fp.dispatch.c.src`). These `.src` files contain templates written in a C-like syntax that get processed to generate specific C code for various data types (e.g., generating `int32_add`, `int64_add`, `float32_add`, `float64_add` from a single addition template). The dispatch files allow NumPy to choose optimized code paths (using e.g., AVX2, AVX512 instructions) based on your CPU's capabilities at runtime.
|
||||
* **Python Wrappers:** `numpy/core/umath.py` provides the Python functions like `np.add`, `np.sin` that you call. It primarily imports these functions directly from the `_multiarray_umath` C extension module.
|
||||
```python
|
||||
# From numpy/core/umath.py - Simplified
|
||||
from . import _multiarray_umath
|
||||
from ._multiarray_umath import * # Imports C-defined ufuncs like 'add'
|
||||
|
||||
# Functions like 'add', 'sin', 'log' are now available in this module's
|
||||
# namespace, ready to be used via `np.add`, `np.sin`, etc.
|
||||
```
|
||||
* **Namespace Setup:** `numpy/core/__init__.py` imports from `numpy.core.umath` (among others) to make functions like `np.add` easily accessible under the main `np` namespace.
|
||||
|
||||
## Conclusion
|
||||
|
||||
You've now seen that the `umath` concept represents the implementation heart of NumPy's universal functions.
|
||||
|
||||
* It provides the optimized **C code** that performs element-wise mathematical operations.
|
||||
* It contains specialized **loops** for different data types, crucial for NumPy's speed.
|
||||
* While historically a separate C module, its functionality is now part of the merged `_multiarray_umath` C extension.
|
||||
* Python files like `numpy/core/umath.py` provide access, but the real work happens in C, often defined via generators like `generate_umath.py` and implemented in templated `.src` or dispatchable C files.
|
||||
|
||||
Understanding `umath` clarifies where the computational power for element-wise operations originates within NumPy's core.
|
||||
|
||||
So far, we've focused on NumPy's built-in functions. But how does NumPy interact with other libraries or allow customization of how operations work on its arrays?
|
||||
|
||||
Next, we'll explore a powerful mechanism for extending NumPy's reach: [Chapter 8: __array_function__ Protocol / Overrides (`overrides`)](08___array_function___protocol___overrides___overrides__.md).
|
||||
|
||||
---
|
||||
|
||||
Generated by [AI Codebase Knowledge Builder](https://github.com/The-Pocket/Tutorial-Codebase-Knowledge)
|
||||
@@ -0,0 +1,195 @@
|
||||
# Chapter 8: __array_function__ Protocol / Overrides (`overrides`)
|
||||
|
||||
Welcome to the final chapter of our NumPy Core exploration! In [Chapter 7: umath Module](07_umath_module.md), we learned how NumPy implements its fast, element-wise mathematical functions (`ufuncs`) using optimized C code. We've seen the core components: the `ndarray` container, `dtype` descriptions, `ufunc` operations, numeric types, printing, and the C modules (`multiarray`, `umath`) that power them.
|
||||
|
||||
But NumPy doesn't exist in isolation. The Python scientific ecosystem is full of other libraries that also work with array-like data. Think of libraries like Dask (for parallel computing on large datasets that don't fit in memory) or CuPy (for running NumPy-like operations on GPUs). How can these *different* types of arrays work smoothly with standard NumPy functions like `np.sum`, `np.mean`, or `np.concatenate`?
|
||||
|
||||
## What Problem Does `__array_function__` Solve? Speaking NumPy's Language
|
||||
|
||||
Imagine you have a special type of array, maybe one that lives on a GPU (like a CuPy array) or one that represents a computation spread across many machines (like a Dask array). You want to calculate the sum of its elements.
|
||||
|
||||
Ideally, you'd just write:
|
||||
|
||||
```python
|
||||
# Assume 'my_special_array' is an instance of a custom array type
|
||||
# (e.g., from CuPy or Dask)
|
||||
result = np.sum(my_special_array)
|
||||
```
|
||||
|
||||
But wait, `np.sum` is a NumPy function, designed primarily for NumPy's `ndarray` ([Chapter 1: ndarray (N-dimensional array)](01_ndarray__n_dimensional_array_.md)). How can it possibly know how to sum elements on a GPU or coordinate a distributed calculation?
|
||||
|
||||
Before the `__array_function__` protocol, this was tricky. Either the library (like CuPy) had to provide its *own* complete set of functions (`cupy.sum`), or NumPy would have needed specific code to handle every possible external array type, which is impossible to maintain.
|
||||
|
||||
We need a way for NumPy functions to ask the input objects: "Hey, do *you* know how to handle this operation (`np.sum` in this case)?" If the object says yes, NumPy can step back and let the object take control.
|
||||
|
||||
This is exactly what the `__array_function__` protocol (defined in NEP-18) allows. It's like a common language or negotiation rule that lets different array libraries "override" or take over the execution of NumPy functions when their objects are involved.
|
||||
|
||||
**Analogy:** Think of NumPy functions as a universal remote control. Initially, it only knows how to control NumPy-brand TVs (`ndarray`s). The `__array_function__` protocol is like adding a feature where the remote, when pointed at a different brand TV (like a CuPy array), asks the TV: "Do you understand this button (e.g., 'sum')?" If the TV responds, "Yes, here's how I do 'sum'," the remote lets the TV handle it.
|
||||
|
||||
## What is the `__array_function__` Protocol?
|
||||
|
||||
The `__array_function__` protocol is a special method that array-like objects can implement. When a NumPy function is called with arguments that include one or more objects defining `__array_function__`, NumPy follows these steps:
|
||||
|
||||
1. **Check Arguments:** NumPy looks at all the input arguments passed to the function (e.g., `np.sum(my_array, axis=0)`).
|
||||
2. **Find Overrides:** It identifies which arguments have an `__array_function__` method.
|
||||
3. **Prioritize:** It sorts these arguments based on a special attribute (`__array_priority__`) or by their position in the function call if priorities are equal. Subclasses are also considered.
|
||||
4. **Negotiate:** It calls the `__array_function__` method of the highest-priority object. It passes two key pieces of information to this method:
|
||||
* The original NumPy function object itself (e.g., `np.sum`).
|
||||
* The arguments (`*args`) and keyword arguments (`**kwargs`) that were originally passed to the NumPy function.
|
||||
5. **Delegate:** The object's `__array_function__` method now has control. It can:
|
||||
* Handle the operation itself (e.g., perform a GPU sum if it's a CuPy array) and return the result.
|
||||
* Decide it *cannot* handle this specific function or combination of arguments and return a special value `NotImplemented`. In this case, NumPy tries the `__array_function__` method of the *next* highest-priority object.
|
||||
* Potentially call the original NumPy function on converted inputs if needed.
|
||||
6. **Fallback:** If *no* object's `__array_function__` method handles the call (they all return `NotImplemented`), NumPy raises a `TypeError`. *Crucially, NumPy usually does NOT fall back to its own default implementation on the foreign objects unless explicitly told to by the override.*
|
||||
|
||||
## Using `__array_function__` (Implementing a Simple Override)
|
||||
|
||||
Let's create a very basic array-like class that overrides `np.sum` but lets other functions pass through (by returning `NotImplemented`).
|
||||
|
||||
```python
|
||||
import numpy as np
|
||||
|
||||
class MySimpleArray:
|
||||
def __init__(self, data):
|
||||
# Store data internally, maybe as a NumPy array for simplicity here
|
||||
self._data = np.asarray(data)
|
||||
|
||||
# This is the magic method!
|
||||
def __array_function__(self, func, types, args, kwargs):
|
||||
print(f"MySimpleArray.__array_function__ got called for {func.__name__}")
|
||||
|
||||
if func is np.sum:
|
||||
# Handle np.sum ourselves!
|
||||
print("-> Handling np.sum internally!")
|
||||
# Convert args to NumPy arrays if they are MySimpleArray
|
||||
np_args = [a._data if isinstance(a, MySimpleArray) else a for a in args]
|
||||
np_kwargs = {k: v._data if isinstance(v, MySimpleArray) else v for k, v in kwargs.items()}
|
||||
# Perform the actual sum using NumPy on the internal data
|
||||
return np.sum(*np_args, **np_kwargs)
|
||||
else:
|
||||
# For any other function, say we don't handle it
|
||||
print(f"-> Don't know how to handle {func.__name__}, returning NotImplemented.")
|
||||
return NotImplemented
|
||||
|
||||
# Make it look a bit like an array for printing
|
||||
def __repr__(self):
|
||||
return f"MySimpleArray({self._data})"
|
||||
|
||||
# --- Try it out ---
|
||||
my_arr = MySimpleArray([1, 2, 3, 4])
|
||||
print("Array:", my_arr)
|
||||
|
||||
# Call np.sum
|
||||
print("\nCalling np.sum(my_arr):")
|
||||
total = np.sum(my_arr)
|
||||
print("Result:", total)
|
||||
|
||||
# Call np.mean (which our class doesn't handle)
|
||||
print("\nCalling np.mean(my_arr):")
|
||||
try:
|
||||
mean_val = np.mean(my_arr)
|
||||
print("Result:", mean_val)
|
||||
except TypeError as e:
|
||||
print("Caught expected TypeError:", e)
|
||||
```
|
||||
|
||||
**Output:**
|
||||
|
||||
```
|
||||
Array: MySimpleArray([1 2 3 4])
|
||||
|
||||
Calling np.sum(my_arr):
|
||||
MySimpleArray.__array_function__ got called for sum
|
||||
-> Handling np.sum internally!
|
||||
Result: 10
|
||||
|
||||
Calling np.mean(my_arr):
|
||||
MySimpleArray.__array_function__ got called for mean
|
||||
-> Don't know how to handle mean, returning NotImplemented.
|
||||
Caught expected TypeError: no implementation found for 'numpy.mean' on types that implement __array_function__: [<class '__main__.MySimpleArray'>]
|
||||
```
|
||||
|
||||
**Explanation:**
|
||||
|
||||
1. We created `MySimpleArray` which holds some data (here, a standard NumPy array `_data`).
|
||||
2. We implemented `__array_function__(self, func, types, args, kwargs)`.
|
||||
* `func`: The NumPy function being called (e.g., `np.sum`, `np.mean`).
|
||||
* `types`: A tuple of unique types implementing `__array_function__` in the arguments.
|
||||
* `args`, `kwargs`: The original arguments passed to `func`.
|
||||
3. Inside `__array_function__`, we check if `func` is `np.sum`.
|
||||
* If yes, we print a message, extract the internal `_data` from any `MySimpleArray` arguments, call `np.sum` on that data, and return the result. NumPy uses this returned value directly.
|
||||
* If no (like for `np.mean`), we print a message and return `NotImplemented`.
|
||||
4. When we call `np.sum(my_arr)`, NumPy detects `__array_function__` on `my_arr`. It calls it. Our method handles `np.sum` and returns `10`.
|
||||
5. When we call `np.mean(my_arr)`, NumPy again calls `__array_function__`. This time, our method returns `NotImplemented`. Since no other arguments handle it, NumPy raises a `TypeError` because it doesn't know how to calculate the mean of `MySimpleArray` by default.
|
||||
|
||||
This example demonstrates how an external library object can selectively take control of NumPy functions. Libraries like CuPy or Dask implement `__array_function__` much more thoroughly, handling many NumPy functions to perform operations on their specific data representations (GPU arrays, distributed arrays).
|
||||
|
||||
## A Glimpse Under the Hood (`overrides.py`)
|
||||
|
||||
How does NumPy actually manage this dispatching process? The logic lives primarily in the `numpy/core/overrides.py` module.
|
||||
|
||||
1. **Decorator:** Many NumPy functions (especially those intended to be public and potentially overridden) are decorated with `@array_function_dispatch(...)` or a similar helper (`@array_function_from_dispatcher`). You can see this decorator used in files like `numpy/core/function_base.py` (for `linspace`, `logspace`, etc.) or `numpy/core/numeric.py` (for `sum`, `mean`, etc. indirectly via ufunc machinery).
|
||||
```python
|
||||
# Example from numpy/core/function_base.py (simplified)
|
||||
from numpy._core import overrides
|
||||
|
||||
array_function_dispatch = functools.partial(
|
||||
overrides.array_function_dispatch, module='numpy')
|
||||
|
||||
def _linspace_dispatcher(start, stop, num=None, ...):
|
||||
# This helper identifies arguments relevant for dispatch
|
||||
return (start, stop)
|
||||
|
||||
@array_function_dispatch(_linspace_dispatcher) # Decorator applied!
|
||||
def linspace(start, stop, num=50, ...):
|
||||
# ... Actual implementation for NumPy arrays ...
|
||||
pass
|
||||
```
|
||||
2. **Dispatcher Class:** The decorator wraps the original function (like `linspace`) in a special callable object, often an instance of `_ArrayFunctionDispatcher`.
|
||||
3. **Call Interception:** When you call the decorated NumPy function (e.g., `np.linspace(...)`), you're actually calling the `_ArrayFunctionDispatcher` object.
|
||||
4. **Argument Check (`_get_implementing_args`):** The dispatcher object first calls the little helper function provided to the decorator (like `_linspace_dispatcher`) to figure out which arguments are relevant for checking the `__array_function__` protocol. Then, it calls the C helper function `_get_implementing_args` (defined in `numpy/core/src/multiarray/overrides.c`) which efficiently inspects the relevant arguments, finds those with `__array_function__`, and sorts them according to priority and type relationships.
|
||||
5. **Delegation Loop:** The dispatcher iterates through the implementing arguments found in step 4 (from highest priority to lowest). For each one, it calls its `__array_function__` method.
|
||||
6. **Handle Result:**
|
||||
* If `__array_function__` returns a value other than `NotImplemented`, the dispatcher immediately returns that value to the original caller. The process stops.
|
||||
* If `__array_function__` returns `NotImplemented`, the dispatcher continues to the next implementing argument in the list.
|
||||
7. **Error or Default:** If the loop finishes without any override handling the call, a `TypeError` is raised.
|
||||
|
||||
Here's a simplified sequence diagram for `np.sum(my_arr)`:
|
||||
|
||||
```mermaid
|
||||
sequenceDiagram
|
||||
participant User
|
||||
participant NumPyFunc as np.sum (Dispatcher Object)
|
||||
participant Overrides as numpy.core.overrides
|
||||
participant CustomArr as my_arr (MySimpleArray)
|
||||
|
||||
User->>NumPyFunc: np.sum(my_arr)
|
||||
NumPyFunc->>Overrides: Get relevant args (my_arr)
|
||||
Overrides->>Overrides: _get_implementing_args([my_arr])
|
||||
Overrides-->>NumPyFunc: Found [my_arr] implements __array_function__
|
||||
NumPyFunc->>CustomArr: call __array_function__(func=np.sum, ...)
|
||||
CustomArr->>CustomArr: Check if func is np.sum (Yes)
|
||||
CustomArr->>CustomArr: Perform custom sum logic
|
||||
CustomArr-->>NumPyFunc: Return result (e.g., 10)
|
||||
NumPyFunc-->>User: Return result (10)
|
||||
```
|
||||
|
||||
The `numpy/core/overrides.py` file defines the Python-level infrastructure (`array_function_dispatch`, `_ArrayFunctionDispatcher`), while the core logic for efficiently finding and sorting implementing arguments (`_get_implementing_args`) is implemented in C for performance.
|
||||
|
||||
## Conclusion
|
||||
|
||||
The `__array_function__` protocol is a powerful mechanism that makes NumPy far more extensible and integrated with the wider Python ecosystem. You've learned:
|
||||
|
||||
* It allows objects from **other libraries** (like Dask, CuPy) to **override** how NumPy functions behave when passed instances of those objects.
|
||||
* It works via a special method, `__array_function__`, that implementing objects define.
|
||||
* NumPy **negotiates** with arguments: it checks for the method and **delegates** the call if an argument handles it.
|
||||
* This enables writing code that looks like standard NumPy (`np.sum(my_obj)`) but can operate seamlessly on diverse array types (CPU, GPU, distributed).
|
||||
* The dispatch logic is managed primarily by decorators and helpers in `numpy/core/overrides.py`, relying on a C function (`_get_implementing_args`) for efficient argument checking.
|
||||
|
||||
This protocol is a key part of why NumPy remains central to scientific computing in Python, allowing it to interact smoothly with specialized array libraries without requiring NumPy itself to know the specifics of each one.
|
||||
|
||||
This concludes our tour through the core concepts of NumPy! We hope this journey from the fundamental `ndarray` to the sophisticated `__array_function__` protocol has given you a deeper appreciation for how NumPy works under the hood.
|
||||
|
||||
---
|
||||
|
||||
Generated by [AI Codebase Knowledge Builder](https://github.com/The-Pocket/Tutorial-Codebase-Knowledge)
|
||||
44
docs/NumPy Core/index.md
Normal file
44
docs/NumPy Core/index.md
Normal file
@@ -0,0 +1,44 @@
|
||||
# Tutorial: NumPy Core
|
||||
|
||||
NumPy provides the powerful **ndarray** object, a *multi-dimensional grid* optimized for numerical computations on large datasets. It uses **dtypes** (data type objects) to precisely define the *kind of data* (like integers or floating-point numbers) stored within an array, ensuring memory efficiency and enabling optimized low-level operations. NumPy also features **ufuncs** (universal functions), which are functions like `add` or `sin` designed to operate *element-wise* on entire arrays very quickly, leveraging compiled code. Together, these components form the foundation for high-performance scientific computing in Python.
|
||||
|
||||
|
||||
**Source Repository:** [https://github.com/numpy/numpy/tree/3b377854e8b1a55f15bda6f1166fe9954828231b/numpy/_core](https://github.com/numpy/numpy/tree/3b377854e8b1a55f15bda6f1166fe9954828231b/numpy/_core)
|
||||
|
||||
```mermaid
|
||||
flowchart TD
|
||||
A0["ndarray (N-dimensional array)"]
|
||||
A1["dtype (Data Type Object)"]
|
||||
A2["ufunc (Universal Function)"]
|
||||
A3["multiarray Module"]
|
||||
A4["umath Module"]
|
||||
A5["Numeric Types"]
|
||||
A6["Array Printing"]
|
||||
A7["__array_function__ Protocol / Overrides"]
|
||||
A0 -- "Has data type" --> A1
|
||||
A2 -- "Operates element-wise on" --> A0
|
||||
A3 -- "Provides implementation for" --> A0
|
||||
A4 -- "Provides implementation for" --> A2
|
||||
A5 -- "Defines scalar types for" --> A1
|
||||
A6 -- "Formats for display" --> A0
|
||||
A6 -- "Uses for formatting info" --> A1
|
||||
A7 -- "Overrides functions from" --> A3
|
||||
A7 -- "Overrides functions from" --> A4
|
||||
A1 -- "References type hierarchy" --> A5
|
||||
```
|
||||
|
||||
## Chapters
|
||||
|
||||
1. [ndarray (N-dimensional array)](01_ndarray__n_dimensional_array_.md)
|
||||
2. [dtype (Data Type Object)](02_dtype__data_type_object_.md)
|
||||
3. [ufunc (Universal Function)](03_ufunc__universal_function_.md)
|
||||
4. [Numeric Types (`numerictypes`)](04_numeric_types___numerictypes__.md)
|
||||
5. [Array Printing (`arrayprint`)](05_array_printing___arrayprint__.md)
|
||||
6. [multiarray Module](06_multiarray_module.md)
|
||||
7. [umath Module](07_umath_module.md)
|
||||
8. [__array_function__ Protocol / Overrides (`overrides`)](08___array_function___protocol___overrides___overrides__.md)
|
||||
|
||||
|
||||
---
|
||||
|
||||
Generated by [AI Codebase Knowledge Builder](https://github.com/The-Pocket/Tutorial-Codebase-Knowledge)
|
||||
Reference in New Issue
Block a user