init push

This commit is contained in:
zachary62
2025-04-04 13:03:54 -04:00
parent e62ee2cb13
commit 2ebad5e5f2
160 changed files with 2 additions and 0 deletions

View File

@@ -0,0 +1,242 @@
# Chapter 1: ndarray (N-dimensional array)
Welcome to the NumPy Core tutorial! If you're interested in how NumPy works under the hood, you're in the right place. NumPy is the foundation for scientific computing in Python, and its core strength comes from a special object called the `ndarray`.
Imagine you have a huge list of numbers, maybe temperatures recorded every second for a year, or the pixel values of a large image. Doing math with standard Python lists can be quite slow for these large datasets. This is the problem NumPy, and specifically the `ndarray`, is designed to solve.
## What is an ndarray?
Think of an `ndarray` (which stands for N-dimensional array) as a powerful grid or table designed to hold items **of the same type**, usually numbers (like integers or decimals). It's the fundamental building block of NumPy.
* **Grid:** It can be a simple list (1-dimension), a table with rows and columns (2-dimensions), or even have more dimensions (3D, 4D, ... N-D).
* **Same Type:** This is key! Unlike Python lists that can hold anything (numbers, strings, objects), NumPy arrays require all elements to be of the *same data type* (e.g., all 32-bit integers or all 64-bit floating-point numbers). This restriction allows NumPy to store and operate on the data extremely efficiently. We'll explore data types more in [Chapter 2: dtype (Data Type Object)](02_dtype__data_type_object_.md).
Analogy: Think of a Python list as a drawer where you can throw anything in socks, books, tools. An `ndarray` is like a specialized toolbox or an egg carton designed to hold only specific things (only tools, only eggs) in an organized way. This organization makes it much faster to work with.
Here's a quick peek at what different dimensional arrays look like conceptually:
```mermaid
flowchart LR
A[0] --> B[1] --> C[2] --> D[3]
```
```mermaid
flowchart LR
subgraph Row 1
R1C1[ R1C1 ] --> R1C2[ R1C2 ] --> R1C3[ R1C3 ]
end
subgraph Row 2
R2C1[ R2C1 ] --> R2C2[ R2C2 ] --> R2C3[ R2C3 ]
end
R1C1 -.-> R2C1
R1C2 -.-> R2C2
R1C3 -.-> R2C3
```
```mermaid
flowchart LR
subgraph Layer 1
L1R1C1[ L1R1C1 ] --> L1R1C2[ L1R1C2 ]
L1R2C1[ L1R2C1 ] --> L1R2C2[ L1R2C2 ]
L1R1C1 -.-> L1R2C1
L1R1C2 -.-> L1R2C2
end
subgraph Layer 2
L2R1C1[ L2R1C1 ] --> L2R1C2[ L2R1C2 ]
L2R2C1[ L2R2C1 ] --> L2R2C2[ L2R2C2 ]
L2R1C1 -.-> L2R2C1
L2R1C2 -.-> L2R2C2
end
L1R1C1 --- L2R1C1
L1R1C2 --- L2R1C2
L1R2C1 --- L2R2C1
L1R2C2 --- L2R2C2
```
## Why ndarrays? The Magic of Vectorization
Let's say you have two lists of numbers and you want to add them element by element. In standard Python, you'd use a loop:
```python
# Using standard Python lists
list1 = [1, 2, 3, 4]
list2 = [5, 6, 7, 8]
result = []
for i in range(len(list1)):
result.append(list1[i] + list2[i])
print(result)
# Output: [6, 8, 10, 12]
```
This works, but for millions of numbers, this Python loop becomes slow.
Now, see how you do it with NumPy ndarrays:
```python
import numpy as np # Standard way to import NumPy
array1 = np.array([1, 2, 3, 4])
array2 = np.array([5, 6, 7, 8])
# Add the arrays directly!
result_array = array1 + array2
print(result_array)
# Output: [ 6 8 10 12]
```
Notice how we just used `+` directly on the arrays? This is called **vectorization**. You write the operation as if you're working on single values, but NumPy applies it to *all* elements automatically.
**Why is this better?**
1. **Speed:** The looping happens behind the scenes in highly optimized C code, which is *much* faster than a Python loop.
2. **Readability:** The code is cleaner and looks more like standard mathematical notation.
This ability to perform operations on entire arrays at once is a core reason why NumPy is so powerful and widely used.
## Creating Your First ndarrays
Let's create some arrays. First, we always import NumPy, usually as `np`:
```python
import numpy as np
```
**1. From Python Lists:** The most common way is using `np.array()`:
```python
# Create a 1-dimensional array (vector)
my_list = [10, 20, 30]
arr1d = np.array(my_list)
print(arr1d)
# Output: [10 20 30]
# Create a 2-dimensional array (matrix/table)
my_nested_list = [[1, 2, 3], [4, 5, 6]]
arr2d = np.array(my_nested_list)
print(arr2d)
# Output:
# [[1 2 3]
# [4 5 6]]
```
`np.array()` takes your list (or list of lists) and converts it into an ndarray. NumPy tries to figure out the best data type automatically.
**2. Arrays of Zeros or Ones:** Often useful as placeholders.
```python
# Create an array of shape (2, 3) filled with zeros
zeros_arr = np.zeros((2, 3))
print(zeros_arr)
# Output:
# [[0. 0. 0.]
# [0. 0. 0.]]
# Create an array of shape (3,) filled with ones
ones_arr = np.ones(3)
print(ones_arr)
# Output: [1. 1. 1.]
```
Notice we pass a tuple like `(2, 3)` to specify the desired shape. By default, these are filled with floating-point numbers.
**3. Using `np.arange`:** Similar to Python's `range`.
```python
# Create an array with numbers from 0 up to (but not including) 5
range_arr = np.arange(5)
print(range_arr)
# Output: [0 1 2 3 4]
```
There are many other ways to create arrays, but these are fundamental.
## Exploring Your ndarray: Basic Attributes
Once you have an array, you can easily check its properties:
```python
arr = np.array([[1.0, 2.0, 3.0], [4.0, 5.0, 6.0]])
# 1. Shape: The size of each dimension
print(f"Shape: {arr.shape}")
# Output: Shape: (2, 3) (2 rows, 3 columns)
# 2. Number of Dimensions (ndim): How many axes it has
print(f"Dimensions: {arr.ndim}")
# Output: Dimensions: 2
# 3. Size: Total number of elements
print(f"Size: {arr.size}")
# Output: Size: 6
# 4. Data Type (dtype): The type of elements in the array
print(f"Data Type: {arr.dtype}")
# Output: Data Type: float64
```
These attributes are crucial for understanding the structure of your data. The `dtype` tells you what kind of data is stored (e.g., `int32`, `float64`, `bool`). We'll dive much deeper into this in [Chapter 2: dtype (Data Type Object)](02_dtype__data_type_object_.md).
## A Glimpse Under the Hood
So, how does NumPy achieve its speed? The `ndarray` you manipulate in Python is actually a clever wrapper around a highly efficient data structure implemented in the **C programming language**.
When you perform an operation like `array1 + array2`, Python doesn't slowly loop through the elements. Instead, NumPy:
1. Checks if the operation is valid (e.g., arrays are compatible).
2. Hands off the arrays and the operation (`+` in this case) to its underlying C code.
3. The C code, which is pre-compiled and highly optimized for your processor, performs the addition very rapidly across the entire block of memory holding the array data.
4. The result (another block of memory) is then wrapped back into a new Python `ndarray` object for you to use.
Here's a simplified view of what happens when you call `np.array()`:
```mermaid
sequenceDiagram
participant P as Python Code (Your script)
participant NPF as NumPy Python Function (e.g., np.array)
participant CF as C Function (in _multiarray_umath)
participant M as Memory
P->>NPF: np.array([1, 2, 3])
NPF->>CF: Call C implementation with list data
CF->>M: Allocate contiguous memory block
CF->>M: Copy data [1, 2, 3] into block
CF-->>NPF: Return C-level ndarray structure pointing to memory
NPF-->>P: Return Python ndarray object wrapping the C structure
```
The core implementation lives within compiled C extension modules, primarily `_multiarray_umath`. Python files like `numpy/core/multiarray.py` and `numpy/core/numeric.py` provide the convenient Python functions (`np.array`, `np.zeros`, etc.) that eventually call this fast C code. You can see how `numeric.py` imports functions from `multiarray`:
```python
# From numpy/core/numeric.py - Simplified
from . import multiarray
from .multiarray import (
arange, array, asarray, asanyarray, # <-- Python functions defined here
empty, empty_like, zeros # <-- More functions
# ... many others ...
)
# The `array` function seen in multiarray.py is often a wrapper
# that calls the actual C implementation.
```
This setup gives you the ease of Python with the speed of C. The `ndarray` object itself stores metadata (like shape, dtype, strides) and a pointer to the actual raw data block in memory. We will see more details about the Python modules involved in [Chapter 6: multiarray Module](06_multiarray_module.md) and [Chapter 7: umath Module](07_umath_module.md).
## Conclusion
You've met the `ndarray`, the heart of NumPy! You learned:
* It's a powerful, efficient grid for storing elements of the **same type**.
* It enables **vectorization**, allowing fast operations on entire arrays without explicit Python loops.
* How to create basic arrays using `np.array`, `np.zeros`, `np.ones`, and `np.arange`.
* How to check key properties like `shape`, `ndim`, `size`, and `dtype`.
* That the speed comes from an underlying **C implementation**.
The `ndarray` is the container. Now, let's look more closely at *what* it contains the different types of data it can hold.
Ready to learn about data types? Let's move on to [Chapter 2: dtype (Data Type Object)](02_dtype__data_type_object_.md).
---
Generated by [AI Codebase Knowledge Builder](https://github.com/The-Pocket/Tutorial-Codebase-Knowledge)

View File

@@ -0,0 +1,213 @@
# Chapter 2: dtype (Data Type Object)
In [Chapter 1: ndarray (N-dimensional array)](01_ndarray__n_dimensional_array_.md), we learned that NumPy's `ndarray` is a powerful grid designed to hold items **of the same type**. This "same type" requirement is fundamental to NumPy's speed and efficiency. But how does NumPy know *what kind* of data it's storing? That's where the `dtype` comes in!
## What Problem Does `dtype` Solve?
Imagine you have a list of numbers in Python: `[1, 2, 3]`. Are these small integers? Big integers? Numbers with decimal points? Python figures this out on the fly, which is flexible but can be slow for large datasets.
NumPy needs to be much faster. To achieve speed, it needs to know *exactly* what kind of data is in an array *before* doing any calculations. Is it a tiny integer that fits in 1 byte? A standard integer using 4 bytes? A decimal number needing 8 bytes?
Knowing the exact type and size allows NumPy to:
1. **Allocate Memory Efficiently:** If you have a million small integers, NumPy can reserve exactly the right amount of memory, not wasting space.
2. **Perform Fast Math:** NumPy can use highly optimized, low-level C or Fortran code that works directly with specific number types (like 32-bit integers or 64-bit floats). These low-level operations are much faster than Python's flexible number handling.
Think of it like packing boxes. If you know you're only packing small screws (like `int8`), you can use small, efficiently packed boxes. If you're packing large bolts (`int64`), you need bigger boxes. If you just have a mixed bag (like a Python list), you need a much larger, less efficient container to hold everything. The `dtype` is the label on the box telling NumPy exactly what's inside.
## What is a `dtype` (Data Type Object)?
A `dtype` is a special **object** in NumPy that describes the **type** and **size** of data stored in an `ndarray`. Every `ndarray` has a `dtype` associated with it.
It's like specifying the "column type" in a database or spreadsheet. If you set a column to "Integer", you expect only whole numbers in that column. If you set it to "Decimal", you expect numbers with potential decimal points. Similarly, the `dtype` ensures all elements in a NumPy array are consistent.
Let's see it in action. Remember from Chapter 1 how we could check the attributes of an array?
```python
import numpy as np
# Create an array of integers
int_array = np.array([1, 2, 3])
print(f"Integer array: {int_array}")
print(f"Data type: {int_array.dtype}")
# Create an array of floating-point numbers (decimals)
float_array = np.array([1.0, 2.5, 3.14])
print(f"\nFloat array: {float_array}")
print(f"Data type: {float_array.dtype}")
# Create an array of booleans (True/False)
bool_array = np.array([True, False, True])
print(f"\nBoolean array: {bool_array}")
print(f"Data type: {bool_array.dtype}")
```
**Output:**
```
Integer array: [1 2 3]
Data type: int64
Float array: [1. 2.5 3.14]
Data type: float64
Boolean array: [ True False True]
Data type: bool
```
Look at the `Data type:` lines.
* For `int_array`, NumPy chose `int64`. This means each element is a 64-bit signed integer (a whole number that can be positive or negative, stored using 64 bits or 8 bytes). The `64` tells us the size.
* For `float_array`, NumPy chose `float64`. Each element is a 64-bit floating-point number (a number with a potential decimal point, following the standard IEEE 754 format, stored using 64 bits or 8 bytes).
* For `bool_array`, NumPy chose `bool`. Each element is a boolean value (True or False), typically stored using 1 byte.
The `dtype` object holds this crucial information.
## Specifying the `dtype`
NumPy usually makes a good guess about the `dtype` when you create an array from a list. But sometimes you need to be explicit, especially if you want to save memory or ensure a specific precision.
You can specify the `dtype` when creating an array using the `dtype` argument:
```python
import numpy as np
# Create an array, specifying 32-bit integers
arr_i32 = np.array([1, 2, 3], dtype=np.int32)
print(f"Array: {arr_i32}")
print(f"Data type: {arr_i32.dtype}")
print(f"Bytes per element: {arr_i32.itemsize}") # itemsize shows bytes
# Create an array, specifying 32-bit floats
arr_f32 = np.array([1, 2, 3], dtype=np.float32)
print(f"\nArray: {arr_f32}") # Notice the decimal points now!
print(f"Data type: {arr_f32.dtype}")
print(f"Bytes per element: {arr_f32.itemsize}")
# Create an array using string codes for dtype
arr_f64_str = np.array([4, 5, 6], dtype='float64') # Equivalent to np.float64
print(f"\nArray: {arr_f64_str}")
print(f"Data type: {arr_f64_str.dtype}")
print(f"Bytes per element: {arr_f64_str.itemsize}")
```
**Output:**
```
Array: [1 2 3]
Data type: int32
Bytes per element: 4
Array: [1. 2. 3.]
Data type: float32
Bytes per element: 4
Array: [4. 5. 6.]
Data type: float64
Bytes per element: 8
```
Notice a few things:
1. We used `np.int32` and `np.float32` to explicitly ask for 32-bit types.
2. The `.itemsize` attribute shows how many *bytes* each element takes. `int32` and `float32` use 4 bytes, while `float64` uses 8 bytes. Choosing `int32` instead of the default `int64` uses half the memory!
3. You can use string codes like `'float64'` (or `'f8'`) instead of the type object `np.float64`.
### Common Data Type Codes
NumPy offers various ways to specify dtypes. Here are the most common:
| Type Category | NumPy Type Objects | String Codes (Common) | Description |
| :----------------- | :------------------------- | :-------------------- | :-------------------------------- |
| **Boolean** | `np.bool_` | `'?'` or `'bool'` | True / False |
| **Signed Integer** | `np.int8`, `np.int16`, `np.int32`, `np.int64` | `'i1'`, `'i2'`, `'i4'`, `'i8'` | Whole numbers (positive/negative) |
| **Unsigned Int** | `np.uint8`, `np.uint16`, `np.uint32`, `np.uint64` | `'u1'`, `'u2'`, `'u4'`, `'u8'` | Whole numbers (non-negative) |
| **Floating Point** | `np.float16`, `np.float32`, `np.float64` | `'f2'`, `'f4'`, `'f8'` | Decimal numbers |
| **Complex Float** | `np.complex64`, `np.complex128` | `'c8'`, `'c16'` | Complex numbers (real+imaginary) |
| **String (Fixed)** | `np.bytes_` | `'S'` + number | Fixed-length byte strings |
| **Unicode (Fixed)**| `np.str_` | `'U'` + number | Fixed-length unicode strings |
| **Object** | `np.object_` | `'O'` | Python objects |
| **Datetime** | `np.datetime64` | `'M8'` + unit | Date and time values |
| **Timedelta** | `np.timedelta64` | `'m8'` + unit | Time durations |
* The numbers in the string codes (`i4`, `f8`, `u2`) usually represent the number of **bytes**. So `i4` = 4-byte integer (`int32`), `f8` = 8-byte float (`float64`).
* `'S'` and `'U'` often need a number after them (e.g., `'S10'`, `'U25'`) to specify the maximum length of the string.
* `'M8'` and `'m8'` usually have a unit like `[D]` for day or `[s]` for second (e.g., `'M8[D]'`). We'll explore numeric types more in [Chapter 4: Numeric Types (`numerictypes`)](04_numeric_types___numerictypes__.md).
Using explicit dtypes is important when:
* You need to control memory usage (e.g., using `int8` if your numbers are always small).
* You are reading data from a file that has a specific binary format.
* You need a specific precision for calculations.
## A Glimpse Under the Hood
How does NumPy manage this `dtype` information internally?
The Python `dtype` object you interact with (like `arr.dtype`) is essentially a wrapper around more detailed information stored in a C structure within NumPy's core. This C structure (often referred to as `PyArray_Descr`) contains everything NumPy needs to know to interpret the raw bytes in the `ndarray`'s memory block:
1. **Type Kind:** Is it an integer, float, boolean, string, etc.? (Represented by a character like `'i'`, `'f'`, `'b'`, `'S'`).
2. **Item Size:** How many bytes does one element occupy? (e.g., 1, 2, 4, 8).
3. **Byte Order:** How are multi-byte numbers stored? (Little-endian `<` or Big-endian `>`. Important for reading files created on different types of computers).
4. **Element Type:** A pointer to the specific C-level functions that know how to operate on this data type.
5. **Fields (for Structured Types):** If it's a structured dtype (like a C struct or a database row), information about the names, dtypes, and offsets of each field.
6. **Subarray (for Nested Types):** Information if the dtype itself represents an array.
When you create an array or perform an operation:
```mermaid
sequenceDiagram
participant P as Python Code (Your script)
participant NPF as NumPy Python Func (e.g., np.array)
participant C_API as NumPy C API
participant DTypeC as C Struct (PyArray_Descr)
participant Mem as Memory
P->>NPF: np.array([1, 2], dtype='int32')
NPF->>C_API: Parse dtype string 'int32'
C_API->>DTypeC: Create/Find PyArray_Descr for int32 (kind='i', itemsize=4, etc.)
C_API->>Mem: Allocate memory (2 items * 4 bytes/item = 8 bytes)
C_API->>Mem: Copy data [1, 2] into memory as 32-bit ints
C_API-->>NPF: Return C ndarray struct (pointing to Mem and DTypeC)
NPF-->>P: Return Python ndarray object wrapping the C struct
```
The `dtype` is created or retrieved *once* and then referenced by potentially many arrays. This C-level description allows NumPy's core functions, especially the [ufunc (Universal Function)](03_ufunc__universal_function_.md)s we'll see next, to work directly on the raw memory with maximum efficiency.
The Python code in `numpy/core/_dtype.py` helps manage the creation and representation (like the nice string output you see when you `print(arr.dtype)`) of these `dtype` objects in Python. For instance, functions like `_kind_name`, `__str__`, and `__repr__` in `_dtype.py` are used to generate the user-friendly names and representations based on the underlying C structure's information. The `_dtype_ctypes.py` file helps bridge the gap between NumPy dtypes and Python's built-in `ctypes` module, allowing interoperability.
## Beyond Simple Numbers: Structured Data and Byte Order
`dtype`s can do more than just describe simple numbers:
* **Structured Arrays:** You can define a `dtype` that represents a mix of types, like a row in a table or a C struct. This is useful for representing structured data efficiently.
```python
# Define a structured dtype: a name (up to 10 chars) and an age (4-byte int)
person_dtype = np.dtype([('name', 'S10'), ('age', 'i4')])
people = np.array([('Alice', 30), ('Bob', 25)], dtype=person_dtype)
print(people)
print(people.dtype)
print(people[0]['name']) # Access fields by name
```
**Output:**
```
[(b'Alice', 30) (b'Bob', 25)]
[('name', 'S10'), ('age', '<i4')]
b'Alice'
```
* **Byte Order:** Computers can store multi-byte numbers in different ways ("endianness"). `dtype`s can specify byte order (`<` for little-endian, `>` for big-endian) which is crucial for reading binary data correctly across different systems. Notice the `'<i4'` in the output above the `<` indicates little-endian, which is common on x86 processors.
## Conclusion
You've now learned about the `dtype` object, the crucial piece of metadata that tells NumPy *what kind* of data is stored in an `ndarray`. You saw:
* `dtype` describes the **type** and **size** of array elements.
* It's essential for NumPy's **memory efficiency** and **computational speed**.
* How to **inspect** (`arr.dtype`) and **specify** (`dtype=...`) data types using type objects (`np.int32`) or string codes (`'i4'`).
* That the Python `dtype` object represents lower-level C information (`PyArray_Descr`) used for efficient operations.
* `dtype`s can also handle more complex scenarios like **structured data** and **byte order**.
Understanding `dtype`s is key to understanding how NumPy manages data efficiently. With the container (`ndarray`) and its contents (`dtype`) defined, we can now explore how NumPy performs fast calculations on these arrays.
Next up, we'll dive into the workhorses of NumPy's element-wise computations: [Chapter 3: ufunc (Universal Function)](03_ufunc__universal_function_.md).
---
Generated by [AI Codebase Knowledge Builder](https://github.com/The-Pocket/Tutorial-Codebase-Knowledge)

View File

@@ -0,0 +1,185 @@
# Chapter 3: ufunc (Universal Function)
Welcome back! In [Chapter 1: ndarray (N-dimensional array)](01_ndarray__n_dimensional_array_.md), we met the `ndarray`, NumPy's powerful container for numerical data. In [Chapter 2: dtype (Data Type Object)](02_dtype__data_type_object_.md), we learned how `dtype`s specify the exact *kind* of data stored within those arrays.
Now, let's tackle a fundamental question: How does NumPy actually *perform calculations* on these arrays so quickly? If you have two large arrays, `a` and `b`, why is `a + b` massively faster than using a Python `for` loop? The answer lies in a special type of function: the **ufunc**.
## What Problem Do ufuncs Solve? Speeding Up Element-wise Math
Imagine you have temperature readings from a sensor stored in a NumPy array, and you need to convert them from Celsius to Fahrenheit. The formula is `F = C * 9/5 + 32`.
With standard Python lists, you'd loop through each temperature:
```python
# Celsius temperatures in a Python list
celsius_list = [0.0, 10.0, 20.0, 30.0, 100.0]
fahrenheit_list = []
# Python loop for conversion
for temp_c in celsius_list:
temp_f = temp_c * (9/5) + 32
fahrenheit_list.append(temp_f)
print(fahrenheit_list)
# Output: [32.0, 50.0, 68.0, 86.0, 212.0]
```
This works, but as we saw in Chapter 1, Python loops are relatively slow, especially for millions of data points.
NumPy offers a much faster way using its `ndarray` and vectorized operations:
```python
import numpy as np
# Celsius temperatures in a NumPy array
celsius_array = np.array([0.0, 10.0, 20.0, 30.0, 100.0])
# NumPy vectorized conversion - NO explicit Python loop!
fahrenheit_array = celsius_array * (9/5) + 32
print(fahrenheit_array)
# Output: [ 32. 50. 68. 86. 212.]
```
Look how clean that is! We just wrote the math formula directly using the array. But *how* does NumPy execute `*`, `/`, and `+` so efficiently on *every element* without a visible loop? This magic is powered by ufuncs.
## What is a ufunc (Universal Function)?
A **ufunc** (Universal Function) is a special type of function in NumPy designed to operate on `ndarray`s **element by element**. Think of them as super-powered mathematical functions specifically built for NumPy arrays.
Examples include `np.add`, `np.subtract`, `np.multiply`, `np.sin`, `np.cos`, `np.exp`, `np.sqrt`, `np.maximum`, `np.equal`, and many more.
**Key Features:**
1. **Element-wise Operation:** A ufunc applies the same operation independently to each element of the input array(s). When you do `np.add(a, b)`, it conceptually does `result[0] = a[0] + b[0]`, `result[1] = a[1] + b[1]`, and so on for all elements.
2. **Speed (Optimized C Loops):** This is the secret sauce! Ufuncs don't actually perform the element-wise operation using slow Python loops. Instead, they execute highly optimized, pre-compiled **C loops** under the hood. This C code can work directly with the raw data buffers of the arrays (remember, ndarrays store data contiguously), making the computations extremely fast.
* **Analogy:** Imagine you need to staple 1000 documents. A Python loop is like picking up the stapler, stapling one document, putting the stapler down, picking it up again, stapling the next... A ufunc is like using an industrial stapling machine that processes the entire stack almost instantly.
3. **Broadcasting Support:** Ufuncs automatically handle operations between arrays of different, but compatible, shapes. For example, you can add a single number (a scalar) to every element of an array, or add a 1D array to each row of a 2D array. The ufunc "stretches" or "broadcasts" the smaller array to match the shape of the larger one during the calculation. (We won't dive deep into broadcasting rules here, just know that ufuncs enable it).
4. **Type Casting:** Ufuncs can intelligently handle inputs with different [Chapter 2: dtype (Data Type Object)](02_dtype__data_type_object_.md)s. For instance, if you add an `int32` array and a `float64` array, the ufunc might decide to convert the integers to `float64` before performing the addition to avoid losing precision, returning a `float64` array. This happens according to well-defined casting rules.
5. **Optional Output Arrays (`out` argument):** You can tell a ufunc to place its result into an *existing* array instead of creating a new one. This can save memory, especially when working with very large arrays or inside loops.
## Using ufuncs
You use ufuncs just like regular Python functions, but you pass NumPy arrays as arguments. Many common mathematical operators (`+`, `-`, `*`, `/`, `**`, `==`, `<`, etc.) also call ufuncs behind the scenes when used with NumPy arrays.
```python
import numpy as np
a = np.array([1, 2, 3, 4])
b = np.array([5, 0, 7, 2])
# Using the ufunc directly
c = np.add(a, b)
print(f"np.add(a, b) = {c}")
# Output: np.add(a, b) = [ 6 2 10 6]
# Using the corresponding operator (which calls np.add internally)
d = a + b
print(f"a + b = {d}")
# Output: a + b = [ 6 2 10 6]
# Other examples
print(f"np.maximum(a, b) = {np.maximum(a, b)}") # Element-wise maximum
# Output: np.maximum(a, b) = [5 2 7 4]
print(f"np.sin(a) = {np.sin(a)}") # Element-wise sine
# Output: np.sin(a) = [ 0.84147098 0.90929743 0.14112001 -0.7568025 ]
```
**Using the `out` Argument:**
Let's pre-allocate an array and tell the ufunc to use it for the result.
```python
import numpy as np
a = np.arange(5) # [0 1 2 3 4]
b = np.arange(5, 10) # [5 6 7 8 9]
# Create an empty array with the same shape and type
result = np.empty_like(a)
# Perform addition, storing the result in the 'result' array
np.add(a, b, out=result)
print(f"a = {a}")
print(f"b = {b}")
print(f"result (after np.add(a, b, out=result)) = {result}")
# Output:
# a = [0 1 2 3 4]
# b = [5 6 7 8 9]
# result (after np.add(a, b, out=result)) = [ 5 7 9 11 13]
```
Instead of creating a *new* array for the sum, `np.add` placed the values directly into `result`.
## A Glimpse Under the Hood
So, what happens internally when you call, say, `np.add(array1, array2)`?
1. **Identify Ufunc:** NumPy recognizes `np.add` as a specific ufunc object. This object holds metadata about the operation (like its name, number of inputs/outputs, identity element if any, etc.).
2. **Check Dtypes:** NumPy inspects the `dtype` of `array1` and `array2` (e.g., `int32`, `float64`). This uses the `dtype` information we learned about in [Chapter 2: dtype (Data Type Object)](02_dtype__data_type_object_.md).
3. **Find the Loop:** The ufunc object contains an internal table (a list of "loops"). Each loop is a specific, pre-compiled C function designed to handle a particular combination of input/output `dtype`s (e.g., `int32 + int32 -> int32`, `float32 + float32 -> float32`, `int32 + float64 -> float64`). NumPy searches this table to find the most appropriate C function based on the input dtypes and casting rules. It might need to select a loop that involves converting one or both inputs to a common, safer type (type casting).
4. **Check Broadcasting:** NumPy checks if the shapes of `array1` and `array2` are compatible according to broadcasting rules. If they are compatible but different, it calculates how to "stretch" the smaller array's dimensions virtually.
5. **Allocate Output:** If the `out` argument wasn't provided, NumPy allocates a new block of memory for the result array, determining its shape (based on broadcasting) and `dtype` (based on the chosen loop).
6. **Execute C Loop:** NumPy calls the selected C function. This function iterates through the elements of the input arrays (using pointers to their raw memory locations, respecting broadcasting rules) and performs the addition, storing the result in the output array's memory. This loop is *very* fast because it's simple, compiled C code operating on primitive types.
7. **Return ndarray:** NumPy wraps the output memory block (either the newly allocated one or the one provided via `out`) into a new Python `ndarray` object ([Chapter 1: ndarray (N-dimensional array)](01_ndarray__n_dimensional_array_.md)) with the correct `shape`, `dtype`, etc., and returns it to your Python code.
Here's a simplified sequence diagram:
```mermaid
sequenceDiagram
participant P as Python Code
participant UFunc as np.add (Ufunc Object)
participant C_API as NumPy C Core (Ufunc Machinery)
participant C_Loop as Specific C Loop (e.g., int32_add)
participant Mem as Memory
P->>UFunc: np.add(arr1, arr2)
UFunc->>C_API: Request execution
C_API->>C_API: Check dtypes (arr1.dtype, arr2.dtype)
C_API->>UFunc: Find appropriate C loop (e.g., int32_add)
C_API->>C_API: Check broadcasting rules
C_API->>Mem: Allocate memory for result (if no 'out')
C_API->>C_Loop: Execute C loop(arr1_data, arr2_data, result_data)
C_Loop->>Mem: Read inputs, Compute, Write output
C_Loop-->>C_API: Signal completion
C_API->>Mem: Wrap result memory in ndarray object
C_API-->>P: Return result ndarray
```
**Where is the Code?**
* The ufunc objects themselves are typically defined in C, often generated by helper scripts like `numpy/core/code_generators/generate_umath.py`. This script reads definitions (like those in the `defdict` variable within the script) specifying the ufunc's name, inputs, outputs, and the C functions to use for different type combinations.
```python
# Snippet from generate_umath.py's defdict for 'add'
'add':
Ufunc(2, 1, Zero, # nin=2, nout=1, identity=0
docstrings.get('numpy._core.umath.add'),
'PyUFunc_AdditionTypeResolver', # Function for type resolution
TD('?', cfunc_alias='logical_or', ...), # Loop for bools
TD(no_bool_times_obj, dispatch=[...]), # Loops for numeric types
# ... loops for datetime, object ...
indexed=intfltcmplx # Types supporting indexed access
),
```
* The Python functions you call (like `numpy.add`) are often thin wrappers defined in places like `numpy/core/umath.py` or `numpy/core/numeric.py`. These Python functions essentially just retrieve the corresponding C ufunc object and trigger its execution mechanism.
* The core C machinery for handling ufunc dispatch (finding the right loop), broadcasting, and executing the loops resides within the compiled `_multiarray_umath` C extension module. We'll touch upon these modules in [Chapter 6: multiarray Module](06_multiarray_module.md) and [Chapter 7: umath Module](07_umath_module.md).
* Helper Python modules like `numpy/core/_methods.py` provide Python implementations for array methods (like `.sum()`, `.mean()`, `.max()`) which often leverage the underlying ufunc's reduction capabilities.
* Error handling during ufunc execution (e.g., division by zero, invalid operations) can be configured using functions like `seterr` defined in `numpy/core/_ufunc_config.py`, and specific exception types like `UFuncTypeError` from `numpy/core/_exceptions.py` might be raised if things go wrong (e.g., no suitable loop found for the input types).
## Conclusion
Ufuncs are the powerhouses behind NumPy's speed for element-wise operations. You've learned:
* They perform operations **element by element** on arrays.
* Their speed comes from executing optimized **C loops**, avoiding slow Python loops.
* They support **broadcasting** (handling compatible shapes) and **type casting** (handling different dtypes).
* You can use them directly (`np.add(a, b)`) or often via operators (`a + b`).
* The `out` argument allows reusing existing arrays, saving memory.
* Internally, NumPy finds the right C loop based on input dtypes, handles broadcasting, executes the loop, and returns a new ndarray.
Now that we understand how basic element-wise operations work, let's delve deeper into the different kinds of numbers NumPy works with.
Next up: [Chapter 4: Numeric Types (`numerictypes`)](04_numeric_types___numerictypes__.md).
---
Generated by [AI Codebase Knowledge Builder](https://github.com/The-Pocket/Tutorial-Codebase-Knowledge)

View File

@@ -0,0 +1,245 @@
# Chapter 4: Numeric Types (`numerictypes`)
Hello again! In [Chapter 3: ufunc (Universal Function)](03_ufunc__universal_function_.md), we saw how NumPy uses universal functions (`ufuncs`) to perform fast calculations on arrays. We learned that these `ufuncs` operate element by element and can handle different data types using optimized C loops.
But what exactly *are* all the different data types that NumPy knows about? We touched on `dtype` objects in [Chapter 2: dtype (Data Type Object)](02_dtype__data_type_object_.md), which *describe* the type of data in an array (like '64-bit integer' or '32-bit float'). Now, we'll look at the actual **types themselves** the specific building blocks like `numpy.int32`, `numpy.float64`, etc., and how they relate to each other. This collection and classification system is handled within the `numerictypes` concept in NumPy's core.
## What Problem Do `numerictypes` Solve? Organizing the Data Ingredients
Imagine you're organizing a huge pantry. You have different kinds of items: grains, spices, canned goods, etc. Within grains, you have rice, oats, quinoa. Within rice, you might have basmati, jasmine, brown rice.
NumPy's data types are similar. It has many specific types of numbers (`int8`, `int16`, `int32`, `int64`, `float16`, `float32`, `float64`, etc.) and other kinds of data (`bool`, `complex`, `datetime`). Just having a list of all these types isn't very organized.
We need a system to:
1. **Define** each specific type precisely (e.g., what exactly is `np.int32`?).
2. **Group** similar types together (e.g., all integers, all floating-point numbers).
3. **Establish relationships** between types (e.g., know that an `int32` *is a kind of* `integer`, which *is a kind of* `number`).
4. Provide convenient **shortcuts or aliases** (e.g., maybe `np.double` is just another name for `np.float64`).
The `numerictypes` concept in NumPy provides this structured catalog or "family tree" for all its scalar data types. It helps NumPy (and you!) understand how different data types are related, which is crucial for operations like choosing the right `ufunc` loop or deciding the output type of a calculation (type promotion).
## What are Numeric Types (`numerictypes`)?
In NumPy, `numerictypes` refers to the collection of **scalar type objects** themselves (like the Python classes `numpy.int32`, `numpy.float64`, `numpy.bool_`) and the **hierarchy** that organizes them.
Think back to the `dtype` object from Chapter 2. The `dtype` object *describes* the data type of an array. The actual type it's describing *is* one of these numeric types (or more accurately, a scalar type, since it includes non-numbers like `bool_` and `str_`).
```python
import numpy as np
# Create an array of 32-bit integers
arr = np.array([10, 20, 30], dtype=np.int32)
# The dtype object describes the type
print(f"Array's dtype object: {arr.dtype}")
# Output: Array's dtype object: int32
# The actual Python type of elements (if accessed individually)
# and the type referred to by the dtype object's `.type` attribute
print(f"The element type class: {arr.dtype.type}")
# Output: The element type class: <class 'numpy.int32'>
# This <class 'numpy.int32'> is one of NumPy's scalar types
# managed under the numerictypes concept.
```
So, `numerictypes` defines the actual classes like `np.int32`, `np.float64`, `np.integer`, `np.floating`, etc., that form the basis of NumPy's type system.
## The Type Hierarchy: A Family Tree
NumPy organizes its scalar types into a hierarchy, much like biological classification (Kingdom > Phylum > Class > Order...). This helps group related types.
At the top is `np.generic`, the base class for all NumPy scalars. Below that, major branches include `np.number`, `np.flexible`, `np.bool_`, etc.
Here's a simplified view of the *numeric* part of the hierarchy:
```mermaid
graph TD
N[np.number] --> I[np.integer]
N --> IX[np.inexact]
I --> SI[np.signedinteger]
I --> UI[np.unsignedinteger]
IX --> F[np.floating]
IX --> C[np.complexfloating]
SI --> i8[np.int8]
SI --> i16[np.int16]
SI --> i32[np.int32]
SI --> i64[np.int64]
SI --> ip[np.intp]
SI --> dots_i[...]
UI --> u8[np.uint8]
UI --> u16[np.uint16]
UI --> u32[np.uint32]
UI --> u64[np.uint64]
UI --> up[np.uintp]
UI --> dots_u[...]
F --> f16[np.float16]
F --> f32[np.float32]
F --> f64[np.float64]
F --> fld[np.longdouble]
F --> dots_f[...]
C --> c64[np.complex64]
C --> c128[np.complex128]
C --> cld[np.clongdouble]
C --> dots_c[...]
%% Styling for clarity
classDef abstract fill:#f9f,stroke:#333,stroke-width:2px;
class N,I,IX,SI,UI,F,C abstract;
```
* **Abstract Types:** Boxes like `np.number`, `np.integer`, `np.floating` represent *categories* or abstract base classes. You usually don't create arrays directly of type `np.integer`, but you can use these categories to check if a specific type belongs to that group.
* **Concrete Types:** Boxes like `np.int32`, `np.float64`, `np.complex128` are the specific, concrete types that you typically use to create arrays. They inherit from the abstract types. For example, `np.int32` is a subclass of `np.signedinteger`, which is a subclass of `np.integer`, which is a subclass of `np.number`.
You can check these relationships using `np.issubdtype` or Python's built-in `issubclass`:
```python
import numpy as np
# Is np.int32 a kind of integer?
print(f"issubdtype(np.int32, np.integer): {np.issubdtype(np.int32, np.integer)}")
# Output: issubdtype(np.int32, np.integer): True
# Is np.float64 a kind of integer?
print(f"issubdtype(np.float64, np.integer): {np.issubdtype(np.float64, np.integer)}")
# Output: issubdtype(np.float64, np.integer): False
# Is np.float64 a kind of number?
print(f"issubdtype(np.float64, np.number): {np.issubdtype(np.float64, np.number)}")
# Output: issubdtype(np.float64, np.number): True
# Using issubclass directly on the types also works
print(f"issubclass(np.int32, np.integer): {issubclass(np.int32, np.integer)}")
# Output: issubclass(np.int32, np.integer): True
```
This hierarchy is useful for understanding how NumPy treats different types, especially during calculations where types might need to be promoted (e.g., adding an `int32` and a `float64` usually results in a `float64`).
## Common Types and Aliases
While NumPy defines many specific types (like `np.int8`, `np.uint16`, `np.float16`), you'll most often encounter these:
* **Integers:** `np.int32`, `np.int64` (default on 64-bit systems is usually `np.int64`)
* **Unsigned Integers:** `np.uint8` (common for images), `np.uint32`, `np.uint64`
* **Floats:** `np.float32` (single precision), `np.float64` (double precision, usually the default)
* **Complex:** `np.complex64`, `np.complex128`
* **Boolean:** `np.bool_` (True/False)
NumPy also provides several **aliases** or alternative names for convenience or historical reasons. Some common ones:
* `np.byte` is an alias for `np.int8`
* `np.short` is an alias for `np.int16`
* `np.intc` often corresponds to the C `int` type (usually `np.int32` or `np.int64`)
* `np.int_` is the default integer type (often `np.int64` on 64-bit systems, `np.int32` on 32-bit systems). Platform dependent!
* `np.single` is an alias for `np.float32`
* `np.double` or `np.float_` is an alias for `np.float64` (matches Python's `float`)
* `np.longdouble` corresponds to the C `long double` (size varies by platform)
* `np.csingle` is an alias for `np.complex64`
* `np.cdouble` or `np.complex_` is an alias for `np.complex128` (matches Python's `complex`)
You can usually use the specific name (like `np.float64`) or an alias (like `np.double`) interchangeably when specifying a `dtype`.
```python
import numpy as np
# Using the specific name
arr_f64 = np.array([1.0, 2.0], dtype=np.float64)
print(f"Type using np.float64: {arr_f64.dtype}")
# Output: Type using np.float64: float64
# Using an alias
arr_double = np.array([1.0, 2.0], dtype=np.double)
print(f"Type using np.double: {arr_double.dtype}")
# Output: Type using np.double: float64
# They refer to the same underlying type
print(f"Is np.float64 the same as np.double? {np.float64 is np.double}")
# Output: Is np.float64 the same as np.double? True
```
## A Glimpse Under the Hood
How does NumPy define all these types and their relationships? It's mostly done in Python code within the `numpy.core` submodule.
1. **Base C Types:** The fundamental types (like a 32-bit integer, a 64-bit float) are ultimately implemented in C as part of the [multiarray Module](06_multiarray_module.md).
2. **Python Class Definitions:** Python classes are defined for each scalar type (e.g., `class int32(signedinteger): ...`) in modules like `numpy/core/numerictypes.py`. These classes inherit from each other to create the hierarchy (e.g., `int32` inherits from `signedinteger`, which inherits from `integer`, etc.).
3. **Type Aliases:** Files like `numpy/core/_type_aliases.py` set up dictionaries (`sctypeDict`, `allTypes`) that map various names (including aliases like "double" or "int_") to the actual type objects (like `np.float64` or `np.intp`). This allows you to use different names when creating `dtype` objects.
4. **Registration:** The Python number types are also registered with Python's abstract base classes (`numbers.Integral`, `numbers.Real`, etc.) in `numerictypes.py` to improve interoperability with standard Python type checking.
5. **Documentation Generation:** Helper scripts like `numpy/core/_add_newdocs_scalars.py` use the type information and aliases to automatically generate parts of the documentation strings you see when you type `help(np.int32)`, making sure the aliases and platform specifics are correctly listed.
When you use a function like `np.issubdtype(np.int32, np.integer)`:
```mermaid
sequenceDiagram
participant P as Your Python Code
participant NPFunc as np.issubdtype
participant PyTypes as Python Type System
participant TypeHier as NumPy Type Hierarchy (in numerictypes.py)
P->>NPFunc: np.issubdtype(np.int32, np.integer)
NPFunc->>TypeHier: Get type object for np.int32
NPFunc->>TypeHier: Get type object for np.integer
NPFunc->>PyTypes: Ask: issubclass(np.int32_obj, np.integer_obj)?
PyTypes-->>NPFunc: Return True (based on class inheritance)
NPFunc-->>P: Return True
```
Essentially, `np.issubdtype` leverages Python's standard `issubclass` mechanism, applied to the hierarchy of type classes defined within `numerictypes`. The `_type_aliases.py` file plays a crucial role in making sure that string names or alias names used in `dtype` specifications resolve to the correct underlying type object before such checks happen.
```python
# Simplified view from numpy/core/_type_aliases.py
# ... (definitions of actual types like np.int8, np.float64) ...
allTypes = {
'int8': np.int8,
'int16': np.int16,
# ...
'float64': np.float64,
# ...
'signedinteger': np.signedinteger, # Abstract type
'integer': np.integer, # Abstract type
'number': np.number, # Abstract type
# ... etc
}
_aliases = {
'double': 'float64', # "double" maps to the key "float64"
'int_': 'intp', # "int_" maps to the key "intp" (platform dependent type)
# ... etc
}
sctypeDict = {} # Dictionary mapping names/aliases to types
# Populate sctypeDict using allTypes and _aliases
# ... (code to merge these dictionaries) ...
# When you do np.dtype('double'), NumPy uses sctypeDict (or similar logic)
# to find that 'double' means np.float64.
```
This setup provides a flexible and organized way to manage NumPy's rich set of data types.
## Conclusion
You've now explored the world of NumPy's `numerictypes`! You learned:
* `numerictypes` define the actual scalar **type objects** (like `np.int32`) and their **relationships**.
* They form a **hierarchy** (like a family tree) with abstract categories (e.g., `np.integer`) and concrete types (e.g., `np.int32`).
* This hierarchy helps NumPy understand how types relate, useful for calculations and type checking (`np.issubdtype`).
* NumPy provides many convenient **aliases** (e.g., `np.double` for `np.float64`).
* The types, hierarchy, and aliases are managed within Python code in `numpy.core`, primarily `numerictypes.py` and `_type_aliases.py`.
Understanding this catalog of types helps clarify why NumPy behaves the way it does when mixing different kinds of numbers.
Now that we know about the arrays, their data types, the functions that operate on them, and the specific numeric types available, how does NumPy *show* us the results?
Let's move on to how NumPy displays arrays: [Chapter 5: Array Printing (`arrayprint`)](05_array_printing___arrayprint__.md).
---
Generated by [AI Codebase Knowledge Builder](https://github.com/The-Pocket/Tutorial-Codebase-Knowledge)

View File

@@ -0,0 +1,336 @@
# Chapter 5: Array Printing (`arrayprint`)
In the previous chapter, [Chapter 4: Numeric Types (`numerictypes`)](04_numeric_types___numerictypes__.md), we explored the different kinds of data NumPy can store in its arrays, like `int32`, `float64`, and more. Now that we know about the arrays ([`ndarray`](01_ndarray__n_dimensional_array_.md)), their data types ([`dtype`](02_dtype__data_type_object_.md)), the functions that operate on them ([`ufunc`](03_ufunc__universal_function_.md)), and the specific number types (`numerictypes`), a practical question arises: How do we actually *look* at these arrays, especially if they are very large?
## What Problem Does `arrayprint` Solve? Making Arrays Readable
Imagine you have a NumPy array representing a large image, maybe with millions of pixel values. Or perhaps you have simulation data with thousands of temperature readings.
```python
import numpy as np
# Imagine this is a huge array, maybe thousands of numbers
large_array = np.arange(2000)
# If Python just tried to print every single number...
# it would flood your screen and be impossible to read!
# print(list(large_array)) # <-- Don't run this! It would be too long.
```
If NumPy just dumped *all* the numbers onto your screen whenever you tried to display a large array, it would be overwhelming and useless. We need a way to show the array's contents in a concise, human-friendly format. How can we get a *sense* of the array's data without printing every single element?
This is the job of NumPy's **array printing** mechanism, often referred to internally by the name of its main Python module, `arrayprint`.
## What is Array Printing (`arrayprint`)?
`arrayprint` is NumPy's **"pretty printer"** for `ndarray` objects. It's responsible for converting a NumPy array into a nicely formatted string representation that's easy to read and understand when you display it (e.g., in your Python console, Jupyter notebook, or using the `print()` function).
Think of it like getting a summary report instead of the raw database dump. `arrayprint` intelligently decides how to show the array, considering things like:
* **Summarization:** For large arrays, it shows only the beginning and end elements, using ellipsis (`...`) to indicate the omitted parts.
* **Precision:** It controls how many decimal places are shown for floating-point numbers.
* **Line Wrapping:** It breaks long rows of data into multiple lines to fit within a certain width.
* **Special Values:** It uses consistent strings for "Not a Number" (`nan`) and infinity (`inf`).
* **Customization:** It allows you to change these settings to suit your needs.
Let's see it in action with our `large_array`:
```python
import numpy as np
large_array = np.arange(2000)
# Let NumPy's array printing handle it
print(large_array)
```
**Output:**
```
[ 0 1 2 ... 1997 1998 1999]
```
Instead of 2000 numbers flooding the screen, NumPy smartly printed only the first three and the last three, with `...` in between. This gives us a good idea of the array's contents (a sequence starting from 0) without being overwhelming.
## Key Features and Options
`arrayprint` has several options you can control to change how arrays are displayed.
### 1. Summarization (`threshold` and `edgeitems`)
* `threshold`: The total number of array elements that triggers summarization. If the array's `size` is greater than `threshold`, the array gets summarized. (Default: 1000)
* `edgeitems`: When summarizing, this is the number of items shown at the beginning and end of each dimension. (Default: 3)
Let's try printing a smaller array and then changing the threshold:
```python
import numpy as np
# An array with 10 elements
arr = np.arange(10)
print("Original:")
print(arr)
# Temporarily set the threshold lower (e.g., 5)
# We use np.printoptions as a context manager for temporary settings
with np.printoptions(threshold=5):
print("\nWith threshold=5:")
print(arr)
# Change edgeitems too
with np.printoptions(threshold=5, edgeitems=2):
print("\nWith threshold=5, edgeitems=2:")
print(arr)
```
**Output:**
```
Original:
[0 1 2 3 4 5 6 7 8 9]
With threshold=5:
[0 1 2 ... 7 8 9]
With threshold=5, edgeitems=2:
[0 1 ... 8 9]
```
You can see how lowering the `threshold` caused the array (size 10) to be summarized, and `edgeitems` controlled how many elements were shown at the ends.
### 2. Floating-Point Precision (`precision` and `suppress`)
* `precision`: Controls the number of digits displayed after the decimal point for floats. (Default: 8)
* `suppress`: If `True`, prevents NumPy from using scientific notation for very small numbers and prints them as zero if they are smaller than the current precision. (Default: False)
```python
import numpy as np
# An array with floating-point numbers
float_arr = np.array([0.123456789, 1.5e-10, 2.987])
print("Default precision:")
print(float_arr)
# Set precision to 3
with np.printoptions(precision=3):
print("\nWith precision=3:")
print(float_arr)
# Set precision to 3 and suppress small numbers
with np.printoptions(precision=3, suppress=True):
print("\nWith precision=3, suppress=True:")
print(float_arr)
```
**Output:**
```
Default precision:
[1.23456789e-01 1.50000000e-10 2.98700000e+00]
With precision=3:
[1.235e-01 1.500e-10 2.987e+00]
With precision=3, suppress=True:
[0.123 0. 2.987]
```
Notice how `precision` changed the rounding, and `suppress=True` made the very small number (`1.5e-10`) display as `0.` and switched from scientific notation to fixed-point for the others. There's also a `floatmode` option for more fine-grained control over float formatting (e.g., 'fixed', 'unique').
### 3. Line Width (`linewidth`)
* `linewidth`: The maximum number of characters allowed per line before wrapping. (Default: 75)
```python
import numpy as np
# A 2D array
arr2d = np.arange(12).reshape(3, 4) * 0.1
print("Default linewidth:")
print(arr2d)
# Set a narrow linewidth
with np.printoptions(linewidth=30):
print("\nWith linewidth=30:")
print(arr2d)
```
**Output:**
```
Default linewidth:
[[0. 0.1 0.2 0.3]
[0.4 0.5 0.6 0.7]
[0.8 0.9 1. 1.1]]
With linewidth=30:
[[0. 0.1 0.2 0.3]
[0.4 0.5 0.6 0.7]
[0.8 0.9 1. 1.1]]
```
*(Note: The output might not actually wrap here because the lines are short. If the array was wider, you'd see the rows break across multiple lines with the narrower `linewidth` setting.)*
### 4. Other Options
* `nanstr`: String representation for Not a Number. (Default: 'nan')
* `infstr`: String representation for Infinity. (Default: 'inf')
* `sign`: Control sign display for floats ('-', '+', or ' ').
* `formatter`: A dictionary to provide completely custom formatting functions for specific data types (like bool, int, float, datetime, etc.). This is more advanced.
## Using and Customizing Array Printing
You usually interact with array printing implicitly just by displaying an array:
```python
import numpy as np
arr = np.linspace(0, 1, 5)
# These both use NumPy's array printing behind the scenes
print(arr) # Calls __str__ -> array_str -> array2string
arr # In interactive sessions, calls __repr__ -> array_repr -> array2string
```
To customize the output, you can use:
1. **`np.set_printoptions(...)`:** Sets options globally (for your entire Python session).
2. **`np.get_printoptions()`:** Returns a dictionary of the current settings.
3. **`np.printoptions(...)`:** A context manager to set options *temporarily* within a `with` block (as used in the examples above). This is often the preferred way to avoid changing settings permanently.
4. **`np.array2string(...)`:** A function to get the string representation directly, allowing you to override options just for that one call.
```python
import numpy as np
import sys # Needed for sys.maxsize
arr = np.random.rand(10, 10) * 1000
# --- Global Setting ---
print("--- Setting threshold globally ---")
original_options = np.get_printoptions() # Store original settings
np.set_printoptions(threshold=50)
print(arr)
np.set_printoptions(**original_options) # Restore original settings
# --- Temporary Setting (Context Manager) ---
print("\n--- Setting precision temporarily ---")
with np.printoptions(precision=2, suppress=True):
print(arr)
print("\n--- Back to default precision ---")
print(arr) # Options are automatically restored outside the 'with' block
# --- Direct Call with Overrides ---
print("\n--- Using array2string with summarization off ---")
# Use sys.maxsize to effectively disable summarization
arr_string = np.array2string(arr, threshold=sys.maxsize, precision=1)
# print(arr_string) # This might still be very long! Let's just print the first few lines
print('\n'.join(arr_string.splitlines()[:5]) + '\n...')
```
**Output (will vary due to random numbers):**
```
--- Setting threshold globally ---
[[992.84337197 931.73648142 119.68616987 ... 305.61919366 516.97897205
707.69140878]
[507.45895986 253.00740626 739.97091378 ... 755.69943511 813.11931119
19.84654589]
[941.25264871 689.43209981 820.11954711 ... 709.83933545 192.49837505
609.30358618]
...
[498.86686503 872.79555956 401.19333028 ... 552.97492858 303.59379464
308.61881807]
[797.51920685 427.86020151 783.2019203 ... 511.63382762 322.52764881
778.22766019]
[ 54.84391309 938.24403397 796.7431406 ... 495.90873227 267.16620292
409.51491904]]
--- Setting precision temporarily ---
[[992.84 931.74 119.69 ... 305.62 516.98 707.69]
[507.46 253.01 739.97 ... 755.7 813.12 19.85]
[941.25 689.43 820.12 ... 709.84 192.5 609.3 ]
...
[498.87 872.8 401.19 ... 552.97 303.59 308.62]
[797.52 427.86 783.2 ... 511.63 322.53 778.23]
[ 54.84 938.24 796.74 ... 495.91 267.17 409.51]]
--- Back to default precision ---
[[992.84337197 931.73648142 119.68616987 ... 305.61919366 516.97897205
707.69140878]
[507.45895986 253.00740626 739.97091378 ... 755.69943511 813.11931119
19.84654589]
[941.25264871 689.43209981 820.11954711 ... 709.83933545 192.49837505
609.30358618]
...
[498.86686503 872.79555956 401.19333028 ... 552.97492858 303.59379464
308.61881807]
[797.51920685 427.86020151 783.2019203 ... 511.63382762 322.52764881
778.22766019]
[ 54.84391309 938.24403397 796.7431406 ... 495.90873227 267.16620292
409.51491904]]
--- Using array2string with summarization off ---
[[992.8 931.7 119.7 922. 912.2 156.5 459.4 305.6 517. 707.7]
[507.5 253. 740. 640.3 420.3 652.1 197. 755.7 813.1 19.8]
[941.3 689.4 820.1 125.8 598.2 219.3 466.7 709.8 192.5 609.3]
[ 32. 855.2 362.1 434.9 133.5 148.1 522.6 725.1 395.5 377.9]
[332.7 782.2 587.3 320.3 905.5 412.8 378. 911.9 972.1 400.2]
...
```
## A Glimpse Under the Hood
What happens when you call `print(my_array)`?
1. Python calls the `__str__` method of the `ndarray` object.
2. NumPy's `ndarray.__str__` method typically calls the internal function `_array_str_implementation`.
3. `_array_str_implementation` checks for simple cases (like 0-dimensional arrays) and then calls the main workhorse: `array2string`.
4. **`array2string`** (defined in `numpy/core/arrayprint.py`) takes the array and any specified options (like `precision`, `threshold`, etc.). It also reads the current default print options (managed by `numpy/core/printoptions.py` using context variables).
5. It determines if the array needs **summarization** based on its `size` and the `threshold` option.
6. It figures out the **correct formatting function** for the array's `dtype` (e.g., `IntegerFormat`, `FloatingFormat`, `DatetimeFormat`). These formatters handle details like precision, sign, and scientific notation for individual elements. `FloatingFormat`, for example, might use the efficient `dragon4` algorithm (implemented in C) to convert floats to strings accurately.
7. It recursively processes the array's dimensions:
* For each element (or summarized chunk), it calls the chosen formatting function to get its string representation.
* It arranges these strings, adding separators (like spaces or commas) and brackets (`[` `]`).
* It checks the `linewidth` and inserts line breaks and indentation as needed.
* If summarizing, it inserts the ellipsis (`...`) string (`summary_insert`).
8. Finally, `array2string` returns the complete, formatted string representation of the array.
```mermaid
sequenceDiagram
participant User
participant Python as print() / REPL
participant NDArray as my_array object
participant ArrayPrint as numpy.core.arrayprint module
participant PrintOpts as numpy.core.printoptions module
User->>Python: print(my_array) or my_array
Python->>NDArray: call __str__ or __repr__
NDArray->>ArrayPrint: call array_str or array_repr
ArrayPrint->>ArrayPrint: call array2string(my_array, ...)
ArrayPrint->>PrintOpts: Get current print options (threshold, precision, etc.)
ArrayPrint->>ArrayPrint: Check size vs threshold -> Summarize?
ArrayPrint->>ArrayPrint: Select Formatter based on my_array.dtype
loop For each element/chunk
ArrayPrint->>ArrayPrint: Format element using Formatter
end
ArrayPrint->>ArrayPrint: Arrange strings, add brackets, wrap lines
ArrayPrint-->>NDArray: Return formatted string
NDArray-->>Python: Return formatted string
Python-->>User: Display formatted string
```
The core logic resides in `numpy/core/arrayprint.py`. This file contains `array2string`, `array_repr`, `array_str`, and various formatter classes (`FloatingFormat`, `IntegerFormat`, `BoolFormat`, `ComplexFloatingFormat`, `DatetimeFormat`, `TimedeltaFormat`, `StructuredVoidFormat`, etc.). The global print options themselves are managed using Python's `contextvars` in `numpy/core/printoptions.py`, allowing settings to be changed globally or temporarily within a context.
## Conclusion
You've now learned how NumPy takes potentially huge and complex arrays and turns them into readable string representations using its `arrayprint` mechanism. Key takeaways:
* `arrayprint` is NumPy's "pretty printer" for arrays.
* It uses **summarization** (`threshold`, `edgeitems`) for large arrays.
* It controls **formatting** (like `precision`, `suppress` for floats) and **layout** (`linewidth`).
* You can customize printing **globally** (`set_printoptions`), **temporarily** (`printoptions` context manager), or for **single calls** (`array2string`).
* The core logic resides in `numpy/core/arrayprint.py`, using formatters tailored to different dtypes and reading options from `numpy/core/printoptions.py`.
Understanding array printing helps you effectively inspect and share your NumPy data.
Next, we'll start looking at the specific C and Python modules that form the core of NumPy's implementation, beginning with the central [Chapter 6: multiarray Module](06_multiarray_module.md).
---
Generated by [AI Codebase Knowledge Builder](https://github.com/The-Pocket/Tutorial-Codebase-Knowledge)

View File

@@ -0,0 +1,154 @@
# Chapter 6: multiarray Module
Welcome back! In [Chapter 5: Array Printing (`arrayprint`)](05_array_printing___arrayprint__.md), we saw how NumPy takes complex arrays and presents them in a readable format. We've now covered the array container ([`ndarray`](01_ndarray__n_dimensional_array_.md)), its data types ([`dtype`](02_dtype__data_type_object_.md)), the functions that compute on them ([`ufunc`](03_ufunc__universal_function_.md)), the catalog of types ([`numerictypes`](04_numeric_types___numerictypes__.md)), and how arrays are displayed ([`arrayprint`](05_array_printing___arrayprint__.md)).
Now, let's peek deeper into the engine room. Where does the fundamental `ndarray` object *actually* come from? How are core operations like creating arrays or accessing elements implemented so efficiently? The answer lies largely within the C code associated with the concept of the `multiarray` module.
## What Problem Does `multiarray` Solve? Providing the Engine
Think about the very first step in using NumPy: creating an array.
```python
import numpy as np
# How does this seemingly simple line actually work?
my_array = np.array([1, 2, 3, 4, 5])
# How does NumPy know its shape? How is the data stored?
print(my_array)
print(my_array.shape)
```
When you execute `np.array()`, you're using a convenient Python function. But NumPy's speed doesn't come from Python itself. It comes from highly optimized code written in the C programming language. How do these Python functions connect to that fast C code? And where is that C code defined?
The `multiarray` concept represents this core C engine. It's the part of NumPy responsible for:
1. **Defining the `ndarray` object:** The very structure that holds your data, its shape, its data type ([`dtype`](02_dtype__data_type_object_.md)), and how it's laid out in memory.
2. **Implementing Fundamental Operations:** Providing the low-level C functions for creating arrays (like allocating memory), accessing elements (indexing), changing the view (slicing, reshaping), and basic mathematical operations.
Think of the Python functions like `np.array`, `np.zeros`, or accessing `arr.shape` as the dashboard and controls of a car. The `multiarray` C code is the powerful engine under the hood that actually makes the car move efficiently.
## What is the `multiarray` Module (Concept)?
Historically, `multiarray` was a distinct C extension module in NumPy. An "extension module" is a module written in C (or C++) that Python can import and use just like a regular Python module. This allows Python code to leverage the speed of C for performance-critical tasks.
More recently (since NumPy 1.16), the C code for `multiarray` was merged with the C code for the [ufunc (Universal Function)](03_ufunc__universal_function_.md) system (which we'll discuss more in [Chapter 7: umath Module](07_umath_module.md)) into a single, larger C extension module typically called `_multiarray_umath.cpython-*.so` (on Linux/Mac) or `_multiarray_umath.pyd` (on Windows).
Even though the C code is merged, the *concept* of `multiarray` remains important. It represents the C implementation layer that provides:
* The **`ndarray` object type** itself (`PyArrayObject` in C).
* The **C-API (Application Programming Interface)**: A set of C functions that can be called by other C extensions (and internally by NumPy's Python code) to work with `ndarray` objects. Examples include functions to create arrays from data, get the shape, get the data pointer, perform indexing, etc.
* Implementations of **core array functionalities**: array creation, data type handling ([`dtype`](02_dtype__data_type_object_.md)), memory layout management (strides), indexing, slicing, reshaping, transposing, and some basic operations.
The Python files you might see in the NumPy source code, like `numpy/core/multiarray.py` and `numpy/core/numeric.py`, often serve as Python wrappers. They provide the user-friendly Python functions (like `np.array`, `np.empty`, `np.dot`) that eventually call the fast C functions implemented within the `_multiarray_umath` extension module.
```python
# numpy/core/multiarray.py - Simplified Example
# This Python file imports directly from the C extension module
from . import _multiarray_umath # Import the compiled C module
from ._multiarray_umath import * # Make C functions available
# Functions like 'array', 'empty', 'dot' that you use via `np.`
# might be defined or re-exported here, ultimately calling C code.
# For example, the `array` function here might parse the Python input
# and then call a C function like `PyArray_NewFromDescr` from _multiarray_umath.
```
This structure gives you the flexibility and ease of Python on the surface, powered by the speed and efficiency of C underneath.
## A Glimpse Under the Hood: Creating an Array
Let's trace what happens when you call `my_array = np.array([1, 2, 3])`:
1. **Python Call:** You call the Python function `np.array`. This function likely lives in `numpy/core/numeric.py` or is exposed through `numpy/core/multiarray.py`.
2. **Argument Parsing:** The Python function examines the input `[1, 2, 3]`. It figures out the data type (likely `int64` by default on many systems) and the shape (which is `(3,)`).
3. **Call C-API Function:** The Python function calls a specific function within the compiled `_multiarray_umath` C extension module. This C function is designed to create a new array. A common one is `PyArray_NewFromDescr` or a related helper.
4. **Memory Allocation (C):** The C function asks the operating system for a block of memory large enough to hold 3 integers of the chosen type (e.g., 3 * 8 bytes = 24 bytes for `int64`).
5. **Data Copying (C):** The C function copies the values `1`, `2`, and `3` from the Python list into the newly allocated memory block.
6. **Create C `ndarray` Struct:** The C function creates an internal C structure (called `PyArrayObject`). This structure stores:
* A pointer to the actual data block in memory.
* Information about the data type ([`dtype`](02_dtype__data_type_object_.md)).
* The shape of the array (`(3,)`).
* The strides (how many bytes to jump to get to the next element in each dimension).
* Other metadata (like flags indicating if it owns the data, if it's writeable, etc.).
7. **Wrap in Python Object:** The C function wraps this internal `PyArrayObject` structure into a Python object that Python can understand the `ndarray` object you interact with.
8. **Return to Python:** The C function returns this new Python `ndarray` object back to your Python code, which assigns it to the variable `my_array`.
Here's a simplified view of that flow:
```mermaid
sequenceDiagram
participant User as Your Python Script
participant PyFunc as NumPy Python Func (np.array)
participant C_API as C Code (_multiarray_umath)
participant Memory
User->>PyFunc: my_array = np.array([1, 2, 3])
PyFunc->>C_API: Call C function (e.g., PyArray_NewFromDescr) with list data, inferred dtype, shape
C_API->>Memory: Allocate memory block (e.g., 24 bytes for 3x int64)
C_API->>Memory: Copy data [1, 2, 3] into block
C_API->>C_API: Create internal C ndarray struct (PyArrayObject) pointing to data, storing shape=(3,), dtype=int64, etc.
C_API->>PyFunc: Return Python ndarray object wrapping the C struct
PyFunc-->>User: Assign returned ndarray object to `my_array`
```
**Where is the Code?**
* **C Implementation:** The core logic is in C files compiled into the `_multiarray_umath` extension module (e.g., parts of `numpy/core/src/multiarray/`). Files like `alloc.c`, `ctors.c` (constructors), `getset.c` (for getting/setting attributes like shape), `item_selection.c` (indexing) contain relevant C code.
* **Python Wrappers:** `numpy/core/numeric.py` and `numpy/core/multiarray.py` provide many of the familiar Python functions. They import directly from `_multiarray_umath`.
```python
# From numpy/core/numeric.py - Simplified
from . import multiarray # Imports numpy/core/multiarray.py
# multiarray.py itself imports from _multiarray_umath
from .multiarray import (
array, asarray, zeros, empty, # Functions defined/re-exported
# ... many others ...
)
```
* **Initialization:** `numpy/core/__init__.py` helps set up the `numpy.core` namespace, importing from `multiarray` and `umath`.
```python
# From numpy/core/__init__.py - Simplified
from . import multiarray
from . import umath
# ... other imports ...
from . import numeric
from .numeric import * # Pulls in functions like np.array, np.zeros
# ... more setup ...
```
* **C API Definition:** Files like `numpy/core/include/numpy/multiarray.h` define the C structures (`PyArrayObject`) and function prototypes (`PyArray_NewFromDescr`, etc.) that make up the NumPy C-API. Code generators like `numpy/core/code_generators/generate_numpy_api.py` help create tables (`__multiarray_api.h`, `__multiarray_api.c`) that allow other C extensions to easily access these core NumPy C functions.
```python
# Snippet from numpy/core/code_generators/generate_numpy_api.py
# This script generates C code that defines an array of function pointers
# making up the C-API.
# Describes API functions, their index in the API table, return type, args...
multiarray_funcs = {
# ... many functions ...
'NewLikeArray': (10, None, 'PyObject *', (('PyArrayObject *', 'prototype'), ...)),
'NewFromDescr': (9, None, 'PyObject *', ...),
'Empty': (8, None, 'PyObject *', ...),
# ...
}
# ... code to generate C header (.h) and implementation (.c) files ...
# These generated files help expose the C functions consistently.
```
## Conclusion
You've now learned about the conceptual `multiarray` module, the C engine at the heart of NumPy.
* It's implemented in **C** (as part of the `_multiarray_umath` extension module) for maximum **speed and efficiency**.
* It provides the fundamental **`ndarray` object** structure.
* It implements **core array operations** like creation, memory management, indexing, and reshaping at a low level.
* Python modules like `numpy.core.numeric` and `numpy.core.multiarray` provide user-friendly interfaces that call this underlying C code.
* Understanding this separation helps explain *why* NumPy is so fast compared to standard Python lists for numerical tasks.
While `multiarray` provides the array structure and basic manipulation, the element-wise mathematical operations often rely on another closely related C implementation layer.
Let's explore that next in [Chapter 7: umath Module](07_umath_module.md).
---
Generated by [AI Codebase Knowledge Builder](https://github.com/The-Pocket/Tutorial-Codebase-Knowledge)

View File

@@ -0,0 +1,159 @@
# Chapter 7: umath Module
Welcome to Chapter 7! In [Chapter 6: multiarray Module](06_multiarray_module.md), we explored the core C engine that defines the `ndarray` object and handles fundamental operations like creating arrays and accessing elements. We saw that the actual power comes from C code.
But what about the mathematical operations themselves? When you perform `np.sin(my_array)` or `array1 + array2`, which part of the C engine handles the actual sine calculation or the addition for *every single element*? This is where the concept of the `umath` module comes in.
## What Problem Does `umath` Solve? Implementing Fast Array Math
Remember the [ufunc (Universal Function)](03_ufunc__universal_function_.md) from Chapter 3? Ufuncs are NumPy's special functions designed to operate element-wise on arrays with incredible speed (like `np.add`, `np.sin`, `np.log`).
Let's take a simple example:
```python
import numpy as np
angles = np.array([0, np.pi/2, np.pi])
sines = np.sin(angles) # How is this sine calculated so fast?
print(angles)
print(sines)
```
**Output:**
```
[0. 1.57079633 3.14159265]
[0.0000000e+00 1.0000000e+00 1.2246468e-16] # Note: pi value is approximate
```
The Python function `np.sin` acts as a dispatcher. It needs to hand off the actual, heavy-duty work of calculating the sine for each element in the `angles` array to highly optimized code. Where does this optimized code live?
Historically, the C code responsible for implementing the *loops and logic* of these mathematical ufuncs (like addition, subtraction, sine, cosine, logarithm, etc.) was contained within a dedicated C extension module called `umath`. It provided the fast, element-by-element computational kernels.
## What is the `umath` Module (Concept)?
The `umath` module represents the part of NumPy's C core dedicated to implementing **universal functions (ufuncs)**. Think of it as NumPy's built-in, highly optimized math library specifically designed for element-wise operations on arrays.
**Key Points:**
1. **Houses ufunc Implementations:** It contains the low-level C code that performs the actual calculations for functions like `np.add`, `np.sin`, `np.exp`, `np.sqrt`, etc.
2. **Optimized Loops:** This C code includes specialized loops that iterate over the array elements very efficiently, often tailored for specific [dtype (Data Type Object)](02_dtype__data_type_object_.md)s (like a fast loop for adding 32-bit integers, another for 64-bit floats, etc.).
3. **Historical C Module:** Originally, `umath` was a separate compiled C extension module (`umath.so` or `umath.pyd`).
4. **Merged with `multiarray`:** Since NumPy 1.16, the C code for `umath` has been merged with the C code for `multiarray` into a single, larger C extension module named `_multiarray_umath`. While they are now in the same compiled file, the *functions and purpose* associated with `umath` (implementing ufunc math) are distinct from those associated with `multiarray` (array object structure and basic manipulation).
5. **Python Access (`numpy/core/umath.py`):** You don't usually interact with the C code directly. Instead, NumPy provides Python functions (like `np.add`, `np.sin`) in the Python file `numpy/core/umath.py`. These Python functions are wrappers that know how to find and trigger the correct C implementation within the `_multiarray_umath` extension module.
**Analogy:** Imagine `multiarray` builds the car chassis and engine block (`ndarray` structure). `umath` provides specialized, high-performance engine components like the fuel injectors for addition (`np.add`'s C code), the turbocharger for exponentiation (`np.exp`'s C code), and the precise valve timing for trigonometry (`np.sin`'s C code). The Python functions (`np.add`, `np.sin`) are the pedals and buttons you use to activate these components.
## How it Works (Usage Perspective)
As a NumPy user, you typically trigger the `umath` C code indirectly by calling a ufunc:
```python
import numpy as np
a = np.array([1, 2, 3])
b = np.array([10, 20, 30])
# Calling the ufunc np.add
result1 = np.add(a, b) # Triggers the C implementation for addition
# Using the operator '+' which also calls np.add for arrays
result2 = a + b # Also triggers the C implementation
print(f"Using np.add: {result1}")
print(f"Using + operator: {result2}")
```
**Output:**
```
Using np.add: [11 22 33]
Using + operator: [11 22 33]
```
Both `np.add(a, b)` and `a + b` ultimately lead to NumPy executing the highly optimized C code associated with the addition ufunc, which conceptually belongs to the `umath` part of the core.
## A Glimpse Under the Hood
When you call a ufunc like `np.add(a, b)`:
1. **Python Call:** You invoke the Python function `np.add` (found in `numpy/core/umath.py` or exposed through `numpy/core/__init__.py`).
2. **Identify Ufunc Object:** This Python function accesses the corresponding ufunc object (`np.add` itself is a ufunc object). This object holds metadata about the operation.
3. **Dispatch to C:** The ufunc object mechanism (part of the `_multiarray_umath` C core) takes over.
4. **Type Resolution & Loop Selection:** The C code inspects the `dtype`s of the input arrays (`a` and `b`). Based on the input types, it looks up an internal table associated with the `add` ufunc to find the *best* matching, pre-compiled C loop. For example, if `a` and `b` are both `int64`, it selects the C function specifically designed for `int64 + int64 -> int64`. This selection process might involve type casting rules (e.g., adding `int32` and `float64` might choose a loop that operates on `float64`).
5. **Execute C Loop:** The selected C function (the core `umath` implementation for this specific type combination) is executed. This function iterates efficiently over the input array(s) memory, performs the addition element by element, and stores the results in the output array's memory.
6. **Return Result:** The C machinery wraps the output memory into a new `ndarray` object and returns it back to your Python code.
Here's a simplified sequence diagram:
```mermaid
sequenceDiagram
participant User as Your Python Script
participant PyUfunc as np.add (Python Wrapper)
participant UfuncObj as Ufunc Object (Metadata)
participant C_Core as C Code (_multiarray_umath)
participant C_Loop as Specific Add Loop (e.g., int64_add)
participant Memory
User->>PyUfunc: result = np.add(a, b)
PyUfunc->>UfuncObj: Access the 'add' ufunc object
UfuncObj->>C_Core: Initiate ufunc execution (pass inputs a, b)
C_Core->>C_Core: Inspect a.dtype, b.dtype
C_Core->>UfuncObj: Find best C loop (e.g., int64_add loop)
C_Core->>Memory: Allocate memory for result (if needed)
C_Core->>C_Loop: Execute int64_add(a_data, b_data, result_data)
C_Loop->>Memory: Read a, b, compute sum, write result
C_Loop-->>C_Core: Signal loop completion
C_Core->>Memory: Wrap result memory in ndarray object
C_Core-->>PyUfunc: Return result ndarray
PyUfunc-->>User: Assign result ndarray to 'result'
```
**Where is the Code?**
* **C Extension Module:** The compiled code lives in `_multiarray_umath.so` / `.pyd`.
* **Ufunc Definition & Generation:** The script `numpy/core/code_generators/generate_umath.py` is crucial. It contains definitions (like the `defdict` dictionary) that describe each ufunc: its name, number of inputs/outputs, identity element, the C functions to use for different type combinations (`TD` entries), and associated docstrings. This script generates C code (`__umath_generated.c`, which is then compiled) that sets up the ufunc objects and their internal loop tables.
```python
# Simplified snippet from generate_umath.py's defdict for 'add'
'add':
Ufunc(2, 1, Zero, # nin=2, nout=1, identity=0
docstrings.get('numpy._core.umath.add'), # Docstring reference
'PyUFunc_AdditionTypeResolver', # Type resolution logic
TD('?', ...), # Loop for booleans
TD(no_bool_times_obj, dispatch=[...]), # Loops for numeric types
# ... loops for datetime, object ...
),
```
This definition tells the generator how to build the `np.add` ufunc, including which C functions (often defined in other C files or generated from templates) handle addition for different data types.
* **C Loop Implementations:** The actual C code performing the math often comes from template files (like `numpy/core/src/umath/loops.c.src`) or CPU-dispatch-specific files (like `numpy/core/src/umath/loops_arithm_fp.dispatch.c.src`). These `.src` files contain templates written in a C-like syntax that get processed to generate specific C code for various data types (e.g., generating `int32_add`, `int64_add`, `float32_add`, `float64_add` from a single addition template). The dispatch files allow NumPy to choose optimized code paths (using e.g., AVX2, AVX512 instructions) based on your CPU's capabilities at runtime.
* **Python Wrappers:** `numpy/core/umath.py` provides the Python functions like `np.add`, `np.sin` that you call. It primarily imports these functions directly from the `_multiarray_umath` C extension module.
```python
# From numpy/core/umath.py - Simplified
from . import _multiarray_umath
from ._multiarray_umath import * # Imports C-defined ufuncs like 'add'
# Functions like 'add', 'sin', 'log' are now available in this module's
# namespace, ready to be used via `np.add`, `np.sin`, etc.
```
* **Namespace Setup:** `numpy/core/__init__.py` imports from `numpy.core.umath` (among others) to make functions like `np.add` easily accessible under the main `np` namespace.
## Conclusion
You've now seen that the `umath` concept represents the implementation heart of NumPy's universal functions.
* It provides the optimized **C code** that performs element-wise mathematical operations.
* It contains specialized **loops** for different data types, crucial for NumPy's speed.
* While historically a separate C module, its functionality is now part of the merged `_multiarray_umath` C extension.
* Python files like `numpy/core/umath.py` provide access, but the real work happens in C, often defined via generators like `generate_umath.py` and implemented in templated `.src` or dispatchable C files.
Understanding `umath` clarifies where the computational power for element-wise operations originates within NumPy's core.
So far, we've focused on NumPy's built-in functions. But how does NumPy interact with other libraries or allow customization of how operations work on its arrays?
Next, we'll explore a powerful mechanism for extending NumPy's reach: [Chapter 8: __array_function__ Protocol / Overrides (`overrides`)](08___array_function___protocol___overrides___overrides__.md).
---
Generated by [AI Codebase Knowledge Builder](https://github.com/The-Pocket/Tutorial-Codebase-Knowledge)

View File

@@ -0,0 +1,195 @@
# Chapter 8: __array_function__ Protocol / Overrides (`overrides`)
Welcome to the final chapter of our NumPy Core exploration! In [Chapter 7: umath Module](07_umath_module.md), we learned how NumPy implements its fast, element-wise mathematical functions (`ufuncs`) using optimized C code. We've seen the core components: the `ndarray` container, `dtype` descriptions, `ufunc` operations, numeric types, printing, and the C modules (`multiarray`, `umath`) that power them.
But NumPy doesn't exist in isolation. The Python scientific ecosystem is full of other libraries that also work with array-like data. Think of libraries like Dask (for parallel computing on large datasets that don't fit in memory) or CuPy (for running NumPy-like operations on GPUs). How can these *different* types of arrays work smoothly with standard NumPy functions like `np.sum`, `np.mean`, or `np.concatenate`?
## What Problem Does `__array_function__` Solve? Speaking NumPy's Language
Imagine you have a special type of array, maybe one that lives on a GPU (like a CuPy array) or one that represents a computation spread across many machines (like a Dask array). You want to calculate the sum of its elements.
Ideally, you'd just write:
```python
# Assume 'my_special_array' is an instance of a custom array type
# (e.g., from CuPy or Dask)
result = np.sum(my_special_array)
```
But wait, `np.sum` is a NumPy function, designed primarily for NumPy's `ndarray` ([Chapter 1: ndarray (N-dimensional array)](01_ndarray__n_dimensional_array_.md)). How can it possibly know how to sum elements on a GPU or coordinate a distributed calculation?
Before the `__array_function__` protocol, this was tricky. Either the library (like CuPy) had to provide its *own* complete set of functions (`cupy.sum`), or NumPy would have needed specific code to handle every possible external array type, which is impossible to maintain.
We need a way for NumPy functions to ask the input objects: "Hey, do *you* know how to handle this operation (`np.sum` in this case)?" If the object says yes, NumPy can step back and let the object take control.
This is exactly what the `__array_function__` protocol (defined in NEP-18) allows. It's like a common language or negotiation rule that lets different array libraries "override" or take over the execution of NumPy functions when their objects are involved.
**Analogy:** Think of NumPy functions as a universal remote control. Initially, it only knows how to control NumPy-brand TVs (`ndarray`s). The `__array_function__` protocol is like adding a feature where the remote, when pointed at a different brand TV (like a CuPy array), asks the TV: "Do you understand this button (e.g., 'sum')?" If the TV responds, "Yes, here's how I do 'sum'," the remote lets the TV handle it.
## What is the `__array_function__` Protocol?
The `__array_function__` protocol is a special method that array-like objects can implement. When a NumPy function is called with arguments that include one or more objects defining `__array_function__`, NumPy follows these steps:
1. **Check Arguments:** NumPy looks at all the input arguments passed to the function (e.g., `np.sum(my_array, axis=0)`).
2. **Find Overrides:** It identifies which arguments have an `__array_function__` method.
3. **Prioritize:** It sorts these arguments based on a special attribute (`__array_priority__`) or by their position in the function call if priorities are equal. Subclasses are also considered.
4. **Negotiate:** It calls the `__array_function__` method of the highest-priority object. It passes two key pieces of information to this method:
* The original NumPy function object itself (e.g., `np.sum`).
* The arguments (`*args`) and keyword arguments (`**kwargs`) that were originally passed to the NumPy function.
5. **Delegate:** The object's `__array_function__` method now has control. It can:
* Handle the operation itself (e.g., perform a GPU sum if it's a CuPy array) and return the result.
* Decide it *cannot* handle this specific function or combination of arguments and return a special value `NotImplemented`. In this case, NumPy tries the `__array_function__` method of the *next* highest-priority object.
* Potentially call the original NumPy function on converted inputs if needed.
6. **Fallback:** If *no* object's `__array_function__` method handles the call (they all return `NotImplemented`), NumPy raises a `TypeError`. *Crucially, NumPy usually does NOT fall back to its own default implementation on the foreign objects unless explicitly told to by the override.*
## Using `__array_function__` (Implementing a Simple Override)
Let's create a very basic array-like class that overrides `np.sum` but lets other functions pass through (by returning `NotImplemented`).
```python
import numpy as np
class MySimpleArray:
def __init__(self, data):
# Store data internally, maybe as a NumPy array for simplicity here
self._data = np.asarray(data)
# This is the magic method!
def __array_function__(self, func, types, args, kwargs):
print(f"MySimpleArray.__array_function__ got called for {func.__name__}")
if func is np.sum:
# Handle np.sum ourselves!
print("-> Handling np.sum internally!")
# Convert args to NumPy arrays if they are MySimpleArray
np_args = [a._data if isinstance(a, MySimpleArray) else a for a in args]
np_kwargs = {k: v._data if isinstance(v, MySimpleArray) else v for k, v in kwargs.items()}
# Perform the actual sum using NumPy on the internal data
return np.sum(*np_args, **np_kwargs)
else:
# For any other function, say we don't handle it
print(f"-> Don't know how to handle {func.__name__}, returning NotImplemented.")
return NotImplemented
# Make it look a bit like an array for printing
def __repr__(self):
return f"MySimpleArray({self._data})"
# --- Try it out ---
my_arr = MySimpleArray([1, 2, 3, 4])
print("Array:", my_arr)
# Call np.sum
print("\nCalling np.sum(my_arr):")
total = np.sum(my_arr)
print("Result:", total)
# Call np.mean (which our class doesn't handle)
print("\nCalling np.mean(my_arr):")
try:
mean_val = np.mean(my_arr)
print("Result:", mean_val)
except TypeError as e:
print("Caught expected TypeError:", e)
```
**Output:**
```
Array: MySimpleArray([1 2 3 4])
Calling np.sum(my_arr):
MySimpleArray.__array_function__ got called for sum
-> Handling np.sum internally!
Result: 10
Calling np.mean(my_arr):
MySimpleArray.__array_function__ got called for mean
-> Don't know how to handle mean, returning NotImplemented.
Caught expected TypeError: no implementation found for 'numpy.mean' on types that implement __array_function__: [<class '__main__.MySimpleArray'>]
```
**Explanation:**
1. We created `MySimpleArray` which holds some data (here, a standard NumPy array `_data`).
2. We implemented `__array_function__(self, func, types, args, kwargs)`.
* `func`: The NumPy function being called (e.g., `np.sum`, `np.mean`).
* `types`: A tuple of unique types implementing `__array_function__` in the arguments.
* `args`, `kwargs`: The original arguments passed to `func`.
3. Inside `__array_function__`, we check if `func` is `np.sum`.
* If yes, we print a message, extract the internal `_data` from any `MySimpleArray` arguments, call `np.sum` on that data, and return the result. NumPy uses this returned value directly.
* If no (like for `np.mean`), we print a message and return `NotImplemented`.
4. When we call `np.sum(my_arr)`, NumPy detects `__array_function__` on `my_arr`. It calls it. Our method handles `np.sum` and returns `10`.
5. When we call `np.mean(my_arr)`, NumPy again calls `__array_function__`. This time, our method returns `NotImplemented`. Since no other arguments handle it, NumPy raises a `TypeError` because it doesn't know how to calculate the mean of `MySimpleArray` by default.
This example demonstrates how an external library object can selectively take control of NumPy functions. Libraries like CuPy or Dask implement `__array_function__` much more thoroughly, handling many NumPy functions to perform operations on their specific data representations (GPU arrays, distributed arrays).
## A Glimpse Under the Hood (`overrides.py`)
How does NumPy actually manage this dispatching process? The logic lives primarily in the `numpy/core/overrides.py` module.
1. **Decorator:** Many NumPy functions (especially those intended to be public and potentially overridden) are decorated with `@array_function_dispatch(...)` or a similar helper (`@array_function_from_dispatcher`). You can see this decorator used in files like `numpy/core/function_base.py` (for `linspace`, `logspace`, etc.) or `numpy/core/numeric.py` (for `sum`, `mean`, etc. indirectly via ufunc machinery).
```python
# Example from numpy/core/function_base.py (simplified)
from numpy._core import overrides
array_function_dispatch = functools.partial(
overrides.array_function_dispatch, module='numpy')
def _linspace_dispatcher(start, stop, num=None, ...):
# This helper identifies arguments relevant for dispatch
return (start, stop)
@array_function_dispatch(_linspace_dispatcher) # Decorator applied!
def linspace(start, stop, num=50, ...):
# ... Actual implementation for NumPy arrays ...
pass
```
2. **Dispatcher Class:** The decorator wraps the original function (like `linspace`) in a special callable object, often an instance of `_ArrayFunctionDispatcher`.
3. **Call Interception:** When you call the decorated NumPy function (e.g., `np.linspace(...)`), you're actually calling the `_ArrayFunctionDispatcher` object.
4. **Argument Check (`_get_implementing_args`):** The dispatcher object first calls the little helper function provided to the decorator (like `_linspace_dispatcher`) to figure out which arguments are relevant for checking the `__array_function__` protocol. Then, it calls the C helper function `_get_implementing_args` (defined in `numpy/core/src/multiarray/overrides.c`) which efficiently inspects the relevant arguments, finds those with `__array_function__`, and sorts them according to priority and type relationships.
5. **Delegation Loop:** The dispatcher iterates through the implementing arguments found in step 4 (from highest priority to lowest). For each one, it calls its `__array_function__` method.
6. **Handle Result:**
* If `__array_function__` returns a value other than `NotImplemented`, the dispatcher immediately returns that value to the original caller. The process stops.
* If `__array_function__` returns `NotImplemented`, the dispatcher continues to the next implementing argument in the list.
7. **Error or Default:** If the loop finishes without any override handling the call, a `TypeError` is raised.
Here's a simplified sequence diagram for `np.sum(my_arr)`:
```mermaid
sequenceDiagram
participant User
participant NumPyFunc as np.sum (Dispatcher Object)
participant Overrides as numpy.core.overrides
participant CustomArr as my_arr (MySimpleArray)
User->>NumPyFunc: np.sum(my_arr)
NumPyFunc->>Overrides: Get relevant args (my_arr)
Overrides->>Overrides: _get_implementing_args([my_arr])
Overrides-->>NumPyFunc: Found [my_arr] implements __array_function__
NumPyFunc->>CustomArr: call __array_function__(func=np.sum, ...)
CustomArr->>CustomArr: Check if func is np.sum (Yes)
CustomArr->>CustomArr: Perform custom sum logic
CustomArr-->>NumPyFunc: Return result (e.g., 10)
NumPyFunc-->>User: Return result (10)
```
The `numpy/core/overrides.py` file defines the Python-level infrastructure (`array_function_dispatch`, `_ArrayFunctionDispatcher`), while the core logic for efficiently finding and sorting implementing arguments (`_get_implementing_args`) is implemented in C for performance.
## Conclusion
The `__array_function__` protocol is a powerful mechanism that makes NumPy far more extensible and integrated with the wider Python ecosystem. You've learned:
* It allows objects from **other libraries** (like Dask, CuPy) to **override** how NumPy functions behave when passed instances of those objects.
* It works via a special method, `__array_function__`, that implementing objects define.
* NumPy **negotiates** with arguments: it checks for the method and **delegates** the call if an argument handles it.
* This enables writing code that looks like standard NumPy (`np.sum(my_obj)`) but can operate seamlessly on diverse array types (CPU, GPU, distributed).
* The dispatch logic is managed primarily by decorators and helpers in `numpy/core/overrides.py`, relying on a C function (`_get_implementing_args`) for efficient argument checking.
This protocol is a key part of why NumPy remains central to scientific computing in Python, allowing it to interact smoothly with specialized array libraries without requiring NumPy itself to know the specifics of each one.
This concludes our tour through the core concepts of NumPy! We hope this journey from the fundamental `ndarray` to the sophisticated `__array_function__` protocol has given you a deeper appreciation for how NumPy works under the hood.
---
Generated by [AI Codebase Knowledge Builder](https://github.com/The-Pocket/Tutorial-Codebase-Knowledge)

44
docs/NumPy Core/index.md Normal file
View File

@@ -0,0 +1,44 @@
# Tutorial: NumPy Core
NumPy provides the powerful **ndarray** object, a *multi-dimensional grid* optimized for numerical computations on large datasets. It uses **dtypes** (data type objects) to precisely define the *kind of data* (like integers or floating-point numbers) stored within an array, ensuring memory efficiency and enabling optimized low-level operations. NumPy also features **ufuncs** (universal functions), which are functions like `add` or `sin` designed to operate *element-wise* on entire arrays very quickly, leveraging compiled code. Together, these components form the foundation for high-performance scientific computing in Python.
**Source Repository:** [https://github.com/numpy/numpy/tree/3b377854e8b1a55f15bda6f1166fe9954828231b/numpy/_core](https://github.com/numpy/numpy/tree/3b377854e8b1a55f15bda6f1166fe9954828231b/numpy/_core)
```mermaid
flowchart TD
A0["ndarray (N-dimensional array)"]
A1["dtype (Data Type Object)"]
A2["ufunc (Universal Function)"]
A3["multiarray Module"]
A4["umath Module"]
A5["Numeric Types"]
A6["Array Printing"]
A7["__array_function__ Protocol / Overrides"]
A0 -- "Has data type" --> A1
A2 -- "Operates element-wise on" --> A0
A3 -- "Provides implementation for" --> A0
A4 -- "Provides implementation for" --> A2
A5 -- "Defines scalar types for" --> A1
A6 -- "Formats for display" --> A0
A6 -- "Uses for formatting info" --> A1
A7 -- "Overrides functions from" --> A3
A7 -- "Overrides functions from" --> A4
A1 -- "References type hierarchy" --> A5
```
## Chapters
1. [ndarray (N-dimensional array)](01_ndarray__n_dimensional_array_.md)
2. [dtype (Data Type Object)](02_dtype__data_type_object_.md)
3. [ufunc (Universal Function)](03_ufunc__universal_function_.md)
4. [Numeric Types (`numerictypes`)](04_numeric_types___numerictypes__.md)
5. [Array Printing (`arrayprint`)](05_array_printing___arrayprint__.md)
6. [multiarray Module](06_multiarray_module.md)
7. [umath Module](07_umath_module.md)
8. [__array_function__ Protocol / Overrides (`overrides`)](08___array_function___protocol___overrides___overrides__.md)
---
Generated by [AI Codebase Knowledge Builder](https://github.com/The-Pocket/Tutorial-Codebase-Knowledge)