2 Modular codebases: the secrets of scalable and maintainable code

One of the key differences between programming and software engineering is that the latter needs to deal with very large codebases, that sometimes need to be extended or modified in unanticipated ways, and require the work of numerous collaborators. It is similar to the difference between designing, say, a demonstrative steam engine and a commercial turboreactor. Succesfully building the latter requires subdivding it into many components, that each have a clear interface with the others and be worked on by different people, as well as a general plan that allows for possible reorganizations, additions or deletions of components. The same thing is true for software, where large codebases are organized into abstractions, a general concept that can be realized by functions, classes, or modules, all entities that will hide the details of their implementation and expose a simple interface to the rest of the codebase. This is what we mean by modularity, and what we will cover in this chapter.

❯ Level 1

What are classes for?

Classes are one of the basic concepts taught in traditional computer science courses, considered as fundamental as functions, variables and loops, but they are often skipped or barely touched upon in programming classes for scientists of other domains. In part, that is because they are not an essential feature of a proper programming language: they only pertain to the object-oriented style of programming, but numerous languages like C or Caml can exist without offering the possibility to define classes, and be just as expressive. It also might be because the “class” system in R and Matlab is cumbersome enough to dissuade people from using it.

However, as codebases grow bigger and more complex, the main job of a programmer shifts from writing clever algorithms towards organizing the complexity, by rendering code modular and breaking it up into abstractions. This is why most industrial projects use languages with classes and why lots of big python projects make use of them. In fact, the concept of abstractions can seem a bit, well abstract, to someone learning programming, but it is really the main obsession of a software engineer to identify what are the good abstractions for any given problem, and this is why thinking through this lens will quickly make you good at writing modular code.

There are many good introductions on how to define classes in python with code examples, so I will give pointers to these in case you haven’t seen it yet and will focus here instead on what are the main features of object-oriented programming you should know about and why they exist. We will give in this chapter a few science-oriented concrete examples to guide you on how to use these concepts in your work.

Let’s first focus on the main concepts everyone should know:

classes are ways to define “super-variables” composed of different elements (the attributes), that together form some sort of coherent entity. Possible examples of these are a model with several parameters that can be grouped together, the state of a system (for example a pendulum: position and velocity together), but also a physical value and its unit for example, or an image (which are the values for many pixels) together with its dimenions and acquisition method… You quickly see classes everywhere.
Classes also define “methods” which are functions attached to the class¹. These methods are ways to define functions that act on a certain kind of data together with the details of how that data is encoded numerically (the attributes). It also groups together all the functions that act on this kind of data instead of having them dispersed around the codebase.
Among their methods, classes usually have a few special ones but most specifically an initializer that defines what needs to be given to create an instance of said class. In python, the initializer is all the more important because it will define what attributes the class will have. A class without an initializer can normally not be instantiated in the first place.
classes can inherit from another class: a class B that inherits from class A will have all the attributes and methods that class A has and can add a few more. This also allows to respect important aspect of good code: avoidance of repetitions, extensibility of existing code, and well-organised abstractions that allow the programmers to have good mind-maps of what is going on.
inheritance goes only in one way naturally: if class B inherits from class A, A cannot inherit from B, and more generally there can be now cycles in the graph of inheritance. However, inheritances can form a chain: we can have a class C that inherits from B that itself inherits from A, in which case naturally all methods of A are also applicable to objects of C. This makes it so inheritance is kind of a family tree. In level 2 we will also see that a class can inherit from two classes at the same time in python (which is not the case in many languages).

What is an abstract class?

An important but less known concept is that of abstract class, also present in most object-oriented languages. Basically, an abstract class is a class that is designed only to be inherited: it defines a set of methods and attributes that all its children must have, without actually writing their implementation. In python it looks like this:

from abc import ABC, abstractmethod

class AbstractClass(ABC):
    @abstractmethod
    def method1(self):
        raise NotImplementedError("This method should be implemented in the child class")

    def method2(self):
        print("This is a concrete method in the abstract class")

class ConcreteClass(AbstractClass):
    def method1(self):
        print("This is the implementation of method1 in the concrete class")

if __name__ == "__main__":
    c = ConcreteClass()
    c.method1()
    c.method2()
    a = AbstractClass()  # This will raise an error

As you see, abstract class need to be defined using the abc module of the standard library. In python, abstract class are kind of an “afterthought” feature of the language and their implementation can feel a bit unnatural. Basically an abstract class is a class that inherits directly from ABC and has a least one method decorated with @abstractmethod (see section on decorators below but no need to understand it deeply here, just think of it as a modifier of the method it decorates). All abstract methods should then raise a NotImplementedError in their body. Then, if you want to make a concrete class that inherits from the abstract class, you will need to implement all the abstract methods. Note that the abstract base class can also implement concrete methods, which will be directly usable in children classes. The killer feature is that the concrete methods can make use of the abstract methods, because we are sure they will be implemented in the concrete children classes.

Now, why would anyone use this? Basically you can think of abstract classes as allowing you to form “families of classes” that are grouped together by their capabilities. In more technical terms, it allows to separate functionality from implementation. This is very useful as it will allow you to write code that can be applied interchangeably to any child of the abstract class. A typical example is the Iterable abstract class (doc) which is basically an abstract class such that all children can be plugged in a for x in iterable_object block. Concrete iterables can be vastly different types of objects, lists, numpy arrays, dictionaries, you name it, but they all share this key functionality. If you write your code such that it only relies on the functionality of the Iterable class, you now it will work whether you pass it a list or a numpy array. Another typical example, as we will see below, would be a Model abstract class that could define both a fit and a predict method that all children would then fill with their concrete implementation. Again, this allows to write code that tests models interchangeably and leads to considerable gains of time.

Another way of putting it is to say that an abstract class defines an interface for interacting with a family of classes (some computer scientists would say “a contract”, “a protocol”, or even “a scaffold” for a family of classes). They are actually even called “interfaces” in some languages like Java. The reason I insist on this terminology is because it can make it clearer what they are useful for: if you have a bunch of classes that look like they should behave similarly, or if you have code that looks exactly the same for different classes, it’s probably a good sign you can define an abstract class. The benefits will be more code reuse, more readability from peers, and less need to think about method names or forgetting something when you define another class of the family.

Examples of applications for classes and abstract classses

Let’s give some examples in scientific computing where object-oriented programming can be made to good use.

First, let’s say you want to define a set of machine learning models for some type of data you are working on². You will probably have models composed of several parameters, sometimes a vector and a scalar (for linear regression say), sometimes a set of matrices (for a neural network), sometimes a set of boundaries (for a decision tree). You can see that for most models, you will probably apply similar steps: a form of random initialization, then fitting it on the data, maybe a method to test it on held-out data, probably also a predict method to generate new predictions. You can see that all these models will thus share identical methods, but not attributes: this is a great opportunity to use an abtract class! You can define a BaseModel abstract class for example, with all those abstract methods, and then define one child class for each type of model with the actual implementation of those methods. Note you could even write the test method in a way that it only relies on the predict method, and thus implement it as a concrete method in the abstract class already, saving some code duplication. Finally, with all this, you can write a very simple loop to try all models on a dataset:

models = {
    "linear": LinearModel(),
    "neural": NeuralModel(),
    "tree": TreeModel()
}
for name, model in models.items():
    model.fit(data)
    print(f"Model {name} has a score of {model.test(held_out_data)}")

The reason you can write this single loop is because all models are children of the same abstract class! Note also that thanks to the use of classes, we can completely separate the data handling and evaluation code from the model computations, which saves a lot of your brain processing power when working on one or the other aspect.

Let’s take another example in the form of a simulation of a physical system: you have a system (like a pendulum, a planetary system, or anything you might think of), that is always in a state, characterised by several variables, and that evolves in time following some rules. The state itself is a good candidate for a class which would group together its variables, and that can contain methods to compute other state-dependent variables (for example the energy). If you are sure about its rules of evolution, you can then add a next_state method to the State class that will compute the next state of the system given the current one. This allows the simulation loop to be trivial to write, and if you need to change those rules of evolution you know exactly where to go. Now, if you want to take into account different phenomena in different simulations, you could instead put the next_state method in different XSystem classes, that would all inherit from a BaseSystem abstract class implementing this next_state method (and maybe even a full simulation_loop method that could be defined in the abstract class). Again, that will allow switching between simulations very easily, and having the State as a separate class allows to isolate the code that depends only on a single state.

Here a few more examples of scientific libraries that can benefit from OOP: - a geometry package with an abstract class for a shape, and concrete implementations for a circle, a square, a triangle, etc, all implementing methods to compute the area, draw them, compute intersections, etc. - a package to handle physical values with quantities with units, with a single Mass class for example that would have a value attribute and a unit attribute, and methods to output it in different units, add it to another mass, etc.

dunder methods

All python classes can implement special methods whose name starts and ends with two underscores __, called “dunder methods”. Everyone knows the __init__ method, that is called not explicitly but in the initialization syntax of the class (c = Class() is actually equivalent to c = Class.__init__()). But there are many more of those special methods that are implicitly called when a certain syntax is used. Here are a few examples:

__str__(self) is called when the object is converted to a string, whether that is with the str built-in function (s = str(c)) or when it is printed (print(c)). It should return a string that represents the object in a human-readable way, and you can choose which information to put in it or leave out. If you don’t define it, python will use a rather useless default string that looks like <__main__.Class object at 0x7f8e3c7b3d30>.
__add__(self, other) will be called when the object is used with the plus operator (for example c1 + c2 where c1 and c2 are instances of the class). This is how are defined the “non-natural sums” in python, for example of lists or strings (which amount to concatenation). It is quite powerful because you don’t even need the second operator to be of the same class! As long as your code in the __add__ method handles it, c2 can be of any type, and the result as well. However, you won’t be able to easily add an object of your class to one of a pre-existing type like an int, because python would then call the __add__ method of the int class (so you could arrange your code so that c + 1 works, but not 1 + c). Simlarly, the __sub__, __mul__, __truediv__ and many other dunder methods defined here are available to use similar operators.
__lt__(self, other) will be called when you use the > operator after an object of your class, as in c1 > c2. It should return a boolean obviously. Again, all comparison operators can be implemented with these dunder methods.
__getitem__(self, key) will be called when you use the square brackets to access an element of your object, as in c[3]. It should return the element at the index key. Again the key can be of any type you want as long as you support it. This is great to implement a container with random-access items like a list or numpy array or dictionary, but note that you can also define containers that can be iterated on without being random-access (like a set for example or a queue). See section below for how to do this.

Iterables and iterators

Basically, an iterable is an object that can be iterated on in a for loop, as in for x in obj. For this, the iterable must implement a method __iter__(), which returns an iterator which is itself an object implementing a method __next__(). This method __next__() should return one new element of the iteration each time it is called, and raise a StopIteration exception when the iteration is finished. These concepts seem close but the key to understanding it is that the iterator will have some sort of internal state describing where it is in the iteration, while the iterable can be iterated on multiple times by creating a new iterator every time.

The code for x in obj is secretely run by python as:

iterator = obj.__iter__()
while True:
    try:
        x = iterator.__next__()
    except StopIteration:
        break
    # run contents of the for loop

Here, the first line initializes a new iterator in order to start a new iteration, and then at each iteration the __next__ method is called to get the next element, until a StopIteration is attained, at which point we break out of the loop.

To make things more concrete, consider the following re-implementation of a list as an iterable³:

class ListIterator:
    def __init__(self, data):
        self.data = data
        self.index = 0

    def __next__(self):
        if self.index >= len(self.data):
            raise StopIteration
        x = self.data[self.index]
        self.index += 1
        return x

    def __iter__(self):
        return self

class ListIterable:
    def __init__(self, data):
        self.data = data

    def __iter__(self):
        return ListIterator(self.data)

As you can see, the __iter__ function of the ListIterable class will simply return a new ListIterator object that will be initialized with an index at 0. The __next__ method then always returns the element at which the current index points to, and increments the index, until the end of the list is reached. Note that for consistency reasons, the iterator must also implement the __iter__ method, which should return the iterator itself (without changing its internal state).

Generators

There is a different way to implement an iterator class without so much boilerplate code, which is by using a generator. A generator is a function that instead of having a return statement has a yield statement, and can be used as in iterator. Everytime a yield statement is reached, it’s as if the function was paused at that line, and the value after the yield word was returned as an element of the iteration. To generate the next element of the iteration, python will resume the execution of the generator where it had paused it, continue until the next yield statement, and that until the function exits without a yield. Here is how the same list iterable could be implement with a generator⁴:

def list_generator(data):
    index = 0
    while index < len(data):
        yield data[index]
        index += 1

for x in list_generator([1, 2, 3]):
    print(x)

What happens here is that where the list_generator function is called python will create an iterator object that will hold the values of all internal variables of the list_generator function, along with the current line of execution. The __next__ method of this implicit iterators resumes execution of the function until it finds a yield statement or the end of the function. There’s a ton of things happening here under the hood, this is python putting a lot of syntactic sugar to make code that reads beautifully well and can be tremendously efficient, albeit hard to debug.

The real power of generators appears when one tries to implement sequences that need to generate new data, for example the classical implementation of the Fibonacci sequence:

def fibonacci():
    a, b = 0, 1
    while True:
        yield a
        a, b = b, a + b

for x in fibonacci():
    if x > 1000:
        break
    print(x)

Here, a key feature of iterators and generators appears: they are lazy, meaning they only compute data as its necessary. This allows to even have infinite iterators as in this case.

Note that thanks to the iterator interface, any iterable or iterator that you implement, whether as a class or as a generator function, can be used not only in a for loop but also in functions such as enumerate, zip, list, set, sorted, max, min, sum, all, any, or those of the itertools module (as long as the necessary operators are supported by elements of the iteration, for example if x < y can be interpreted by python if you want to use the sorted or max functions). This allows creative solutions to problems that would otherwise take numerous lines of code.

Finally, let’s mention generator expressions which are yet another layer of syntactic sugar on top of all this: it’s a syntax that allows you to define a generator in a single line composed of an expression followed by one or several for and if clauses. It reads very well, for example:

upper_triangle_elements = (matrix[i, j] for i in range(n) for j in range(i, n))
for x in upper_triangle_elements:
    # ...

The first line here is lazy: it doesn’t lead to any actual computation, but rather defines a generator. It is equivalent to the following code:

def gen_upper_triangle_elements(matrix, n):
    for i in range(n):
        for j in range(i, n):
            yield matrix[i, j]
upper_triangle_elements = gen_upper_triangle_elements(matrix, n)

This is as far as it gets! Again, the advantage is it’s very readable and concise, the drawback is it can be hard to debug if something goes wrong.

A guide to the intricacies of python imports

Let’s talk about an aspect of python that is often confusing even for experienced programmers: modules and the import system. As long as you try to import external packages you will surely be fine (provided the packages are installed which is another story, see the section on pip and conda in chapter 5), but have you ever tried to import modules from your own codebase? All sorts of complicated errors can happen then.

Let us say here we have a codebase with the following structure:

my_project/
    main.py
    utils.py
    models/
        linear.py
        common.py
    data/
        preprocessing.py

So far you can quite trivially import function into main.py from the other scripts. This should work fine:

# main.py
from utils import my_function
from models.linear import LinearModel

Weirdly however, importing things into linear.py won’t work as trivially. If we try to put into the linear.py the line:

# In models/linear.py
from common import common_function

you get an error: the common module is not found. However if you put in the linear.py file:

# In models/linear.py
from utils import my_function

Then the main script will work just fine even though utils.py is not in the same directory as linear.py! This seems incomprehensible.

The key to these mysteries is that everytime python finds an import statement, it searches a list of directories called the sys.path. You can see it by running import sys; print(sys.list) in your code. Any imported module has to be in one of those directories to be found, either in the form of a script named module_name.py, or of a directory named module_name containing an __init__.py file. If we look at the documentation, we see that “The first entry in the module search path is the directory that contains the input script, if there is one.”. So in our case, when main.py is run, the directory containing it is added to the sys.path, and any import even in other scripts will be able to import the utils.py module. That would break however if you tried import the functionality of linear.py into a notebook or a script somewhere else, as the directory containing main.py and utils.py would not be in the sys.path anymore.

A more robust solution is to use python’s relative imports system. First you can try to do the following in linear.py:

# In models/linear.py
from .common import common_function

This should work, wherever you import linear.py from! The dot tells python to look for the module in the same directory as the file it is reading, instead of in the sys.path. This system can also be used to search parent directories, with one additional dot to go back one additional directory. Following this logic, you can also import the utils.py module from linear.py with:

# In models/linear.py
from ..utils import my_function

However, you’ll notice that python still complains. Things now do get convoluted: the above solution actually works if (brace yourself) you import linear.py at least from the parent directory of the my_project/ directory! You can try it, if you open the python interpreter in the parent of my_project and then run import my_project.models.linear it will work. This is because during a relative import, python will move backwards in the package hierarchy and not in the filesystem hierarchy. For the user it looks exactly the same, but in practice it means that from ..utils import my_function will only work in a file that is imported at least two packages deep, as was the case when we ran import my_project.models.linear from the parent directory of my_project. If we run python in the my_project directory and there run import models.linear, any relative import in linear.py can only have at most one initial dot. Another thing: any relative import that starts with a dot won’t work in a script that is run directly, or in a notebook. So in main.py, any statement like from .utils import ... won’t work if you run the script with python main.py (but from models.linear import ... will work because it relies on the sys.path, not on the relative imports mechanism). A workaround is to run the script with python -m my_project.main which makes python “run the script as a module”, but it’s still not applicable to notebooks.

To wrap-up on relative imports, they are only a solution to handle imports between files of a single library without scripts or notebooks, and they don’t allow you to import the library from outside of its top-level directory. So they are not a robust solution for a large project, or a project that you want to share with others.

What should you do then? A dirty solution that if often seen is direct meddling with the sys.path, but that’s not very robust. Other than that, the only clean solution is to make your library installable with one of the solutions detailed in chapter 5, the simplest being a pyproject.toml file, such that you can run pip install -e . in the directory containing the pyproject.toml file to install your library. In our case, we would first need to rework the directory structure to separate the scripts from the library, yielding something like:

my_project/
    my_project/
        __init__.py
        utils.py
        models/
            __init__.py
            linear.py
            common.py
        data/
            __init__.py
            preprocessing.py
    scripts/
        main.py
    setup.py

We will cover how to write the pyproject.toml in chapter 5, but once it is done you can run pip install -e . when you are in the top-level directory, and from there you can run import my_project.models.linear for example from anywhere on your computer, in the python REPL or in a script or notebook. The script main.py could then contain statements like from my_project.utils import ... and from my_project.models.linear import .... The files of the library however can still use relative imports, as they are now all in the same top-level package (but from my_project-style imports are also fine).

You noticed also that we added __init__.py files now in all directories of the library, that are simply empty files. It is indeed recommended for any python library to add an eventually empty __init__.py to each package directory, although this is not strictly necessary since python 3.3⁵.

Docstrings

❯❯ Level 2

Decorators

This is another of those “syntactic sugar” features of python that make it such an enjoyable language to read. You have seen it when we mentioned abstract classes, as the @abstractmethod line that had to be added before the abstract method definition.

Basically, a decorator is a function that takes a function as input and returns a modified version of that function as output⁶. A typical example would be a function that times the execution of another function:

import time
import functools

def timer(func):
    @functools.wraps(func)  # This is recommended, explanations below
    def wrapper(*args, **kwargs):
        start = time.time()
        result = func(*args, **kwargs)
        end = time.time()
        print(f"Function {func.__name__} took {end - start} seconds to run")
        return result
    return wrapper

The above code looks convoluted, but it really just defines a new function wrapper that itself calls func, does some things before and after, and returns the result. The decorator itself simply takes any function func as argument and returns the wrapper it built around it. You can then use this decorator by adding @timer before the definition of any function you want to time:

@timer
def my_function(x, y):
    # do something

This is equivalent to writing:

def my_function(x, y):
    # do something
my_function = timer(my_function)

You can then call my_function(x, y) as usual, except that it will be like calling timer(my_function)(x, y) if we didn’t use the decorator syntax. As you can see, it is very useful because it allows you to avoid repetition of boilerplate code that would be the same for many functions but cannot be put in a single function call (because as in the example above it has to be called at the beginning and end of the function, or because it would depend on the function arguments).

You can imagine many applications of this syntax, for example to log function calls somewhere, to check the validity of certain function’s inputs, to retry function calls if they fail… One decorator you might want to know about is @functools.lru_cache (import functools before). A function decorated with this will have a “least recently used” (LRU) cache of its results, meaning that everytime you call it, it will store in memory the arguments and result obtained. If called again with the same arguments these are simply retrieved from the cache. This is very useful for functions that run lengthy calculations but are often called with the same arguments. The decorator can be used with an argument maxsize to specifiy the size of its cache (simply write @functools.lru_cache(maxsize) when using it).

Another useful decorator is @functools.wraps which funnily enough is used pretty much only when writing a decorator. It is recommended to always decorate the wrapper inside your decorator with it, for reasons that are explained in detail here but in short to preserve properties of the decorated function like its __name__ and __doc__ attributes. Finally, if you need to write a decorator that takes arguments, the syntax is a bit more complex, you can see here or there.

Multiple inheritance and mixins

The traditional view of class inheritance is a linear one, where a children class inherits from a single parent, adding more specialized methods and attributes, and then can itself have more specialized children. For example you would create a Classifier base class, and then some children like a LinearClassifier and NeuralClassifier. But what if you want your NeuralClassifier to also inherit the methods of torch.nn.Module? Well in python you can! You can simply add as many parents as you want to a single class:

class NeuralClassifier(Classifier, torch.nn.Module):
    # ...

This is not possible in many languages because the consequences can be hard to deal with. For example, if both parents have a train method, python needs a rule to know which one to call first. And imagine if those parents itself inherit from multiple classes, the likelihood of these conflicts will only increase. To solve this, python has an “order of its ancestors” under the form of a list called the “method resolution order” (accesible by calling MyClass.__mro__ for any class). When a method of MyClass is called, python will look for it among the methods defined in MyClass itself, then among the methods of the first element of the MRO, and then move on along the MRO until it finds it.

Note that you can always call explicitly a method of any parent class by using the super() function. Basically, a default call to super() will defer the method call to the first parent class in the MRO. But you can add a class as an argument to super() and then it will search for the first class in the MRO following the argument of super(). An example will be more telling:

class A:
    def method(self):
        print("A")

class B:
    def method(self):
        print("B")

class C(A, B):
    def method(self):
        super().method() # Will print "A" (first parent class)
        super(A, self).method() # Will print "B" (comes after A in the MRO)
        super(B, self).method() # Will raise an error (B is not in the MRO after C)

A particularly interesting use of multiple inheritance is the concept of mixin classes that are often used in large codebases. A mixin is a class (often abstract) that is not meant to be instantiated on its own but only to add some functionality to other classes through secondary inheritance. This allows for compositional inheritance patterns which would be impossible with simple inheritance. For example, let’s say you have models that descend from a BaseModel class. Some of these models can be continually trained with some shared functionality, so you think of adding a ContinualTrainingModel subclass they will inherit from. Some of these models can also predict confidence intervals with some shared functionality, so you think of adding a ConfidenceIntervalModel subclass they will inherit from. But what if you want a model that can do both? And what if you want to reuse the confidence interval code in another class that is not a model? Mixins provide the cleanest way to deal with this kind of solutions, by making each mixin class focused on a narrow set of functionality, and by making them all horizontal. That way, you just have to think about what features you want to add to your children class, and simply make it inherit from all the right mixins at the same time as the base class. Something like: class MyGreatModel(BaseModel, ContinualTrainingMixin, ConfidenceIntervalMixin):.

Dataclasses

To understand this section, it is better to know about type annotations, seen in Chapter 4 - level 2.

Classes are very useful but they have the drawback of adding a lot of boilerplate code, in particular with the __init__ function. Typically, you define a class, for example Result to hold the result of a linear regression, whose attributes will be the fitted parameters, bias, p-value, and R2. In python, the attributes are defined in the __init__ function that will resemble something like this:

class Result:
    def __init__(self, parameters, bias, p_value, r2):
        self.parameters = parameters
        self.bias = bias
        self.p_value = p_value
        self.r2 = r2

This code is very predictable and boring, it’s sort of a weird way to just say “these are the attributes of my class and they can be passed from the initializer”. For these types of classes, that mainly hold data passed directly from the initializer without having complicated internal mechanisms, it is handy to use the dataclasses standard library (installed by default with python). It consists in a mere class decorator, that will render the initializer implicit if you declare your attributes in the body of the class, with type annotations (those are necessary). The previous class could then be created as follows:

from dataclasses import dataclass

@dataclass
class Result:
    parameters: list[float]
    bias: float
    p_value: float
    r2: float

The decorator will have the effect that an __init__ function will be created, whose arguments will have the same name as the attributes they will be assigned to (as any sane programmer would do anyway). So it can be used very simply, for example: result = Result(parameters=[1.2, 3], bias=3.5, p_value=0.01, r2=0.98). The attributes you define in the body of the function[^7] can also have default values, that will be automatically mapped to the implicit initializer. On top of that, dataclasses will create a few handy dunder methods for your class, for comparisons (like __eq__), converting to a string (__str__) and other utilities you can learn about in the doc. One handy argument of the dataclass decorator you might want to know about it the frozen argument: when set to True, it ensures that the attributes can only be set at the moment of initialization, and then cannot be modified individually (doing result.r2 = 0.87) which is useful for objects that should not be modified after creation, like the Result class above.

Finally, what if you want to customize a bit the __init__ code, for example to check if an argument is within a given range or to set an attribute from the value of others, you can still do it via a __post_init__ special function that is described here.

❯❯❯ Level 3

Pydantic

This section deals with a technique at the interface of object-oriented programming and robust programming, having read the first parts of chapter 4 is thus recommended.

Dataclasses already comes in handy when you want to create objects made mostly to handle data, but it still keeps python traditions of “not checking anything”. As we see in chapter 4, python’s blessing and curse is that lets you do anything, rendering it flexible but dangerous. For example, if you define the Result dataclass shown above, and decide to pass a string to the p_value attribute, python will not complain. If you really want this class to be robust, you would have to add tests in the constructor to verify that the types are correct, and that p_value is between 0 and 1 for instance. These types of checks are very boring to write, so nobody does it. Thankfully, pydantic is a library that will render this boilerplate code implicit, so you don’t have to worry about it anymore.

To give a concrete example, here is how we would write the above Result class with pydantic:

from pydantic import BaseModel, Field

class Result(BaseModel, validate_assignment=True):
    parameters: list[float]
    bias: float
    p_value: float = Field(..., ge=0, le=1)
    r2: float

So instead of adding a class decorator as above, you need to inherit from the pydantic class BaseModel (because in data handling theory, such a structure can be called a “data model”, or even a “data schema”). The rest is similar to what you would do for a dataclass, except that you can assign to certain attributes a Field object, that allows to specify more detailed constraints, such as a range of acceptable values. The first argument of Field is a default value if you want to pass any, and can be ... if you want the user to obligatorily pass a value. The result of this definition is that pydantic will leverage both the type hints and the Field assignments to implicitly check all calls to the constructor of this class. You can try to execute r = Result(parameters=[1, 2], bias="hi", p_value=1.5, r2=0.98) and you will see that pydantic refuses to instantiate the object and gives you informative error messages. Even better, if in the constructor you pass “1” for the bias attribute for example, pydantic will automatically convert it to a float for you! That’s a lot of desirable boilerplate that’s automatically done behind the scenes.

Pydantic models can be configured in a rather peculiar way, by passing keyword arguments to the inheritance statement. For example, changing the first line of the class definition to:

class Result(BaseModel, validate_assignment=True, extra="forbid"):

Will also enforce checks to attribute assignments (for example r.p_value = 1.5 would raise an error, which is unfortunately not the default behavior), and will forbid the creation of attributes that are not defined in the class (for example r.new_attribute = 3 would raise an error). Another useful argument is frozen which, when set to True, will make any instance of the class immutable, and the whole set of arguments that can be passed ocrresponds to the arguments of the pydantic.ConfigDict which can be found here. This way of passing extra arguments to an inheritance statement is very weird, and is actually not a python feature but something pydantic can implement through metaclasses which are a subject for later[^8]. This is to say you shouldn’t expect this syntax to work with other classes!

Another interesting feature of pydantic is the possibility to decorate functions with @validate_call to enforce similar type checking capabilities for simple functions:

from typing import Annotated
from pydantic import validate_call, Field

@validate_call
def f(x: float, y: Annotated[float, Field(gt=0)]) -> float:
    return x * log(y)

where the syntax for specifying constraints is a bit different, wrapped in the Annotated type (which is simply a way to add metadata to a type hint), but will have the same effect.

[^8] There are alternative equivalent ways to do this, such as adding a model_config attribute as shown here or the old way of adding a Config class inside the model class which can still be seen in some places.

Technically that’s what a “class” is: a super-variables composed of many variables, together functions acting on it. Some languages offer the “super-variable” feature without the methods (like structs in C) which are a form of proto-classes. In python, dictionaries could play such a role.↩︎
Any resemblance to existing machine learning libraries would be a pure coincidence.↩︎
lists are already iterables, in pretty much the same way, this is used here simply to illustrate the concept.↩︎
again, this is purely illustrative, and one could simply iterate over the list but let’s pretend lists are not natively iterable for the sake of the exercise.↩︎
what happens behind the scenes of those __init__.py files is even more complicated than what we just saw, but if you still have some courage left you can read about it here and in references therein.↩︎
the physicists or mathy people around here may see a similarity to the concept of “operator”.↩︎