Software Engineering for Scientists

Author

Adrian Valente

Published

July 1, 2025

[This book is a draft, and many parts are still to be finished.]

Welcome to the Software Engineering for Scientists book! This e-book compiles information on the diverse facets of software engineering that can help you do good science day-to-day. If you wanted to learn git, automated tests or the principles of maintainable code in a quick way, you’re at the right place!

Book organization

This book is organized as a matrix, because it makes it easier to read for a heterogeneous audience, and because I’m a nerd. Basically, the columns are the chapters and each one covers a different topic, while the rows are the levels, from 1 to 3. That way if you feel like you already have a good level you can jump directly to the advanced topics on all fronts, or if you want to begin you can focus on reading only the level 1 parts of each chapter first. All chapters are pretty much independent by the way, so feel free to also read them in the order that suits you best. Moreover, each level-chapter is itself divided into little “nuggets” that should be easy to read in a few minutes, and give you directly actionable advice that you can start applying.

Prerequisites

This book is NOT a book to learn programming: there are already a fair amount of those! I’ll expect here that you know the basics of python, and that you’ve been using it already from time to time. I’ll expect also that you know the very basics of the terminal. This book is not really aimed at R or Matlab users, simply because it is very very hard to do actual software engineering with these languages. Python hits a sweet spot of being both great for scientific computing and quick prototyping, and having everything one needs to build complex professional projects. As such, I really urge you to switch to it if you haven’t already. In all cases, certain chapters will be useful even to users of other languages, as well as most general concepts.

Why this book?

As someone who spent time both in the software industry and in academia (computational neuroscience mostly), I noticed that some of the skills I acquired in the former were tremendously useful in research. Many tools and processes that are used every day by professional programmers are practically never taught in universities, and as such academic labs tend to perpetuate a certain tradition of spaghetiness in their codebases. We can hardly blame them however: SWE is very vast, and it is hard to know where to even begin learning it if you are by yourself. Moreover, programmers can care all day about CI pipelines and modularity but scientists have a whole other field of knowledge to focus on!

However, I believe that if you are an early-career scientist, you need to take some time to master SWE, for the following reasons: - Teamwork: research is moving towards big science, with projects involving many people collaborating on single codebases, and this requires a mastery of the best practices of the industry. - Reproducibility: it is not so rare to be unable to reproduce one’s own results after a few months. To avoid this dread, and move towards a world where we can check each other’s results easily and build on them, a few good practices can go a long way. - Your future: most PhD students eventually leave academia, and quite a few of them find jobs related to software. Knowing all the good practices will put you considerably ahead to find such a job. This is especially true nowadays, as anyone can program with an LLM on their side, and merely knowing how to code is not considered sufficient anymore. - Vibe-coding in peace: you may use AI tools to help you write code in your daily work, and you definitely should! However, these tools don’t scale very well, they can make your code a mess even faster than you can, and make errors that are hard to spot. Being good at version control, testing, debugging, and organizing your codebase will help you make use of the full power of vibe-coding without loosing your mind. - Efficiency and productivity: countless PhD students nights have been lost to bugs that could easily be avoided by a test, or trying to reproduce a result from an earlier version of the code, or trying to decipher what the previous postdoc’s code was doing. All the principles in this book are consistently applied in the industry because they are effective, and save enormous amounts of time and money, a fact that remains true for academics.

If you are convinced, let’s jump in!

About the author

My name is Adrian Valente, and you can learn more about me on my website. I did my PhD at the intersection of computational neuroscience and machine learning, in Paris, and have worked in computational biology at Centre Léon Bérard, and in software engineering at Microsoft and now Instadeep. I love code, research, and bringing those two together.

You can reach me via bluesky or linkedin, or find my email on my website.