In this article series, I will explain the benefits of using Windows 10 with Windows Subsystem Linux 2 for ML problems.
The article will be published in three parts: In part one we talked about what you need to know before using GPU-accelerated models on your laptop. Now we’ll go through the benefits of using WSL 2 and discuss why you might want to avoid Mac OS in machine learning. The next article will be an in-depth install guide for WSL 2 in case you run into problems.
I have tried all the OS’s several times during the past 15 years. The best years for Mac OS seemed to be 2012–2014. As of 2021, I would choose Chrome OS because there is nothing better than developing against a cloud-native copy of the production environment, or some smaller version of that same system.
The second-best option would be Linux, but I have not been able to install it without glitches on any machine that has an Nvidia GPU, which I need for local testing of GPU-accelerated model training.
Most data scientists (and developers in general) choose Mac OS and make do without local testing of GPU-accelerated models, which is fine I suppose. But it does cause some persistent problems.
Many data scientists use more mature models and they do not seem to run into the issue of fixing the code presented in some university papers. Some data scientists lack full-stack development experience and are unaware that some problems can and should be fixed.
GPU Acceleration is nice, but there are other problems in Mac OS too
It is very nice to have the option of testing GPU-accelerated models locally on your laptop, but using Windows Subsystem Linux 2 also solves a few other important problems mentioned below. It is almost as good as cloud-native development with Chrome OS against real server machines, as WSL 2 uses a real Linux kernel locally.
There are three fundamental problems in Mac OS compared to Linux that might lead to false positive performance (not model, but CPU and memory performance) validation of the system you are trying to deploy. There are two main problems related to data science: the CPU-heavy and file-I/O-heavy (pre)processes. Then there is the problem of package management, which is common to all development.
Linux is the superior development platform in all three aspects, mostly because you are always developing for Linux servers.
1) The problem with processing incentive solutions
CPU-heavy processes need to be solved with multithreaded or multiprocessing optimization. In Linux, you should always build small programs that can be piped together into a bigger process. Once you learn the paradigm, it is easy to use and guarantees high CPU utilization.
In my opinion, this guides you towards good software architecture choices (modularization of code).
So you basically just develop with this paradigm and the OS pipeline will do all the multiprocessing and multithreading for you. Also, if you need to do some more fine-grained multithreading stuff at programming language level, Linux is a reliable companion, because the server machine thinks about all these concepts in the same terms.
Mac OS is different in places, which means that what is well CPU-optimized on Mac OS might perform poorly on a Linux system.
2) The problem with file-I/O-heavy solutions
File-I/O-heavy processes depend on a few basic tricks: sorting the data and pushing it to the memory file system from HDD or SSD (or Hadoop) drives. Most of the time, it is easiest to optimize a pipeline for fast inter-cloud bucket access that writes directly to RAM without any mass-storage access besides the cloud buckets.
In Linux, you would use the POSIX standard implementation of `/dev/shm` shared memory, but that doesn’t exist in Mac OS. Linux directories also always return file lists in sorted form, while Mac OS returns files in access order.
This can lead to unexpected bugs, especially when you use third-party components that were tested on Linux, but not on Mac OS.
Another non-performance-related issue that differs between the operating systems is access right management. If you share a git repository with Linux developers and Mac OS developers, a commit by a Mac user can change a file access right in a way that prevents the Linux system from accessing the file any more.
This doesn’t happen often, but I have experienced it several times with some JVM systems (I do not understand the root of the problem, though).
3) The problem of untouchable dependencies
There is another problem with dependency from third party solutions, which you will very often have in real-world projects (or I seem to bump into this at any rate). You can install GNU libraries on Mac OS, but if you depend on some deeply nested combination of JVM thingies, or other everyday developer stuff like that, there could be some Linux-related assumptions in the implementation and it just might not work no matter how you install the GNU libraries.
Usually the symptom is missing data, as the fundamental difference (or at least the most important one for what I do) in the Mac OS implementation of GNU libraries is that they count byte stream sizes differently, using different default units (MiB vs MB).
This is usually easy to debug if you know what you are looking for; just check whether the last files pushed into the data pipeline were left unprocessed. But if you do not know what you are looking for, you could very well spend days debugging these.
That’s all for now. The next article will be an in-depth install guide for WSL 2 in case you run into problems. Happy GPU-accelerated development times and stay tuned!