Arch Linux setup guide tailored towards data science, R and spatial analysis

This guide does not claim to be complete. It reflects my view on how to setup a working Arch Linux system tailored towards data science, R and spatial analysis at the time of writing. If you have suggestions for modifications, please open an issue. Enjoy the power of Linux!

Setting up Arch Linux the old-school way (i.e. configuring all yourself) is quite tedious. To be honest, I’ve never done it myself. I always used distributions like Antergos or Manjaro. Another layer which sits on top of the operating system is the desktop environment (DE) (e.g. GNOME, KDE, XFCE, etc). I will only briefly touch this point and focus on points which apply to all DE’s.

Make sure to check out the ArchWiki FAQs and Arch compared to other distributions - ArchWiki to get a better understanding of Arch.

Installation

Setting up the partitions

Several valid concepts exists on how to partition a Linux system. The following reflects my current view:

  1. Select “Manual” partitioning when being prompted
  2. Create a partition of 1 GB. Mount point: /boot/efi. Format: fat32
  3. Create a SWAP partition which is slightly larger than your RAM size. (e.g. for 16 GB RAM use 16.5 GB partition size). Format: Linux Swap
  4. Create a 50 - 100 GB GB partition for “root”. Mount point: /. Format: ext4
  5. With the remainng space create “home”. Mount point: /home. Format: ext4

Installing the package manager

One of the best wrappers around pacman is trizen. Here is a list (AUR helpers - ArchWiki) comparing alternatives (scroll to the bottom).

Install trizen:

git clone https://aur.archlinux.org/trizen.git
cd trizen
makepkg -si

In ~/.config/trizen/trizen.conf set no_edit => 1. This setting saves you the confirmation prompt asking whether you want to edit a PKGBUILD (i.e. the scripts that specify how to download and install a package).

(Until v1.55 there was also a switch that saved you the prompt asking whether you really want to install the package (install_built_with_noconfirm => 1). Unfortunately it has been removed in v1.56 and now you need to pass --noconfirm to trizen commands to skip the re-occuring confirmation prompts.)

Choose your shell

Almost all Linux distros come with bash (Bourne-again shell) as the default. While bash is not bad, there are better alternatives. If you are already used to bash you have to decide whether it is worth investing the time to switch to a new shell. Most things will be identical and if you have not written scripts with that specific shell, you will not probably not face any problems.

My favorite is the fish shell.

zsh

The zsh (Z-shell) is highly customizable but configuration is a bit complicated. It has several advantages (file globbing, visual appearance, etc.) to bash.

A good zsh helper is prezto: GitHub - sorin-ionescu/prezto: The configuration framework for Zsh). Install it via (trizen zsh). My favorite theme is agnoster.

fish

The fish shell is similar to zsh but comes with better defaults and an easier syntax. The omf package manager is great for installing additional plugins that simplify the shell usage. Check Oh-my-fish for an introduction.

The shell is available in the standard repositories and can be installed with trizen fish.

A great way to get started is to call fish_config in a fish session to configure fish to your needs in a graphical browser window.

Enabling parallel compiling

Compiling certain libraries from source can take time. To speed up the process via parallelization, set the MAKEVARS variable in /etc/makepkg.conf: MAKEFLAGS="-j$(nproc)". This will use all available cores on your machine for compiling.

(This options seems to be enabled by default in some distros. However, you might want to verify this.)

Installing system libraries

A note before the installation process: If you only have 8 GB of RAM, you need to temporary increase the size of your /tmp folder. Otherwise the tmp folder will run out of space due to all the package downloads and installations and you will face weird error messages if you do all the installations without a reboot. Increasing the /tmp folder temporarily can be done by sudo mount -o remount,size=20G,noatime /tmp. This increases the size for your current session to 20 GB. Usually the size is about half or your RAM size. The size will be resetted when you restart your machine next time.

When calling trizen <package> a search in the official repos AND the AUR is done. This is different to pacman which only searches in the official repos. Now you know why a pacman-wrapper is convenient :-)

A word of caution: Try to avoid installing python libraries via pip. If you install libraries via pacman that have python dependencies, pacman will try to install the respective python-<package>. If you installed this package via pip before, you will face conflicts. First search if the package is available in the repos and if not, you can safely install it via pip.

While already talking about python packages, QGIS needs some external python libraries for its plugins. Otherwise it will throw errors during startup.

trizen -S python-gdal python-yaml python-jinja python-psycopg2 python-owslib python-numpy python-pygments

Other important system libraries for spatial applications:

trizen -S gdal udunits postgis jdk-openjdk openjdk-src texlive-most pandoc pandoc-citeproc pandoc-crossref

texlive-most (wrapper package that installs the most important tex libraries. Similar to texlive-full on other Linux distributions.)

Apps

An opinionated list of applications which are great for science and productivity.

A note on Mailspring: If you are on battery, quit the app as it consume a lot of energy while steadily trying to sync your mails.

Editors

Editors are an important topic so I devote an extra section to them. You can use editors to only edit text files but they can also be used as an IDE for coding. There are many editors out there, all loved by a certain group of people.

Here’s a list of the most popular ones (this list does not claim to be complete). Also some of the listed apps are actually IDEs while some are only text editors.

I currently use Neovim which is a fork of vim for quick editing on the command line. A good general purpose IDE is Visual Studio Code. For R development I use RStudio.

Office

trizen --noconfirm libreoffice-fresh

If you chose KDE as the desktop environment, LibreOffice may flicker black/white. This is caused by OpenGl.

To solve it, set the value item in following two lines of ~/.config/libreoffice/4/user/registrymodifications.xcu to false:

<item oor:path="/org.openoffice.Office.Common/VCL"><prop oor:name="ForceOpenGL" oor:op="fuse"><value>false</value></prop></item>
<item oor:path="/org.openoffice.Office.Common/VCL"><prop oor:name="UseOpenGL" oor:op="fuse"><value>false</value></prop></item>

Additionally, I recommend to install the Papirus Icon theme for Libreoffice: trizen -S --noconfirm papirus-libreoffice-theme.

R

General

ccache

This library caches all C compiled code, making reoccuring package installations that use C code a lot faster.

Put the following into ~/.R/Makevars (create it if missing):

VER=
CCACHE=ccache
CC=$(CCACHE) gcc$(VER)
CXX=$(CCACHE) g++$(VER)
C11=$(CCACHE) g++$(VER)
C14=$(CCACHE) g++$(VER)
FC=$(CCACHE) gfortran$(VER)
F77=$(CCACHE) gfortran$(VER)

Additionally, install ccache on your system: trizen ccache. See this blog post by Dirk Eddelbuettel as a reference.

To use R from the shell without a prior defined mirror, you need the system libraries tcl and tk to launch the mirror selection popup (trizen -S tcl tk).

R & RStudio

R with optimized Openblas / LAPACK

Next, install either the “Intel MKL” or libopenblas to be used in favor of the standard “libRlapack/libRblas” libraries that are shipped with the default R installation. These libraries are responsible for numerical computations and have impressive speedups compared to the default libraries. Thanks @marcosci for the hint. While the “Intel MKL” library is the fasted according to the benchmarks, its also much more complicated to install.

libopenblas will automatically be used if its installed since the default R installation on Arch is configured with the --with-blas option (see section A.3.1 in https://cran.r-project.org/doc/manuals/r-release/R-admin.html#Installation). I recommend installing the AUR package openblas-lapack as its package cominbing multiple libraries: trizen -S --noconfirm openblas-lapack.

To verify your installation in R, simply run sessionInfo() and check the printed information:

sessionInfo()

R version 3.4.4 (2018-03-15)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Arch Linux

Matrix products: default
BLAS/LAPACK: /usr/lib/libopenblas_haswellp-r0.2.20.so

If you want to try out the “Intel-MKL” library, follow these instructions:

There is an AUR package that provides R compiled with intel-mkl named r-mkl. Note: The download size of intel-mkl is around 4 GB and takes a lot of memory during installation. Most of it will stored in the swap (around 10 GB) so make sure your SWAP space is > 10 GB.

Also to successfully install intel-mkl, you need to temporarly increase the /tmp directory as intel-mkl requires quite some space: sudo mount -o remount,size=30G,noatime /tmp.

RStudio

Use trizen -s rstudio and pick your favorite release channel. During installation R will get installed as a dependency (if you have not already done so). If you build the “git” version, the appearance will match your chosen DE. Others (e.g. the “preview” version) are binary applications which come with a fixed appearance (GTK) and might look odd if you are using a QT-based DE (e.g. KDE).

Packages

Open RStudio and install the R package usethis (it will install quite a few dependencies, get a coffee) and then call usethis::browse_github_pat(). Follow the instructions to set up a valid GITHUB_PAT environment variable that will be used for installing packages from Github.

Task view “Spatial”

Of course you it is not required to install all packages of a task view. You will never use all packages of a task view. In my opinion, however, it is pretty neat to have one command that installs (almost) all packages I use of a certain field. I do not care about the additional packages installed.

Required system libraries:

trizen -S jq gcc-fortran tk nlopt gsl

trizen -s nlopt

trizen -s v8-3.14

For rJava you first need to call sudo R CMD javareconf.

Now you can install the ctv package and then call ctv::install.views("Spatial"). This will install all packages listed in the spatial task view.

Packages that error during installation (Please report back if you have a working solution):

Task view “Machine Learning”

Required system libraries:

Packages that error during installation (Please report back if you have a working solution):

Git* repos

The easiest way (in my opinion) is to use SSH and usethis::create_from_github().

SSH configuration

If you have never set a “ssh keygen pair” at your local machine, please do so by calling ssh-keygen -t rsa.

If you already have a file named id_rsa.pub in your ~/.ssh folder at your local machine, skip this step! Otherwise it will override your existing one and may invalidate previous ssh connections you set up. You now have an id_rsa.pub file in a (hidden!) folder named .ssh within /home (at your local machine). (You can enable viewing hidden files/folders in the file-manager with the shortcut ALT + . (Dolphin) or CTRL + h (Nautilus)).

Next, make sure that the permissions of the ssh files are correct:

  1. The local directory in which you want to store your Github repos should have 777 permissions. This usually is not the case if you create the directory. If the permissions are wrong, usethis::create_from_github() will not be able to write files there. sudo chmod 777 ~/git.
  2. Make sure your ssh keys have the right permissions: sudo chmod 600 ~/.ssh/id_rsa, sudo chmod 644 ~/.id_rsa.pub
  3. Add your ssh-key to the keychain: ssh-add -K ~/.ssh/id_rsa

Sometimes, the “ssh agent” is not initialized when starting a new shell. You can force this behavior by putting the following either into your ~/.bash_profile (if you are using the bash shell)

if [ -f ~/.ssh/agent.env ] ; then
    . ~/.ssh/agent.env > /dev/null
    if ! kill -0 $SSH_AGENT_PID > /dev/null 2>&1; then
        echo "Stale agent file found. Spawning new agent… "
        eval `ssh-agent | tee ~/.ssh/agent.env`
        ssh-add
    fi
else
    echo "Starting ssh-agent"
    eval `ssh-agent | tee ~/.ssh/agent.env`
    ssh-add
fi

or into ~/.config/fish/config.fish (for the fish shell):

# SSH AGENT
setenv SSH_ENV $HOME/.ssh/environment

function start_agent
    echo "Initializing new SSH agent ..."
    ssh-agent -c | sed 's/^echo/#echo/' > $SSH_ENV
    echo "succeeded"
    chmod 600 $SSH_ENV
    . $SSH_ENV > /dev/null
    ssh-add
end

function test_identities
    ssh-add -l | grep "The agent has no identities" > /dev/null
    if [ $status -eq 0 ]
        ssh-add
        if [ $status -eq 2 ]
            start_agent
        end
    end
end

if [ -n "$SSH_AGENT_PID" ]
    ps -ef | grep $SSH_AGENT_PID | grep ssh-agent > /dev/null
    if [ $status -eq 0 ]
        test_identities
    end
else
    if [ -f $SSH_ENV ]
        . $SSH_ENV > /dev/null
    end
    ps -ef | grep $SSH_AGENT_PID | grep -v grep | grep ssh-agent > /dev/null
    if [ $status -eq 0 ]
        test_identities
    else
        start_agent
    end
end

If you are still facing problems, have a look at this comment.

As the very last, you can also hand over the information manually:

cred <- git2r::cred_ssh_key(publickey = "~/.ssh/id_rsa.pub", privatekey = "~/.ssh/id_rsa")

This object is then passed to the credentials argument in create_from_github().

Now clone all your repos from Github, e.g. create_from_github(repo = "pat-s/oddsratio", destdir = "~/git", credentials = cred). Alternatively, you can also check if git2r::check_ssh_key() returns the correct credentials. If it returns

git2r::cred_ssh_key()
$publickey
[1] "/home/pjs/.ssh/id_rsa.pub"

$privatekey
[1] "/home/pjs/.ssh/id_rsa"

$passphrase
character(0)

attr(,"class")
[1] "cred_ssh_key"

you can also use create_from_github(repo = "pat-s/oddsratio", destdir = "~/git", credentials = git2r::cred_ssh_key()).

The little overhead is really worth it: You have a working ssh setup and by reusing the command and just replacing the repo name the cloning off all your repos is done within minutes!

4.5 R from the command line

While most people use R from within RStudio, it is important to have a proper command line setup. I often call R in a second session (besides RStudio) to update packages, run R CMD check on a package, etc. The native R command line that you get when typing R in your shell lacks a lot of features. Fortunately, there is radian. Its advantages are listed in the README of the repo.

I’ve set an alias that maps r to radian. So whenever I type r into the console and hit enter, I get a “21st century ready” R console.

trizen -S --noconfirm radian

Rprofile

The ~/.Rprofile holds several options that will be applied during R startup. However, specifying all custom functions, options and other calls in one R file can get messy. Fortunately, there is the startup package. You can put several .R files into the ~/.Rprofile.d directory. This way you can organize your custom R startup better. Also, by running startup::startup(debug = TRUE) you can actually see what happens if you start R.

You can find my settings in my Dropbox.

Accessing remote servers

File access (file manager)

fstab

There are multiple approaches how to achieve this (Auto-mount network shares (cifs, sshfs, nfs) on-demand using autofs | Patrick Schratz, fstab - ArchWiki).

Here is an example of a fstab setup for a sshfs (to Linux server) and cifs (to Windows server) mount. Append those lines to /etc/fstab; don’t overwrite the existing content as this will result in boot errors otherwise!

# sshfs
sshfs#<username>@<ip>:<remote mount point> <local mount point> fuse        reconnect,idmap=user,transform_symlinks,identityFile=~/.ssh/id_rsa,allow_other,cache=yes,kernel_cache,compression=no,default_permissions,uid=1000,gid=100,umask=0,_netdev,x-systemd.after=network-online.target   0 0

# cifs
//<ip>/<remote mount point> <local mount point> cifs        credentials=/etc/.smbcredentials.txt,uid=1000,file_mode=0775,dir_mode=0775,gid=100,sec=ntlm,vers=1.0,dom=ads.uni-jena.de,forcegid,_netdev,x-systemd.after=network-online.target 0 0

Notes:

Reboot.

Both advantage and disadvantage of using fstab are that it tries to mount the directories during boot. However, often this fails. Either because of a missing network connection at this point or because you need a VPN to access a server remotely.

Manual mount

You can also put this information in a different file and do a manual mount when everything is ready (i.e. your machine is booted, your VPN connected). The syntax looks a bit different then:

sudo sshfs -o reconnect,idmap=user,transform_symlinks,identityFile=~/.ssh/id_rsa,allow_other,cache=yes,kernel_cache,compression=no,default_permissions,uid=1000,gid=100,umask=0 <username>@<server ip>:/ <local mount point>

sudo mount -t cifs -o credentials=<location of credentials file>,uid=1000,file_mode=0775,dir_mode=0775,gid=100,sec=ntlm,vers=1.0,dom=<domain name>,forcegid //<server ip>/<shared folder>     <local mount point>

I won’t go into details of all options which are used here. Check out the manual pages of the respective protocols if you are facing errors.

Executing the mount

If you use fstab, you can mount all mounts with sudo mount -a. The manual approach needs to be saved in a bash script and called from your shell with bash <filename>.sh. To avoid conflicts when remounting (after a network disconnect or similar), I wrote a little helper script:

#! /bin/bash

sudo pkill -kill -f "sshfs"
sudo umount -f /mnt/<name>
sudo umount -l /mnt//<name>
sudo umount -a -t cifs -l
sudo bash `<name of mount file`.sh

Some operations in this file may be redundant or ineffective.

So in summary, I am calling my helper script which does the following:

  1. Unmount all mounts
  2. Mount all mounts specified in <name of mount file>.sh.

Command-line access (Terminal)

SSH setup

Again we use ssh, this time to log into remote servers rather than downloading Github repos. If you have never set a “ssh keygen pair” at your local machine, please do so by calling ssh-keygen -t rsa.

If you already have a file named id_rsa.pub in your ~/.ssh folder at your local machine, skip this step! Otherwise it will override your existing one and may invalidate previous ssh connections you set up. You now have an id_rsa.pub file in a (hidden!) folder named .ssh within /home (at your local machine). Now you need to copy this file (id_rsa.pub) to the server so that you can be identified:

ssh username@<server ip> 'test -d ~/.ssh && mkdir ~/.ssh' # creates the .ssh directory if it does not exist
scp .ssh/id_rsa.pub username@<server>:/home/<username>/.ssh/ # copies your local public key to the server

Every time you log in via command line now, you will not be prompted for your password. You can further simplify the login process. Instead of having to type ssh <user>@<server> you can store all of this information in your ~/.ssh/config file:

Host <name-you-wanna-use>
    user <username>
    Hostname <server>
    IdentityFile /path/to/.ssh/id_rsa

You can easily connect to all servers you have access to with a little one-time effort.

Using tmux and tmuxinator for terminal automatization

tmux is a terminal multiplexer that lets you create complex terminal arrangements. Additionally, you can use keybindings to quickly move between panes and a lot of extensions exists to save and restore layouts.

One extension, namely tmuxinator gives you the power to write template files for server connections. This makes it possible to load several server connections with just one command.

For example, I currently have connections to six different servers stored in my config file. Executing tmuxinator start servers opens 6 windows with 3 panes each for each server. Here is a screenshot:

KDE

KDE is probably the most customizable desktop environment. By default it comes with a Windows-like navigation panel. However, you can have a dock by using latte-dock. I used KDE for quite some time and I was satisfied with most of it.

Networkmanager

If you want to use an automatic login to a VPN and the networkmanager-daemon (e.g. Openconnect) does not store your password, try the network-manager-applet package. It is the GNOME network-manager and has (for whatever reason) no problems with storing the password.

GNOME

GNOME looks more like macOS. Clean, fresh, smooth. However, it needs a bit more love in my opinion to get a proper working setup. You need to find which gnome-extensions you need for your work. Most of them are present in the AUR.

Also (and this is the biggest downside) it consumes a lot of memory. I ditched it in the end because of this. On my 16 GB machine, I could hardly have all my standard apps open with now hitting my RAM limit. In contrast, tiling window managers (e.gg. i3 or Sway) hardly cross the 8 GB mark with the same applications.

I had the following extensions installed:

trizen -Q | grep "gnome-shell"
gnome-shell 1:3.30.2-1
gnome-shell-extension-appindicator-git 1:21+15+g7bd97d4-1
gnome-shell-extension-arch-update 28-1
gnome-shell-extension-cpufreq-git v30.0.r0.g17e0861-1
gnome-shell-extension-dash-to-dock 1:64-1
gnome-shell-extension-freon-git 35.r16.g2cf167a-1
gnome-shell-extension-gravatar 4-1
gnome-shell-extension-multi-monitors-add-on-git 20171018-1
gnome-shell-extension-status-menu-buttons 0.3-2
gnome-shell-extension-system-monitor-git 884.21ae32a-1
gnome-shell-extension-taskbar 57.0-1
gnome-shell-extension-topicons 1:20-1
gnome-shell-extension-topicons-plus-git 21+23+g1766933-1
gnome-shell-extension-weather-git 1:r589.ea2d56a-1
gnome-shell-extensions 3.30.1-1

You can control the settings via the “Tweak tool” that comes with GNOME by default.

Note that some GNOME based distros come with Wayland as the desktop compositor library while KDE still uses X. I am not an expert in both but the future is said to be Wayland. You might run into some troubles with driver support for Wayland and there is much better documentation and help for X. You can also run GNOME with X if you are facing too many issues.

Laptop battery life optimization

Although the Linux kernel has a lot of power saving options, they are not all enabled by default.

There are two main power optimization tools:

I prefer tlp as powertop often causes trouble with USB devices going into sleep mode. Also, applying the changes on boot is easier with tlp.

Do trizen -S --noconfirm tlp and then follow the instructions on TLP - ArchWiki to configure it correctly. powertop though is useful to check the applied settings. Do sudo powertop and go to the “tunables” section and check if most settings are “GOOD” (most are “BAD” before applying tlp).

Miscellaneous

Backup your config files and settings

A machine can crash any time. It is not only important to backup your data and scripts (or even better, have them stored in the cloud). You also want to backup all configurations of your apps so that you can restore them easily with the most recent settings. This is also important and useful if you want to sync all your configurations across multiple machines.

In the past I recommended mackup but nowadys I highly recommedn yadm.

arara

GitHub - cereda/arara: arara is a TeX automation tool based on rules and directives. An automatization tool for TeX: pac install arara-git. However, lately I use the latex-workshop extention in Visual Studio Code for all my LaTex stuff.

latexindent.pl: Required perl modules

latexindent is a library which automatically indents your LaTeX document during compilation: GitHub - cmhughes/latexindent.pl

trizen -S perl-log-dispatch perl-dbix-log4perl perl-file-homedir perl-unicode-linebreak.

It’s also integrad in the latex-workshop extention in Visual Studio Code.

Editor schemes

Fonts

Icon themes

Desktop Themes

KDE

My overall desktop theme favorite is “Adapta”. Set it via “System Settings -> Workspace theme -> Desktop Theme”. For “Look and Feel” I prefer “Arc Dark”.

To install these, simply click on “Get new looks” on the bottom right when you are in “System Settings -> Workspace theme”.

GNOME

KDE apps in GNOME (and the other way round) usually have an odd appearance because they are powered by different graphical libraries. To make KDE apps look acceptable in GNOME, do the following:

  1. Install qt5ct
  2. Set the environment variable QT_QPA_PLATFORMTHEME to "qt5ct" (add set -gx QT_QPA_PLATFORMTHEME "qt5ct" in .config/fish/config.fish).
  3. Run qt5ct and change the settings to your liking. Note: The default GNOME font is “Cantarell Regular 11pt”. To select the default KDE “Adwaita” theme, you need to install it first: trizen -s adwaita-qt5.

Now you can enjoy KDE apps such as Dolphin or Okular. However, you have to start them from the command line. A convenient workaround is to autostart them at boot. Create a file called .config/autostart/dolphin.desktop with the following content:

[Desktop Entry]
Name=dolphin
Comment=Run dolphin
Exec=dolphin
Terminal=false
Type=Application⏎

Presentations

To create presentations I use the R package xaringan. Usually I convert the resulting HTML slides to PDF using decktape (install with trizen -s nodejs-decktape) and present the talk using impressive (install with trizen -s impressive).

Touchpad drivers

Some devices need to install the “synaptic touchpad drivers” to enable the a “soft-click” window activation. trizen -s xf86-input-libinput.

Other helpful tools