LFS, round #6

It’s been a while since I wrote a technical post in this series, since the last post I made a build of what I called Viewpoint Linux Distribution available. This post will cover the time between the last post (round #5) and the launch of the distro.

By the time I’d written the previous post, things had roughly taken shape and I was thinking about what would sit on top via packaged software. Being interested in Guix from afar I thought about using that as there had been some interesting talks about it at FOSDEM‘s Declarative and Minimalistic Computing devroom a month prior. Didn’t end up going down that route as Guix requires GNU Guile, GnuTLS, and various extensions for Guile. It’s not so much of a problem what its requirements are but that I would have to ship and maintain copies of these in the base OS and I didn’t want to do that so I stuck with what I knew. I’ve spent a lot of time with pkgsrc and am comfortable working with it. pkgsrc gives you control where it satisfies dependencies and as long as you have a shell & compiler installed it can get itself to a working state. Unless specified, the bootstrap process on Linux opts to satisfy all dependencies from itself and ignore anything already installed on the system. This behaviour can be overridden by specifying --prefer-native yes when bootstrapping and in this scenario it was preferable since the OS was using recent if not latest available versions of things. Despite preferring native components, when it came to building packages, things that were present on the OS were being built again anyway, specifically readline.

$ cd /usr/pkgsrc/shells/bash ; bmake show-var VARNAME=IS_BUILTIN.readline
no

After some investigation it turned out the builtin detection mechanism was not working and so dependencies would always get built, this was due to a difference between where libraries are installed when following the LFS guide and where pkgsrc expects to find them. The instructions in the LFS guide specify /usr/lib for libdir and pkgsrc expects to find them in /usr/lib${LIBABISUFFIX} which in this case would expand to /usr/lib64. Just to move thing along I patched pkgsrc/mk/platform/Linux.mk to include /usr/lib for _OPSYS_SYSTEM_RPATH / _OPSYS_LIB_DIRS and builtin detection then started working. With a working packaging system, I began packaging BCC and bpftrace though opted to use the bpftrace binary which the project produces with every release in the end. This made things easier as there is a working environment out of the box to start with and if BCC is needed, it can be installed, but since the BPF Performance Tools book is largely about using bpftrace, you get to start off without dealing with packaging. By keeping the packaging system a separate component, it also saves on shipping a bootstrap kit for the packaging system with every release and likely stale packages depending on how quickly things evolve. I dislike the idea of having to run a package update on first boot to shed the stale packages which are shipped with the OS.

After testing various things out I set out to make a new build of the distro to publish, this time opting to use lib64 as the libdir to reduce the need for changes to pkgsrc, I have not attempted any large runs of bulkbuilds but Emacs 21 package was definitely not happy as it expected to find some things in /usr/lib.

There are various packages which ship with DTrace USDT probes which bpftrace can also make use of. This involves building those packages with DTrace support enabled and using SystemTap which provides a Python script called dtrace to do the relevant work, on Linux. I created a package but since it require Python, it created a circular dependency when using Python 3, as Python 3 has USDT probes. As a workaround to sidestep the issue, my SystemTap package uses Python 2, which is still supported by SystemTap. To enable building with DTrace support I introduced a “dtrace tool” which pulled in SystemTap as dependency on Linux when USE_TOOLS+=dtrace was specified, and nothing on other platforms. I then added USE_TOOLS+=dtrace across the tree where dtrace was a supported option.

bpftrace listing the USDT probes found in libpython built from the Python 3.8 package in pkgsrc

With the OS rebuild, I dropped nscd(8) from the system, the thought of having up to three caching resolvers seemed a bit excessive (nscd/systemd-resolved/unbound). This post highlights why you might not want nscd support on your system. As part of the rebuild I began populating the repository with sources for everything that would ship with distro, it was a tedious process that slowed that as I progressed through the build and imported more and more components, because on the initial import I would roll the tree back to the start to import into a branch, update to the tip of the tree, merge the branch, repeat. I used the hg-git mercurial plugin to convert and push the tree to a Git mirror

The kernel config used started life as the default config which gets created when you run make defconfig and built up from there to cover what the LFS guide suggests and those required by BCC / bpftrace. Testing that X11 worked ok revealed that I was missing various options, from lack of mouse support to support for emulated graphics, the safe bet being the VMware virtual card to use on VirtualBox (VMSVGA which is default) and QEMU, other options resulted in offset problems with the cursor where it would appear on one place on the screen but clicks and drags would register at a different location. Everything works out of the box with VMware option.

I’ve been really impressed by how quickly the system boots and shuts down (not having an initrd image to load and minimal drivers to probe for, account for that), I hope I don’t end up loosing that. I used the work leading up to the release as an excuse to start using org-mode on Emacs. Following the beginners guide I now have a long list of todo items which I work through. The next big item is build infrastructure so I can turn around releases quicker.

Book review: BPF Performance Tools: Linux System and Application Observability

It’s more than 11 years since the shouting in the data centre video landed and I still manage to surprise folks in 2020 who have never seen it with what is possible.
The idea that such transparency is a reality in some circles comes as a shock.

Without the facility to be able to dynamically instrument a system the operator is severely limited of insight into what is happening on a system using conventional tools, solely. Having to resort to debugging tools to gain insight is a non option usually for several reasons
1) disruptive (may need for application to be re-invoked via tooling).
2) considerable performance impact.
3) unable to provide a holistic view (may provides insight into one component leaving it operator to correlate information from other sources).
If you do have the luxury, the problem is how do you instrument the system?
The mechanism offers the ability to ask questions about the system, but can you formulate the right question?? This book hopefully helps with that.

Observation of an application, you need both resource analysis and application-level analysis. With BPF tracing, this allows you to study the flow from the application and its code and context, through libraries and syscalls, kernel services, and device drivers. Imagine taking the various ways disk I/O was instrumented and adding query string as another dimension for breakdowns.

The BPF performance tools book centres around bpftrace but covers BCC as well. bpftrace gives a DTrace like tool for one liners and writing scripts similarly to D, so if you are comfortable with DTrace, syntax should be familiar though it is slightly different.
BCC provides a more powerful and complex interface for writing scripts which leverage other languages to compose a desired tool. I believe the majority of the BCC tools use Python though Luajit is supported too.
Either way, in the background everything end up as LLVM IR and goes through libLLVM to compile to BPF.

The first part of the book covers the technology, starting with introducing eBPF and moving down to cover the history, interfaces, how things work, and the tooling which compliment eBPF such as PMCs, flamegraphs, perf_events and more.
A quick introduction to performance analysis followed by a BCC and bpftrace introduction rounds off the first part of the book in preparation for applying them to different parts of a system, broken down by chapter, starting with CPU.

The methodology is clear cut. Use the traditional tools commonly available to gauge the state of the system and then use bpftrace or BCC to hone in on the problem, iterating through the layers of the system to find the root cause. As opposed to trying to solve thing purely with eBPF.

I did not read the third and fourth sections of the book which covered additional topics and appendixes but I suspect I will be returning to read the “tips, tricks and common problems” chapter.
From the first sixteen chapters which I read, the CPU chapter really helped me understand the way CPU usage is measured on Linux. I enjoyed the chapter dedicated to languages, especially the Bash Shell section.
Given a binary (in this case bash):
how you go about extracting information from it, whether it has been compiled with or without frame pointers preserved.
How you could expand the shell to add USDT probes.
I did not finish the Java section, too painful to read about what’s needed to be done due to the nature of Java being a C++ code base and the JIT runtime (the book states it is a complex target to trace) and couldn’t contain myself to read the containers *yawn* chapter.
All the scripts covered in the book have their history covered in the footnotes of the page which was nice to see (I like history)

I created the first execsnoop using DTrace on 24-Mar-2004, to solve a common performance problem I was seeing with short-lived processes in Solaris environments. My prior analysis technique was to enable process accounting or BSM auditing and pick the exec events out of the logs, but both of these came with caveats: Process accounting truncated the process name and arguments to only eight characters. By comparison, my execsnoop tool could be run on a system immediately, without needing special audit modes, and could show much more of the command string. execsnoop is installed by default on OS X, and some Solaris and BSD versions. I also developed the BCC version on 7-Feb-2016, and the bpftrace version on 15-Nov-2017, and for that I added the join() built-in to bpftrace.

and a heads up is given on the impact of running the script is likely to have, because some will have a noticeable impact.

The performance overhead of offcputime(8) can be significant, exceeding 5%, depending on the rate of context switches. This is at least manageable: it could be run for short periods in production as needed. Prior to BPF, performing off-CPU analysis involved dumping all stacks to user-space for post processing, and the overhead was usually prohibitive for production use.

I followed the book with a copy of Ubuntu 20.04 installed on my ThinkPad x230 and it mostly went smoothly, the only annoying thing was that user space stack traces were usually broken due to things such as libc not being built with frame pointers preserved (-fno-omit-frame-pointer).
Section 13.2.9 discusses the issue with libc and libpthread rebuild requirement as well as pointing to the Debian bug tracking the issue.
I’m comfortable compiling and installing software but didn’t want to go down the rabbit hole of trying to rebuild my OS as I worked through the book just yet, the thought of maintaining such a system alongside binary updates from vendor seemed like a hassle in this space. My next step is to address that so I have working stack traces. 🙂

Besides that, I enjoyed reading the book especially the background/history parts and look forward to Systems Performance: Enterprise and the Cloud, 2nd Edition, which is out in a couple of months.

Juniper SRX & FreeBSD/mips

I didn’t realise the Juniper SRX line (at least the 100) was based on a MIPS SoC made by OCTEON.

CPU in a SRX100b
OCTEON CN5020-SCP pass 1.1, Core clock: 500 MHz, DDR clock: 266MHz (532 Mhz data rate)

dmesg from SRX100

Thinking about it now, I now understand why Juniper contributed the code back up to FreeBSD back in 2007 & as I search around for reference material to link to in this blog post the pieces are falling into place.
An announcement was made at the start of month that DTrace had been ported to FreeBSD/MIPS by Oleksandr Tymoshenko.
What this will mean is that when the code makes it back into a Junos release you will have the ability to get near realtime answers of what is going on your router or firewall for example using the network provider & it’ll be safe to run in production because DTrace is designed not to be harmful, something which Cisco doesn’t do & use of debug commands is discouraged on production systems because they are considered harmful.

If you’ve never played with DTrace & have a Mac, its available on all system running Leopard & above, see this article on getting started
Its available in Solaris (& derivatives) which is also where it originates from & on FreeBSD but system has to be rebuilt to enable support, see the wiki article for details.