Otros Blogs
New laptop and Silverblue update
Figured I'd post an update on how things are going with the new laptop (HP Omnibook Ultra 14, AMD Ryzen AI 9 365 "Strix Point", for the searchers) and with Silverblue.
I managed to work around the hub issue by swapping out the fancy $300 Thunderbolt hub for a $40 USB-C hub off Amazon. This comes with limitations - you're only going to get a single 4k 60Hz external display, and limited bandwidth for anything else - but it's sufficient for my needs, and makes me regret buying the fancy hub in the first place. It seems to work 100% reliably on startup, reboot and across suspend/resume. There's still clearly something wrong with Thunderbolt handling in the kernel, but it's not my problem any more.
The poor performance of some sites in Firefox turned out to be tied to the hanging problem - I'd disabled graphics acceleration in Firefox, which helped with the hanging, but was causing the appalling performance on Google sites and others. I've now cargo-culted a set of kernel args - amdgpu.dcdebugmask=0x800 amdgpu.lockup_timeout=100000 drm.vblankoffdelay=0 - which seem to be helping; I turned graphics acceleration back on in Firefox and it hasn't started hanging again. At least, I haven't had random hangs for the last few days, and this morning I played a video on youtube and the system has not hung since then. I've no idea how bad they are for battery life, but hey, they seem to be keeping things stable. So, the system is pretty workable at this point. I've been using it full-time, haven't had to go back to the old one.
I'm also feeling better about Silverblue as a main OS this time. A lot of things seem to have got better. The toolbox container experience is pretty smooth now. I managed to get adb working inside a container by putting these udev rules in /etc/udev/rules.d. It seems like I have to kill and re-start the adb server any time the phone disconnects or reboots - usually adb would keep seeing the phone just fine across those events - but it's a minor inconvenience. I had to print something yesterday, was worried for a moment that I'd have to figure out how to get hp-setup to do its thing, but then...Silverblue saw my ancient HP printer on the network, let me print to it, and it worked, all without any manual setup at all. It seems to be working over IPP, but I'm a bit surprised, as the printer is from 2010 or 2011 and I don't think it worked before. But I'm not complaining!
I haven't had any real issues with app availability so far. All the desktop apps I need to use are available as flatpaks, and the toolbox container handles CLI stuff. I'm running Firefox (baked-in version), Evolution, gedit, ptyxis (built-in), liferea, nheko, slack and vesktop (for discord) without any trouble. LibreOffice and GIMP flatpaks also work fine. Everything's really been pretty smooth.
I do have a couple of tweaks in my bashrc (I put them in a file in ~/.bashrc.d, which is a neat invention) that other Atomic users might find useful...
if [ -n "$container" ] then alias gedit="flatpak-spawn --host /var/lib/flatpak/exports/bin/org.gnome.gedit" alias xdg-open=flatpak-xdg-open else alias gedit=/var/lib/flatpak/exports/bin/org.gnome.gedit fithe gedit aliases let me do gedit somefile either inside or outside a container, and the file just opens in my existing gedit instance. Can't really live without that. You can adapt it for anything that's a flatpak app on the host. The xdg-open alias within containers similar makes xdg-open somefile within the container do the same as it would outside the container.
So it's still early days, but I'm optimistic I'll keep this setup this time. I might try rebasing to the bootc build soon.
New laptop experience (Fedora on HP Omnibook Ultra 14 - Ryzen AI 365, "Strix Point")
New year, new blog post! Fedora's going great...41 came out and seems to be getting good reviews, there's exciting stuff going on with atomic/bootc, we're getting a new forge, it's an exciting time to be alive...
Personally I've spent a large chunk of the last few weeks bashing my head against two awkward bugs - one kernel bug, one systemd bug. It's a bit of a slog, but hey. Also now working up to our next big openQA project, extending test coverage for the new installer.
But also! I bought myself a new laptop. For the last couple of years I've been using a Dell XPS 13 9315, the Alder Lake generation. I've been using various generations of XPS 13 ever since Sony stopped making laptops (previously I used a 2010 Vaio Z) - I always found it to be the best thin-and-light design, and this one was definitely that. But over time it felt really underpowered. Some of this is the fault of modern apps. I have to run a dumb amount of modern chat apps, and while they're much nicer than IRC, they sure use a lot more resources than hexchat. Of course I have a browser with about 50 tabs open at all times, Evolution uses quite a lot of memory for my email setup for some reason, and I have to run VMs quite often for my work obviously. Put all that together, and...I was often running out of RAM despite having 16GB, which is pretty ridiculous. But even aside from that, you could tell the CPU was just struggling with everything. Just being in a video chat was hard work for it (if I switched apps too much while in a meeting, my audio would start chopping up for others on the call). Running more than two VMs tend to hang the system irretrievably. Just normal use often caused the fan to spin up pretty high. And the battery life wasn't great. It got better with kernel updates over time, but still only 3-4 hours probably.
So I figured I'd throw some hardware at the problem. I've been following all the chipset releases over the last couple of years, and decided I wanted to get something with AMD's latest silicon, codenamed "Strix Point", the Ryzen AI 3xx chips. They're not massively higher-performing than the previous gen, but the battery life seems to be improved, and they have somewhat better GPUs. That pretty much brought it down to the Asus Vivobook S 14, HP Omnibook Ultra 14, and Lenovo T14S gen 6 AMD. The Asus is stuck with 24GB of RAM max and I'm not a huge Asus fan in general, and the HP came in like $600 cheaper than the Thinkpad with equivalent specs, and had a 3 year warranty included. So I went with the HP, with 1TB of storage and 32GB of RAM.
I really like the system as a whole. It's heavier than the XPS 13, obviously, the bezels are a little bigger, and the screen is glossier. But the screen is pretty nice, I like the keyboard, and the overall build quality feels pretty solid. The trackpad seems fine.
As for running Fedora (and Linux in general) on it...well, it's almost great. Everything more or less works out of the box, except the fingerprint reader. I don't care about that because I set up the reader on the XPS 13 and kinda hated it; it's nowhere near as nice as a fingerprint reader on a phone. Even if it worked on the HP I'd leave it off. The performance is fantastic (except that Google office sites perform weirdly terribly on Firefox, haven't tried on Chromium yet).
But...after using it for a while, the issues become apparent. The first one I hit is that the system seems to hang pretty reproducibly playing video in browsers. This seems to be affecting pretty much everyone with a Strix Point system, and the only 'fix' is to turn off hardware video acceleration in the browser, which isn't really great (it means playing video will use the CPU, hurting battery life and performance). Then I found even with that workaround applied the system would hang occasionally. Looking at the list of Strix Point issues on the AMD issue tracker, I found a few that recommended kernel parameters to disable various features of the GPU to work around this; I'm running with amdgpu.dcdebugmask=0x800, which disables idle power states for the GPU, which probably hurts battery life pretty bad. Haven't had a hang with that yet, but we'll see. But aside from that, I'm also having issues with docks. I have a Caldigit TS3+, which was probably overpowered for what I really need, but worked great with the XPS 13. I have a keyboard, camera, headset, ethernet and monitor connected to it. With the HP, I find that at encryption passphrase entry during boot (so, in the initramfs) the keyboard works fine, but once I reach the OS proper, only the monitor works. Nothing else attached to the dock works at all. A couple of times, suspending the system and resuming it seemed to make it start working - but then I tried that a couple more times and it didn't work. I have another Caldigit dock in the basement; tried that, same deal. Then I tried my cheap no-name travel hub (which just has power pass-through, an HDMI port, and one USB-A port on it) with a USB-A hub, and...at first it worked fine! But then I suspended and resumed and the camera and headset stopped working. Keyboard still worked. Sigh. I've ordered a mid-range hub with HDMI, ethernet, a card reader and four USB-A ports on it off Amazon, so I won't need the USB-A hub daisychain any more...I'm hoping that'll work well enough. If not, it's a bit awkward.
So, so far it's a bit of a frustrating experience. It could clearly be a fantastic Linux laptop, but it isn't quite one yet. I'd probably recommend holding off for a bit while the upstream devs (hopefully) shake out all the bugs...
AdamW's Debugging Adventures: Inadvertent Extreme Optimization
It's time for that rarest of events: a blog post! And it's another debugging adventure. Settle in, folks!
Recently I got interested in improving the time it takes to do a full compose of Fedora. This is when we start from just the packages and a few other inputs (basically image recipes and package groups), and produce a set of repositories and boot trees and images and all that stuff. For a long time this took somewhere between 5 and 10 hours. Recently we've managed to get it down to 3-4, then I figured out a change which has got it under 3 hours.
After that I re-analyzed the process and figured out that the biggest remaining point to attack is something called the 'pkgset' phase, which happens near the start of the process, not in parallel with anything else, and takes 35 minutes or so. So I started digging into that to see if it can be improved.
I fairly quickly found that it spends about 20 minutes in one relatively small codepath. It's created one giant package set (this is just a concept in memory at the time, it gets turned into an actual repo later) with every package in the compose in it. During this 20 minutes, it creates subsets of that package set per architecture, with only the packages relevant to that architecture in it (so packages for that arch, plus noarch packages, plus source packages, plus 'compatible' arches, like including i686 for x86_64).
I poked about at that code a bit and decided I could maybe make it a bit more efficient. The current version works by creating each arch subset one at a time by looping over the big global set. Because every arch includes noarch and src packages, it winds up looping over the noarch and src lists once per arch, which seemed inefficient. So I went ahead and rewrote it to create them all at once, to try and reduce the repeated looping.
Today I was testing that out, which unfortunately has to be done more or less 'in production', so if you like you can watch me here, where you'll see composes appearing and disappearing every fifteen minutes or so. At first of course my change didn't work at all because I'd made the usual few dumb mistakes with wrong variable names and stuff. After fixing all that up, I timed it, and it turned out about 7 minutes faster. Not earth shattering, but hey.
So I started checking it was accurate (i.e. created the same package sets as the old code). It turned out it wasn't quite (a subtle bug with noarch package exclusions). While fixing that, I ran across some lines in the code that had bugged me since the first time I started looking at it:
if i.file_path in self.file_cache: # TODO: test if it really works continueThese were extra suspicious to me because, not much later, they're followed by this:
self.file_cache.file_cache[i.file_path] = ithat is, we check if the thing is in self.file_cache and move on if it is, but if it's not, we add it to self.file_cache.file_cache? That didn't look right at all. But up till now I'd left it alone, because hey, it had been this way for years, right? Must be OK. Well, this afternoon, in passing, I thought "eh, let's try changing it".
Then things got weird.
I was having trouble getting the compose process to actually run exactly as it does in production, but once I did, I was getting what seemed like really bizarre results. The original code was taking 22 minutes in my tests. My earlier test of my branch had taken about 14 minutes. Now it was taking three seconds.
I thought, this can't possibly be right! So I spent a few hours running and re-running the tests, adding debug lines, trying to figure out how (surely) I had completely broken it and it was just bypassing the whole block, or something.
Then I thought...what if I go back to the original code, but change the cache thing?
So I went back to unmodified pungi code, commented out those three lines, ran a compose...and it took three seconds. Tried again with the check corrected to self.file_cache.file_cache instead of self.file_cache...three seconds.
I repeated this enough times that it must be true, but it still bugged me. So I just spent a while digging into it, and I think I know why. These file caches are kobo.pkgset.FileCache instances; see the source code here. So, what's the difference between foo in self.file_cache and foo in self.file_cache.file_cache? Well, a FileCache instance's own file_cache is a dict. FileCache instances also implement __iter__, returning iter(self.file_cache). I think this is why foo in self.file_cache works at all - it actually does do the right thing. But the key is, I think, that it does it inefficiently.
Python's preferred way to do foo in bar is to call bar.__contains__(foo). If that doesn't work, it falls back on iterating over bar until it either hits foo or runs out of iterations. If bar doesn't support iteration it just raises an exception.
Python dictionaries have a very efficient implementation of __contains__. So when we do foo in self.file_cache.file_cache, we hit that efficient algorithm. But FileCache does not implement __contains__, so when we do foo in self.file_cache, we fall back to iteration and wind up using that iterator over the dictionary's keys. This works, but is massively less efficient than the dictionary's __contains__ method would be. And because these package sets are absolutely fracking huge, that makes a very significant difference in the end (because we hit the cache check a huge number of times, and every time it has to iterate over a huge number of dict keys).
So...here's the pull request.
Turns out I could have saved the day and a half it took me to get my rewrite correct. And if anyone had ever got around to TODOing the TODO, we could've saved about 20 minutes out of every Fedora compose for the last nine years...