Sanitize ArXiv RSS feeds

2023/05/18

Categories: Emacs Lisp Tags: Emacs Gnus Lisp

If you like to dip your toes into the latest stats research on ArXiv, one of the best way to track what’s going on is by following the RSS feed. So, naturally, I do this in Emacs/Gnus, and it actually offers some distinct benefits over other web readers.

First, one of the annoying things about ArXiv RSS is that the article authors are actually hyperlinks embedded in the RSS. In the Arxiv feed, you can have article headers that look like this:

Not very good.

But we can clean it up with some emacs lisp that sanitizes the headers.

(defun jth-article-remove-html-tag-from ()
  "remove </a> and <a ...> from from:"
  (interactive nil gnus-article-mode gnus-summary-mode)
  (gnus-with-article-headers
    (gnus-article-goto-header "from")
    (save-restriction
      (mail-header-narrow-to-field)
      (while (re-search-forward "</a>" nil t)
        (replace-match "" t t))
      (goto-char (point-min))
      (while (re-search-forward "<a\n?.*>" nil t)
        (replace-match "" t t))
      (goto-char (point-min))
      (while (re-search-forward "\n " nil t)
        (replace-match "" t t)))))

(add-hook 'gnus-article-prepare-hook #'jth-article-remove-html-tag-from)

Now the headers are much improved (example is different article header):

The second thing that’s nice about Gnus for reading Arxiv is that I can sort new articles based on keywords. This is called scoring, and it’s a shame that other readers don’t have a system as good as Gnus’s. For example, I use the following scoring code on the Subject line to prioritize articles about experimental design in my reading. New articles with these words in the subject will appear above others. I usually read these abstracts, and maybe a few others, then delete the rest.

(("subject"
  ("design of experiments" nil nil s)
  ("experimental design" nil nil s)
  ("design" nil nil s)))

I don’t read Arxiv very much, but this makes it a little bit better in Emacs, and maybe I’ll read it more later.