The most cited Annals of Statistics articles

John Haman

2021/01/19

Categories: R Statistics Tags: R Statistics

Or, how to bot Project Euclid with R.

Intro

Whenever I look at the front page of Project Euclid, they try to be helpful and show me the top articles of the week, but it’s not exactly what I want to see. I want to know the all-time top articles from AOS, and I don’t just want a list of 5 articles, I want all the data.

True, there are some papers and blog posts out there that do some tabulation of the top articles in the whole discipline of statistics, but I want a break-down for a specific journal, and I want to use the most up to date information I can find. I also want to be able to see the top articles by keyword in each journal1, and by decade, so we’re just going to have roll up our sleeves and do this the hard way :)

Why would someone want this information? It’s just intellectual curiosity. Sometimes I flip through a journal, and I start reading, but it’s kinda aimless if the journal is too technical. I can read American Statistician cover to cover because they mix it up with lots of different themes. Not so much with AOS. I need some help, because I’m only going to read a few articles a year. So this project is effectively about making a recommendation engine for an academic journal.2

library(rvest) # Web scraping
library(rcrossref) # Get citations
library(data.table) # Data tables and disk I/O
library(magrittr) # this is not a pipe

Goal: We need to collect a list of all published articles in AOS, and get the number of citations for each of the articles. The plan is to use a couple web scrapers to get the job done.

Getting all the articles in AOS

First, crawl around the main AOS landing page. The AOS landing page can be changed to the landing page of any journal hosted on Project Euclid. Later, I’m going to want to adapt this code to work on some other journals, such as the Annals of Applied Statistics.

Let me be the first to admit that I’m not a web-scraping guru, so my scraping strategy is likely suboptimal. But ya gotta make some mistakes before you know how to avoid them. Let me know how I can make things more stable and faster for my next go at it!

## This is the landing page for AOS
url <- "https://projecteuclid.org/all/euclid.aos"
main <- read_html(url)

## Suck up all links from the main page
aos_links <- html_nodes(main, "a") %>% html_attr("href") # links

## regex to match issues from the main page.
link_to_issue <- grepl("euclid.aos/[[:digit:]]+", aos_links)

## Toss out the links that do not appear to correspond to any issues.
aos_links <- aos_links[link_to_issue]

Now we have links to all landing pages of each issue the journal has released. Each issue page contains links to the articles in that issue, and each article page contains the precious DOIs. We will use the DOIs to search for the number of citations later. That’s the goal.

Here is a function to loop through all the articles in an issue. The main idea is that the function takes in a prepared, but empty, data.table and goes through all the articles in issue_links, and spits out a completed data.table for that issue. Each row of the outputted data.table corresponds to a single article, and on each article we record the DOI, year, authors, and article name. Nice!

issue_data_table <- function(DT, issue_links) {

  ## This loop goes through all the articles in an issue, and gets all the article
  ## information
  for (i in seq_along(issue_links)) {

    ## Go each each article page.
    ## Each article page is an element in the vector of issue_links.
    article_url <- paste0("https://projecteuclid.org", issue_links[i], "#info")

    message("Downloading an article...")

    ## FIXME: Wrap this in TryCatch()
    article_html <- read_html(article_url)

    ## Get the DOI
    ## Finding the correct nodes takes some fiddling around with SelectorGadget
    DOI_unclean <- article_html %>%
      html_node("p:nth-child(5)") %>%
      html_text()
    ## Some of the article information needs to be cleaned up.
    ## I prefer base-R regex functions when the job is not too bad :)
    DOI_clean <- sub("^Digital Object Identifier", "", DOI_unclean)

    ## Get the year of each article
    year_unclean <- article_html %>%
      html_node("#info p:nth-child(2)") %>%
      html_text()
    year_clean <- sub("^Source", "", year_unclean)

    ## Get article name
    art_name <- article_html %>%
      html_node("h3") %>%
      html_text()

    ## Get authors
    authors <- article_html %>%
      html_node(".small") %>%
      html_text()

    ## FIXME: Additionally get the keywords for each article. I forgot to do
    ## that after starting the AOS crawler.

    ## Assign all the article information to one row of a pre-made data.table
    ## `set` is a loopable version of `:=`.
    set(DT, i, "DOI", DOI_clean)
    set(DT, i, "Year", year_clean)
    set(DT, i, "Article_name", art_name)
    set(DT, i, "Authors", authors)

    ## Pause for 10 secs to not overwhelm the server.
    Sys.sleep(10)
  }
  DT
}

Apply the issue -> data.table function to each issue of the AOS catalog, while taking care to fail gracefully.

get_aos <- function(aos_links) {
  for (i in seq_along(aos_links)) {
    ## Crawl on the main page for each issue.
    issue_url <- paste0("https://projecteuclid.org", aos_links[i])

    ## Make a different data.table for each issue and name them programmatically.
    DTname <- paste0("issueDT_",
                     regmatches(issue_url, regexpr("[[:digit:]]+", issue_url)))

    ## Filename for each data.table
    fname <- paste0("./assets/", DTname, ".csv")

    ## If we already have that data, next
    if (file.exists(fname)) {
      message(paste("Checked issue", issue_url, "   ", i / length(aos_links) * 100, "% Done."))
      next
    }

    ## Fail silently, if it must be ...
    issue_html <- tryCatch(read_html(issue_url), error = function(e) e)

    ## Skip this issue if there is something wrong, we can try again later.
    ## https://stackoverflow.com/a/8094059/7281549
    if (inherits(issue_html, "error")) {
      message(paste("Problem encountered with issue"), aos_links[i], "Trying the next issue...")
      next
    }

    ## Search the issue page for links to each paper
    issue_links <- html_nodes(issue_html, "a") %>% html_attr("href")
    link_to_papers <- grepl("^/euclid.aos/[[:digit:]]+", issue_links)
    issue_links <- issue_links[link_to_papers] %>% unique()

    ## Make empty data.table for the issue information with the right dimensions
    assign(DTname,
           data.table(DOI = character(length = length(issue_links)),
                      Year = character(length = length(issue_links)),
                      Article_name = character(length = length(issue_links)),
                      Authors = character(length = length(issue_links))))

    ## Use our previous function to get the issue information into the data.table
    assign(DTname, issue_data_table(get(DTname), issue_links))

    ## Save the data.table
    fwrite(get(DTname), fname)

    ## Update the R console with our progress.
    message(paste("Downloaded issue", issue_url, "   ", i / length(aos_links) * 100, "% Done."))

    ## Pause again, do not burden ProjectEuclid.
    Sys.sleep(10)
  }
}

Let it rip:

get_aos(aos_links)

Looping through all the articles in all the issues takes about a day, because we cannot go too quickly without getting IP-banned from the website. I found this out the hard way …

The outer loop output looks something like this, so we can keep track of progress and see any failures when they inevitably arise. Normally I would have used txtProgressBar(), but I do want to print out which issue has a failure on each iteration, so the progress bar is a bit superfluous.

Checked issue https://projecteuclid.org/euclid.aos/1176342455     98.8679245283019 % Done.
Checked issue https://projecteuclid.org/euclid.aos/1176342405     99.2452830188679 % Done.
Checked issue https://projecteuclid.org/euclid.aos/1176342358     99.622641509434 % Done.
Checked issue https://projecteuclid.org/euclid.aos/1193342377     100 % Done.
...

When there is a failure with data retrieval, I think I’ve got things rigged correctly to ensure that it happens before the data.table is written to disk, so re-running the code will skip all “correct” data.tables and just fill in the missing ones.

Fast-forward 12 hours of web crawling …

Now we have a bunch of data.tables, one for each issue, with all the information required to run a citation search against each entry. All the data.tables were saved to individual files, so that we don’t have to run the function again :)

Minor data cleaning

We’ve got about 260 data.tables to merge and clean. Let’s do that. Assuming this is a new R session, we read the files from disk. I’d like to clean the data and save it to a new directory. Let’s keep the table with citation information separate from the tables without citation information.

clean_all_issues <- function() {

  ## get a list of all issues.
  issues <- list.files("./assets/", pattern = ".csv$")

  for (i in seq_along(issues)) {

    issue <- fread(paste0("./assets/", issues[i]))[grepl("^doi:", DOI)][
      !grepl("^Discussion", Article_name)][
        !grepl("^Rejoinder", Article_name)][
          !grepl("^Correction", Article_name)][
            !grepl("^Volume Information", Article_name)]

    fwrite(issue, paste0("./assets/cleaned_issues/", issues[i]))
  }
}
clean_all_issues()

We need a function that will take all the data.tables from assets, append the citations, and puts them all in the new folder, with the citations nicely appended.

Getting the citation data

Google Scholar (utter failure)

It seems that Google Scholar might be the best way to do this, but we’ll have to be careful about obtaining the information since Google no doubt has the best anti-botting software.

My first idea was that Google Scholar would be perfect for getting the citation counts. I wrote a nice function, which takes one of my cleaned Project Euclid data.tables, and goes through each issue and gets the citations from Google Scholar.

While this approach was promising at first, I ultimately had to abandon the idea that Google was going to be any help at all. The problem is that Google simply had too many CAPTCHAs, there is one CAPTCHA after every 20 or so searches, regardless of my connection.3

I tried to rig things up so that my IP address would change after downloading an issue, but that proved to be more trouble than it’s worth (and probably against Google’s terms of service…)

Time to start looking elsewhere.

Trying again with CrossRef API

CrossRef.org seems to be one of the few alternatives to Google Scholar, and better yet, they offer an API, and a well-maintained R package is available. Let’s take the easy way out this time, and not use rvest.

This time we need a couple functions to collection all the citation information:

  1. apprend_citations. This function takes in a cleaned data.table and loops through all the DOIs. For each DOI, we check if there is already some citation information. If none is found, proceed to retrieve that information from the CrossRef API. At the end, just hand back the data.table to us, with a new column for citation counts.
append_citations <- function(DT) {
  ## Make some space to store the citations (if not there)
  ## We just make the citations a column of -1's initially
  ## We can't use 0, because the article may actually have 0 cites
  if (!("citations" %in% colnames(DT)))
    DT[, citations := rep(-1, .N)]

  ## Now loop through all articles in an issue.
  for (i in seq_len(nrow(DT))) {

    ## skip some rows that evaded my crappy data cleaning skills.
    if (!grepl("^doi:", DT[i, DOI])) next

    read_cites <- DT[i, citations]

    ## an article is already present, no need to try again.
    if ((read_cites > -1 | read_cites == "error")) {
      message(paste("Article", DT[i, DOI], "is already here!"))
      next
    }

    doi <- sub("^doi:", "", DT[i, DOI])

    ## Get the number of citations for the article
    article_citations <- cr_citation_count(doi)$count

    ## save the data
    if (is.numeric(article_citations)) {
      set(DT, as.integer(i), "citations", article_citations)
      message(paste("Just got article", DT[i, DOI], "... The count was", article_citations))
    } else {
      set(DT, as.integer(i), "citations", "error")
      message(paste("Error with article", DT[i, DOI]))
    }

    ## wait 10 seconds. Do not poke the bear.
    Sys.sleep(5)
  }
  message("Finished an issue!")
  DT
}
  1. get_citations just carefully applies that previous append_citations function to all my Annals of Statistics issues.
get_citations <- function(issues_list) {

  ## Loop over all the issues.
  for (i in seq_along(issues_list)) {
    issue <- issues_list[i]

    ## If the data.table has citation information, use that, otherwise get a clean data.table
    if (issue %in% list.files("./assets/cite_tables/")) {
      issue_DT <- fread(paste0("./assets/cite_tables/", issue))
    } else {
      issue_DT <- fread(paste0("./assets/cleaned_issues/", issue))
    }

    message("Now working on ", issue)

    ## Apply the citations retrieval function to the table
    issue_with_cites <- append_citations(issue_DT)

    ## save or overwrite.
    fwrite(issue_with_cites, paste0("./assets/cite_tables/", issue))
    message(i / length(issues_list) * 100, "% Done.")
  }
}

Apply the citation function to all DOIs:

First, make a directory to store the citation frames as we download them.

if (!dir.exists("./assets/cite_tables")) dir.create("./assets/cite_tables")

And apply the get_citations() function to suck in all the citation counts. The whole process takes about 12 hours, but that’s mostly because I set a 5-second delay between API requests. It’s free data, so I don’t want to bog them down with my project.

issues <- list.files("./assets/cleaned_issues/", pattern = ".csv$")
get_citations(issues)

The output is nicely formatted for monitoring:

...
Just got article doi:10.1214/12-AOS1025 ... The count was 7
Just got article doi:10.1214/12-AOS1026 ... The count was 18
Finished an issue!
80% Done.
Now working on issueDT_1351602526.csv
Just got article doi:10.1214/11-AOS949 ... The count was 130
Just got article doi:10.1214/12-AOS999 ... The count was 45
...

I’m happy to use the CrossRef API to collection the citation counts, even if the data is not 100% reliable, the API and R package worked well. And I’m not one to endorse many R packages!

Tabulate Results

First, gotta combine all the issue tables into one table. There were some parsing warnings that I’m going to ignore for now…

issues <- list.files(paste0("./assets/cite_tables/"), pattern = ".csv$", full.names = TRUE)
list_of_issues <- lapply(issues, fread)
issuesDT <- rbindlist(list_of_issues)[order(-citations)]
## Column 1 ['doi:10.1214/aos/1018031261'] of item 18 is missing in item 1. Use fill=TRUE to fill with NA (NULL for list columns), or use.names=FALSE to ignore column names. use.names='check' (default from v1.12.2) emits this message and proceeds as if use.names=FALSE for  backwards compatibility. See news item 5 in v1.12.2 for options to control this message.
issuesDT[, Article_name := gsub("\r?\n|\r", "", Article_name)]

summary(issuesDT$citations)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max.     NA's 
##     0.00     6.00    17.00    56.84    43.00 22674.00       48

The top 100 AOS articles:

knitr::kable(head(issuesDT, 100), row.names = TRUE)
DOI Year Article_name Authors citations
1 doi:10.1214/aos/1176344136 Ann. Statist., Volume 6, Number 2 (1978), 461-464. Estimating the Dimension of a Model Gideon Schwarz 22674
2 doi:10.1214/aos/1176344552 Ann. Statist., Volume 7, Number 1 (1979), 1-26. Bootstrap Methods: Another Look at the Jackknife B. Efron 8945
3 doi:10.1214/aos/1013203451 Ann. Statist., Volume 29, Number 5 (2001), 1189-1232. Greedy function approximation: A gradient boosting machine. Jerome H. Friedman 5550
4 doi:10.1214/009053604000000067 Ann. Statist., Volume 32, Number 2 (2004), 407-499. Least angle regression Bradley Efron, Trevor Hastie, Iain Johnstone, and Robert Tibshirani 4363
5 doi:10.1214/aos/1176347963 Ann. Statist., Volume 19, Number 1 (1991), 1-67. Multivariate Adaptive Regression Splines Jerome H. Friedman 4082
6 doi:10.1214/aos/1016218223 Ann. Statist., Volume 28, Number 2 (2000), 337-407. Additive logistic regression: a statistical view of boosting (With discussion and a rejoinder by the authors) Jerome Friedman, Trevor Hastie, and Robert Tibshirani 3193
7 doi:10.1214/aos/1176350951 Ann. Statist., Volume 16, Number 3 (1988), 1141-1154. A Class of \(K\)-Sample Tests for Comparing the Cumulative Incidence of a Competing Risk Robert J. Gray 3060
8 doi:10.1214/aos/1013699998 Ann. Statist., Volume 29, Number 4 (2001), 1165-1188. The control of the false discovery rate in multiple testing under dependency Yoav Benjamini and Daniel Yekutieli 2626
9 doi:10.1214/aos/1176345976 Ann. Statist., Volume 10, Number 4 (1982), 1100-1120. Cox’s Regression Model for Counting Processes: A Large Sample Study P. K. Andersen and R. D. Gill 2202
10 doi:10.1214/aos/1176342360 Ann. Statist., Volume 1, Number 2 (1973), 209-230. A Bayesian Analysis of Some Nonparametric Problems Thomas S. Ferguson 2176
11 doi:10.1214/aos/1176325750 Ann. Statist., Volume 22, Number 4 (1994), 1701-1728. Markov Chains for Exploring Posterior Distributions Luke Tierney 1806
12 doi:10.1214/aos/1176343003 Ann. Statist., Volume 3, Number 1 (1975), 119-131. Statistical Inference Using Extreme Order Statistics James Pickands III 1781
13 doi:10.1214/aos/1176346060 Ann. Statist., Volume 11, Number 1 (1983), 95-103. On the Convergence Properties of the EM Algorithm C. F. Jeff Wu 1712
14 doi:10.1214/aos/1176343247 Ann. Statist., Volume 3, Number 5 (1975), 1163-1174. A Simple General Approach to Inference About the Tail of a Distribution Bruce M. Hill 1646
15 doi:10.1214/aos/1176345632 Ann. Statist., Volume 9, Number 6 (1981), 1135-1151. Estimation of the Mean of a Multivariate Normal Distribution Charles M. Stein 1249
16 doi:10.1214/009053606000000281 Ann. Statist., Volume 34, Number 3 (2006), 1436-1462. High-dimensional graphs and variable selection with the Lasso Nicolai Meinshausen and Peter Bühlmann 1237
17 doi:10.1214/aos/1176344064 Ann. Statist., Volume 6, Number 1 (1978), 34-58. Bayesian Inference for Causal Effects: The Role of Randomization Donald B. Rubin 1197
18 doi:10.1214/aos/1176343654 Ann. Statist., Volume 4, Number 6 (1976), 1236-1239. Agreeing to Disagree Robert J. Aumann 1191
19 doi:10.1214/aos/1176342503 Ann. Statist., Volume 1, Number 5 (1973), 799-821. Robust Regression: Asymptotics, Conjectures and Monte Carlo Peter J. Huber 1183
20 doi:10.1214/aos/1176347265 Ann. Statist., Volume 17, Number 3 (1989), 1217-1241. The Jackknife and the Bootstrap for General Stationary Observations Hans R. Kunsch 1182
21 doi:10.1214/aos/1074290335 Ann. Statist., Volume 31, Number 6 (2003), 2013-2035. The positive false discovery rate: a Bayesian interpretation and the q-value John D. Storey 1159
22 doi:10.1214/09-AOS729 Ann. Statist., Volume 38, Number 2 (2010), 894-942. Nearly unbiased variable selection under minimax concave penalty Cun-Hui Zhang 1149
23 doi:10.1214/aos/1024691352 Ann. Statist., Volume 26, Number 5 (1998), 1651-1686. Boosting the margin: a new explanation for the effectiveness of voting methods Peter Bartlett, Yoav Freund, Wee Sun Lee, and Robert E. Schapire 1082
24 doi:10.1214/aos/1176349519 Ann. Statist., Volume 13, Number 2 (1985), 435-475. Projection Pursuit Peter J. Huber 1047
25 doi:10.1214/aos/1176346150 Ann. Statist., Volume 11, Number 2 (1983), 416-431. A Universal Prior for Integers and Estimation by Minimum Description Length Jorma Rissanen 954
26 doi:10.1214/aos/1176346577 Ann. Statist., Volume 13, Number 1 (1985), 70-84. The Dip Test of Unimodality J. A. Hartigan and P. M. Hartigan 954
27 doi:10.1214/aos/1176350142 Ann. Statist., Volume 14, Number 4 (1986), 1261-1295. Jackknife, Bootstrap and Other Resampling Methods in Regression Analysis C. F. J. Wu 947
28 doi:10.1214/10-AOS799 Ann. Statist., Volume 38, Number 5 (2010), 2916-2957. Kernel density estimation via diffusion Z. I. Botev, J. F. Grotowski, and D. P. Kroese 940
29 doi:10.1214/aos/1176347494 Ann. Statist., Volume 18, Number 1 (1990), 90-120. Empirical Likelihood Ratio Confidence Regions Art Owen 925
30 doi:10.1214/aos/1176325370 Ann. Statist., Volume 22, Number 1 (1994), 300-325. Empirical Likelihood and General Estimating Equations Jin Qin and Jerry Lawless 921
31 doi:10.1214/aos/1176343886 Ann. Statist., Volume 5, Number 4 (1977), 595-620. Consistent Nonparametric Regression Charles J. Stone 921
32 doi:10.1214/aos/1176324317 Ann. Statist., Volume 23, Number 5 (1995), 1630-1661. Gaussian Semiparametric Estimation of Long Range Dependence P. M. Robinson 905
33 doi:10.1214/aos/1176342871 Ann. Statist., Volume 2, Number 6 (1974), 1152-1174. Mixtures of Dirichlet Processes with Applications to Bayesian Nonparametric Problems Charles E. Antoniak 884
34 doi:10.1214/aos/1176345637 Ann. Statist., Volume 9, Number 6 (1981), 1196-1217. Some Asymptotic Theory for the Bootstrap Peter J. Bickel and David A. Freedman 852
35 doi:10.1214/aos/1176350366 Ann. Statist., Volume 15, Number 2 (1987), 642-656. High Breakdown-Point and High Efficiency Robust Estimates for Regression Victor J. Yohai 820
36 doi:10.1214/aos/1056562461 Ann. Statist., Volume 31, Number 3 (2003), 705-767. Slice sampling Radford M. Neal 776
37 doi:10.1214/009053607000000505 Ann. Statist., Volume 35, Number 6 (2007), 2769-2794. Measuring and testing dependence by correlation of distances Gábor J. Székely, Maria L. Rizzo, and Nail K. Bakirov 762
38 doi:10.1214/009053607000000677 Ann. Statist., Volume 36, Number 3 (2008), 1171-1220. Kernel methods in machine learning Thomas Hofmann, Bernhard Schölkopf, and Alexander J. Smola 760
39 doi:10.1214/aos/1176346079 Ann. Statist., Volume 11, Number 1 (1983), 286-295. Negative Association of Random Variables with Applications Kumar Joag-Dev and Frank Proschan 758
40 doi:10.1214/aos/1176344247 Ann. Statist., Volume 6, Number 4 (1978), 701-726. Nonparametric Inference for a Family of Counting Processes Odd Aalen 739
41 doi:10.1214/aos/1176346391 Ann. Statist., Volume 12, Number 1 (1984), 46-60. On Chi-Squared Tests for Multiway Contingency Tables with Cell Proportions Estimated from Survey Data J. N. K. Rao and A. J. Scott 718
42 doi:10.1214/aos/1176345969 Ann. Statist., Volume 10, Number 4 (1982), 1040-1053. Optimal Global Rates of Convergence for Nonparametric Regression Charles J. Stone 690
43 doi:10.1214/aos/1176349040 Ann. Statist., Volume 21, Number 1 (1993), 520-533. Consistency and Limiting Distribution of the Least Squares Estimator of a Threshold Autoregressive Model K. S. Chan 680
44 doi:10.1214/aos/1176345513 Ann. Statist., Volume 9, Number 4 (1981), 705-724. Logistic Regression Diagnostics Daryl Pregibon 677
45 doi:10.1214/aos/1176324636 Ann. Statist., Volume 23, Number 3 (1995), 1048-1072. Log-Periodogram Regression of Time Series with Long Range Dependence P. M. Robinson 676
46 doi:10.1214/aos/1009210544 Ann. Statist., Volume 29, Number 2 (2001), 295-327. On the distribution of the largest eigenvalue in principal components analysis Iain M. Johnstone 675
47 doi:10.1214/aos/1176345462 Ann. Statist., Volume 9, Number 3 (1981), 586-596. The Jackknife Estimate of Variance B. Efron and C. Stein 666
48 doi:10.1214/08-AOS620 Ann. Statist., Volume 37, Number 4 (2009), 1705-1732. Simultaneous analysis of Lasso and Dantzig selector Peter J. Bickel, Ya’acov Ritov, and Alexandre B. Tsybakov 628
49 doi:10.1214/aos/1176346785 Ann. Statist., Volume 12, Number 4 (1984), 1151-1172. Bayesianly Justifiable and Relevant Frequency Calculations for the Applied Statistician Donald B. Rubin 622
50 doi:10.1214/aos/1176349403 Ann. Statist., Volume 21, Number 4 (1993), 1926-1947. Comparing Nonparametric Versus Parametric Regression Fits W. Hardle and E. Mammen 622
51 doi:10.1214/aos/1028144844 Ann. Statist., Volume 26, Number 2 (1998), 451-471. Classification by pairwise coupling Trevor Hastie and Robert Tibshirani 606
52 doi:10.1214/aos/1176342372 Ann. Statist., Volume 1, Number 2 (1973), 353-355. Ferguson Distributions Via Polya Urn Schemes David Blackwell and James B. MacQueen 597
53 doi:10.1214/aos/1176350051 Ann. Statist., Volume 14, Number 3 (1986), 1080-1100. Stochastic Complexity and Modeling Jorma Rissanen 566
54 doi:10.1214/aos/1031689016 Ann. Statist., Volume 30, Number 4 (2002), 1031-1068. Vines–a new graphical model for dependent random variables Tim Bedford and Roger M. Cooke 562
55 doi:10.1214/aos/1176350164 Ann. Statist., Volume 14, Number 4 (1986), 1379-1387. Optimal Stopping Times for Detecting Changes in Distributions George V. Moustakides 561
56 doi:10.1214/aos/1176349548 Ann. Statist., Volume 13, Number 2 (1985), 689-705. Additive Regression and Other Nonparametric Models Charles J. Stone 556
57 doi:10.1214/aos/1032181158 Ann. Statist., Volume 24, Number 6 (1996), 2350-2383. Heuristics of instability and stabilization in model selection Leo Breiman 552
58 doi:10.1214/aos/1176325632 Ann. Statist., Volume 22, Number 3 (1994), 1346-1370. Multivariate Locally Weighted Least Squares Regression D. Ruppert and M. P. Wand 538
59 doi:10.1214/aos/1176343347 Ann. Statist., Volume 4, Number 1 (1976), 51-67. Robust \(M\)-Estimators of Multivariate Location and Scatter Ricardo Antonio Maronna 534
60 doi:10.1214/aos/1176349936 Ann. Statist., Volume 14, Number 2 (1986), 517-532. Large-Sample Properties of Parameter Estimates for Strongly Dependent Stationary Gaussian Time Series Robert Fox and Murad S. Taqqu 523
61 doi:10.1214/aos/1024691079 Ann. Statist., Volume 26, Number 3 (1998), 801-849. Arcing classifier (with discussion and a rejoinder by the author) Leo Breiman 519
62 doi:10.1214/aos/1176348666 Ann. Statist., Volume 20, Number 2 (1992), 971-1001. Asymptotics for Linear Processes Peter C. B. Phillips and Victor Solo 513
63 doi:10.1214/aos/1176347393 Ann. Statist., Volume 17, Number 4 (1989), 1749-1766. Efficient Parameter Estimation for Self-Similar Processes Rainer Dahlhaus 508
64 doi:10.1214/aos/1015957397 Ann. Statist., Volume 28, Number 5 (2000), 1356-1378. Asymptotics for lasso-type estimators Wenjiang Fu and Keith Knight 506
65 doi:10.1214/aos/1176347115 Ann. Statist., Volume 17, Number 2 (1989), 453-510. Linear Smoothers and Additive Models Andreas Buja, Trevor Hastie, and Robert Tibshirani 505
66 doi:10.1214/009053607000000758 Ann. Statist., Volume 36, Number 1 (2008), 199-227. Regularized estimation of large covariance matrices Peter J. Bickel and Elizaveta Levina 500
67 doi:10.1214/009053607000000802 Ann. Statist., Volume 36, Number 4 (2008), 1509-1533. One-step sparse estimates in nonconcave penalized likelihood models Hui Zou and Runze Li 498
68 doi:10.1214/aos/1176349022 Ann. Statist., Volume 21, Number 1 (1993), 196-216. Local Linear Regression Smoothers and Their Minimax Efficiencies Jianqing Fan 494
69 doi:10.1214/aos/1176345338 Ann. Statist., Volume 9, Number 1 (1981), 130-134. The Bayesian Bootstrap Donald B. Rubin 490
70 doi:10.1214/aos/1176345638 Ann. Statist., Volume 9, Number 6 (1981), 1218-1228. Bootstrapping Regression Models D. A. Freedman 471
71 doi:10.1214/aos/1176324456 Ann. Statist., Volume 23, Number 1 (1995), 73-102. Penalized Discriminant Analysis Trevor Hastie, Andreas Buja, and Robert Tibshirani 467
72 doi:10.1214/aos/1176348385 Ann. Statist., Volume 19, Number 4 (1991), 2032-2066. Why Least Squares and Maximum Entropy? An Axiomatic Approach to Inference for Linear Inverse Problems Imre Csiszar 460
73 doi:10.1214/aos/1176346056 Ann. Statist., Volume 11, Number 1 (1983), 59-67. Quasi-Likelihood Functions Peter McCullagh 456
74 doi:10.1214/aos/1176350933 Ann. Statist., Volume 16, Number 3 (1988), 927-953. Theoretical Comparison of Bootstrap Confidence Intervals Peter Hall 451
75 doi:10.1214/08-AOS600 Ann. Statist., Volume 36, Number 6 (2008), 2577-2604. Covariance regularization by thresholding Peter J. Bickel and Elizaveta Levina 435
76 doi:10.1214/009053607000000127 Ann. Statist., Volume 35, Number 5 (2007), 2173-2192. On the “degrees of freedom” of the lasso Hui Zou, Trevor Hastie, and Robert Tibshirani 426
77 doi:10.1214/009053604000000256 Ann. Statist., Volume 32, Number 3 (2004), 928-961. Nonconcave penalized likelihood with a diverging number of parameters Jianqing Fan and Heng Peng 423
78 doi:10.1214/aos/1176348653 Ann. Statist., Volume 20, Number 2 (1992), 712-736. Exact Mean Integrated Squared Error J. S. Marron and M. P. Wand 423
79 doi:10.1214/aos/1176342558 Ann. Statist., Volume 1, Number 6 (1973), 1071-1095. On Some Global Measures of the Deviations of Density Function Estimates P. J. Bickel and M. Rosenblatt 418
80 doi:10.1214/aos/1176342810 Ann. Statist., Volume 2, Number 5 (1974), 849-879. General Equivalence Theory for Optimum Designs (Approximate Theory) J. Kiefer 416
81 doi:10.1214/aos/1176345636 Ann. Statist., Volume 9, Number 6 (1981), 1187-1195. On the Asymptotic Accuracy of Efron’s Bootstrap Kesar Singh 414
82 doi:10.1214/aos/1176350057 Ann. Statist., Volume 14, Number 3 (1986), 1171-1179. The Use of Subseries Values for Estimating the Variance of a General Statistic from a Stationary Sequence Edward Carlstein 414
83 doi:10.1214/aos/1176342705 Ann. Statist., Volume 2, Number 3 (1974), 437-453. A Large Sample Study of the Life Table and Product Limit Estimates Under Random Censorship N. Breslow and J. Crowley 412
84 doi:10.1214/aos/1034276620 Ann. Statist., Volume 25, Number 1 (1997), 1-37. Fitting time series models to nonstationary processes R. Dahlhaus 401
85 doi:10.1214/aos/1176348248 Ann. Statist., Volume 19, Number 3 (1991), 1257-1272. On the Optimal Rates of Convergence for Nonparametric Deconvolution Problems Jianqing Fan 400
86 doi:10.1214/aos/1176348368 Ann. Statist., Volume 19, Number 4 (1991), 1725-1747. Empirical Likelihood for Linear Models Art Owen 399
87 doi:10.1214/aos/1176345206 Ann. Statist., Volume 8, Number 6 (1980), 1348-1360. Optimal Rates of Convergence for Nonparametric Estimators Charles J. Stone 398
88 doi:10.1214/aos/1024691081 Ann. Statist., Volume 26, Number 3 (1998), 879-921. Minimax estimation via wavelet shrinkage David L. Donoho and Iain M. Johnstone 396
89 doi:10.1214/aos/1176346788 Ann. Statist., Volume 12, Number 4 (1984), 1215-1230. Bandwidth Choice for Nonparametric Regression John Rice 394
90 doi:10.1214/aos/1176349025 Ann. Statist., Volume 21, Number 1 (1993), 255-285. Bootstrap and Wild Bootstrap for High Dimensional Linear Models Enno Mammen 393
91 doi:10.1214/aos/1176347397 Ann. Statist., Volume 17, Number 4 (1989), 1833-1855. A Moment Estimator for the Index of an Extreme-Value Distribution A. L. M. Dekkers, J. H. J. Einmahl, and L. De Haan 388
92 doi:10.1214/009053604000001048 Ann. Statist., Volume 33, Number 1 (2005), 1-53. Analysis of variance—why it is more important than ever Andrew Gelman 386
93 doi:10.1214/aos/1176325622 Ann. Statist., Volume 22, Number 3 (1994), 1142-1160. Posterior Predictive \(p\)-Values Xiao-Li Meng 386
94 doi:10.1214/aos/1176345697 Ann. Statist., Volume 10, Number 1 (1982), 154-166. Least Squares Estimates in Stochastic Regression Models with Applications to Identification and Control of Dynamic Systems Tze Leung Lai and Ching Zong Wei 386
95 doi:10.1214/aos/1016218226 Ann. Statist., Volume 28, Number 2 (2000), 461-482. General notions of statistical depth function Robert Serfling and Yijun Zuo 383
96 doi:10.1214/aos/1176343842 Ann. Statist., Volume 5, Number 3 (1977), 445-463. Minimum Hellinger Distance Estimates for Parametric Models Rudolf Beran 375
97 doi:10.1214/aos/1176342752 Ann. Statist., Volume 2, Number 4 (1974), 615-629. Prior Distributions on Spaces of Probability Measures Thomas S. Ferguson 374
98 doi:10.1214/aos/1176347507 Ann. Statist., Volume 18, Number 1 (1990), 405-414. On a Notion of Data Depth Based on Random Simplices Regina Y. Liu 374
99 doi:10.1214/009053604000000238 Ann. Statist., Volume 32, Number 3 (2004), 870-897. Optimal predictive model selection Maria Maddalena Barbieri and James O. Berger 373
100 doi:10.1214/aos/1176349020 Ann. Statist., Volume 21, Number 1 (1993), 157-178. Optimal Smoothing in Single-Index Models Wolfgang Hardle, Peter Hall, and Hidehiko Ichimura 366

Conclusion

This project took way too much time, several days to reliably download all the required data, but I had some fun, and learned a bit about web scraping. I’ll revisit this work when I get interested in other journals.

Time permitting, it would be good to compile the data from this project into an R package for others to use. I’d first want to validate the data and further clean out some annoying tab and newline characters, but it’s do-able.

Minimally, anyone can read this blog post and figure out how to apply my scraping functions to other Project Euclid journals.


  1. Unfortunately, summarizing by keyword only occurred to me the day after I sucked up the Project Euclid data, so that tabulation will have to wait for another journal.

  2. Citations are definitely not the best measure of popularity, but I’m sure how to get any better metrics…

  3. There is also a Google Scholar R package, but I have a feeling that it would be subject to similar or the same issues that I encountered. Better to use a service that actually exposes an API.