Intro
Getting all the articles in AOS
- Minor data cleaning
Getting the citation data
- Google Scholar (utter failure)
- Trying again with CrossRef API
Tabulate Results
The top 100 AOS articles:
- Conclusion

"robots" by jmorgan is licensed with CC BY-SA 2.0. To view a copy of this license, visit https://creativecommons.org/licenses/by-sa/2.0/

Or, how to bot Project Euclid with R.

Intro

Whenever I look at the front page of Project Euclid, they try to be helpful and show me the top articles of the week, but it’s not exactly what I want to see. I want to know the all-time top articles from AOS, and I don’t just want a list of 5 articles, I want all the data.

True, there are some papers and blog posts out there that do some tabulation of the top articles in the whole discipline of statistics, but I want a break-down for a specific journal, and I want to use the most up to date information I can find. I also want to be able to see the top articles by keyword in each journal¹, and by decade, so we’re just going to have roll up our sleeves and do this the hard way :)

Why would someone want this information? It’s just intellectual curiosity. Sometimes I flip through a journal, and I start reading, but it’s kinda aimless if the journal is too technical. I can read American Statistician cover to cover because they mix it up with lots of different themes. Not so much with AOS. I need some help, because I’m only going to read a few articles a year. So this project is effectively about making a recommendation engine for an academic journal.²

library(rvest) # Web scraping
library(rcrossref) # Get citations
library(data.table) # Data tables and disk I/O
library(magrittr) # this is not a pipe

Goal: We need to collect a list of all published articles in AOS, and get the number of citations for each of the articles. The plan is to use a couple web scrapers to get the job done.

The front page of AOS last time I visited. The top page shows the top five articles in terms of clicks.

Getting all the articles in AOS

First, crawl around the main AOS landing page. The AOS landing page can be changed to the landing page of any journal hosted on Project Euclid. Later, I’m going to want to adapt this code to work on some other journals, such as the Annals of Applied Statistics.

Let me be the first to admit that I’m not a web-scraping guru, so my scraping strategy is likely suboptimal. But ya gotta make some mistakes before you know how to avoid them. Let me know how I can make things more stable and faster for my next go at it!

## This is the landing page for AOS
url <- "https://projecteuclid.org/all/euclid.aos"
main <- read_html(url)

## Suck up all links from the main page
aos_links <- html_nodes(main, "a") %>% html_attr("href") # links

## regex to match issues from the main page.
link_to_issue <- grepl("euclid.aos/[[:digit:]]+", aos_links)

## Toss out the links that do not appear to correspond to any issues.
aos_links <- aos_links[link_to_issue]

Now we have links to all landing pages of each issue the journal has released. Each issue page contains links to the articles in that issue, and each article page contains the precious DOIs. We will use the DOIs to search for the number of citations later. That’s the goal.

Here is a function to loop through all the articles in an issue. The main idea is that the function takes in a prepared, but empty, data.table and goes through all the articles in issue_links, and spits out a completed data.table for that issue. Each row of the outputted data.table corresponds to a single article, and on each article we record the DOI, year, authors, and article name. Nice!

issue_data_table <- function(DT, issue_links) {

  ## This loop goes through all the articles in an issue, and gets all the article
  ## information
  for (i in seq_along(issue_links)) {

    ## Go each each article page.
    ## Each article page is an element in the vector of issue_links.
    article_url <- paste0("https://projecteuclid.org", issue_links[i], "#info")

    message("Downloading an article...")

    ## FIXME: Wrap this in TryCatch()
    article_html <- read_html(article_url)

    ## Get the DOI
    ## Finding the correct nodes takes some fiddling around with SelectorGadget
    DOI_unclean <- article_html %>%
      html_node("p:nth-child(5)") %>%
      html_text()
    ## Some of the article information needs to be cleaned up.
    ## I prefer base-R regex functions when the job is not too bad :)
    DOI_clean <- sub("^Digital Object Identifier", "", DOI_unclean)

    ## Get the year of each article
    year_unclean <- article_html %>%
      html_node("#info p:nth-child(2)") %>%
      html_text()
    year_clean <- sub("^Source", "", year_unclean)

    ## Get article name
    art_name <- article_html %>%
      html_node("h3") %>%
      html_text()

    ## Get authors
    authors <- article_html %>%
      html_node(".small") %>%
      html_text()

    ## FIXME: Additionally get the keywords for each article. I forgot to do
    ## that after starting the AOS crawler.

    ## Assign all the article information to one row of a pre-made data.table
    ## `set` is a loopable version of `:=`.
    set(DT, i, "DOI", DOI_clean)
    set(DT, i, "Year", year_clean)
    set(DT, i, "Article_name", art_name)
    set(DT, i, "Authors", authors)

    ## Pause for 10 secs to not overwhelm the server.
    Sys.sleep(10)
  }
  DT
}

Apply the issue -> data.table function to each issue of the AOS catalog, while taking care to fail gracefully.

get_aos <- function(aos_links) {
  for (i in seq_along(aos_links)) {
    ## Crawl on the main page for each issue.
    issue_url <- paste0("https://projecteuclid.org", aos_links[i])

    ## Make a different data.table for each issue and name them programmatically.
    DTname <- paste0("issueDT_",
                     regmatches(issue_url, regexpr("[[:digit:]]+", issue_url)))

    ## Filename for each data.table
    fname <- paste0("./assets/", DTname, ".csv")

    ## If we already have that data, next
    if (file.exists(fname)) {
      message(paste("Checked issue", issue_url, "   ", i / length(aos_links) * 100, "% Done."))
      next
    }

    ## Fail silently, if it must be ...
    issue_html <- tryCatch(read_html(issue_url), error = function(e) e)

    ## Skip this issue if there is something wrong, we can try again later.
    ## https://stackoverflow.com/a/8094059/7281549
    if (inherits(issue_html, "error")) {
      message(paste("Problem encountered with issue"), aos_links[i], "Trying the next issue...")
      next
    }

    ## Search the issue page for links to each paper
    issue_links <- html_nodes(issue_html, "a") %>% html_attr("href")
    link_to_papers <- grepl("^/euclid.aos/[[:digit:]]+", issue_links)
    issue_links <- issue_links[link_to_papers] %>% unique()

    ## Make empty data.table for the issue information with the right dimensions
    assign(DTname,
           data.table(DOI = character(length = length(issue_links)),
                      Year = character(length = length(issue_links)),
                      Article_name = character(length = length(issue_links)),
                      Authors = character(length = length(issue_links))))

    ## Use our previous function to get the issue information into the data.table
    assign(DTname, issue_data_table(get(DTname), issue_links))

    ## Save the data.table
    fwrite(get(DTname), fname)

    ## Update the R console with our progress.
    message(paste("Downloaded issue", issue_url, "   ", i / length(aos_links) * 100, "% Done."))

    ## Pause again, do not burden ProjectEuclid.
    Sys.sleep(10)
  }
}

Let it rip:

get_aos(aos_links)

Looping through all the articles in all the issues takes about a day, because we cannot go too quickly without getting IP-banned from the website. I found this out the hard way …

The outer loop output looks something like this, so we can keep track of progress and see any failures when they inevitably arise. Normally I would have used txtProgressBar(), but I do want to print out which issue has a failure on each iteration, so the progress bar is a bit superfluous.

Checked issue https://projecteuclid.org/euclid.aos/1176342455     98.8679245283019 % Done.
Checked issue https://projecteuclid.org/euclid.aos/1176342405     99.2452830188679 % Done.
Checked issue https://projecteuclid.org/euclid.aos/1176342358     99.622641509434 % Done.
Checked issue https://projecteuclid.org/euclid.aos/1193342377     100 % Done.
...

When there is a failure with data retrieval, I think I’ve got things rigged correctly to ensure that it happens before the data.table is written to disk, so re-running the code will skip all “correct” data.tables and just fill in the missing ones.

Fast-forward 12 hours of web crawling …

Now we have a bunch of data.tables, one for each issue, with all the information required to run a citation search against each entry. All the data.tables were saved to individual files, so that we don’t have to run the function again :)

Minor data cleaning

We’ve got about 260 data.tables to merge and clean. Let’s do that. Assuming this is a new R session, we read the files from disk. I’d like to clean the data and save it to a new directory. Let’s keep the table with citation information separate from the tables without citation information.

clean_all_issues <- function() {

  ## get a list of all issues.
  issues <- list.files("./assets/", pattern = ".csv$")

  for (i in seq_along(issues)) {

    issue <- fread(paste0("./assets/", issues[i]))[grepl("^doi:", DOI)][
      !grepl("^Discussion", Article_name)][
        !grepl("^Rejoinder", Article_name)][
          !grepl("^Correction", Article_name)][
            !grepl("^Volume Information", Article_name)]

    fwrite(issue, paste0("./assets/cleaned_issues/", issues[i]))
  }
}

clean_all_issues()

We need a function that will take all the data.tables from assets, append the citations, and puts them all in the new folder, with the citations nicely appended.

Getting the citation data

Google Scholar (utter failure)

It seems that Google Scholar might be the best way to do this, but we’ll have to be careful about obtaining the information since Google no doubt has the best anti-botting software.

My first idea was that Google Scholar would be perfect for getting the citation counts. I wrote a nice function, which takes one of my cleaned Project Euclid data.tables, and goes through each issue and gets the citations from Google Scholar.

While this approach was promising at first, I ultimately had to abandon the idea that Google was going to be any help at all. The problem is that Google simply had too many CAPTCHAs, there is one CAPTCHA after every 20 or so searches, regardless of my connection.³

I tried to rig things up so that my IP address would change after downloading an issue, but that proved to be more trouble than it’s worth (and probably against Google’s terms of service…)

Time to start looking elsewhere.

Trying again with CrossRef API

CrossRef.org seems to be one of the few alternatives to Google Scholar, and better yet, they offer an API, and a well-maintained R package is available. Let’s take the easy way out this time, and not use rvest.

This time we need a couple functions to collection all the citation information:

apprend_citations. This function takes in a cleaned data.table and loops through all the DOIs. For each DOI, we check if there is already some citation information. If none is found, proceed to retrieve that information from the CrossRef API. At the end, just hand back the data.table to us, with a new column for citation counts.

append_citations <- function(DT) {
  ## Make some space to store the citations (if not there)
  ## We just make the citations a column of -1's initially
  ## We can't use 0, because the article may actually have 0 cites
  if (!("citations" %in% colnames(DT)))
    DT[, citations := rep(-1, .N)]

  ## Now loop through all articles in an issue.
  for (i in seq_len(nrow(DT))) {

    ## skip some rows that evaded my crappy data cleaning skills.
    if (!grepl("^doi:", DT[i, DOI])) next

    read_cites <- DT[i, citations]

    ## an article is already present, no need to try again.
    if ((read_cites > -1 | read_cites == "error")) {
      message(paste("Article", DT[i, DOI], "is already here!"))
      next
    }

    doi <- sub("^doi:", "", DT[i, DOI])

    ## Get the number of citations for the article
    article_citations <- cr_citation_count(doi)$count

    ## save the data
    if (is.numeric(article_citations)) {
      set(DT, as.integer(i), "citations", article_citations)
      message(paste("Just got article", DT[i, DOI], "... The count was", article_citations))
    } else {
      set(DT, as.integer(i), "citations", "error")
      message(paste("Error with article", DT[i, DOI]))
    }

    ## wait 10 seconds. Do not poke the bear.
    Sys.sleep(5)
  }
  message("Finished an issue!")
  DT
}

get_citations just carefully applies that previous append_citations function to all my Annals of Statistics issues.

get_citations <- function(issues_list) {

  ## Loop over all the issues.
  for (i in seq_along(issues_list)) {
    issue <- issues_list[i]

    ## If the data.table has citation information, use that, otherwise get a clean data.table
    if (issue %in% list.files("./assets/cite_tables/")) {
      issue_DT <- fread(paste0("./assets/cite_tables/", issue))
    } else {
      issue_DT <- fread(paste0("./assets/cleaned_issues/", issue))
    }

    message("Now working on ", issue)

    ## Apply the citations retrieval function to the table
    issue_with_cites <- append_citations(issue_DT)

    ## save or overwrite.
    fwrite(issue_with_cites, paste0("./assets/cite_tables/", issue))
    message(i / length(issues_list) * 100, "% Done.")
  }
}

Apply the citation function to all DOIs:

First, make a directory to store the citation frames as we download them.

if (!dir.exists("./assets/cite_tables")) dir.create("./assets/cite_tables")

And apply the get_citations() function to suck in all the citation counts. The whole process takes about 12 hours, but that’s mostly because I set a 5-second delay between API requests. It’s free data, so I don’t want to bog them down with my project.

issues <- list.files("./assets/cleaned_issues/", pattern = ".csv$")
get_citations(issues)

The output is nicely formatted for monitoring:

...
Just got article doi:10.1214/12-AOS1025 ... The count was 7
Just got article doi:10.1214/12-AOS1026 ... The count was 18
Finished an issue!
80% Done.
Now working on issueDT_1351602526.csv
Just got article doi:10.1214/11-AOS949 ... The count was 130
Just got article doi:10.1214/12-AOS999 ... The count was 45
...

I’m happy to use the CrossRef API to collection the citation counts, even if the data is not 100% reliable, the API and R package worked well. And I’m not one to endorse many R packages!

Tabulate Results

First, gotta combine all the issue tables into one table. There were some parsing warnings that I’m going to ignore for now…

issues <- list.files(paste0("./assets/cite_tables/"), pattern = ".csv$", full.names = TRUE)
list_of_issues <- lapply(issues, fread)
issuesDT <- rbindlist(list_of_issues)[order(-citations)]

## Column 1 ['doi:10.1214/aos/1018031261'] of item 18 is missing in item 1. Use fill=TRUE to fill with NA (NULL for list columns), or use.names=FALSE to ignore column names. use.names='check' (default from v1.12.2) emits this message and proceeds as if use.names=FALSE for  backwards compatibility. See news item 5 in v1.12.2 for options to control this message.

issuesDT[, Article_name := gsub("\r?\n|\r", "", Article_name)]

summary(issuesDT$citations)

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max.     NA's 
##     0.00     6.00    17.00    56.84    43.00 22674.00       48

The top 100 AOS articles:

knitr::kable(head(issuesDT, 100), row.names = TRUE)

	DOI	Year	Article_name	Authors	citations
1	doi:10.1214/aos/1176344136	Ann. Statist., Volume 6, Number 2 (1978), 461-464.	Estimating the Dimension of a Model	Gideon Schwarz	22674
2	doi:10.1214/aos/1176344552	Ann. Statist., Volume 7, Number 1 (1979), 1-26.	Bootstrap Methods: Another Look at the Jackknife	B. Efron	8945
3	doi:10.1214/aos/1013203451	Ann. Statist., Volume 29, Number 5 (2001), 1189-1232.	Greedy function approximation: A gradient boosting machine.	Jerome H. Friedman	5550
4	doi:10.1214/009053604000000067	Ann. Statist., Volume 32, Number 2 (2004), 407-499.	Least angle regression	Bradley Efron, Trevor Hastie, Iain Johnstone, and Robert Tibshirani	4363
5	doi:10.1214/aos/1176347963	Ann. Statist., Volume 19, Number 1 (1991), 1-67.	Multivariate Adaptive Regression Splines	Jerome H. Friedman	4082
6	doi:10.1214/aos/1016218223	Ann. Statist., Volume 28, Number 2 (2000), 337-407.	Additive logistic regression: a statistical view of boosting (With discussion and a rejoinder by the authors)	Jerome Friedman, Trevor Hastie, and Robert Tibshirani	3193
7	doi:10.1214/aos/1176350951	Ann. Statist., Volume 16, Number 3 (1988), 1141-1154.	A Class of \(K\)-Sample Tests for Comparing the Cumulative Incidence of a Competing Risk	Robert J. Gray	3060
8	doi:10.1214/aos/1013699998	Ann. Statist., Volume 29, Number 4 (2001), 1165-1188.	The control of the false discovery rate in multiple testing under dependency	Yoav Benjamini and Daniel Yekutieli	2626
9	doi:10.1214/aos/1176345976	Ann. Statist., Volume 10, Number 4 (1982), 1100-1120.	Cox’s Regression Model for Counting Processes: A Large Sample Study	P. K. Andersen and R. D. Gill	2202
10	doi:10.1214/aos/1176342360	Ann. Statist., Volume 1, Number 2 (1973), 209-230.	A Bayesian Analysis of Some Nonparametric Problems	Thomas S. Ferguson	2176
11	doi:10.1214/aos/1176325750	Ann. Statist., Volume 22, Number 4 (1994), 1701-1728.	Markov Chains for Exploring Posterior Distributions	Luke Tierney	1806
12	doi:10.1214/aos/1176343003	Ann. Statist., Volume 3, Number 1 (1975), 119-131.	Statistical Inference Using Extreme Order Statistics	James Pickands III	1781
13	doi:10.1214/aos/1176346060	Ann. Statist., Volume 11, Number 1 (1983), 95-103.	On the Convergence Properties of the EM Algorithm	C. F. Jeff Wu	1712
14	doi:10.1214/aos/1176343247	Ann. Statist., Volume 3, Number 5 (1975), 1163-1174.	A Simple General Approach to Inference About the Tail of a Distribution	Bruce M. Hill	1646
15	doi:10.1214/aos/1176345632	Ann. Statist., Volume 9, Number 6 (1981), 1135-1151.	Estimation of the Mean of a Multivariate Normal Distribution	Charles M. Stein	1249
16	doi:10.1214/009053606000000281	Ann. Statist., Volume 34, Number 3 (2006), 1436-1462.	High-dimensional graphs and variable selection with the Lasso	Nicolai Meinshausen and Peter Bühlmann	1237
17	doi:10.1214/aos/1176344064	Ann. Statist., Volume 6, Number 1 (1978), 34-58.	Bayesian Inference for Causal Effects: The Role of Randomization	Donald B. Rubin	1197
18	doi:10.1214/aos/1176343654	Ann. Statist., Volume 4, Number 6 (1976), 1236-1239.	Agreeing to Disagree	Robert J. Aumann	1191
19	doi:10.1214/aos/1176342503	Ann. Statist., Volume 1, Number 5 (1973), 799-821.	Robust Regression: Asymptotics, Conjectures and Monte Carlo	Peter J. Huber	1183
20	doi:10.1214/aos/1176347265	Ann. Statist., Volume 17, Number 3 (1989), 1217-1241.	The Jackknife and the Bootstrap for General Stationary Observations	Hans R. Kunsch	1182
21	doi:10.1214/aos/1074290335	Ann. Statist., Volume 31, Number 6 (2003), 2013-2035.	The positive false discovery rate: a Bayesian interpretation and the q-value	John D. Storey	1159
22	doi:10.1214/09-AOS729	Ann. Statist., Volume 38, Number 2 (2010), 894-942.	Nearly unbiased variable selection under minimax concave penalty	Cun-Hui Zhang	1149
23	doi:10.1214/aos/1024691352	Ann. Statist., Volume 26, Number 5 (1998), 1651-1686.	Boosting the margin: a new explanation for the effectiveness of voting methods	Peter Bartlett, Yoav Freund, Wee Sun Lee, and Robert E. Schapire	1082
24	doi:10.1214/aos/1176349519	Ann. Statist., Volume 13, Number 2 (1985), 435-475.	Projection Pursuit	Peter J. Huber	1047
25	doi:10.1214/aos/1176346150	Ann. Statist., Volume 11, Number 2 (1983), 416-431.	A Universal Prior for Integers and Estimation by Minimum Description Length	Jorma Rissanen	954
26	doi:10.1214/aos/1176346577	Ann. Statist., Volume 13, Number 1 (1985), 70-84.	The Dip Test of Unimodality	J. A. Hartigan and P. M. Hartigan	954
27	doi:10.1214/aos/1176350142	Ann. Statist., Volume 14, Number 4 (1986), 1261-1295.	Jackknife, Bootstrap and Other Resampling Methods in Regression Analysis	C. F. J. Wu	947
28	doi:10.1214/10-AOS799	Ann. Statist., Volume 38, Number 5 (2010), 2916-2957.	Kernel density estimation via diffusion	Z. I. Botev, J. F. Grotowski, and D. P. Kroese	940
29	doi:10.1214/aos/1176347494	Ann. Statist., Volume 18, Number 1 (1990), 90-120.	Empirical Likelihood Ratio Confidence Regions	Art Owen	925
30	doi:10.1214/aos/1176325370	Ann. Statist., Volume 22, Number 1 (1994), 300-325.	Empirical Likelihood and General Estimating Equations	Jin Qin and Jerry Lawless	921
31	doi:10.1214/aos/1176343886	Ann. Statist., Volume 5, Number 4 (1977), 595-620.	Consistent Nonparametric Regression	Charles J. Stone	921
32	doi:10.1214/aos/1176324317	Ann. Statist., Volume 23, Number 5 (1995), 1630-1661.	Gaussian Semiparametric Estimation of Long Range Dependence	P. M. Robinson	905
33	doi:10.1214/aos/1176342871	Ann. Statist., Volume 2, Number 6 (1974), 1152-1174.	Mixtures of Dirichlet Processes with Applications to Bayesian Nonparametric Problems	Charles E. Antoniak	884
34	doi:10.1214/aos/1176345637	Ann. Statist., Volume 9, Number 6 (1981), 1196-1217.	Some Asymptotic Theory for the Bootstrap	Peter J. Bickel and David A. Freedman	852
35	doi:10.1214/aos/1176350366	Ann. Statist., Volume 15, Number 2 (1987), 642-656.	High Breakdown-Point and High Efficiency Robust Estimates for Regression	Victor J. Yohai	820
36	doi:10.1214/aos/1056562461	Ann. Statist., Volume 31, Number 3 (2003), 705-767.	Slice sampling	Radford M. Neal	776
37	doi:10.1214/009053607000000505	Ann. Statist., Volume 35, Number 6 (2007), 2769-2794.	Measuring and testing dependence by correlation of distances	Gábor J. Székely, Maria L. Rizzo, and Nail K. Bakirov	762
38	doi:10.1214/009053607000000677	Ann. Statist., Volume 36, Number 3 (2008), 1171-1220.	Kernel methods in machine learning	Thomas Hofmann, Bernhard Schölkopf, and Alexander J. Smola	760
39	doi:10.1214/aos/1176346079	Ann. Statist., Volume 11, Number 1 (1983), 286-295.	Negative Association of Random Variables with Applications	Kumar Joag-Dev and Frank Proschan	758
40	doi:10.1214/aos/1176344247	Ann. Statist., Volume 6, Number 4 (1978), 701-726.	Nonparametric Inference for a Family of Counting Processes	Odd Aalen	739
41	doi:10.1214/aos/1176346391	Ann. Statist., Volume 12, Number 1 (1984), 46-60.	On Chi-Squared Tests for Multiway Contingency Tables with Cell Proportions Estimated from Survey Data	J. N. K. Rao and A. J. Scott	718
42	doi:10.1214/aos/1176345969	Ann. Statist., Volume 10, Number 4 (1982), 1040-1053.	Optimal Global Rates of Convergence for Nonparametric Regression	Charles J. Stone	690
43	doi:10.1214/aos/1176349040	Ann. Statist., Volume 21, Number 1 (1993), 520-533.	Consistency and Limiting Distribution of the Least Squares Estimator of a Threshold Autoregressive Model	K. S. Chan	680
44	doi:10.1214/aos/1176345513	Ann. Statist., Volume 9, Number 4 (1981), 705-724.	Logistic Regression Diagnostics	Daryl Pregibon	677
45	doi:10.1214/aos/1176324636	Ann. Statist., Volume 23, Number 3 (1995), 1048-1072.	Log-Periodogram Regression of Time Series with Long Range Dependence	P. M. Robinson	676
46	doi:10.1214/aos/1009210544	Ann. Statist., Volume 29, Number 2 (2001), 295-327.	On the distribution of the largest eigenvalue in principal components analysis	Iain M. Johnstone	675
47	doi:10.1214/aos/1176345462	Ann. Statist., Volume 9, Number 3 (1981), 586-596.	The Jackknife Estimate of Variance	B. Efron and C. Stein	666
48	doi:10.1214/08-AOS620	Ann. Statist., Volume 37, Number 4 (2009), 1705-1732.	Simultaneous analysis of Lasso and Dantzig selector	Peter J. Bickel, Ya’acov Ritov, and Alexandre B. Tsybakov	628
49	doi:10.1214/aos/1176346785	Ann. Statist., Volume 12, Number 4 (1984), 1151-1172.	Bayesianly Justifiable and Relevant Frequency Calculations for the Applied Statistician	Donald B. Rubin	622
50	doi:10.1214/aos/1176349403	Ann. Statist., Volume 21, Number 4 (1993), 1926-1947.	Comparing Nonparametric Versus Parametric Regression Fits	W. Hardle and E. Mammen	622
51	doi:10.1214/aos/1028144844	Ann. Statist., Volume 26, Number 2 (1998), 451-471.	Classification by pairwise coupling	Trevor Hastie and Robert Tibshirani	606
52	doi:10.1214/aos/1176342372	Ann. Statist., Volume 1, Number 2 (1973), 353-355.	Ferguson Distributions Via Polya Urn Schemes	David Blackwell and James B. MacQueen	597
53	doi:10.1214/aos/1176350051	Ann. Statist., Volume 14, Number 3 (1986), 1080-1100.	Stochastic Complexity and Modeling	Jorma Rissanen	566
54	doi:10.1214/aos/1031689016	Ann. Statist., Volume 30, Number 4 (2002), 1031-1068.	Vines–a new graphical model for dependent random variables	Tim Bedford and Roger M. Cooke	562
55	doi:10.1214/aos/1176350164	Ann. Statist., Volume 14, Number 4 (1986), 1379-1387.	Optimal Stopping Times for Detecting Changes in Distributions	George V. Moustakides	561
56	doi:10.1214/aos/1176349548	Ann. Statist., Volume 13, Number 2 (1985), 689-705.	Additive Regression and Other Nonparametric Models	Charles J. Stone	556
57	doi:10.1214/aos/1032181158	Ann. Statist., Volume 24, Number 6 (1996), 2350-2383.	Heuristics of instability and stabilization in model selection	Leo Breiman	552
58	doi:10.1214/aos/1176325632	Ann. Statist., Volume 22, Number 3 (1994), 1346-1370.	Multivariate Locally Weighted Least Squares Regression	D. Ruppert and M. P. Wand	538
59	doi:10.1214/aos/1176343347	Ann. Statist., Volume 4, Number 1 (1976), 51-67.	Robust \(M\)-Estimators of Multivariate Location and Scatter	Ricardo Antonio Maronna	534
60	doi:10.1214/aos/1176349936	Ann. Statist., Volume 14, Number 2 (1986), 517-532.	Large-Sample Properties of Parameter Estimates for Strongly Dependent Stationary Gaussian Time Series	Robert Fox and Murad S. Taqqu	523
61	doi:10.1214/aos/1024691079	Ann. Statist., Volume 26, Number 3 (1998), 801-849.	Arcing classifier (with discussion and a rejoinder by the author)	Leo Breiman	519
62	doi:10.1214/aos/1176348666	Ann. Statist., Volume 20, Number 2 (1992), 971-1001.	Asymptotics for Linear Processes	Peter C. B. Phillips and Victor Solo	513
63	doi:10.1214/aos/1176347393	Ann. Statist., Volume 17, Number 4 (1989), 1749-1766.	Efficient Parameter Estimation for Self-Similar Processes	Rainer Dahlhaus	508
64	doi:10.1214/aos/1015957397	Ann. Statist., Volume 28, Number 5 (2000), 1356-1378.	Asymptotics for lasso-type estimators	Wenjiang Fu and Keith Knight	506
65	doi:10.1214/aos/1176347115	Ann. Statist., Volume 17, Number 2 (1989), 453-510.	Linear Smoothers and Additive Models	Andreas Buja, Trevor Hastie, and Robert Tibshirani	505
66	doi:10.1214/009053607000000758	Ann. Statist., Volume 36, Number 1 (2008), 199-227.	Regularized estimation of large covariance matrices	Peter J. Bickel and Elizaveta Levina	500
67	doi:10.1214/009053607000000802	Ann. Statist., Volume 36, Number 4 (2008), 1509-1533.	One-step sparse estimates in nonconcave penalized likelihood models	Hui Zou and Runze Li	498
68	doi:10.1214/aos/1176349022	Ann. Statist., Volume 21, Number 1 (1993), 196-216.	Local Linear Regression Smoothers and Their Minimax Efficiencies	Jianqing Fan	494
69	doi:10.1214/aos/1176345338	Ann. Statist., Volume 9, Number 1 (1981), 130-134.	The Bayesian Bootstrap	Donald B. Rubin	490
70	doi:10.1214/aos/1176345638	Ann. Statist., Volume 9, Number 6 (1981), 1218-1228.	Bootstrapping Regression Models	D. A. Freedman	471
71	doi:10.1214/aos/1176324456	Ann. Statist., Volume 23, Number 1 (1995), 73-102.	Penalized Discriminant Analysis	Trevor Hastie, Andreas Buja, and Robert Tibshirani	467
72	doi:10.1214/aos/1176348385	Ann. Statist., Volume 19, Number 4 (1991), 2032-2066.	Why Least Squares and Maximum Entropy? An Axiomatic Approach to Inference for Linear Inverse Problems	Imre Csiszar	460
73	doi:10.1214/aos/1176346056	Ann. Statist., Volume 11, Number 1 (1983), 59-67.	Quasi-Likelihood Functions	Peter McCullagh	456
74	doi:10.1214/aos/1176350933	Ann. Statist., Volume 16, Number 3 (1988), 927-953.	Theoretical Comparison of Bootstrap Confidence Intervals	Peter Hall	451
75	doi:10.1214/08-AOS600	Ann. Statist., Volume 36, Number 6 (2008), 2577-2604.	Covariance regularization by thresholding	Peter J. Bickel and Elizaveta Levina	435
76	doi:10.1214/009053607000000127	Ann. Statist., Volume 35, Number 5 (2007), 2173-2192.	On the “degrees of freedom” of the lasso	Hui Zou, Trevor Hastie, and Robert Tibshirani	426
77	doi:10.1214/009053604000000256	Ann. Statist., Volume 32, Number 3 (2004), 928-961.	Nonconcave penalized likelihood with a diverging number of parameters	Jianqing Fan and Heng Peng	423
78	doi:10.1214/aos/1176348653	Ann. Statist., Volume 20, Number 2 (1992), 712-736.	Exact Mean Integrated Squared Error	J. S. Marron and M. P. Wand	423
79	doi:10.1214/aos/1176342558	Ann. Statist., Volume 1, Number 6 (1973), 1071-1095.	On Some Global Measures of the Deviations of Density Function Estimates	P. J. Bickel and M. Rosenblatt	418
80	doi:10.1214/aos/1176342810	Ann. Statist., Volume 2, Number 5 (1974), 849-879.	General Equivalence Theory for Optimum Designs (Approximate Theory)	J. Kiefer	416
81	doi:10.1214/aos/1176345636	Ann. Statist., Volume 9, Number 6 (1981), 1187-1195.	On the Asymptotic Accuracy of Efron’s Bootstrap	Kesar Singh	414
82	doi:10.1214/aos/1176350057	Ann. Statist., Volume 14, Number 3 (1986), 1171-1179.	The Use of Subseries Values for Estimating the Variance of a General Statistic from a Stationary Sequence	Edward Carlstein	414
83	doi:10.1214/aos/1176342705	Ann. Statist., Volume 2, Number 3 (1974), 437-453.	A Large Sample Study of the Life Table and Product Limit Estimates Under Random Censorship	N. Breslow and J. Crowley	412
84	doi:10.1214/aos/1034276620	Ann. Statist., Volume 25, Number 1 (1997), 1-37.	Fitting time series models to nonstationary processes	R. Dahlhaus	401
85	doi:10.1214/aos/1176348248	Ann. Statist., Volume 19, Number 3 (1991), 1257-1272.	On the Optimal Rates of Convergence for Nonparametric Deconvolution Problems	Jianqing Fan	400
86	doi:10.1214/aos/1176348368	Ann. Statist., Volume 19, Number 4 (1991), 1725-1747.	Empirical Likelihood for Linear Models	Art Owen	399
87	doi:10.1214/aos/1176345206	Ann. Statist., Volume 8, Number 6 (1980), 1348-1360.	Optimal Rates of Convergence for Nonparametric Estimators	Charles J. Stone	398
88	doi:10.1214/aos/1024691081	Ann. Statist., Volume 26, Number 3 (1998), 879-921.	Minimax estimation via wavelet shrinkage	David L. Donoho and Iain M. Johnstone	396
89	doi:10.1214/aos/1176346788	Ann. Statist., Volume 12, Number 4 (1984), 1215-1230.	Bandwidth Choice for Nonparametric Regression	John Rice	394
90	doi:10.1214/aos/1176349025	Ann. Statist., Volume 21, Number 1 (1993), 255-285.	Bootstrap and Wild Bootstrap for High Dimensional Linear Models	Enno Mammen	393
91	doi:10.1214/aos/1176347397	Ann. Statist., Volume 17, Number 4 (1989), 1833-1855.	A Moment Estimator for the Index of an Extreme-Value Distribution	A. L. M. Dekkers, J. H. J. Einmahl, and L. De Haan	388
92	doi:10.1214/009053604000001048	Ann. Statist., Volume 33, Number 1 (2005), 1-53.	Analysis of variance—why it is more important than ever	Andrew Gelman	386
93	doi:10.1214/aos/1176325622	Ann. Statist., Volume 22, Number 3 (1994), 1142-1160.	Posterior Predictive \(p\)-Values	Xiao-Li Meng	386
94	doi:10.1214/aos/1176345697	Ann. Statist., Volume 10, Number 1 (1982), 154-166.	Least Squares Estimates in Stochastic Regression Models with Applications to Identification and Control of Dynamic Systems	Tze Leung Lai and Ching Zong Wei	386
95	doi:10.1214/aos/1016218226	Ann. Statist., Volume 28, Number 2 (2000), 461-482.	General notions of statistical depth function	Robert Serfling and Yijun Zuo	383
96	doi:10.1214/aos/1176343842	Ann. Statist., Volume 5, Number 3 (1977), 445-463.	Minimum Hellinger Distance Estimates for Parametric Models	Rudolf Beran	375
97	doi:10.1214/aos/1176342752	Ann. Statist., Volume 2, Number 4 (1974), 615-629.	Prior Distributions on Spaces of Probability Measures	Thomas S. Ferguson	374
98	doi:10.1214/aos/1176347507	Ann. Statist., Volume 18, Number 1 (1990), 405-414.	On a Notion of Data Depth Based on Random Simplices	Regina Y. Liu	374
99	doi:10.1214/009053604000000238	Ann. Statist., Volume 32, Number 3 (2004), 870-897.	Optimal predictive model selection	Maria Maddalena Barbieri and James O. Berger	373
100	doi:10.1214/aos/1176349020	Ann. Statist., Volume 21, Number 1 (1993), 157-178.	Optimal Smoothing in Single-Index Models	Wolfgang Hardle, Peter Hall, and Hidehiko Ichimura	366

Conclusion

This project took way too much time, several days to reliably download all the required data, but I had some fun, and learned a bit about web scraping. I’ll revisit this work when I get interested in other journals.

Time permitting, it would be good to compile the data from this project into an R package for others to use. I’d first want to validate the data and further clean out some annoying tab and newline characters, but it’s do-able.

Minimally, anyone can read this blog post and figure out how to apply my scraping functions to other Project Euclid journals.

Unfortunately, summarizing by keyword only occurred to me the day after I sucked up the Project Euclid data, so that tabulation will have to wait for another journal.↩
Citations are definitely not the best measure of popularity, but I’m sure how to get any better metrics…↩
There is also a Google Scholar R package, but I have a feeling that it would be subject to similar or the same issues that I encountered. Better to use a service that actually exposes an API.↩

The most cited Annals of Statistics articles

John Haman

2021/01/19