Or, how to bot Project Euclid with R.
Intro
Whenever I look at the front page of Project Euclid, they try to be helpful and show me the top articles of the week, but it’s not exactly what I want to see. I want to know the all-time top articles from AOS, and I don’t just want a list of 5 articles, I want all the data.
True, there are some papers and blog posts out there that do some tabulation of the top articles in the whole discipline of statistics, but I want a break-down for a specific journal, and I want to use the most up to date information I can find. I also want to be able to see the top articles by keyword in each journal1, and by decade, so we’re just going to have roll up our sleeves and do this the hard way :)
Why would someone want this information? It’s just intellectual curiosity. Sometimes I flip through a journal, and I start reading, but it’s kinda aimless if the journal is too technical. I can read American Statistician cover to cover because they mix it up with lots of different themes. Not so much with AOS. I need some help, because I’m only going to read a few articles a year. So this project is effectively about making a recommendation engine for an academic journal.2
library(rvest) # Web scraping
library(rcrossref) # Get citations
library(data.table) # Data tables and disk I/O
library(magrittr) # this is not a pipe
Goal: We need to collect a list of all published articles in AOS, and get the number of citations for each of the articles. The plan is to use a couple web scrapers to get the job done.
Getting all the articles in AOS
First, crawl around the main AOS landing page. The AOS landing page can be changed to the landing page of any journal hosted on Project Euclid. Later, I’m going to want to adapt this code to work on some other journals, such as the Annals of Applied Statistics.
Let me be the first to admit that I’m not a web-scraping guru, so my scraping strategy is likely suboptimal. But ya gotta make some mistakes before you know how to avoid them. Let me know how I can make things more stable and faster for my next go at it!
## This is the landing page for AOS
url <- "https://projecteuclid.org/all/euclid.aos"
main <- read_html(url)
## Suck up all links from the main page
aos_links <- html_nodes(main, "a") %>% html_attr("href") # links
## regex to match issues from the main page.
link_to_issue <- grepl("euclid.aos/[[:digit:]]+", aos_links)
## Toss out the links that do not appear to correspond to any issues.
aos_links <- aos_links[link_to_issue]
Now we have links to all landing pages of each issue the journal has released. Each issue page contains links to the articles in that issue, and each article page contains the precious DOIs. We will use the DOIs to search for the number of citations later. That’s the goal.
Here is a function to loop through all the articles in an issue. The main idea
is that the function takes in a prepared, but empty, data.table and goes through
all the articles in issue_links
, and spits out a completed data.table for that
issue. Each row of the outputted data.table corresponds to a single article, and
on each article we record the DOI, year, authors, and article name. Nice!
issue_data_table <- function(DT, issue_links) {
## This loop goes through all the articles in an issue, and gets all the article
## information
for (i in seq_along(issue_links)) {
## Go each each article page.
## Each article page is an element in the vector of issue_links.
article_url <- paste0("https://projecteuclid.org", issue_links[i], "#info")
message("Downloading an article...")
## FIXME: Wrap this in TryCatch()
article_html <- read_html(article_url)
## Get the DOI
## Finding the correct nodes takes some fiddling around with SelectorGadget
DOI_unclean <- article_html %>%
html_node("p:nth-child(5)") %>%
html_text()
## Some of the article information needs to be cleaned up.
## I prefer base-R regex functions when the job is not too bad :)
DOI_clean <- sub("^Digital Object Identifier", "", DOI_unclean)
## Get the year of each article
year_unclean <- article_html %>%
html_node("#info p:nth-child(2)") %>%
html_text()
year_clean <- sub("^Source", "", year_unclean)
## Get article name
art_name <- article_html %>%
html_node("h3") %>%
html_text()
## Get authors
authors <- article_html %>%
html_node(".small") %>%
html_text()
## FIXME: Additionally get the keywords for each article. I forgot to do
## that after starting the AOS crawler.
## Assign all the article information to one row of a pre-made data.table
## `set` is a loopable version of `:=`.
set(DT, i, "DOI", DOI_clean)
set(DT, i, "Year", year_clean)
set(DT, i, "Article_name", art_name)
set(DT, i, "Authors", authors)
## Pause for 10 secs to not overwhelm the server.
Sys.sleep(10)
}
DT
}
Apply the issue -> data.table function to each issue of the AOS catalog, while taking care to fail gracefully.
get_aos <- function(aos_links) {
for (i in seq_along(aos_links)) {
## Crawl on the main page for each issue.
issue_url <- paste0("https://projecteuclid.org", aos_links[i])
## Make a different data.table for each issue and name them programmatically.
DTname <- paste0("issueDT_",
regmatches(issue_url, regexpr("[[:digit:]]+", issue_url)))
## Filename for each data.table
fname <- paste0("./assets/", DTname, ".csv")
## If we already have that data, next
if (file.exists(fname)) {
message(paste("Checked issue", issue_url, " ", i / length(aos_links) * 100, "% Done."))
next
}
## Fail silently, if it must be ...
issue_html <- tryCatch(read_html(issue_url), error = function(e) e)
## Skip this issue if there is something wrong, we can try again later.
## https://stackoverflow.com/a/8094059/7281549
if (inherits(issue_html, "error")) {
message(paste("Problem encountered with issue"), aos_links[i], "Trying the next issue...")
next
}
## Search the issue page for links to each paper
issue_links <- html_nodes(issue_html, "a") %>% html_attr("href")
link_to_papers <- grepl("^/euclid.aos/[[:digit:]]+", issue_links)
issue_links <- issue_links[link_to_papers] %>% unique()
## Make empty data.table for the issue information with the right dimensions
assign(DTname,
data.table(DOI = character(length = length(issue_links)),
Year = character(length = length(issue_links)),
Article_name = character(length = length(issue_links)),
Authors = character(length = length(issue_links))))
## Use our previous function to get the issue information into the data.table
assign(DTname, issue_data_table(get(DTname), issue_links))
## Save the data.table
fwrite(get(DTname), fname)
## Update the R console with our progress.
message(paste("Downloaded issue", issue_url, " ", i / length(aos_links) * 100, "% Done."))
## Pause again, do not burden ProjectEuclid.
Sys.sleep(10)
}
}
Let it rip:
get_aos(aos_links)
Looping through all the articles in all the issues takes about a day, because we cannot go too quickly without getting IP-banned from the website. I found this out the hard way …
The outer loop output looks something like this, so we can keep track of
progress and see any failures when they inevitably arise. Normally I would have
used txtProgressBar()
, but I do want to print out which issue has a failure on
each iteration, so the progress bar is a bit superfluous.
Checked issue https://projecteuclid.org/euclid.aos/1176342455 98.8679245283019 % Done.
Checked issue https://projecteuclid.org/euclid.aos/1176342405 99.2452830188679 % Done.
Checked issue https://projecteuclid.org/euclid.aos/1176342358 99.622641509434 % Done.
Checked issue https://projecteuclid.org/euclid.aos/1193342377 100 % Done.
...
When there is a failure with data retrieval, I think I’ve got things rigged correctly to ensure that it happens before the data.table is written to disk, so re-running the code will skip all “correct” data.tables and just fill in the missing ones.
Fast-forward 12 hours of web crawling …
Now we have a bunch of data.tables, one for each issue, with all the information required to run a citation search against each entry. All the data.tables were saved to individual files, so that we don’t have to run the function again :)
Minor data cleaning
We’ve got about 260 data.tables to merge and clean. Let’s do that. Assuming this is a new R session, we read the files from disk. I’d like to clean the data and save it to a new directory. Let’s keep the table with citation information separate from the tables without citation information.
clean_all_issues <- function() {
## get a list of all issues.
issues <- list.files("./assets/", pattern = ".csv$")
for (i in seq_along(issues)) {
issue <- fread(paste0("./assets/", issues[i]))[grepl("^doi:", DOI)][
!grepl("^Discussion", Article_name)][
!grepl("^Rejoinder", Article_name)][
!grepl("^Correction", Article_name)][
!grepl("^Volume Information", Article_name)]
fwrite(issue, paste0("./assets/cleaned_issues/", issues[i]))
}
}
clean_all_issues()
We need a function that will take all the data.tables from assets, append the citations, and puts them all in the new folder, with the citations nicely appended.
Getting the citation data
Google Scholar (utter failure)
It seems that Google Scholar might be the best way to do this, but we’ll have to be careful about obtaining the information since Google no doubt has the best anti-botting software.
My first idea was that Google Scholar would be perfect for getting the citation counts. I wrote a nice function, which takes one of my cleaned Project Euclid data.tables, and goes through each issue and gets the citations from Google Scholar.
While this approach was promising at first, I ultimately had to abandon the idea that Google was going to be any help at all. The problem is that Google simply had too many CAPTCHAs, there is one CAPTCHA after every 20 or so searches, regardless of my connection.3
I tried to rig things up so that my IP address would change after downloading an issue, but that proved to be more trouble than it’s worth (and probably against Google’s terms of service…)
Time to start looking elsewhere.
Trying again with CrossRef API
CrossRef.org seems to be one of the few alternatives to Google Scholar, and
better yet, they offer an API, and a well-maintained R package is available.
Let’s take the easy way out this time, and not use rvest
.
This time we need a couple functions to collection all the citation information:
apprend_citations
. This function takes in a cleaned data.table and loops through all the DOIs. For each DOI, we check if there is already some citation information. If none is found, proceed to retrieve that information from the CrossRef API. At the end, just hand back the data.table to us, with a new column for citation counts.
append_citations <- function(DT) {
## Make some space to store the citations (if not there)
## We just make the citations a column of -1's initially
## We can't use 0, because the article may actually have 0 cites
if (!("citations" %in% colnames(DT)))
DT[, citations := rep(-1, .N)]
## Now loop through all articles in an issue.
for (i in seq_len(nrow(DT))) {
## skip some rows that evaded my crappy data cleaning skills.
if (!grepl("^doi:", DT[i, DOI])) next
read_cites <- DT[i, citations]
## an article is already present, no need to try again.
if ((read_cites > -1 | read_cites == "error")) {
message(paste("Article", DT[i, DOI], "is already here!"))
next
}
doi <- sub("^doi:", "", DT[i, DOI])
## Get the number of citations for the article
article_citations <- cr_citation_count(doi)$count
## save the data
if (is.numeric(article_citations)) {
set(DT, as.integer(i), "citations", article_citations)
message(paste("Just got article", DT[i, DOI], "... The count was", article_citations))
} else {
set(DT, as.integer(i), "citations", "error")
message(paste("Error with article", DT[i, DOI]))
}
## wait 10 seconds. Do not poke the bear.
Sys.sleep(5)
}
message("Finished an issue!")
DT
}
get_citations
just carefully applies that previousappend_citations
function to all my Annals of Statistics issues.
get_citations <- function(issues_list) {
## Loop over all the issues.
for (i in seq_along(issues_list)) {
issue <- issues_list[i]
## If the data.table has citation information, use that, otherwise get a clean data.table
if (issue %in% list.files("./assets/cite_tables/")) {
issue_DT <- fread(paste0("./assets/cite_tables/", issue))
} else {
issue_DT <- fread(paste0("./assets/cleaned_issues/", issue))
}
message("Now working on ", issue)
## Apply the citations retrieval function to the table
issue_with_cites <- append_citations(issue_DT)
## save or overwrite.
fwrite(issue_with_cites, paste0("./assets/cite_tables/", issue))
message(i / length(issues_list) * 100, "% Done.")
}
}
Apply the citation function to all DOIs:
First, make a directory to store the citation frames as we download them.
if (!dir.exists("./assets/cite_tables")) dir.create("./assets/cite_tables")
And apply the get_citations()
function to suck in all the citation counts. The
whole process takes about 12 hours, but that’s mostly because I set a 5-second
delay between API requests. It’s free data, so I don’t want to bog them down
with my project.
issues <- list.files("./assets/cleaned_issues/", pattern = ".csv$")
get_citations(issues)
The output is nicely formatted for monitoring:
...
Just got article doi:10.1214/12-AOS1025 ... The count was 7
Just got article doi:10.1214/12-AOS1026 ... The count was 18
Finished an issue!
80% Done.
Now working on issueDT_1351602526.csv
Just got article doi:10.1214/11-AOS949 ... The count was 130
Just got article doi:10.1214/12-AOS999 ... The count was 45
...
I’m happy to use the CrossRef API to collection the citation counts, even if the data is not 100% reliable, the API and R package worked well. And I’m not one to endorse many R packages!
Tabulate Results
First, gotta combine all the issue tables into one table. There were some parsing warnings that I’m going to ignore for now…
issues <- list.files(paste0("./assets/cite_tables/"), pattern = ".csv$", full.names = TRUE)
list_of_issues <- lapply(issues, fread)
issuesDT <- rbindlist(list_of_issues)[order(-citations)]
## Column 1 ['doi:10.1214/aos/1018031261'] of item 18 is missing in item 1. Use fill=TRUE to fill with NA (NULL for list columns), or use.names=FALSE to ignore column names. use.names='check' (default from v1.12.2) emits this message and proceeds as if use.names=FALSE for backwards compatibility. See news item 5 in v1.12.2 for options to control this message.
issuesDT[, Article_name := gsub("\r?\n|\r", "", Article_name)]
summary(issuesDT$citations)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00 6.00 17.00 56.84 43.00 22674.00 48
The top 100 AOS articles:
knitr::kable(head(issuesDT, 100), row.names = TRUE)
DOI | Year | Article_name | Authors | citations | |
---|---|---|---|---|---|
1 | doi:10.1214/aos/1176344136 | Ann. Statist., Volume 6, Number 2 (1978), 461-464. | Estimating the Dimension of a Model | Gideon Schwarz | 22674 |
2 | doi:10.1214/aos/1176344552 | Ann. Statist., Volume 7, Number 1 (1979), 1-26. | Bootstrap Methods: Another Look at the Jackknife | B. Efron | 8945 |
3 | doi:10.1214/aos/1013203451 | Ann. Statist., Volume 29, Number 5 (2001), 1189-1232. | Greedy function approximation: A gradient boosting machine. | Jerome H. Friedman | 5550 |
4 | doi:10.1214/009053604000000067 | Ann. Statist., Volume 32, Number 2 (2004), 407-499. | Least angle regression | Bradley Efron, Trevor Hastie, Iain Johnstone, and Robert Tibshirani | 4363 |
5 | doi:10.1214/aos/1176347963 | Ann. Statist., Volume 19, Number 1 (1991), 1-67. | Multivariate Adaptive Regression Splines | Jerome H. Friedman | 4082 |
6 | doi:10.1214/aos/1016218223 | Ann. Statist., Volume 28, Number 2 (2000), 337-407. | Additive logistic regression: a statistical view of boosting (With discussion and a rejoinder by the authors) | Jerome Friedman, Trevor Hastie, and Robert Tibshirani | 3193 |
7 | doi:10.1214/aos/1176350951 | Ann. Statist., Volume 16, Number 3 (1988), 1141-1154. | A Class of \(K\)-Sample Tests for Comparing the Cumulative Incidence of a Competing Risk | Robert J. Gray | 3060 |
8 | doi:10.1214/aos/1013699998 | Ann. Statist., Volume 29, Number 4 (2001), 1165-1188. | The control of the false discovery rate in multiple testing under dependency | Yoav Benjamini and Daniel Yekutieli | 2626 |
9 | doi:10.1214/aos/1176345976 | Ann. Statist., Volume 10, Number 4 (1982), 1100-1120. | Cox’s Regression Model for Counting Processes: A Large Sample Study | P. K. Andersen and R. D. Gill | 2202 |
10 | doi:10.1214/aos/1176342360 | Ann. Statist., Volume 1, Number 2 (1973), 209-230. | A Bayesian Analysis of Some Nonparametric Problems | Thomas S. Ferguson | 2176 |
11 | doi:10.1214/aos/1176325750 | Ann. Statist., Volume 22, Number 4 (1994), 1701-1728. | Markov Chains for Exploring Posterior Distributions | Luke Tierney | 1806 |
12 | doi:10.1214/aos/1176343003 | Ann. Statist., Volume 3, Number 1 (1975), 119-131. | Statistical Inference Using Extreme Order Statistics | James Pickands III | 1781 |
13 | doi:10.1214/aos/1176346060 | Ann. Statist., Volume 11, Number 1 (1983), 95-103. | On the Convergence Properties of the EM Algorithm | C. F. Jeff Wu | 1712 |
14 | doi:10.1214/aos/1176343247 | Ann. Statist., Volume 3, Number 5 (1975), 1163-1174. | A Simple General Approach to Inference About the Tail of a Distribution | Bruce M. Hill | 1646 |
15 | doi:10.1214/aos/1176345632 | Ann. Statist., Volume 9, Number 6 (1981), 1135-1151. | Estimation of the Mean of a Multivariate Normal Distribution | Charles M. Stein | 1249 |
16 | doi:10.1214/009053606000000281 | Ann. Statist., Volume 34, Number 3 (2006), 1436-1462. | High-dimensional graphs and variable selection with the Lasso | Nicolai Meinshausen and Peter Bühlmann | 1237 |
17 | doi:10.1214/aos/1176344064 | Ann. Statist., Volume 6, Number 1 (1978), 34-58. | Bayesian Inference for Causal Effects: The Role of Randomization | Donald B. Rubin | 1197 |
18 | doi:10.1214/aos/1176343654 | Ann. Statist., Volume 4, Number 6 (1976), 1236-1239. | Agreeing to Disagree | Robert J. Aumann | 1191 |
19 | doi:10.1214/aos/1176342503 | Ann. Statist., Volume 1, Number 5 (1973), 799-821. | Robust Regression: Asymptotics, Conjectures and Monte Carlo | Peter J. Huber | 1183 |
20 | doi:10.1214/aos/1176347265 | Ann. Statist., Volume 17, Number 3 (1989), 1217-1241. | The Jackknife and the Bootstrap for General Stationary Observations | Hans R. Kunsch | 1182 |
21 | doi:10.1214/aos/1074290335 | Ann. Statist., Volume 31, Number 6 (2003), 2013-2035. | The positive false discovery rate: a Bayesian interpretation and the q-value | John D. Storey | 1159 |
22 | doi:10.1214/09-AOS729 | Ann. Statist., Volume 38, Number 2 (2010), 894-942. | Nearly unbiased variable selection under minimax concave penalty | Cun-Hui Zhang | 1149 |
23 | doi:10.1214/aos/1024691352 | Ann. Statist., Volume 26, Number 5 (1998), 1651-1686. | Boosting the margin: a new explanation for the effectiveness of voting methods | Peter Bartlett, Yoav Freund, Wee Sun Lee, and Robert E. Schapire | 1082 |
24 | doi:10.1214/aos/1176349519 | Ann. Statist., Volume 13, Number 2 (1985), 435-475. | Projection Pursuit | Peter J. Huber | 1047 |
25 | doi:10.1214/aos/1176346150 | Ann. Statist., Volume 11, Number 2 (1983), 416-431. | A Universal Prior for Integers and Estimation by Minimum Description Length | Jorma Rissanen | 954 |
26 | doi:10.1214/aos/1176346577 | Ann. Statist., Volume 13, Number 1 (1985), 70-84. | The Dip Test of Unimodality | J. A. Hartigan and P. M. Hartigan | 954 |
27 | doi:10.1214/aos/1176350142 | Ann. Statist., Volume 14, Number 4 (1986), 1261-1295. | Jackknife, Bootstrap and Other Resampling Methods in Regression Analysis | C. F. J. Wu | 947 |
28 | doi:10.1214/10-AOS799 | Ann. Statist., Volume 38, Number 5 (2010), 2916-2957. | Kernel density estimation via diffusion | Z. I. Botev, J. F. Grotowski, and D. P. Kroese | 940 |
29 | doi:10.1214/aos/1176347494 | Ann. Statist., Volume 18, Number 1 (1990), 90-120. | Empirical Likelihood Ratio Confidence Regions | Art Owen | 925 |
30 | doi:10.1214/aos/1176325370 | Ann. Statist., Volume 22, Number 1 (1994), 300-325. | Empirical Likelihood and General Estimating Equations | Jin Qin and Jerry Lawless | 921 |
31 | doi:10.1214/aos/1176343886 | Ann. Statist., Volume 5, Number 4 (1977), 595-620. | Consistent Nonparametric Regression | Charles J. Stone | 921 |
32 | doi:10.1214/aos/1176324317 | Ann. Statist., Volume 23, Number 5 (1995), 1630-1661. | Gaussian Semiparametric Estimation of Long Range Dependence | P. M. Robinson | 905 |
33 | doi:10.1214/aos/1176342871 | Ann. Statist., Volume 2, Number 6 (1974), 1152-1174. | Mixtures of Dirichlet Processes with Applications to Bayesian Nonparametric Problems | Charles E. Antoniak | 884 |
34 | doi:10.1214/aos/1176345637 | Ann. Statist., Volume 9, Number 6 (1981), 1196-1217. | Some Asymptotic Theory for the Bootstrap | Peter J. Bickel and David A. Freedman | 852 |
35 | doi:10.1214/aos/1176350366 | Ann. Statist., Volume 15, Number 2 (1987), 642-656. | High Breakdown-Point and High Efficiency Robust Estimates for Regression | Victor J. Yohai | 820 |
36 | doi:10.1214/aos/1056562461 | Ann. Statist., Volume 31, Number 3 (2003), 705-767. | Slice sampling | Radford M. Neal | 776 |
37 | doi:10.1214/009053607000000505 | Ann. Statist., Volume 35, Number 6 (2007), 2769-2794. | Measuring and testing dependence by correlation of distances | Gábor J. Székely, Maria L. Rizzo, and Nail K. Bakirov | 762 |
38 | doi:10.1214/009053607000000677 | Ann. Statist., Volume 36, Number 3 (2008), 1171-1220. | Kernel methods in machine learning | Thomas Hofmann, Bernhard Schölkopf, and Alexander J. Smola | 760 |
39 | doi:10.1214/aos/1176346079 | Ann. Statist., Volume 11, Number 1 (1983), 286-295. | Negative Association of Random Variables with Applications | Kumar Joag-Dev and Frank Proschan | 758 |
40 | doi:10.1214/aos/1176344247 | Ann. Statist., Volume 6, Number 4 (1978), 701-726. | Nonparametric Inference for a Family of Counting Processes | Odd Aalen | 739 |
41 | doi:10.1214/aos/1176346391 | Ann. Statist., Volume 12, Number 1 (1984), 46-60. | On Chi-Squared Tests for Multiway Contingency Tables with Cell Proportions Estimated from Survey Data | J. N. K. Rao and A. J. Scott | 718 |
42 | doi:10.1214/aos/1176345969 | Ann. Statist., Volume 10, Number 4 (1982), 1040-1053. | Optimal Global Rates of Convergence for Nonparametric Regression | Charles J. Stone | 690 |
43 | doi:10.1214/aos/1176349040 | Ann. Statist., Volume 21, Number 1 (1993), 520-533. | Consistency and Limiting Distribution of the Least Squares Estimator of a Threshold Autoregressive Model | K. S. Chan | 680 |
44 | doi:10.1214/aos/1176345513 | Ann. Statist., Volume 9, Number 4 (1981), 705-724. | Logistic Regression Diagnostics | Daryl Pregibon | 677 |
45 | doi:10.1214/aos/1176324636 | Ann. Statist., Volume 23, Number 3 (1995), 1048-1072. | Log-Periodogram Regression of Time Series with Long Range Dependence | P. M. Robinson | 676 |
46 | doi:10.1214/aos/1009210544 | Ann. Statist., Volume 29, Number 2 (2001), 295-327. | On the distribution of the largest eigenvalue in principal components analysis | Iain M. Johnstone | 675 |
47 | doi:10.1214/aos/1176345462 | Ann. Statist., Volume 9, Number 3 (1981), 586-596. | The Jackknife Estimate of Variance | B. Efron and C. Stein | 666 |
48 | doi:10.1214/08-AOS620 | Ann. Statist., Volume 37, Number 4 (2009), 1705-1732. | Simultaneous analysis of Lasso and Dantzig selector | Peter J. Bickel, Ya’acov Ritov, and Alexandre B. Tsybakov | 628 |
49 | doi:10.1214/aos/1176346785 | Ann. Statist., Volume 12, Number 4 (1984), 1151-1172. | Bayesianly Justifiable and Relevant Frequency Calculations for the Applied Statistician | Donald B. Rubin | 622 |
50 | doi:10.1214/aos/1176349403 | Ann. Statist., Volume 21, Number 4 (1993), 1926-1947. | Comparing Nonparametric Versus Parametric Regression Fits | W. Hardle and E. Mammen | 622 |
51 | doi:10.1214/aos/1028144844 | Ann. Statist., Volume 26, Number 2 (1998), 451-471. | Classification by pairwise coupling | Trevor Hastie and Robert Tibshirani | 606 |
52 | doi:10.1214/aos/1176342372 | Ann. Statist., Volume 1, Number 2 (1973), 353-355. | Ferguson Distributions Via Polya Urn Schemes | David Blackwell and James B. MacQueen | 597 |
53 | doi:10.1214/aos/1176350051 | Ann. Statist., Volume 14, Number 3 (1986), 1080-1100. | Stochastic Complexity and Modeling | Jorma Rissanen | 566 |
54 | doi:10.1214/aos/1031689016 | Ann. Statist., Volume 30, Number 4 (2002), 1031-1068. | Vines–a new graphical model for dependent random variables | Tim Bedford and Roger M. Cooke | 562 |
55 | doi:10.1214/aos/1176350164 | Ann. Statist., Volume 14, Number 4 (1986), 1379-1387. | Optimal Stopping Times for Detecting Changes in Distributions | George V. Moustakides | 561 |
56 | doi:10.1214/aos/1176349548 | Ann. Statist., Volume 13, Number 2 (1985), 689-705. | Additive Regression and Other Nonparametric Models | Charles J. Stone | 556 |
57 | doi:10.1214/aos/1032181158 | Ann. Statist., Volume 24, Number 6 (1996), 2350-2383. | Heuristics of instability and stabilization in model selection | Leo Breiman | 552 |
58 | doi:10.1214/aos/1176325632 | Ann. Statist., Volume 22, Number 3 (1994), 1346-1370. | Multivariate Locally Weighted Least Squares Regression | D. Ruppert and M. P. Wand | 538 |
59 | doi:10.1214/aos/1176343347 | Ann. Statist., Volume 4, Number 1 (1976), 51-67. | Robust \(M\)-Estimators of Multivariate Location and Scatter | Ricardo Antonio Maronna | 534 |
60 | doi:10.1214/aos/1176349936 | Ann. Statist., Volume 14, Number 2 (1986), 517-532. | Large-Sample Properties of Parameter Estimates for Strongly Dependent Stationary Gaussian Time Series | Robert Fox and Murad S. Taqqu | 523 |
61 | doi:10.1214/aos/1024691079 | Ann. Statist., Volume 26, Number 3 (1998), 801-849. | Arcing classifier (with discussion and a rejoinder by the author) | Leo Breiman | 519 |
62 | doi:10.1214/aos/1176348666 | Ann. Statist., Volume 20, Number 2 (1992), 971-1001. | Asymptotics for Linear Processes | Peter C. B. Phillips and Victor Solo | 513 |
63 | doi:10.1214/aos/1176347393 | Ann. Statist., Volume 17, Number 4 (1989), 1749-1766. | Efficient Parameter Estimation for Self-Similar Processes | Rainer Dahlhaus | 508 |
64 | doi:10.1214/aos/1015957397 | Ann. Statist., Volume 28, Number 5 (2000), 1356-1378. | Asymptotics for lasso-type estimators | Wenjiang Fu and Keith Knight | 506 |
65 | doi:10.1214/aos/1176347115 | Ann. Statist., Volume 17, Number 2 (1989), 453-510. | Linear Smoothers and Additive Models | Andreas Buja, Trevor Hastie, and Robert Tibshirani | 505 |
66 | doi:10.1214/009053607000000758 | Ann. Statist., Volume 36, Number 1 (2008), 199-227. | Regularized estimation of large covariance matrices | Peter J. Bickel and Elizaveta Levina | 500 |
67 | doi:10.1214/009053607000000802 | Ann. Statist., Volume 36, Number 4 (2008), 1509-1533. | One-step sparse estimates in nonconcave penalized likelihood models | Hui Zou and Runze Li | 498 |
68 | doi:10.1214/aos/1176349022 | Ann. Statist., Volume 21, Number 1 (1993), 196-216. | Local Linear Regression Smoothers and Their Minimax Efficiencies | Jianqing Fan | 494 |
69 | doi:10.1214/aos/1176345338 | Ann. Statist., Volume 9, Number 1 (1981), 130-134. | The Bayesian Bootstrap | Donald B. Rubin | 490 |
70 | doi:10.1214/aos/1176345638 | Ann. Statist., Volume 9, Number 6 (1981), 1218-1228. | Bootstrapping Regression Models | D. A. Freedman | 471 |
71 | doi:10.1214/aos/1176324456 | Ann. Statist., Volume 23, Number 1 (1995), 73-102. | Penalized Discriminant Analysis | Trevor Hastie, Andreas Buja, and Robert Tibshirani | 467 |
72 | doi:10.1214/aos/1176348385 | Ann. Statist., Volume 19, Number 4 (1991), 2032-2066. | Why Least Squares and Maximum Entropy? An Axiomatic Approach to Inference for Linear Inverse Problems | Imre Csiszar | 460 |
73 | doi:10.1214/aos/1176346056 | Ann. Statist., Volume 11, Number 1 (1983), 59-67. | Quasi-Likelihood Functions | Peter McCullagh | 456 |
74 | doi:10.1214/aos/1176350933 | Ann. Statist., Volume 16, Number 3 (1988), 927-953. | Theoretical Comparison of Bootstrap Confidence Intervals | Peter Hall | 451 |
75 | doi:10.1214/08-AOS600 | Ann. Statist., Volume 36, Number 6 (2008), 2577-2604. | Covariance regularization by thresholding | Peter J. Bickel and Elizaveta Levina | 435 |
76 | doi:10.1214/009053607000000127 | Ann. Statist., Volume 35, Number 5 (2007), 2173-2192. | On the “degrees of freedom” of the lasso | Hui Zou, Trevor Hastie, and Robert Tibshirani | 426 |
77 | doi:10.1214/009053604000000256 | Ann. Statist., Volume 32, Number 3 (2004), 928-961. | Nonconcave penalized likelihood with a diverging number of parameters | Jianqing Fan and Heng Peng | 423 |
78 | doi:10.1214/aos/1176348653 | Ann. Statist., Volume 20, Number 2 (1992), 712-736. | Exact Mean Integrated Squared Error | J. S. Marron and M. P. Wand | 423 |
79 | doi:10.1214/aos/1176342558 | Ann. Statist., Volume 1, Number 6 (1973), 1071-1095. | On Some Global Measures of the Deviations of Density Function Estimates | P. J. Bickel and M. Rosenblatt | 418 |
80 | doi:10.1214/aos/1176342810 | Ann. Statist., Volume 2, Number 5 (1974), 849-879. | General Equivalence Theory for Optimum Designs (Approximate Theory) | J. Kiefer | 416 |
81 | doi:10.1214/aos/1176345636 | Ann. Statist., Volume 9, Number 6 (1981), 1187-1195. | On the Asymptotic Accuracy of Efron’s Bootstrap | Kesar Singh | 414 |
82 | doi:10.1214/aos/1176350057 | Ann. Statist., Volume 14, Number 3 (1986), 1171-1179. | The Use of Subseries Values for Estimating the Variance of a General Statistic from a Stationary Sequence | Edward Carlstein | 414 |
83 | doi:10.1214/aos/1176342705 | Ann. Statist., Volume 2, Number 3 (1974), 437-453. | A Large Sample Study of the Life Table and Product Limit Estimates Under Random Censorship | N. Breslow and J. Crowley | 412 |
84 | doi:10.1214/aos/1034276620 | Ann. Statist., Volume 25, Number 1 (1997), 1-37. | Fitting time series models to nonstationary processes | R. Dahlhaus | 401 |
85 | doi:10.1214/aos/1176348248 | Ann. Statist., Volume 19, Number 3 (1991), 1257-1272. | On the Optimal Rates of Convergence for Nonparametric Deconvolution Problems | Jianqing Fan | 400 |
86 | doi:10.1214/aos/1176348368 | Ann. Statist., Volume 19, Number 4 (1991), 1725-1747. | Empirical Likelihood for Linear Models | Art Owen | 399 |
87 | doi:10.1214/aos/1176345206 | Ann. Statist., Volume 8, Number 6 (1980), 1348-1360. | Optimal Rates of Convergence for Nonparametric Estimators | Charles J. Stone | 398 |
88 | doi:10.1214/aos/1024691081 | Ann. Statist., Volume 26, Number 3 (1998), 879-921. | Minimax estimation via wavelet shrinkage | David L. Donoho and Iain M. Johnstone | 396 |
89 | doi:10.1214/aos/1176346788 | Ann. Statist., Volume 12, Number 4 (1984), 1215-1230. | Bandwidth Choice for Nonparametric Regression | John Rice | 394 |
90 | doi:10.1214/aos/1176349025 | Ann. Statist., Volume 21, Number 1 (1993), 255-285. | Bootstrap and Wild Bootstrap for High Dimensional Linear Models | Enno Mammen | 393 |
91 | doi:10.1214/aos/1176347397 | Ann. Statist., Volume 17, Number 4 (1989), 1833-1855. | A Moment Estimator for the Index of an Extreme-Value Distribution | A. L. M. Dekkers, J. H. J. Einmahl, and L. De Haan | 388 |
92 | doi:10.1214/009053604000001048 | Ann. Statist., Volume 33, Number 1 (2005), 1-53. | Analysis of variance—why it is more important than ever | Andrew Gelman | 386 |
93 | doi:10.1214/aos/1176325622 | Ann. Statist., Volume 22, Number 3 (1994), 1142-1160. | Posterior Predictive \(p\)-Values | Xiao-Li Meng | 386 |
94 | doi:10.1214/aos/1176345697 | Ann. Statist., Volume 10, Number 1 (1982), 154-166. | Least Squares Estimates in Stochastic Regression Models with Applications to Identification and Control of Dynamic Systems | Tze Leung Lai and Ching Zong Wei | 386 |
95 | doi:10.1214/aos/1016218226 | Ann. Statist., Volume 28, Number 2 (2000), 461-482. | General notions of statistical depth function | Robert Serfling and Yijun Zuo | 383 |
96 | doi:10.1214/aos/1176343842 | Ann. Statist., Volume 5, Number 3 (1977), 445-463. | Minimum Hellinger Distance Estimates for Parametric Models | Rudolf Beran | 375 |
97 | doi:10.1214/aos/1176342752 | Ann. Statist., Volume 2, Number 4 (1974), 615-629. | Prior Distributions on Spaces of Probability Measures | Thomas S. Ferguson | 374 |
98 | doi:10.1214/aos/1176347507 | Ann. Statist., Volume 18, Number 1 (1990), 405-414. | On a Notion of Data Depth Based on Random Simplices | Regina Y. Liu | 374 |
99 | doi:10.1214/009053604000000238 | Ann. Statist., Volume 32, Number 3 (2004), 870-897. | Optimal predictive model selection | Maria Maddalena Barbieri and James O. Berger | 373 |
100 | doi:10.1214/aos/1176349020 | Ann. Statist., Volume 21, Number 1 (1993), 157-178. | Optimal Smoothing in Single-Index Models | Wolfgang Hardle, Peter Hall, and Hidehiko Ichimura | 366 |
Conclusion
This project took way too much time, several days to reliably download all the required data, but I had some fun, and learned a bit about web scraping. I’ll revisit this work when I get interested in other journals.
Time permitting, it would be good to compile the data from this project into an R package for others to use. I’d first want to validate the data and further clean out some annoying tab and newline characters, but it’s do-able.
Minimally, anyone can read this blog post and figure out how to apply my scraping functions to other Project Euclid journals.
Unfortunately, summarizing by keyword only occurred to me the day after I sucked up the Project Euclid data, so that tabulation will have to wait for another journal.↩
Citations are definitely not the best measure of popularity, but I’m sure how to get any better metrics…↩
There is also a Google Scholar R package, but I have a feeling that it would be subject to similar or the same issues that I encountered. Better to use a service that actually exposes an API.↩