Home on Random effect
https://randomeffect.net/
Recent content in Home on Random effectHugo -- gohugo.ioen-usTue, 20 Jul 2021 00:00:00 +0000Starting a new R experience in Emacs
https://randomeffect.net/post/2021/07/20/starting-a-new-r-experience-in-emacs/
Tue, 20 Jul 2021 00:00:00 +0000https://randomeffect.net/post/2021/07/20/starting-a-new-r-experience-in-emacs/For some time, I’ve been wondering if it’s possible to make an alternative to ESS. While I love ESS, I feel that it could be interesting to try to integrate some of the more modern features that the Emacs ecosystem has to offer with the R programming experience.
I’ve started a small project to do just that. The plan is to create some sort of thin R mode for Emacs that looks more like python.Statisticians at conferences
https://randomeffect.net/post/2021/06/10/statisticians-at-conferences/
Thu, 10 Jun 2021 00:00:00 +0000https://randomeffect.net/post/2021/06/10/statisticians-at-conferences/Today, while doing some review on causal inference, I saw a funny quote, paraphrasing:
It wasn’t too long ago that you could go to any statistics or social science conference, and bump into a statistician that would proclaim “causal inference doesn’t exist” or “there are no causal effects outside of randomized experiments”.
Hey! They are talking about me from two years ago!How to relevel a factor in R
https://randomeffect.net/post/2021/06/04/how-to-relevel-a-factor-in-r/
Fri, 04 Jun 2021 00:00:00 +0000https://randomeffect.net/post/2021/06/04/how-to-relevel-a-factor-in-r/Here is an interesting ‘problem’ I had with R. Suppose we have a factor:
f <- factor(c("a", "a", "b", "b", "b"), levels = c("a", "b")) f ## [1] a a b b b ## Levels: a b Suppose I want the levels of f to be reversed, or placed into the order that makes the most sense for my plot.
Here is what not to do: do not mettle with the levels of f directly, this will change the data:More about correlated predictors
https://randomeffect.net/post/2021/05/30/more-about-correlated-predictors/
Sun, 30 May 2021 00:00:00 +0000https://randomeffect.net/post/2021/05/30/more-about-correlated-predictors/The problem with correlated predictors is that there’s no way to pin-point the unique effect of the individual predictors. Well, that’s a bit of a lie. We can determine partial effects if the model is correctly specified.
Here’s the simulation: Let’s take two correlated predictors, \(x\) and \(z\), and say they have some joint effect on \(y\).
This code block simulates the correlated variates. \(x\) and \(z\) are correlated with a Pearson’s correlation of \(0.The {rms} validate function
https://randomeffect.net/post/2021/05/02/the-rms-validate-function/
Sun, 02 May 2021 00:00:00 +0000https://randomeffect.net/post/2021/05/02/the-rms-validate-function/The post is about the statistics in the rms function validate. I find it very hard to remember what each of these statistics is, and the regression modeling strategies book does not lay these out as nicely as I would like.
I’m going to concentrate on the validation of logistic regression models, since that’s the most relevant to my job.
library(rms) dat <- read.csv("https://stats.idre.ucla.edu/stat/data/binary.csv") We fit an example model to predict the probability of admittance to a school based on GPA and class rank.Nomograms
https://randomeffect.net/post/2021/04/22/nomograms/
Thu, 22 Apr 2021 00:00:00 +0000https://randomeffect.net/post/2021/04/22/nomograms/How far can you take the nomogram?
A nomogram is an easy visual description of a fitted model. Nomogram creation is facilitated by the rms package in R. They are popular as a tool to give physicians, so that someone can estimate (say) the probability of hazardous side-effect without plugging numbers into an equation.
library(rms) Let’s take a previous data set, the college admission data from UCLA. From the data, one can model the probability of college admision using two variates, high school gpa, and class rank.Murders vs. Firearm Prevalence
https://randomeffect.net/post/2021/04/18/deaths-vs-firearm-prevalence/
Sun, 18 Apr 2021 00:00:00 +0000https://randomeffect.net/post/2021/04/18/deaths-vs-firearm-prevalence/I received the following email:
From the World Health Organization, The latest Murder Statistics for the world: Murders per 100,000 citizens per year. Honduras 91.6 (WOW!!) El Salvador 69.2, … [the rest of the numbers I omit], Suriname 4.6, Laos 4.6, Georgia 4.3, Martinique 4.2, And ……….The United States 4.2, ALL (109) of the countries above America, HAVE 100% gun bans. It might be of interest to note that SWITZERLAND is not shown on this list, because it has… NO MURDER OCCURRENCE!Kinesis keyboard recommended layout
https://randomeffect.net/post/2021/03/31/kinesis-keyboard-recommended-layout/
Wed, 31 Mar 2021 00:00:00 +0000https://randomeffect.net/post/2021/03/31/kinesis-keyboard-recommended-layout/The Kinesis advantage keyboard is a wonderful keyboard that I keep coming back to, but it takes some adjustments to make it really work well for me. Luckily, this is very simple: the Kinesis keyboard has programmable firmware.
The default layout is not bad:
You can see that the keys are arranged in columns, instead of rows. You’ve also got some nice ‘thumb clusters’ that help take the strain off of my pinkies.Firth's bias correction as a Bayesian model
https://randomeffect.net/post/2021/03/12/firth-s-bias-reducing-logit-regression-correction/
Fri, 12 Mar 2021 00:00:00 +0000https://randomeffect.net/post/2021/03/12/firth-s-bias-reducing-logit-regression-correction/Introduction Visualizing the Prior Sampling from the Posterior Metropolis-Hastings Sampler (Random walk version) Diagnostics Confirm the MH algorithm agrees with brglm library(mvtnorm) library(brglm2) library(data.table) library(ggplot2); theme_set(theme_bw(base_size = 14)) Introduction Firth’s adjustment is a technique in logistic regression that ensures the maximum likelihood estimates always exist.
It’s an unfortunate fact that MLEs for logistic regression frequently don’t exist. This is due to sparsity in the data, which makes certain classes perfectly separable.Bias corrected calibration curve from scratch
https://randomeffect.net/post/2021/03/10/bias-corrected-calibration-curve-from-scratch/
Wed, 10 Mar 2021 00:00:00 +0000https://randomeffect.net/post/2021/03/10/bias-corrected-calibration-curve-from-scratch/library(ggplot2); theme_set(theme_bw(base_size = 14)) library(rms) In the last post, we saw how to fit a bias-corrected calibration curve using the rms package. In this post we see how to do the same thing without loading the rms package. Of course, we need a model to start things off.
set.seed(125) dat <- lme4::InstEval[sample(nrow(lme4::InstEval), 800), ] fit <- glm(y > 3 ~ lectage + studage + service + dept, binomial, dat) In trying to reproduce my own version of calibrate, I am a bit disappointed that the documentation for calibrate does not provide any references.How to draw a calibration curve for logistic regression
https://randomeffect.net/post/2021/03/08/how-to-draw-a-calibration-curve-for-logistic-regression/
Mon, 08 Mar 2021 00:00:00 +0000https://randomeffect.net/post/2021/03/08/how-to-draw-a-calibration-curve-for-logistic-regression/Calibration curves are a useful little regression diagnostic that provide a nice goodness of fit measure. The basic idea behind the diagnostic is that if we plot our estimated probabilities against the observed binary data, and if the model is a good fit, a loess curve1 on this scatter plot should be close to a diagonal line.
The key advantage of calibration curves is that they show goodness of fit in an absolute sense.ANOVA pie chart
https://randomeffect.net/post/2021/02/05/anova-pie-chart/
Fri, 05 Feb 2021 00:00:00 +0000https://randomeffect.net/post/2021/02/05/anova-pie-chart/library(nlme) library(colorspace) library(ggplot2); theme_set(theme_bw(base_size = 16)) It’s easy to make fun of pie charts. They get a lot of hate, but I don’t think it’s all well deserved. A pie chart can be a superior representation of the data if we want to visualize data that must sum to \(100\%\).
For analysis of variance (ANOVA), a pie chart is a good way of showing the sum of squares (SS) decomposition.The most cited Annals of Statistics articles
https://randomeffect.net/post/2021/01/19/the-most-cited-annals-of-statistics-articles/
Tue, 19 Jan 2021 00:00:00 +0000https://randomeffect.net/post/2021/01/19/the-most-cited-annals-of-statistics-articles/Intro Getting all the articles in AOS Minor data cleaning Getting the citation data Google Scholar (utter failure) Trying again with CrossRef API Tabulate Results The top 100 AOS articles: Conclusion Or, how to bot Project Euclid with R.
Intro Whenever I look at the front page of Project Euclid, they try to be helpful and show me the top articles of the week, but it’s not exactly what I want to see.Does model selection bias p-values?
https://randomeffect.net/post/2021/01/16/does-model-selection-bias-p-values/
Sat, 16 Jan 2021 00:00:00 +0000https://randomeffect.net/post/2021/01/16/does-model-selection-bias-p-values/YES.
Model selections techniques have the potential to wreck inferential statistics.
My friend brought up an interesting practice at her work: Model builders were using step-wise AIC to select a linear model specification. There are dozens of possible variables, and there is little interest in trying to select a reasonable model based on subject matter expertise. After the best model is found using step-wise selection, \(p\)-values are further called in to pare down the model: predictors with large \(p\)-values get tossed out.Ubuntu Desktop 2021
https://randomeffect.net/post/2021/01/16/ubuntu-desktop-2021/
Sat, 16 Jan 2021 00:00:00 +0000https://randomeffect.net/post/2021/01/16/ubuntu-desktop-2021/New year. Similar Desktop.
Well, some minor changes. I’m still sticking with Ubuntu, and I’m still mostly using Emacs, and some ‘helper’ programs, displayed in the taskbar.
All personal work occurs in Emacs, Firefox, Thunderbird, and Gnome terminal.
Emacs is very different in 2021. I switched to GccEmacs a few days ago, which is lightning fast compared to the regular Emacs 26 in the Ubuntu LTS repository. I’m using the excellent Modus themes to keep it looking nice.Plotting a spherical distribution in R
https://randomeffect.net/post/2021/01/05/plotting-a-spherical-distribution-in-r/
Tue, 05 Jan 2021 00:00:00 +0000https://randomeffect.net/post/2021/01/05/plotting-a-spherical-distribution-in-r/Directional data, data collected on the surface of a sphere or circle, is rather common in certain lines of work. There are several approaches to handling this data, but a most satisfying approach is to adopt a directional statistics mindset, which respects the topology of the sphere or circle.
However, visualizing spherical data in R is challenging. R is exceptional for making two-dimensional Cartesian visualizations, but these can distort the geometry of the sphere.Visualizing the log determinant surface
https://randomeffect.net/post/2020/12/19/visualizing-the-log-determinant-surface/
Sat, 19 Dec 2020 00:00:00 +0000https://randomeffect.net/post/2020/12/19/visualizing-the-log-determinant-surface/A small simulation that shows how the volume of the linear model covariance matrix can change under sample size and factor correlations:
library(MASS) library(data.table) save_det <- data.table(sample_size = rep(1 : 100 + 5, each = 100), corr = rep(0 : 99 / 100, 100), det = NA) det1 <- vector("numeric") for (i in 1 : 100 + 5) { for (j in 0 : 99) { S <- matrix(c(1, j / 100, j / 100, 1), nrow = 2) X <- mvrnorm(i, c(0, 0), S) beta <- c(1, 2, 3) y <- cbind(1, X) %*% beta + rnorm(i) fit <- lm(y ~ X) det1 <- c(det1, det(vcov(fit))) } } save_det[, det := det1] with(save_det, filled.Explainable AI again
https://randomeffect.net/post/2020/12/15/explainable-ai-again/
Tue, 15 Dec 2020 00:00:00 +0000https://randomeffect.net/post/2020/12/15/explainable-ai-again/Judging by the multiple and non-overlapping options for explaining a machine learning model, I am happy to report that the explainable AI problem is still kicking.
It will not be resolved. Factors interact, factors are correlated. Factors affect responses non-linearly. How can one claim to boil down these super complicated machines into hueristics that don’t deceive decision makers?
Meanwhile, machine learners still are not using analysis of variance to explain their models.LASSO variable selection performance
https://randomeffect.net/post/2020/12/09/lasso-variable-selection-performance/
Wed, 09 Dec 2020 00:00:00 +0000https://randomeffect.net/post/2020/12/09/lasso-variable-selection-performance/library(MASS) library(glmnet) library(ggplot2); theme_set(theme_bw(base_size = 15)) After seeing Frank Harrell’s discussion of the LASSO as a variable selection procedure on YouTube, I started to get curious about just how bad it might be. Link. The LASSO bit starts at around 24 minutes.
In Frank’s video, he shows a simulation that purports to show that LASSO is unsuitable for variable selection because it is too noisy.
Here was the basic idea.Arcane Emacs
https://randomeffect.net/post/2020/12/08/arcane-emacs/
Tue, 08 Dec 2020 00:00:00 +0000https://randomeffect.net/post/2020/12/08/arcane-emacs/Fiddling around with my Emacs configuration tonight, I discovered that the way I specify my default frame font can (paradoxically) have a big impact on my Emacs startup time.
I was picking my font using:
(set-face-attribute 'default nil :family "Fira Code" :height 120) Which I kinda thought was prohibitively slow. After a little bit of deep Googling, I found another way, namely
(add-to-list 'default-frame-alist '(font . "Fira Code-12")) This shaves about 0.Variable Importance - Linear Regression
https://randomeffect.net/post/2020/11/01/variable-importance-linear-regression/
Sun, 01 Nov 2020 00:00:00 +0000https://randomeffect.net/post/2020/11/01/variable-importance-linear-regression/Sometimes, I hear people say that to determine the most importance variables in a regression model, you have to standardize all the predictors. For example here.
The advice is good, because model coefficients alone are calculated on predictors of different scales, but also misguided, because the advice seems to imply that one should use the magnitude of the model coefficients to determine variable importance.
But standardizing variables is unnecessary. It’s probably better to use the \(t\) scores from the model to figure out what matters.Multiplicity problems in vaccine hunting
https://randomeffect.net/post/2020/10/27/multiplicity-problems-in-vaccine-hunting/
Tue, 27 Oct 2020 00:00:00 +0000https://randomeffect.net/post/2020/10/27/multiplicity-problems-in-vaccine-hunting/library(ggplot2); theme_set(theme_bw(base_size = 15)) library(data.table) One thing that is statistically concerning about the current COVID situation is the multiple testing problem around vaccine trials. The New York Times is currently tracking the status of vaccines in the US and elsewhere. As of October 27, 34 are in phase 1, 13 in phase 2, and 11 are in phase 3. If we suppose that all the phase 1 and 2 vaccine candidates are safe, then there are 58 possibly effective treatments.The correct interpretation of a confidence interval
https://randomeffect.net/post/2020/10/20/the-correct-interpretation-of-a-confidence-interval/
Tue, 20 Oct 2020 00:00:00 +0000https://randomeffect.net/post/2020/10/20/the-correct-interpretation-of-a-confidence-interval/Thanks Google! Your results are getting better everyday.Visualizing Contrast Coding
https://randomeffect.net/post/2020/10/18/visualizing-treatment-coding/
Sun, 18 Oct 2020 00:00:00 +0000https://randomeffect.net/post/2020/10/18/visualizing-treatment-coding/library(ggplot2); theme_set(theme_bw(base_size = 15)) library(data.table) First, some data
dat <- data.table( x = gl(2, 10, 20), z = gl(2, 5, 20)) X <- model.matrix( ~ . * ., data = dat) beta <- c(1, 2, 3, 1) dat$y <- rnorm(20, X %*% beta, 1) Plot the data:
## calculate means by each group means <- dat[, .(m = mean(y)), by = .(x, z)] ## plot means with raw data p <- ggplot(dat, aes(x = x, y = y, color = z, group = interaction(x, z))) + geom_point() + geom_point(data = means, aes(y = m), fill = 'black', col = "black", size = 4, shape = 21) p When we fit an ANOVA model, what do the coefficients mean?Sample Size for interaction effects under different codings
https://randomeffect.net/post/2020/10/17/sample-size-for-interaction-effects-under-different-codings/
Sat, 17 Oct 2020 00:00:00 +0000https://randomeffect.net/post/2020/10/17/sample-size-for-interaction-effects-under-different-codings/Following another discussion of the 16x sample size rule for half-sized interactions on Andrew Gelman’s blog, I commented that I do not think the rule holds up under different coding schemes, but that I’d not done a simulation. In particular, I was concerned about the rule under the DOE ‘-1, +1’ coding scheme compared to the usual treatment coding scheme.
It’s actually an important distinction to make, because it’s easy to design an experiment with one coding in mind, and then analyze it under a different one.Quasibinomial model in R glm()
https://randomeffect.net/post/2020/10/12/quasi-binomial-in-r-glm/
Mon, 12 Oct 2020 00:00:00 +0000https://randomeffect.net/post/2020/10/12/quasi-binomial-in-r-glm/We talk a lot about Bayes on the blog, because Bayes is internally coherent. But there’s also a fair amount of coherence on the frequentist side as well. The coherence in frequentist stats generally comes from the theory of estimating equations. In this post, we will start with one of the grandparents of estimating equations, the quasi-binomial model. Quasi models are a beautiful class of models that don’t assume any likelihood (!Randomizing run order in a designed experiment
https://randomeffect.net/post/2020/10/10/when-to-randomize-the-doe/
Sat, 10 Oct 2020 00:00:00 +0000https://randomeffect.net/post/2020/10/10/when-to-randomize-the-doe/Introduction Randomization of an experimental design (DOE) is different from randomization in a clinical trial. In a DOE, the design of the \(n\) trials is known in advance, but in the clinical trial, it is not.
DOE randomization refers to randomizing the order of runs from the fixed design.
Clinical trial randomization means that as individuals present themselves to the trial, they are assigned to a treatment group depending on the result of a random number generator.Overlapping confidence intervals doesn't mean non-significant difference!!!
https://randomeffect.net/post/2020/10/09/overlapping-confidence-intervals-doesn-t-mean-non-significant-difference/
Fri, 09 Oct 2020 00:00:00 +0000https://randomeffect.net/post/2020/10/09/overlapping-confidence-intervals-doesn-t-mean-non-significant-difference/I am tired of observing a common statistical mistake. The mistake is simple: It’s the idea that the plot of confidence intervals on two groups can be used in place of a \(t\)-test.
To demonstrate that this practice is incorrect, we find a small counter example in R.
We can find a data-set that gives a barely significant \(p\)-value, but the errorbars on the two groups overlap. I accomplish this by doing a brute-force search for a two samples of normal data, both of size \(10\), such that the \(p\)-value of the \(t\)-test just barely below \(0.Controversies in Machine Learning
https://randomeffect.net/post/2020/10/06/controversies-in-statistics/
Tue, 06 Oct 2020 00:00:00 +0000https://randomeffect.net/post/2020/10/06/controversies-in-statistics/Frank Harrell has a new keynote talk on Youtube. Watch it here.
While the name of the talk is controversies in predictive modeling and machine learning, I would say it also Frank Harrell’s philosophy of statistics in a nutshell. Here are my notes on the talk, which are also sprinkled through-out his RMS book.
External Validation is Overrated: Data splitting (using training and testing sets) is bad. Training and testing is not external validation.Big 'p'
https://randomeffect.net/post/2020/10/05/big-p/
Mon, 05 Oct 2020 00:00:00 +0000https://randomeffect.net/post/2020/10/05/big-p/When the number of dimensions is big, everything is far away.
Let’s start with two dimensions. Just standard normal distribution.
library(mvtnorm) set.seed(15) dat <- rmvnorm(1000, sigma = diag(2)) We can calculate the distance from the points to the origin:
inter <- dat ^ 2 d <- sqrt(apply(inter, 1, sum)) How far away are the points from the origin?
library(ggplot2); theme_set(theme_bw(base_size = 15)) ggplot(data.frame(d = d), aes(x = d)) + ggtitle("Distribution of Distances", "N(0, diag(2))") + geom_histogram(col = "black", fill = 'royalblue2') Lots of data are close to 0.Comparison: Roll-ups vs. Post-Hoc Linear Contrast
https://randomeffect.net/post/2020/10/01/comparison-roll-ups-vs-post-hoc-linear-contrast/
Thu, 01 Oct 2020 00:00:00 +0000https://randomeffect.net/post/2020/10/01/comparison-roll-ups-vs-post-hoc-linear-contrast/The nice thing about a designed experiment is that the data are always conditioned on the factors. Variance resides within conditions and between conditions. In a typical ANOVA model, the residual variance is an estimate of the ‘within condition’ variance, and the fixed effects explain the between condition variability. Fancier models extend this basic idea to more complex situations.
In this note I’ll show that consideration of the ‘within condition’ variance is beneficial, even if the goal to do inference on the overall mean \(E(y)\).python version of Xian's code
https://randomeffect.net/post/2020/09/09/python-version-of-xian-s-code/
Wed, 09 Sep 2020 00:00:00 +0000https://randomeffect.net/post/2020/09/09/python-version-of-xian-s-code/Sort of inspired to find a faster solution to a Riddler by Xian, I rewrote some of his code in Python a got a ~10-15x speed increase. Read Xian’s post for a link to the original Riddle.
Load Libs library(reticulate) library(tictoc) R Solution tic() simz=t(apply(matrix(runif(3*1e5),ncol=3),1,sort)) mean((simz[,1]>.5)*simz[,1]+ (simz[,1]<.5)*(simz[,2]>.5)*(simz[,2]-simz[,1])+ (simz[,2]<.5)*(simz[,3]>.5)*(simz[,3]-simz[,2])+ (simz[,3]<.5)*(1-simz[,3])) ## [1] 0.4685494 toc() ## 4.123 sec elapsed Python solution import numpy as np import time start = time.What's the model?
https://randomeffect.net/post/2020/09/01/what-s-the-model/
Tue, 01 Sep 2020 00:00:00 +0000https://randomeffect.net/post/2020/09/01/what-s-the-model/This post is a short R demonstration on how to make fake data from three statistical models. All models are different in that each model is usually taught or encountered in a different context. But they are all the same because each model involves the generation of a “latent” normal random variable.
I am struck by the ubiquity of the latent random normal variable. It is useful for
Time series data (Model 1) Distance correlated data (Model 2) clustered data (Model 3) library(MASS) library(ggplot2) theme_set(theme_bw(base_size = 16)) Model 1 Random draw from a multi-normal with a time-series like correlation matrix:AUC-UQ
https://randomeffect.net/post/2020/08/27/auc-uq/
Thu, 27 Aug 2020 00:00:00 +0000https://randomeffect.net/post/2020/08/27/auc-uq/Here’s an interesting problem: Suppose you have a curve that represents the probability of an event, conditioned on some \(x\) values. It could be any curve, but we hope it is reasonably well-behaved. The \(x\) values represent an offset, some distance from a center-point, so \(x=-1\) is functionally similar to \(x=+1\).
Now, we’d like to define a typical set: a range of \(x\) values with some characteristic probability, which is the average probability on the typical set.Fix Rmarkdown fontification issues in Emacs
https://randomeffect.net/post/2020/08/22/buggy-fontification-in-polymode/
Sat, 22 Aug 2020 00:00:00 +0000https://randomeffect.net/post/2020/08/22/buggy-fontification-in-polymode/Using markdown-mode with Polymode in Emacs is a bit of a challenge for statisticians. But it’s still probably the best way to work with .Rmd files. I’ll describe the issues and my solution in this post.
Polymode, the software that lets Emacs have multiple major modes in a single buffer, generally works well and provides a lot of premade polymodes for working with different types of code. The most important premade polymode is the one for .Principle of Marginality, scattered thoughts.
https://randomeffect.net/post/2020/08/11/principle-of-marginality/
Tue, 11 Aug 2020 00:00:00 +0000https://randomeffect.net/post/2020/08/11/principle-of-marginality/The principle of marginality (PM or MP) is a statistical principle (not a mathematical principle) that guides the way researchers should interpret linear models. (some PDF notes)
The principle If a factor effect is marginal to another effect in the model, then we neither test not interpret that factor effect. Further, if interactions are included in the regression model, then all lower order main effects should be included in the model as well.Statistical analysis of website load times
https://randomeffect.net/post/2020/08/09/my-slow-website/
Sun, 09 Aug 2020 00:00:00 +0000https://randomeffect.net/post/2020/08/09/my-slow-website/How this site is hosted Possible Culprits A profile in Chrome How do JS elements affect load time? How much data? Experimental Data Inspect the data Model Fit Testing A better model? Conclusions Can you help me? Despite the spartan looks of this website, it is actually a low-key Javascript monster. I’m not really sure how this happened, but I don’t know a lot about websites, and I just kind of trust that if I don’t do too much fiddling then the website will be fast.Is Explainable AI anything at all?
https://randomeffect.net/post/2020/08/07/is-explainable-ai-anything-at-all/
Fri, 07 Aug 2020 00:00:00 +0000https://randomeffect.net/post/2020/08/07/is-explainable-ai-anything-at-all/Machine learning people will often tell you that there is a trade-off between explainability and predictive performance, so we have no choice but to use highly predictive models that we cannot explain. Maybe this explains (ha) the new field in machine learning called “explainable AI” (XAI) that seeks to shed some light on the inner workings of the ‘black box’ so that we can understand why the models predict one way or another.SEM
https://randomeffect.net/post/2020/08/07/sem/
Fri, 07 Aug 2020 00:00:00 +0000https://randomeffect.net/post/2020/08/07/sem/While reading a bit of Yihui’s blog I found something that I’ve been hoping to hear from another statistician for a long time …
Typically I ignore any questions on Structural Equation Modeling (SEM) or factor analysis, since I’m not convinced of their usefulness at all. I know little about time series and do not like econometrics. I have little interest in quantitative research in social sciences.
SEM and factor analysis lead to very few helpful inferences.Free R package idea!!
https://randomeffect.net/post/2020/08/06/free-r-package-idea/
Thu, 06 Aug 2020 00:00:00 +0000https://randomeffect.net/post/2020/08/06/free-r-package-idea/JSM was this week, but I’m sad to report that I didn’t get to watch as many talks as I would have liked. For a couple reasons, the conference was only a so-so experience for me because:
Loads of technical problems. All of the talks that I attended had some kind of technical problem, either audio, video, or both. This usually resulted in thumb twiddling while waiting for tech support to sort things out.Moving from General to Hydra
https://randomeffect.net/post/2020/07/22/general-to-hydra/
Wed, 22 Jul 2020 00:00:00 +0000https://randomeffect.net/post/2020/07/22/general-to-hydra/I had been happy to use General to define all my Emacs keybindings. General is especially nice if you are an Evil user, and it takes much of the pain out of the usual Emacs way of defining keymaps. I use General to make several custom keymaps: I have one big keymap that is bound to SPC giving me a mini-Spacemacs setup, and several mode specific keymaps that I bind to ,, no matter the mode.Work from home
https://randomeffect.net/post/2020/07/21/work-from-home/
Tue, 21 Jul 2020 00:00:00 +0000https://randomeffect.net/post/2020/07/21/work-from-home/My work from home setup is… okay: some parts are good and others are lacking considerably. I have made a few upgrades (and downgrades!) in the last few months.
Desk It’s a cheap desk that I bought at Aldi, of all places. It’s fairly spacious and has good drawers. The only problem is that it’s too high. I miss my keyboard tray.
Chair Some folding chair that my wife purchased.Counterfactuals aren't allowed in history class, why statistics?
https://randomeffect.net/post/2020/07/16/counterfactuals-aren-t-allowed-in-history-class-why-statistics/
Thu, 16 Jul 2020 00:00:00 +0000https://randomeffect.net/post/2020/07/16/counterfactuals-aren-t-allowed-in-history-class-why-statistics/A counterfactual is a what-if statement. Like “what if 911 had not happened?”, or “What if Trump had not won the election?” I read that historians don’t entertain counterfactuals, partially because they hardly know what actually did happen.
Why would a statistician use a counter-factual argument?
It turns out that we use them all the time: a \(p\)-value is a counterfactual – It is a statement about what could have happened, as opposed to what actually happened.Do you need 16x the sample size to estimate a half-sized interaction effect?
https://randomeffect.net/post/2020/07/15/you-need-16x-the-sample-size-to-estimate-a-half-size-interaction/
Wed, 15 Jul 2020 00:00:00 +0000https://randomeffect.net/post/2020/07/15/you-need-16x-the-sample-size-to-estimate-a-half-size-interaction/Today, another good number to keep in your back pocket. 16. That’s how much more data you need to estimate an interaction compared to an average treatment effect… assuming the interaction is half the size of the main-effect and you want the same power as before. And you are fitting a OLS regression model.
You need 4x the data to estimate an interaction that is the same magnitude as the main effect.The effect of imbalance on power in the two-sample proportions test
https://randomeffect.net/post/2020/07/12/the-effect-of-imbalance-on-power-in-the-two-sample-proportions-test/
Sun, 12 Jul 2020 00:00:00 +0000https://randomeffect.net/post/2020/07/12/the-effect-of-imbalance-on-power-in-the-two-sample-proportions-test/Generally, data imbalance is thought to be bad in experiments, but this is not always the case. Suppose you want to study the effect of a drug on \(Pr(Survival)\). In a trial where the cost of enrolling a subject for the treatment or control is identical, then the optimal experiment (from the statistician’s perspective) is to roughly balance the treatment and control groups.
This is the assumption power.prop.test makes in R when it calculates the power of a two-sample proportions test.Lines of code in my Emacs packages
https://randomeffect.net/post/2020/07/11/the-weight-of-my-emacs-packages/
Sat, 11 Jul 2020 00:00:00 +0000https://randomeffect.net/post/2020/07/11/the-weight-of-my-emacs-packages/I was curious about how many lines of code comprised the individual emacs packages that I use, and decided to write a R script that will count the lines of code and make a little chart:
List all the packages in your ELPA directory:
library(utils) p <- "~/.emacs.d/elpa/" ps <- list.files(p, full.names = TRUE) pshort <- list.files(p) For simplicity, I just count the lines of code in the “.el” files.Variance Explained
https://randomeffect.net/post/2020/06/30/variance-explained/
Tue, 30 Jun 2020 00:00:00 +0000https://randomeffect.net/post/2020/06/30/variance-explained/‘Variance explained’. It’s so simple, it’s seductive for researchers and decision makers. You fit a model and compute the \(R^{2}\) from the model – this is the amount of variation explained in the response by the statistical model. With \(R^{2}\) in hand, you get to make wonderfully simple statements like ‘90% of the variation in Y is due to X’ according to my model.
I’ve got two issues with \(R^2\).The goal of statistics
https://randomeffect.net/post/2020/06/27/the-goal-of-statistics/
Sat, 27 Jun 2020 00:00:00 +0000https://randomeffect.net/post/2020/06/27/the-goal-of-statistics/The goal of statistics is to summarize without misleading
That’s one view. I think as long as statisticians keep their eyes on the ball – focus estimation on the relevant quantities that directly link to research questions – we are doing the right thing. The specific philosophy doesn’t matter. What’s more important than the method is the data going into it, the assumptions we make about the data, and how seriously we take those assumptions.Multiple versions of statistical power
https://randomeffect.net/post/2020/06/23/multiple-versions-of-statistical-power/
Tue, 23 Jun 2020 00:00:00 +0000https://randomeffect.net/post/2020/06/23/multiple-versions-of-statistical-power/Bayesian Power Power is an idea from frequency statistics that is used by some to judge the quality of an experimental design. I don’t think power is (or should be) a universal design requirement, but it seems to get into just about any design – even experiments that do not intend to analyze data using ANOVA, or something like that. Briefly, for a simple test of \(H_0\) vs \(H_1\),Facebook's bogus machine learning interview questions
https://randomeffect.net/post/2020/06/14/the-difference-between-statistics-and-ml-according-to-facebook/
Sun, 14 Jun 2020 00:00:00 +0000https://randomeffect.net/post/2020/06/14/the-difference-between-statistics-and-ml-according-to-facebook/Since my wife recently interviewed for a job at Facebook, I’m privy to a slew of FB interview questions. They range from statistics and machine learning to algorithms, but some of the questions were particularly annoying from a statistical perspective.
I’ll share the more extreme examples, but most of the questions were actually pretty modest.
How do you deal with outliers? Outliers do not exist. An outlier is a false dichotomy in the data, and if the word outlier enters your thoughts, it just means that your data are over dispersed and that you have to change your model.mgcv: two options for smoothing splines over grouped data
https://randomeffect.net/post/2020/06/14/mgcv-two-options-for-smoothing-splines-over-grouped-data/
Sun, 14 Jun 2020 00:00:00 +0000https://randomeffect.net/post/2020/06/14/mgcv-two-options-for-smoothing-splines-over-grouped-data/Simulate some fake data dat <- gamSim(4, n = 400, dist = "normal", scale = 1) ## Factor `by' variable example The simulation creates a dataframe with a single response and 3 numeric covariates. The effect of each covariate depends on the value of a factor, fac.
We can take a look at the data by plotting the response against each of the covariates:
ggplot(dat, aes(x = f1, y = y)) + geom_point(size = 1.My two talks on validation and evidence
https://randomeffect.net/post/2020/06/14/my-two-talks-on-validation-and-evidence/
Sun, 14 Jun 2020 00:00:00 +0000https://randomeffect.net/post/2020/06/14/my-two-talks-on-validation-and-evidence/Last month I gave a talk about computer model validation. The point of the talk was to detail a couple of trends I observed the validation space. The trends are:
Computer model validation methodologies have been held back by attempts to frame model validation projects as hypothesis testing problems. Validation research can better serve sponsor needs by providing estimates and uncertainties about important parameters.
Because computer model validation strategies range from completely visual to null hypothesis significance tests, and because there are so many different and interesting ways that models can differ from reality, we have trouble comparing validation techniques, even empirically.Bayesian methods can violate the likelihood principle
https://randomeffect.net/post/2020/05/30/bayesian-methods-can-violate-the-likelihood-principle/
Sat, 30 May 2020 00:00:00 +0000https://randomeffect.net/post/2020/05/30/bayesian-methods-can-violate-the-likelihood-principle/The likelihood principle (LP) is a statistical principle that can be stated, roughly, as follows:
All the infomation/evidence regarding a model parameter is contained in the likelihood function Or equivalently:
If two likelihoods are proportional, then the information/evidence concerning the parameters should be the same. These are not the most precise definitions, but I’m trying to convey the spirit of LP instead of the most accurate definition, which may suffer from some interpretability issue.Your prior is too informative?!
https://randomeffect.net/post/2020/05/27/your-prior-is-too-informative/
Wed, 27 May 2020 00:00:00 +0000https://randomeffect.net/post/2020/05/27/your-prior-is-too-informative/(Disclaimer: I perform a lot of frequency calculations in the post as approximations of Bayesian calculations.)
Suppose you have historical data that indicates a coin is biased. The data consists of \(100\) trials. From the data, we gathered that \(P_{Head} = 0.95\). We can use a normal approximation to justify that the Bernoulli parameter is somewhere between 0.906 and 0.994.
That’s a confidence interval of the form
\[ (l,u) = \left(\hat{p} - 2 \sqrt{\frac{\hat{p}(1-\hat{p})}{100}},\hat{p} + 2 \sqrt{\frac{\hat{p}(1-\hat{p})}{100}}\right).Conditionality is Bogus?
https://randomeffect.net/post/2020/05/23/conditionality-is-bogus/
Sat, 23 May 2020 00:00:00 +0000https://randomeffect.net/post/2020/05/23/conditionality-is-bogus/Larry Wasserman has a nice post on his blog about the rationality of statistical principles, in particular, the conditionality principle. Link. I know it’s old, but I just found it yesterday!
Statistics doesn’t have axioms like math, we have principles that guide data analysis.1 Different principles compete for mind share, but the likelihood principle is a big one that Bayesians like to cling to. The likelihood principle is rather important in the philosophy of statistics because it follows logically from two simpler and seemingly acceptable principles, sufficiency and conditionality (Birnbaum, 1962), and LP itself seems to be a good argument against the use of some frequentist techniques.Continuous Variable Time Series
https://randomeffect.net/post/2020/05/21/time-series/
Thu, 21 May 2020 00:00:00 +0000https://randomeffect.net/post/2020/05/21/time-series/This came up in a conversation today. What’s the right way to handle time series data when the index variable is continuous?
I’m only aware of the Gaussian process model, general regression models with the CAR correlation function for residuals, and some of the usual moving-average models that you can find in any old time series book.
Gaussian processes are obviously the most popular (right now), due in part to the publicity that they get on Gelman’s blog, and in BDA3.Disagreements
https://randomeffect.net/post/2020/05/21/disagreements/
Thu, 21 May 2020 00:00:00 +0000https://randomeffect.net/post/2020/05/21/disagreements/The existence of persistent disagreements is a puzzle from a Bayesian perspective. If there is only one reality, and everyone can talk to each other, then we should converge on the right answer1.
That persistent disagreements exist is (perhaps) an argument against the Bayesian perspective. At least the philosophy, I don’t know if there is some great bearing on statistical practice.
On the other hand, Bayesians have tools that let others take the evidence contained in the data, then properly and simply apply the user’s own prior distributions.Measurement trumps analysis
https://randomeffect.net/post/2020/05/20/measurement-trump-analysis/
Wed, 20 May 2020 00:00:00 +0000https://randomeffect.net/post/2020/05/20/measurement-trump-analysis/Along with good design, precise measurement beats analysis.
Statisticians can do a lot of good by convincing researchers to measure the extremity of an event, rather than its occurrence. This is hard: it’s much easier to measure that a component failed than the time of failure. It easier to say a block is broken than to (possible subjectively) measure its degree of broken-ness.
How do we enable researchers and testers to make the best measurements?We don't like data
https://randomeffect.net/post/2020/05/20/we-don-t-like-data/
Wed, 20 May 2020 00:00:00 +0000https://randomeffect.net/post/2020/05/20/we-don-t-like-data/Here is a model I’d like to fit: What proportion of statisticians actually like data? What does your prior distribution for this proportion look like?
Some statisticians like the idea of data, the theory of data, stuff like that. Real data means data cleaning, which is kind of fun, in limited doses. But I do not think some (many?, most?) statisticians like working on real data problems. They are just too messy.Not Identifiable
https://randomeffect.net/post/2020/05/18/not-identifiable/
Mon, 18 May 2020 00:00:00 +0000https://randomeffect.net/post/2020/05/18/not-identifiable/A non-identifiable model is one that has parameters that cannot be dis-entangled (ie estimated distinctly) given the likelihood alone. If we want to fit such a model, something extra is needed. That ‘extra’ might be a constraint (hard) or a prior distributions (soft).
Here’s a simple example of non-identifiability:
\[ y_1, ..., y_n \sim Normal(a+b, \sigma^2) \]
From the likelihood alone, the parameters \(a\) and \(b\) are non estimable, unless we introduce something extra (prior, constraints, etc.Against bucketing variables
https://randomeffect.net/post/2020/05/10/against-bucketing-variables/
Sun, 10 May 2020 00:00:00 +0000https://randomeffect.net/post/2020/05/10/against-bucketing-variables/If you want to know how the something, say the probability of a failure, varies with respect to another continuous variable, like time, don’t cut the variable into several little buckets and calculate the mean in each bucket. You’ll end up with a discontinuous function. And because the probs will depend on how you make the buckets, you’ll start to worry about the best way to make the buckets, a very bias-variance trade-off type problem.The minimum sample size for logistic regression
https://randomeffect.net/post/2020/04/19/the-minimum-sample-size-for-logistic-regression/
Sun, 19 Apr 2020 00:00:00 +0000https://randomeffect.net/post/2020/04/19/the-minimum-sample-size-for-logistic-regression/I saw something interesting on Frank Harrell’s Datamethods forum. A user wanted to know what to do with her ‘big p, small n’ data. She was analyzing clinical trial data from a trial that had been cut short. The outcome was a binary variable (something like medicine successfully reduce inflammation), but she’d only collected \(20\) obs, and had a handful of predictors.
The is exactly the situation I don’t want to be in.Bad Questions
https://randomeffect.net/post/2020/04/16/bad-questions/
Thu, 16 Apr 2020 00:00:00 +0000https://randomeffect.net/post/2020/04/16/bad-questions/There are many bad statistics questions on the internet, and about half as many poor answers to those questions. In education, we like to say that there are no bad questions.
Maybe there are no bad questions, but there is just a plethora of lazy questions.
I spend a fair amount of time answering a lot of good questions at work (and at home!) so when I see a lazy question on the internet, I get very annoyed.Credit to Tidyverse
https://randomeffect.net/post/2020/04/16/credit-to-tidyverse/
Thu, 16 Apr 2020 00:00:00 +0000https://randomeffect.net/post/2020/04/16/credit-to-tidyverse/I do not generally agree with the workflow that the tidyverse insists users follow, but I’m starting to come around. I find it very hard to maintain consistency in data cleaning, because I often switch back and forth between tidy and base R coding in the same script. I’m guessing that a lot of R users who learned base R years before tidyverse became popular do the same thing.Blogdown w/ Polymode
https://randomeffect.net/post/2020/04/13/blogdown/
Mon, 13 Apr 2020 00:00:00 +0000https://randomeffect.net/post/2020/04/13/blogdown/Polymode is a nifty little Emacs “mode” that lets you run multiple major modes in the same buffer. This is one heck of a hack because Emacs typically requires one major mode per buffer.1
In particular, you can have a buffer with markdown-mode and ESS running simulaneously, which if exactly what we need to edit Rmarkdown files. When the point is in an R code fence, ESS mode is active, otherwise markdown-mode is active.Boring
https://randomeffect.net/post/2020/04/12/boring/
Sun, 12 Apr 2020 00:00:00 +0000https://randomeffect.net/post/2020/04/12/boring/Bayesian analysis is “boring” because all analyses have the same workflow: given prior and likelihood, calculate posterior. Then the loss function determines how to summarize the posterior. Maybe perform some posterior predictive checks.
Machine learning is “boring” because, given the data, you just partition into training and test and find the method that minimizes the test error. Or maximizes the accuracy1.
Frequentist statistics might be the only non-boring way to do analysis, you have to be really creative to find the best method for your problem.Design - Model
https://randomeffect.net/post/2020/04/12/design-model/
Sun, 12 Apr 2020 00:00:00 +0000https://randomeffect.net/post/2020/04/12/design-model/What’s the value in DOE if we don’t use a model to analyze the data?
How do we design experiments that are robust to different modes of analysis?
Optimal experimental designs require a model for optimality, is the same person that designed the experiment is not the one that models the data, optimal designs are … sub-optimal!
Thoughts like these creep in all the time.Estimating three or more things
https://randomeffect.net/post/2020/04/12/estimating-three-or-more-things/
Sun, 12 Apr 2020 00:00:00 +0000https://randomeffect.net/post/2020/04/12/estimating-three-or-more-things/a.sourceLine { display: inline-block; line-height: 1.25; } a.sourceLine { pointer-events: none; color: inherit; text-decoration: inherit; } a.sourceLine:empty { height: 1.2em; } .sourceCode { overflow: visible; } code.sourceCode { white-space: pre; position: relative; } div.sourceCode { margin: 1em 0; } pre.sourceCode { margin: 0; } @media screen { div.sourceCode { overflow: auto; } } @media print { code.sourceCode { white-space: pre-wrap; } a.sourceLine { text-indent: -1em; padding-left: 1em; } } pre.First Post
https://randomeffect.net/post/2020/04/12/first-post/
Sun, 12 Apr 2020 00:00:00 +0000https://randomeffect.net/post/2020/04/12/first-post/Hello World.Univariate statistics for multivariate problems
https://randomeffect.net/post/2020/04/12/univariate-statistics-for-multivariate-problems/
Sun, 12 Apr 2020 00:00:00 +0000https://randomeffect.net/post/2020/04/12/univariate-statistics-for-multivariate-problems/a.sourceLine { display: inline-block; line-height: 1.25; } a.sourceLine { pointer-events: none; color: inherit; text-decoration: inherit; } a.sourceLine:empty { height: 1.2em; } .sourceCode { overflow: visible; } code.sourceCode { white-space: pre; position: relative; } div.sourceCode { margin: 1em 0; } pre.sourceCode { margin: 0; } @media screen { div.sourceCode { overflow: auto; } } @media print { code.sourceCode { white-space: pre-wrap; } a.sourceLine { text-indent: -1em; padding-left: 1em; } } pre.About
https://randomeffect.net/about/
Mon, 01 Jan 0001 00:00:00 +0000https://randomeffect.net/about/I love statistics. And Emacs. I enjoy research in decision theory, statistical evidence, Bayesian statistics and philosophy, experimental design (including randomization and causality), statistical models, graphics, and computational statistics.
I’m slightly active on Stack Exchange (Cross validated), HN, and some R/Emacs/ESS mailing lists. I have a bit of code available on Github.
This blog is some sort of spiritual successor to a research blog I kept in grad school. The old blog is a hodge-podge of org files that I might someday convert to R-markdown so I can host them here.Links
https://randomeffect.net/links/
Mon, 01 Jan 0001 00:00:00 +0000https://randomeffect.net/links/Statistics and Data Science Christian Robert https://xianblog.wordpress.com/ Andrew Gelman https://statmodeling.stat.columbia.edu/ Deborah Mayo https://errorstatistics.com/ Frank Harrell https://www.fharrell.com/ Radford Neal https://radfordneal.wordpress.com/ Variance Explained http://varianceexplained.org/ Larry Wasserman (inactive) https://normaldeviate.wordpress.com/ 20% Statistician https://daniellakens.blogspot.com/ Jim Albert https://baseballwithr.wordpress.com/author/bayesball/ The R blog https://developer.r-project.org/Blog/public/ Graphic Detail, by The Economist https://www.economist.com/graphic-detail/rss.xml The Royal Society Data Science Section https://rssdss.design.blog/blog-feed/ Many good “blogs” can be created by following good users on Cross Validated.
Many statistics journals also offer RSS feeds.Old Photos
https://randomeffect.net/photos/
Mon, 01 Jan 0001 00:00:00 +0000https://randomeffect.net/photos/Canada Portugal DC Costa Rica Japan Taiwan Quotes and Jokes
https://randomeffect.net/quotes/
Mon, 01 Jan 0001 00:00:00 +0000https://randomeffect.net/quotes/The problem with machine learning is that the machine does all the learning. (Unknown)
We have more data than ever, more good data than ever, a lower proportion of data that are good, a lack of strategic thinking about what data are needed to answer questions of interest, sub-optimal analysis of data, and an occasional tendency to do research that should not be done. (Frank Harrell)