Credit to Tidyverse

John Haman


Categories: R

I do not generally agree with the workflow that the tidyverse insists users follow, but I’m starting to come around. I find it very hard to maintain consistency in data cleaning, because I often switch back and forth between tidy and base R coding in the same script. I’m guessing that a lot of R users who learned base R years before tidyverse became popular do the same thing.

I’m a mild Tidyverse Skeptic.

Despite my skepticism, the tidyverse deserves massive credit for a certain second order effect on statistics and data science: standardizing nomenclature. This is my big pet peeve in statistics – everyone using different words to describe the same thing. This is probably worst in the machine learning vs statistics divide (feature vs. covariate, response vs. target, and so on). But at least tidyverse has brought order to the data cleaning lexicon, as far as R users are concerned.1

No longer do I have to worrying about sorting vs. arranging, now it’s just arrange. Joining vs. merging is just joining. Everyone is using the same words to (in)formally describe their work. Subsetting vs. filtering is now always filtering when I talk to my colleagues. This is the real advantage of the ‘verb first’ coding style that tidyverse evangelizes.

Everyone using the same lexicon for a very complex topic brings massive benefits in communication and education.

  1. I doubt this is an issue for people who only pull and clean data using SQL