Usage shares of programming languages in economics research

29 Dec 2023

My shiny app Finding Economics Articles with Data contains meanwhile over 8000 economic articles with replication packages. You can use it here: https://ejd.econ.mathematik.uni-ulm.de

Some of the data on articles and file types in the reproduction packages can be downloaded as a zipped SQLite database from my server (see the “About” page in the app for the link). Let us use the database to take a look at the usage shares of different programming languages.

The following code extracts our data set by merging two tables from the data base.

library(RSQLite)
library(dbmisc)
library(dplyr)

# Open data base using schemas as defined in my dbmisc
# package
db = dbConnect(RSQLite::SQLite(),"articles.sqlite")

articles = dbGet(db,"article")
fs = dbGet(db,"files_summary") 
fs = fs %>% 
  left_join(select(articles, year, journ, id), by="id")
head(fs)

id	file_type	num_files	mb	is_code	is_data	year	journ
aejapp_10_4_5	csv	9	6.49858	0	1	2018	aejapp
aejapp_10_4_5	do	19	0.169755	1	0	2018	aejapp
aejapp_10_4_5	dta	207	19918.231	0	1	2018	aejapp
aejpol_10_4_8	csv	1	2.110033	0	1	2018	aejpol
aejpol_10_4_8	do	18	0.118644	1	0	2018	aejpol
aejpol_10_4_8	gz	1	4294.9673	0	0	2018	aejpol

The data frame fs contains for each article and corresponding reproduction packages counts for common data or code files.

Let us take a look at the total number of reproduction packages and then compute the shares of reproduction packages that contain at least one file of specific programming languages (I am aware that not everybody would call e.g. Stata a programming language. Just feel free to replace the term by your favorite expression like scripting language or statistical software.):

n_art = n_distinct(fs$id)
n_art

## [1] 8262

fs %>% 
  group_by(file_type) %>%
  summarize(
    count = n(),
    share=round((count / n_art)*100,1)
  ) %>%
  # note that all file extensions are stored in lower case
  filter(file_type %in% c("do","r","py","jl","m","java","c","cpp","nb","f90","f95", "sas","mod","js","g","gms","ztt")) %>%
  arrange(desc(share))

file_type	count	share
do	5915	71.6
m	2023	24.5
r	808	9.8
sas	349	4.2
py	341	4.1
mod	198	2.4
f90	188	2.3
nb	116	1.4
c	105	1.3
ztt	104	1.3
cpp	66	0.8
jl	39	0.5
java	33	0.4
g	28	0.3
gms	19	0.2
js	18	0.2
f95	7	0.1

The most used software is by a far margin Stata, whose .do scripts can be found in 71.6% of reproduction packages. It follows Matlab with 24.5%. The most popular open source language is R with 9.8%. After one more proprietary software SAS, Python then follows as second most most used open source language with 4.1%. If you wonder why the shares add up to more than 100%: some reproduction packages simply use more than one language.

Let us take a look at the development over time for Stata, Matlab, R and Python.

year_dat = fs %>%
  filter(year >= 2010) %>%
  group_by(year) %>%
  mutate(n_art_year = n_distinct(id)) %>%
  group_by(year, file_type) %>%
  summarize(
    count = n(),
    share=count / first(n_art_year),
    # Compute approximate 95% CI of proportion
    se = sqrt(share*(1-share)/first(n_art_year)),
    ci_up = share + 1.96*se,
    ci_low = share - 1.96*se
  ) %>%
  filter(file_type %in% c("do","r","py","m")) %>%
  arrange(year,desc(share))  

library(ggplot2)
ggplot(year_dat, aes(x=year, y=share,ymin=ci_low, ymax=ci_up, color=file_type)) +
  facet_wrap(~file_type) +
  geom_ribbon(fill="#000000", colour = NA, alpha=0.1) +
  geom_line() +
  theme_bw()

The usage share of Stata and Matlab stays relatively constant over time. Yet, we still see a substantial increase in R usage from 1.4% in 2010 to over 20% in 2023. Also Python usage increases: from 0.4% in 2010 to almost 10% in 2023.

So open source software is getting more popular in academic economic research with large growth rates but absolute usage levels that are still substantially below Stata usage.

Note that the representation of journals is not balanced across years in our data base. E.g. the first reproduction package from Management Science in our data base is from 2019. To check whether the growth of R usage can also be found within journals, let us look at the development of its usage share within journals:

year_journ_dat = fs %>%
  filter(year >= 2010) %>%
  group_by(year, journ) %>%
  mutate(n_art = n_distinct(id)) %>%
  group_by(year, journ, file_type) %>%
  summarize(
    count = n(),
    share=count / first(n_art),
    # Compute approximate 95% CI of proportion
    se = sqrt(share*(1-share)/first(n_art)),
    ci_up = share + 1.96*se,
    ci_low = share - 1.96*se

  )
ggplot(year_journ_dat %>% filter(file_type=="r"),
  aes(x=year, y=share,ymin=ci_low, ymax=ci_up)) +
  facet_wrap(~ journ, scales = "free_y") +
  geom_ribbon(fill="#000000", colour = NA, alpha=0.1) +
  geom_line() +
  coord_cartesian(ylim = c(0, 0.4)) +
  ylab("") +
  ggtitle("Share of replication packages using R")+
  theme_bw()

We see a substantial increase in R usage in most journals. Finally, let us take a similar look at the time trends of Stata usage within journals.

ggplot(year_journ_dat %>% filter(file_type=="do"),
  aes(x=year, y=share,ymin=ci_low, ymax=ci_up)) +
  facet_wrap(~ journ, scales = "free_y") +
  geom_ribbon(fill="#000000", colour = NA, alpha=0.1) +
  geom_line() +
  coord_cartesian(ylim = c(0, 1)) +
  ylab("") +
  ggtitle("Share of replication packages using Stata")+
  theme_bw()

Published on 29 Dec 2023 •