My shiny app Finding Economics Articles with Data contains meanwhile over 8000 economic articles with replication packages. You can use it here: https://ejd.econ.mathematik.uni-ulm.de
Some of the data on articles and file types in the reproduction packages can be downloaded as a zipped SQLite database from my server (see the “About” page in the app for the link). Let us use the database to take a look at the usage shares of different programming languages.
The following code extracts our data set by merging two tables from the data base.
library(RSQLite)
library(dbmisc)
library(dplyr)
# Open data base using schemas as defined in my dbmisc
# package
db = dbConnect(RSQLite::SQLite(),"articles.sqlite")
articles = dbGet(db,"article")
fs = dbGet(db,"files_summary")
fs = fs %>%
left_join(select(articles, year, journ, id), by="id")
head(fs)
id | file_type | num_files | mb | is_code | is_data | year | journ |
---|---|---|---|---|---|---|---|
aejapp_10_4_5 | csv | 9 | 6.49858 | 0 | 1 | 2018 | aejapp |
aejapp_10_4_5 | do | 19 | 0.169755 | 1 | 0 | 2018 | aejapp |
aejapp_10_4_5 | dta | 207 | 19918.231 | 0 | 1 | 2018 | aejapp |
aejpol_10_4_8 | csv | 1 | 2.110033 | 0 | 1 | 2018 | aejpol |
aejpol_10_4_8 | do | 18 | 0.118644 | 1 | 0 | 2018 | aejpol |
aejpol_10_4_8 | gz | 1 | 4294.9673 | 0 | 0 | 2018 | aejpol |
The data frame fs
contains for each article and corresponding reproduction packages counts for common data or code files.
Let us take a look at the total number of reproduction packages and then compute the shares of reproduction packages that contain at least one file of specific programming languages (I am aware that not everybody would call e.g. Stata a programming language. Just feel free to replace the term by your favorite expression like scripting language or statistical software.):
n_art = n_distinct(fs$id)
n_art
## [1] 8262
fs %>%
group_by(file_type) %>%
summarize(
count = n(),
share=round((count / n_art)*100,1)
) %>%
# note that all file extensions are stored in lower case
filter(file_type %in% c("do","r","py","jl","m","java","c","cpp","nb","f90","f95", "sas","mod","js","g","gms","ztt")) %>%
arrange(desc(share))
file_type | count | share |
---|---|---|
do | 5915 | 71.6 |
m | 2023 | 24.5 |
r | 808 | 9.8 |
sas | 349 | 4.2 |
py | 341 | 4.1 |
mod | 198 | 2.4 |
f90 | 188 | 2.3 |
nb | 116 | 1.4 |
c | 105 | 1.3 |
ztt | 104 | 1.3 |
cpp | 66 | 0.8 |
jl | 39 | 0.5 |
java | 33 | 0.4 |
g | 28 | 0.3 |
gms | 19 | 0.2 |
js | 18 | 0.2 |
f95 | 7 | 0.1 |
The most used software is by a far margin Stata, whose .do
scripts can be found in 71.6% of reproduction packages. It follows Matlab with 24.5%. The most popular open source language is R with 9.8%. After one more proprietary software SAS, Python then follows as second most most used open source language with 4.1%. If you wonder why the shares add up to more than 100%: some reproduction packages simply use more than one language.
Let us take a look at the development over time for Stata, Matlab, R and Python.
year_dat = fs %>%
filter(year >= 2010) %>%
group_by(year) %>%
mutate(n_art_year = n_distinct(id)) %>%
group_by(year, file_type) %>%
summarize(
count = n(),
share=count / first(n_art_year),
# Compute approximate 95% CI of proportion
se = sqrt(share*(1-share)/first(n_art_year)),
ci_up = share + 1.96*se,
ci_low = share - 1.96*se
) %>%
filter(file_type %in% c("do","r","py","m")) %>%
arrange(year,desc(share))
library(ggplot2)
ggplot(year_dat, aes(x=year, y=share,ymin=ci_low, ymax=ci_up, color=file_type)) +
facet_wrap(~file_type) +
geom_ribbon(fill="#000000", colour = NA, alpha=0.1) +
geom_line() +
theme_bw()
The usage share of Stata and Matlab stays relatively constant over time. Yet, we still see a substantial increase in R usage from 1.4% in 2010 to over 20% in 2023. Also Python usage increases: from 0.4% in 2010 to almost 10% in 2023.
So open source software is getting more popular in academic economic research with large growth rates but absolute usage levels that are still substantially below Stata usage.
Note that the representation of journals is not balanced across years in our data base. E.g. the first reproduction package from Management Science in our data base is from 2019. To check whether the growth of R usage can also be found within journals, let us look at the development of its usage share within journals:
year_journ_dat = fs %>%
filter(year >= 2010) %>%
group_by(year, journ) %>%
mutate(n_art = n_distinct(id)) %>%
group_by(year, journ, file_type) %>%
summarize(
count = n(),
share=count / first(n_art),
# Compute approximate 95% CI of proportion
se = sqrt(share*(1-share)/first(n_art)),
ci_up = share + 1.96*se,
ci_low = share - 1.96*se
)
ggplot(year_journ_dat %>% filter(file_type=="r"),
aes(x=year, y=share,ymin=ci_low, ymax=ci_up)) +
facet_wrap(~ journ, scales = "free_y") +
geom_ribbon(fill="#000000", colour = NA, alpha=0.1) +
geom_line() +
coord_cartesian(ylim = c(0, 0.4)) +
ylab("") +
ggtitle("Share of replication packages using R")+
theme_bw()
We see a substantial increase in R usage in most journals. Finally, let us take a similar look at the time trends of Stata usage within journals.
ggplot(year_journ_dat %>% filter(file_type=="do"),
aes(x=year, y=share,ymin=ci_low, ymax=ci_up)) +
facet_wrap(~ journ, scales = "free_y") +
geom_ribbon(fill="#000000", colour = NA, alpha=0.1) +
geom_line() +
coord_cartesian(ylim = c(0, 1)) +
ylab("") +
ggtitle("Share of replication packages using Stata")+
theme_bw()
Published on 29 Dec 2023 •