In my view, one of the greatest developments during the last decade in economics is that the Journals of the American Economic Association and some other leading journals require authors to upload the replication code and data sets of accepted articles.
I wrote a Shiny app that allows to search currently among more than 3000 articles that have an accessible data and code supplement. Just click here to use it:
One can perform a keyword search among the abstract and title. The screenshot shows an example:
One gets some information about the size of the data files and the used code files. I also tried to find and extract a README file from each supplement. Most README files explain whether all results can be replicated with the provided data sets or whether some results require confidential or proprietary data sets. The link allows you to look at the README without the need to download the whole data set.
The main idea is that such a search function could be helpful for teaching economics and data science. For example, my students can use the app to find an interesting topic for a Bachelor or Master Thesis in form of an interactive analysis with RTutor. You could also generate a topic list for a seminar, in which students shall replicate some key findings of a resarch article.
While the app performs well for a single user, I have not tested the performance for many simultaneous users. If it is too sluggish or you don’t get connected there are perhaps currently too many users. Then just try it out a bit later.
If you want to analyse yourself the collected data underlying the search app, you can download the zipped SQLite databases using the following links:
Main database (should suffice for most analyses):
Large database containing names and sizes of all files in the data and code supplements:
I try to update the databases regularly.
Below is an example, for a simple analysis based on that databases. First make sure that you download and extract
articles.zip into your working directory.
We first open a database connection
library(RSQLite) db = dbConnect(RSQLite::SQLite(),"articles.sqlite")
File type conversion between databases and R can sometimes be a bit tedious. For example, SQLite knows no native
logical type. For this reason, I typically use my package dbmisc when working with SQLite databases. It allows to specify a database schema as simple yaml file and has a lot of convenience function to retrieve or modify data that automatically use the provided schema. The following code sets the database schema that is provided in the package
library(dbmisc) db = set.db.schemas(db,schema.file= system.file("schema/articles.yaml", package="EconJournalData"))
Of course, for a simple analysis as ours below just using the standard function in the
DBI package without schemata suffices. But I am just used to working with the
The main information about articles is stored in the table
# Get the first 4 entries of articles as data frame dbGet(db, "article",n = 4)
|aer_108_11_1||2018||2018-11-01||aer||Firm Sorting and Agglomeration||108||11||1||https://www.aeaweb.org/articles?id=10.1257/aer.20150361||TRUE||https://www.aeaweb.org/doi/10.1257/aer.20150361.data||0.05339||MB||NA||aer_vol_108_issue_11_article_1.zip||NA||TRUE||NA||Abstract To account for the uneven distribution of economic activity in space, I propose a theory of the location choices of heterogeneous firms in a variety of sectors across cities. In equilibrium, the distribution of city sizes and the sorting patterns of firms are uniquely determined and affect aggregate TFP and welfare. I estimate the model using French firm-level data and find that nearly half of the productivity advantage of large cities is due to firm sorting, the rest coming from agglomeration economies. I quantify the general equilibrium effects of place-based policies: policies that subsidize smaller cities have negative aggregate effects.||aer/2018/aer_108_11_1/READ_ME.pdf|
|aer_108_11_2||2018||2018-11-01||aer||Near-Feasible Stable Matchings with Couples||108||11||2||https://www.aeaweb.org/articles?id=10.1257/aer.20141188||TRUE||https://www.aeaweb.org/doi/10.1257/aer.20141188.data||0.07286||MB||NA||aer_vol_108_issue_11_article_2.zip||NA||TRUE||NA||Abstract The National Resident Matching program seeks a stable matching of medical students to teaching hospitals. With couples, stable matchings need not exist. Nevertheless, for any student preferences, we show that each instance of a matching problem has a "nearby" instance with a stable matching. The nearby instance is obtained by perturbing the capacities of the hospitals. In this perturbation, aggregate capacity is never reduced and can increase by at most four. The capacity of each hospital never changes by more than two.||aer/2018/aer_108_11_2/Readme.pdf|
|aer_108_11_3||2018||2018-11-01||aer||The Costs of Patronage: Evidence from the British Empire||108||11||3||https://www.aeaweb.org/articles?id=10.1257/aer.20171339||TRUE||https://www.aeaweb.org/doi/10.1257/aer.20171339.data||0.44938||MB||NA||aer_vol_108_issue_11_article_3.zip||NA||TRUE||NA||Abstract I combine newly digitized personnel and public finance data from the British colonial administration for the period 1854-1966 to study how patronage affects the promotion and incentives of governors. Governors are more likely to be promoted to higher salaried colonies when connected to their superior during the period of patronage. Once allocated, they provide more tax exemptions, raise less revenue, and invest less. The promotion and performance gaps disappear after the abolition of patronage appointments. Patronage therefore distorts the allocation of public sector positions and reduces the incentives of favored bureaucrats to perform.||aer/2018/aer_108_11_3/Readme.pdf|
|aer_108_11_4||2018||2018-11-01||aer||The Logic of Insurgent Electoral Violence||108||11||4||https://www.aeaweb.org/articles?id=10.1257/aer.20170416||TRUE||https://www.aeaweb.org/doi/10.1257/aer.20170416.data||56||MB||NA||aer_vol_108_issue_11_article_4.zip||NA||TRUE||NA||Abstract Competitive elections are essential to establishing the political legitimacy of democratizing regimes. We argue that insurgents undermine the state's mandate through electoral violence. We study insurgent violence during elections using newly declassified microdata on the conflict in Afghanistan. Our data track insurgent activity by hour to within meters of attack locations. Our results suggest that insurgents carefully calibrate their production of violence during elections to avoid harming civilians. Leveraging a novel instrumental variables approach, we find that violence depresses voting. Collectively, the results suggest insurgents try to depress turnout while avoiding backlash from harming civilians. Counterfactual exercises provide potentially actionable insights for safeguarding at-risk elections and enhancing electoral legitimacy in emerging democracies.||aer/2018/aer_108_11_4/READ_ME.pdf|
files_summary contains information about code, data and archive files for each article
dbGet(db, "files_summary",n = 6)
Let us now analyse which share of articles uses Stata, R, Python, Matlab or Julia and how the usage has developed over time.
Since our datasets are small, we can just download the two tables and work with
dplyr in memory. Alternatively, you could use some SQL commands or work with
dplyr on the database connection.
articles = dbGet(db,"article") fs = dbGet(db,"files_summary")
Let us now compute the shares of articles that have one of the file types, we are interested in
# Number of articles with analyes data & code supplementary n_art = n_distinct(fs$id) # Count articles by file types and compute shares fs %>% group_by(file_type) %>% summarize(count = n(), share=round((count / n_art)*100,2)) %>% # note that all file extensions are stored in lower case filter(file_type %in% c("do","r","py","jl","m")) %>% arrange(desc(share))
Roughly 70% of the articles have Stata
do files and almost a quarter Matlab
m files. Using open source statistical software seems not yet very popular among economists: less than 3% of articles have R code files, Python is below 1% and only 2 articles have Julia code.
This dominance of Stata in economics never ceases to surprise me, in particular when for some reason I just happened to open the Stata do file editor and compare it with RStudio… But then, I am not an expert in writing empirical economic research papers - I just like R programming and rather passively consume empirical research. For writing empirical papers it probably is convenient that in Stata you can add a
robust cluster option to almost every type of regression in order to quickly get the economists’ standard standard errors…
For a teaching empirical economics with R the dominance of Stata is not neccessarily bad news. It means that there are a lot of studies which students can replicate in R. Such replication would be considerably less interesting if the original code of the articles would already be given in R.
Let us finish by having a look at the development over time…
sum_dat = fs %>% left_join(select(articles, year, id), by="id") %>% group_by(year) %>% mutate(n_art_year = n()) %>% group_by(year, file_type) %>% summarize( count = n(), share=round((count / first(n_art_year))*100,2) ) %>% filter(file_type %in% c("do","r","py","jl","m")) %>% arrange(year,desc(share)) head(sum_dat)
library(ggplot2) ggplot(sum_dat, aes(x=year, y=share, color=file_type)) + geom_line(size=1.5) + scale_y_log10() + theme_bw()
Well, maybe there is a little upward trend for the open source languages, but not too much seems to have happened over time so far…