Finding Economic Articles With Data

21 Feb 2019

In my view, one of the greatest developments during the last decade in economics is that the Journals of the American Economic Association and some other leading journals require authors to upload the replication code and data sets of accepted articles.

I wrote a Shiny app that allows to search currently among more than 3000 articles that have an accessible data and code supplement. Just click here to use it:

https://ejd.econ.mathematik.uni-ulm.de/

(Note that I updated the link to have proper https.) One can perform a keyword search among the abstract and title. The screenshot shows an example:

One gets some information about the size of the data files and the used code files. I also tried to find and extract a README file from each supplement. Most README files explain whether all results can be replicated with the provided data sets or whether some results require confidential or proprietary data sets. The link allows you to look at the README without the need to download the whole data set.

The main idea is that such a search function could be helpful for teaching economics and data science. For example, my students can use the app to find an interesting topic for a Bachelor or Master Thesis in form of an interactive analysis with RTutor. You could also generate a topic list for a seminar, in which students shall replicate some key findings of a resarch article.

While the app performs well for a single user, I have not tested the performance for many simultaneous users. If it is too sluggish or you don’t get connected there are perhaps currently too many users. Then just try it out a bit later.

If you want to analyse yourself the collected data underlying the search app, you can download the zipped SQLite databases using the following links:

Main database (should suffice for most analyses):
http://econ.mathematik.uni-ulm.de/ejd/articles.zip
Large database containing names and sizes of all files in the data and code supplements:
http://econ.mathematik.uni-ulm.de/ejd/files.zip

I try to update the databases regularly.

Below is an example, for a simple analysis based on that databases. First make sure that you download and extract articles.zip into your working directory.

We first open a database connection

library(RSQLite)
db = dbConnect(RSQLite::SQLite(),"articles.sqlite")

File type conversion between databases and R can sometimes be a bit tedious. For example, SQLite knows no native Date or logical type. For this reason, I typically use my package dbmisc when working with SQLite databases. It allows to specify a database schema as simple yaml file and has a lot of convenience function to retrieve or modify data that automatically use the provided schema. The following code sets the database schema that is provided in the package EconJournalData:

library(dbmisc)
db = set.db.schemas(db,schema.file=
  system.file("schema/articles.yaml", package="EconJournalData"))

Of course, for a simple analysis as ours below just using the standard function in the DBI package without schemata suffices. But I am just used to working with the dbmisc package.

The main information about articles is stored in the table article

# Get the first 4 entries of articles as data frame
dbGet(db, "article",n = 4)

id	year	date	journ	title	vol	issue	artnum	article_url	has_data	data_url	size	unit	files_txt	downloaded_file	num_authors	file_info_stored	file_info_summarized	abstract	readme_file
aer_108_11_1	2018	2018-11-01	aer	Firm Sorting and Agglomeration	108	11	1	https://www.aeaweb.org/articles?id=10.1257/aer.20150361	TRUE	https://www.aeaweb.org/doi/10.1257/aer.20150361.data	0.05339	MB	NA	aer_vol_108_issue_11_article_1.zip	NA	TRUE	NA	Abstract To account for the uneven distribution of economic activity in space, I propose a theory of the location choices of heterogeneous firms in a variety of sectors across cities. In equilibrium, the distribution of city sizes and the sorting patterns of firms are uniquely determined and affect aggregate TFP and welfare. I estimate the model using French firm-level data and find that nearly half of the productivity advantage of large cities is due to firm sorting, the rest coming from agglomeration economies. I quantify the general equilibrium effects of place-based policies: policies that subsidize smaller cities have negative aggregate effects.	aer/2018/aer_108_11_1/READ_ME.pdf
aer_108_11_2	2018	2018-11-01	aer	Near-Feasible Stable Matchings with Couples	108	11	2	https://www.aeaweb.org/articles?id=10.1257/aer.20141188	TRUE	https://www.aeaweb.org/doi/10.1257/aer.20141188.data	0.07286	MB	NA	aer_vol_108_issue_11_article_2.zip	NA	TRUE	NA	Abstract The National Resident Matching program seeks a stable matching of medical students to teaching hospitals. With couples, stable matchings need not exist. Nevertheless, for any student preferences, we show that each instance of a matching problem has a "nearby" instance with a stable matching. The nearby instance is obtained by perturbing the capacities of the hospitals. In this perturbation, aggregate capacity is never reduced and can increase by at most four. The capacity of each hospital never changes by more than two.	aer/2018/aer_108_11_2/Readme.pdf
aer_108_11_3	2018	2018-11-01	aer	The Costs of Patronage: Evidence from the British Empire	108	11	3	https://www.aeaweb.org/articles?id=10.1257/aer.20171339	TRUE	https://www.aeaweb.org/doi/10.1257/aer.20171339.data	0.44938	MB	NA	aer_vol_108_issue_11_article_3.zip	NA	TRUE	NA	Abstract I combine newly digitized personnel and public finance data from the British colonial administration for the period 1854-1966 to study how patronage affects the promotion and incentives of governors. Governors are more likely to be promoted to higher salaried colonies when connected to their superior during the period of patronage. Once allocated, they provide more tax exemptions, raise less revenue, and invest less. The promotion and performance gaps disappear after the abolition of patronage appointments. Patronage therefore distorts the allocation of public sector positions and reduces the incentives of favored bureaucrats to perform.	aer/2018/aer_108_11_3/Readme.pdf
aer_108_11_4	2018	2018-11-01	aer	The Logic of Insurgent Electoral Violence	108	11	4	https://www.aeaweb.org/articles?id=10.1257/aer.20170416	TRUE	https://www.aeaweb.org/doi/10.1257/aer.20170416.data	56	MB	NA	aer_vol_108_issue_11_article_4.zip	NA	TRUE	NA	Abstract Competitive elections are essential to establishing the political legitimacy of democratizing regimes. We argue that insurgents undermine the state's mandate through electoral violence. We study insurgent violence during elections using newly declassified microdata on the conflict in Afghanistan. Our data track insurgent activity by hour to within meters of attack locations. Our results suggest that insurgents carefully calibrate their production of violence during elections to avoid harming civilians. Leveraging a novel instrumental variables approach, we find that violence depresses voting. Collectively, the results suggest insurgents try to depress turnout while avoiding backlash from harming civilians. Counterfactual exercises provide potentially actionable insights for safeguarding at-risk elections and enhancing electoral legitimacy in emerging democracies.	aer/2018/aer_108_11_4/READ_ME.pdf

The table files_summary contains information about code, data and archive files for each article

dbGet(db, "files_summary",n = 6)

id	file_type	num_files	mb	is_code	is_data
aejapp_1_1_10	do	9	0.009427	TRUE	FALSE
aejapp_1_1_10	dta	2	0.100694	FALSE	TRUE
aejapp_1_1_3	do	19	0.103628	TRUE	FALSE
aejapp_1_1_4	csv	1	0.024872	FALSE	TRUE
aejapp_1_1_4	dat	1	7.15491	FALSE	TRUE
aejapp_1_1_4	do	9	0.121618	TRUE	FALSE

Let us now analyse which share of articles uses Stata, R, Python, Matlab or Julia and how the usage has developed over time.

Since our datasets are small, we can just download the two tables and work with dplyr in memory. Alternatively, you could use some SQL commands or work with dplyr on the database connection.

articles = dbGet(db,"article")
fs = dbGet(db,"files_summary")

Let us now compute the shares of articles that have one of the file types, we are interested in

# Number of articles with analyes data & code supplementary
n_art = n_distinct(fs$id)

# Count articles by file types and compute shares
fs %>% group_by(file_type) %>%
  summarize(count = n(), share=round((count / n_art)*100,2)) %>%
  # note that all file extensions are stored in lower case
  filter(file_type %in% c("do","r","py","jl","m")) %>%
  arrange(desc(share))

file_type	count	share
do	2606	70.55
m	852	23.06
r	105	2.84
py	32	0.87
jl	2	0.05

Roughly 70% of the articles have Stata do files and almost a quarter Matlab m files. Using open source statistical software seems not yet very popular among economists: less than 3% of articles have R code files, Python is below 1% and only 2 articles have Julia code.

This dominance of Stata in economics never ceases to surprise me, in particular when for some reason I just happened to open the Stata do file editor and compare it with RStudio… But then, I am not an expert in writing empirical economic research papers - I just like R programming and rather passively consume empirical research. For writing empirical papers it probably is convenient that in Stata you can add a robust or robust cluster option to almost every type of regression in order to quickly get the economists’ standard standard errors…

For a teaching empirical economics with R the dominance of Stata is not neccessarily bad news. It means that there are a lot of studies which students can replicate in R. Such replication would be considerably less interesting if the original code of the articles would already be given in R.

Let us finish by having a look at the development over time…

sum_dat = fs %>% 
  left_join(select(articles, year, id), by="id") %>%
  group_by(year) %>%
  mutate(n_art_year = n()) %>%
  group_by(year, file_type) %>%
  summarize(
    count = n(),
    share=round((count / first(n_art_year))*100,2)
  ) %>%
  filter(file_type %in% c("do","r","py","jl","m")) %>%
  arrange(year,desc(share))  
head(sum_dat)

year	file_type	count	share
2005	do	25	22.12
2005	m	10	8.85
2006	do	24	20.87
2006	m	13	11.3
2007	do	24	19.35
2007	m	16	12.9

library(ggplot2)
ggplot(sum_dat, aes(x=year, y=share, color=file_type)) 
  + geom_line(size=1.5) + scale_y_log10() + theme_bw()

Well, maybe there is a little upward trend for the open source languages, but not too much seems to have happened over time so far…

Published on 21 Feb 2019 •