We cloned the project repository from Github, working with the latest commit (fd9822d). We created a Conda environment with the same Python version (3.7.3) as specified in the paper's Reporting Summary. As the exact PyStan version was not mentioned in the Reporting Summary, we first tried to run the code with a recent version (3+). However, the Jupyter Notebook for the statistical analysis did not manage to execute successfully. We then saw that one of the references in the paper refers to PyStan 2.17.1.0., upon installation of which the analysis notebook ran successfully.
In parallel, Samuel created a reproducible environment using a Docker container. The forked repository can be found at:
I have a high degree of familiarity with the Jupyter environment and the Python data science stack. I have never used PyStan before.
Samuel is actively engaged in Bayesian statistics and reproducible research.
PyStan 2.17.1.0
Python 3.7.3
Docker
Additionally, in my local environment:
Numpy 1.21.5
Pandas 1.3.5
Matplotlib 3.5.1
Seaborn 0.11.2 (for the figures in import_data.ipynb)
Jupyter Lab 3.4.2
The Jupyter Notebook to create the figures (tables_figures.ipynb) ran into a Pandas Attribute Error. The notebook ran successfully after making the following two adjustments:
Line 911 in function subset_df() in utils.py:
# del df.index.name
df = df.rename_axis(None, axis=0)
Line 118 in function image_impact() in paper.py:
# del out.index.name
out = out.rename_axis(None, axis=0)
We did not encounter any errors in the two other notebooks (statistical_analysis.ipynb and import_data.ipynb). However, we did not manage to run the full statistical analysis within the allotted time of the ReproHack, as fitting the models takes a lot of computing time. The models up to and including social media use were completed without errors. The model on trusted information sources did not finish after running it for roughly two hours on a hexacore laptop (Intel(R) Core(TM) i7-10750H CPU @ 2.60GHz). Notably, no diagnostic information was printed for me during the fitting of the models.
Creating a Docker environment and forking the repository allows other users to easily recreate our reproduction attempt.
I chose the option "Partially Reproducible" because there was not enough time to fully run the analysis in the given time frame. Since we did not run into any other major issues that we couldn't fix ourselves, it is still possible that the results are fully reproducible.
We could not find the exact package versions used in the analysis, neither in the paper's Reporting Summary nor in the README.md. This is quite crucial, as the analysis notebook doesn't run out of the box with newer versions of PyStan (notably, even the import statement has changed between versions 2 and 3). The Attribute Error mentioned earlier was also likely the result of different Pandas versions.
Besides specifying the exact package versions in the paper's Reporting Summary, it would be beneficial to always include something like a Conda environment.yml file or a Dockerfile as part of the Git repository.
The study design, mathematical details, and general flow of the notebooks are well documented. The analysis notebook can be read as a self-contained summary of the paper.