Review of
"Computational identification and experimental characterization of preferred downstream positions in human core promoters"

Review of "Computational identification and experimental characterization of preferred downstream positions in human core promoters"

Submitted by mberkeley  

June 24, 2024, 2:23 p.m.

Lead reviewer

mberkeley

Review Body

Reproducibility

Did you manage to reproduce it?
Partially Reproducible
Reproducibility rating
How much of the paper did you manage to reproduce?
5 / 10
Briefly describe the procedure followed/tools used to reproduce it

I downloaded and installed RStudio to replicate the figures.

I also set up a project in Renku as an independent check to ensure the difficulties I had were not due to my own system.

Briefly describe your familiarity with the procedure/tools used by the paper.

Not very familiar. I have used R in the past but not regularly. Fortunately it is easy to understand.

Which type of operating system were you working in?
Apple Operating System (macOSX)
What additional software did you need to install?

I installed RStudio (version 2024.04.2+764) and used Rscript version 3.4.0. I also had to install the corrplot package through RStudio. Then I had to install the seqLogo package using the instructions here: https://bioconductor.org/packages/release/bioc/html/seqLogo.html

What software did you use

R Studio

Renku

What were the main challenges you ran into (if any)?

Figure 2: the code references just the dm6 matrices and not the hg19 matrices, so when I ran the fig2.R script three of the six plots were incorrect.

Figure 5 in the paper does not correspond with Fig5 in the repository. The correlation plot is the same as the one generated in Fig4, and there are two additional bar plots created that do not appear to be in the paper.

In the clusteringAlgorithms directory the driver.sh script, specifically the seq2mono.pl calls, never complete. They are taking 0 cpu and 0 memory according to a top inspection, but apart from creating an empty .dat output file nothing appears to be happening. I have left it running for several hours both on my system and on Renku with no result.

What were the positive features of this approach?

Generally well documented repository, with helpful README files at each level.

Any other comments/suggestions on the reproducibility approach?

Test the scripts provided and check the output corresponds with the paper figures.

If the long clustering scripts can be run in parallel please indicate this or provide a --threads or --cpus option.


Documentation

Documentation rating
How well was the material documented?
9 / 10
How could the documentation be improved?

It is very good in general. The top-level README could include more information on what order to run the various scripts and how the subdirectories tie together.

What do you like about the documentation?

I liked the modular approach to documentation, where each subfolder had its own README with instructions on how to reproduce just the data in the subfolder.

After attempting to reproduce, how familiar do you feel with the code and methods used in the paper?
7 / 10
Any suggestions on how the analysis could be made more transparent?

I don't use R, but I could understand the R scripts. The Perl scripts are harder to understand for someone who doesn't use Perl. Perhaps explain what the perl commands are doing.


Reusability

Reusability rating
Rate the project on reusability of the material
6 / 10
Permissive Data license included:  
Permissive Code license included:  

Any suggestions on how the project could be more reusable?


Any final comments

Nice repository structure. Clearly a lot of effort has been put into making the work reproducible. I think it needs a little more testing to iron out the remaining issues.