{ "cells": [ { "cell_type": "markdown", "id": "800b01b5-4958-463c-a808-7e6a811cc7a4", "metadata": {}, "source": [ "# Visualizing PubMed article GitHub repository software information entropy\n", "\n", "The content below seeks to better understand how software information entropy manifests in a dataset of ~10,000 PubMed article GitHub repositories.\n", "\n", "PubMed GitHub repositories are extracted using the PubMed API to query for GitHub links within article abstracts.\n", "GitHub data about these repositories is gathered using the GitHub API.\n", "The code to perform data extractions may be found under the directory: [gather-pubmed-repos](https://github.com/software-gardening/almanack/tree/main/src/book/seed-bank/pubmed-github-repositories/gather-pubmed-repos/) .\n", "\n", "Software entropy measurements are gathered using the notebook: [software-information-entropy.ipynb](https://github.com/software-gardening/almanack/tree/main/src/book/seed-bank/pubmed-github-repositories/software-information-entropy.ipynb) .\n", "\n", "We derive software information entropy using methods inspired from _Predicting faults using the complexity of code changes_{cite:p}`hassan_predicting_2009`.\n", "Software information entropy within the context of this notebook is normalized for all files within a repository using the first and latest commits.\n", "\n", "Project durations highlighted below is the number of days between the first and latest commit date." ] }, { "cell_type": "markdown", "id": "44ec4b0f-0260-4a79-8432-146798060cef", "metadata": {}, "source": [ "## Data Extraction\n", "\n", "The following section extracts data which includes software entropy and GitHub-derived data.\n", "We merge the data to form a table which includes PubMed, GitHub, and Almanack software entropy information on the repositories." ] }, { "cell_type": "code", "execution_count": 19, "id": "fa03047f-fcef-4f3a-a389-9aead0a7a081", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " | Repository URL | \n", "Normalized Total Entropy | \n", "Date of First Commit | \n", "Date of Last Commit | \n", "Time of Existence (days) | \n", "PMID | \n", "article_date | \n", "title | \n", "authors | \n", "github_link | \n", "... | \n", "GitHub Detected Languages | \n", "Date Created | \n", "Date Most Recent Commit | \n", "Duration Created to Most Recent Commit | \n", "Duration Created to Now | \n", "Duration Most Recent Commit to Now | \n", "Repository Size (KB) | \n", "GitHub Repo Archived | \n", "total lines of GitHub detected code | \n", "Primary language | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | \n", "https://github.com/BUStools/BUSZ-format | \n", "0.000000 | \n", "2022-11-08 | \n", "2022-11-18 | \n", "10.0 | \n", "37129540 | \n", "None | \n", "BUSZ: compressed BUS files. | \n", "Einarsson, Melsted | \n", "https://github.com/BUStools/BUSZ-format | \n", "... | \n", "{'AGS Script': None, 'AMPL': None, 'ANTLR': No... | \n", "2022-11-08 14:44:25+00:00 | \n", "2022-11-18 15:22:39+00:00 | \n", "\n", " | \n", " | \n", " | 5 | \n", "False | \n", "0 | \n", "None | \n", "
1 | \n", "https://github.com/pmelsted/BUSZ_paper | \n", "0.178897 | \n", "2023-03-27 | \n", "2023-03-28 | \n", "1.0 | \n", "37129540 | \n", "None | \n", "BUSZ: compressed BUS files. | \n", "Einarsson, Melsted | \n", "https://github.com/pmelsted/BUSZ_paper | \n", "... | \n", "{'AGS Script': None, 'AMPL': None, 'ANTLR': No... | \n", "2023-03-27 13:37:38+00:00 | \n", "2023-03-28 21:04:19+00:00 | \n", "\n", " | \n", " | \n", " | 78 | \n", "False | \n", "13927 | \n", "Python | \n", "
2 | \n", "https://github.com/WormBase/scdefg | \n", "0.153966 | \n", "2021-01-29 | \n", "2022-02-02 | \n", "368.0 | \n", "35814290 | \n", "2022-03-28 | \n", "WormBase single-cell tools. | \n", "da Veiga Beltrame, Arnaboldi, Sternberg | \n", "https://github.com/WormBase/scdefg | \n", "... | \n", "{'AGS Script': None, 'AMPL': None, 'ANTLR': No... | \n", "2021-02-24 03:39:24+00:00 | \n", "2022-09-04 00:33:47+00:00 | \n", "\n", " | \n", " | \n", " | 136 | \n", "False | \n", "37039 | \n", "Python | \n", "
3 | \n", "https://github.com/ekg/guix-genomics | \n", "0.176469 | \n", "2020-01-06 | \n", "2024-01-22 | \n", "1476.0 | \n", "36448683 | \n", "None | \n", "Unbiased pangenome graphs. | \n", "Garrison, Guarracino | \n", "https://github.com/ekg/guix-genomics | \n", "... | \n", "{'AGS Script': None, 'AMPL': None, 'ANTLR': No... | \n", "2020-01-06 21:03:53+00:00 | \n", "2024-01-22 17:24:15+00:00 | \n", "\n", " | \n", " | \n", " | 261 | \n", "False | \n", "68889 | \n", "Scheme | \n", "
4 | \n", "https://github.com/ekg/seqwish | \n", "0.027122 | \n", "2018-06-11 | \n", "2023-12-09 | \n", "2007.0 | \n", "36448683 | \n", "None | \n", "Unbiased pangenome graphs. | \n", "Garrison, Guarracino | \n", "https://github.com/ekg/seqwish | \n", "... | \n", "{'AGS Script': None, 'AMPL': None, 'ANTLR': No... | \n", "2018-06-11 15:31:26+00:00 | \n", "2024-04-07 20:25:49+00:00 | \n", "\n", " | \n", " | \n", " | 883 | \n", "False | \n", "179753 | \n", "C++ | \n", "
... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "
11153 | \n", "https://github.com/maxplanck-ie/parkour | \n", "0.000449 | \n", "2016-05-23 | \n", "2022-08-24 | \n", "2284.0 | \n", "30239601 | \n", "None | \n", "Parkour LIMS: high-quality sample preparation ... | \n", "Anatskiy, Ryan, Grüning, Arrigoni, Manke, Bönisch | \n", "https://github.com/maxplanck-ie/parkour | \n", "... | \n", "{'AGS Script': None, 'AMPL': None, 'ANTLR': No... | \n", "2016-05-23 09:18:45+00:00 | \n", "2022-08-24 16:08:54+00:00 | \n", "\n", " | \n", " | \n", " | 69014 | \n", "True | \n", "121557916 | \n", "JavaScript | \n", "
11154 | \n", "https://github.com/linsalrob/partie | \n", "0.000257 | \n", "2016-09-19 | \n", "2022-10-23 | \n", "2224.0 | \n", "28369246 | \n", "None | \n", "PARTIE: a partition engine to separate metagen... | \n", "Torres, Edwards, McNair | \n", "https://github.com/linsalrob/partie | \n", "... | \n", "{'AGS Script': None, 'AMPL': None, 'ANTLR': No... | \n", "2016-09-19 17:45:30+00:00 | \n", "2023-02-25 21:59:03+00:00 | \n", "\n", " | \n", " | \n", " | 257411 | \n", "False | \n", "32264 | \n", "Perl | \n", "
11155 | \n", "https://github.com/louzounlab/CountingIsAlmost... | \n", "0.011558 | \n", "2022-06-15 | \n", "2022-11-04 | \n", "141.0 | \n", "36741395 | \n", "2023-01-20 | \n", "Counting is almost all you need. | \n", "Akerman, Isakov, Levi, Psevkin, Louzoun | \n", "https://github.com/louzounlab/CountingIsAlmost... | \n", "... | \n", "{'AGS Script': None, 'AMPL': None, 'ANTLR': No... | \n", "2022-08-08 10:09:39+00:00 | \n", "2022-11-04 15:00:17+00:00 | \n", "\n", " | \n", " | \n", " | 867622 | \n", "False | \n", "118683 | \n", "Python | \n", "
11156 | \n", "https://github.com/sysbio-polito/NWN_CElegans_... | \n", "0.002924 | \n", "2021-02-12 | \n", "2021-10-06 | \n", "235.0 | \n", "34765090 | \n", "2021-10-09 | \n", "Nets-within-nets for modeling emergent pattern... | \n", "Bardini, Benso, Politano, Di Carlo | \n", "https://github.com/sysbio-polito/NWN_CElegans_... | \n", "... | \n", "{'AGS Script': None, 'AMPL': None, 'ANTLR': No... | \n", "2021-02-12 12:41:36+00:00 | \n", "2021-10-06 07:47:40+00:00 | \n", "\n", " | \n", " | \n", " | 2755061 | \n", "False | \n", "65375 | \n", "Java | \n", "
11157 | \n", "https://github.com/lennylv/DGCddG | \n", "0.014386 | \n", "2021-09-26 | \n", "2024-06-12 | \n", "990.0 | \n", "37018301 | \n", "2023-06-05 | \n", "DGCddG: Deep Graph Convolution for Predicting ... | \n", "Jiang, Quan, Li, Li, Zhou, Wu, Lyu | \n", "https://github.com/lennylv/DGCddG | \n", "... | \n", "{'AGS Script': None, 'AMPL': None, 'ANTLR': No... | \n", "2021-09-26 07:58:04+00:00 | \n", "2024-06-12 13:12:07+00:00 | \n", "\n", " | \n", " | \n", " | 8046743 | \n", "False | \n", "207621 | \n", "Python | \n", "
11158 rows × 33 columns
\n", "