Visualizing PubMed article GitHub repository software information entropy#

The content below seeks to better understand how software information entropy manifests in a dataset of ~10,000 PubMed article GitHub repositories.

PubMed GitHub repositories are extracted using the PubMed API to query for GitHub links within article abstracts. GitHub data about these repositories is gathered using the GitHub API. The code to perform data extractions may be found under the directory: gather-pubmed-repos .

Software entropy measurements are gathered using the notebook: software-information-entropy.ipynb .

We derive software information entropy using methods inspired from Predicting faults using the complexity of code changesSB1. Software information entropy within the context of this notebook is normalized for all files within a repository using the first and latest commits.

Project durations highlighted below is the number of days between the first and latest commit date.

Data Extraction#

The following section extracts data which includes software entropy and GitHub-derived data. We merge the data to form a table which includes PubMed, GitHub, and Almanack software entropy information on the repositories.

import numpy as np
import pandas as pd
import plotly.express as px
import plotly.io as pio

# set plotly default theme
pio.templates.default = "plotly_white"

# read example data which includes pubmed github links detected from article abstracts
df = pd.merge(
    left=pd.read_parquet("repository_analysis_results"),
    right=pd.read_parquet(
        "gather-pubmed-repos/pubmed_github_links_with_github_data.parquet"
    ),
    left_on="Repository URL",
    right_on="github_link",
)
df
Repository URL Normalized Total Entropy Date of First Commit Date of Last Commit Time of Existence (days) PMID article_date title authors github_link ... GitHub Detected Languages Date Created Date Most Recent Commit Duration Created to Most Recent Commit Duration Created to Now Duration Most Recent Commit to Now Repository Size (KB) GitHub Repo Archived total lines of GitHub detected code Primary language
0 https://github.com/BUStools/BUSZ-format 0.000000 2022-11-08 2022-11-18 10.0 37129540 None BUSZ: compressed BUS files. Einarsson, Melsted https://github.com/BUStools/BUSZ-format ... {'AGS Script': None, 'AMPL': None, 'ANTLR': No... 2022-11-08 14:44:25+00:00 2022-11-18 15:22:39+00:00 5 False 0 None
1 https://github.com/pmelsted/BUSZ_paper 0.178897 2023-03-27 2023-03-28 1.0 37129540 None BUSZ: compressed BUS files. Einarsson, Melsted https://github.com/pmelsted/BUSZ_paper ... {'AGS Script': None, 'AMPL': None, 'ANTLR': No... 2023-03-27 13:37:38+00:00 2023-03-28 21:04:19+00:00 78 False 13927 Python
2 https://github.com/WormBase/scdefg 0.153966 2021-01-29 2022-02-02 368.0 35814290 2022-03-28 WormBase single-cell tools. da Veiga Beltrame, Arnaboldi, Sternberg https://github.com/WormBase/scdefg ... {'AGS Script': None, 'AMPL': None, 'ANTLR': No... 2021-02-24 03:39:24+00:00 2022-09-04 00:33:47+00:00 136 False 37039 Python
3 https://github.com/ekg/guix-genomics 0.176469 2020-01-06 2024-01-22 1476.0 36448683 None Unbiased pangenome graphs. Garrison, Guarracino https://github.com/ekg/guix-genomics ... {'AGS Script': None, 'AMPL': None, 'ANTLR': No... 2020-01-06 21:03:53+00:00 2024-01-22 17:24:15+00:00 261 False 68889 Scheme
4 https://github.com/ekg/seqwish 0.027122 2018-06-11 2023-12-09 2007.0 36448683 None Unbiased pangenome graphs. Garrison, Guarracino https://github.com/ekg/seqwish ... {'AGS Script': None, 'AMPL': None, 'ANTLR': No... 2018-06-11 15:31:26+00:00 2024-04-07 20:25:49+00:00 883 False 179753 C++
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
11153 https://github.com/maxplanck-ie/parkour 0.000449 2016-05-23 2022-08-24 2284.0 30239601 None Parkour LIMS: high-quality sample preparation ... Anatskiy, Ryan, Grüning, Arrigoni, Manke, Bönisch https://github.com/maxplanck-ie/parkour ... {'AGS Script': None, 'AMPL': None, 'ANTLR': No... 2016-05-23 09:18:45+00:00 2022-08-24 16:08:54+00:00 69014 True 121557916 JavaScript
11154 https://github.com/linsalrob/partie 0.000257 2016-09-19 2022-10-23 2224.0 28369246 None PARTIE: a partition engine to separate metagen... Torres, Edwards, McNair https://github.com/linsalrob/partie ... {'AGS Script': None, 'AMPL': None, 'ANTLR': No... 2016-09-19 17:45:30+00:00 2023-02-25 21:59:03+00:00 257411 False 32264 Perl
11155 https://github.com/louzounlab/CountingIsAlmost... 0.011558 2022-06-15 2022-11-04 141.0 36741395 2023-01-20 Counting is almost all you need. Akerman, Isakov, Levi, Psevkin, Louzoun https://github.com/louzounlab/CountingIsAlmost... ... {'AGS Script': None, 'AMPL': None, 'ANTLR': No... 2022-08-08 10:09:39+00:00 2022-11-04 15:00:17+00:00 867622 False 118683 Python
11156 https://github.com/sysbio-polito/NWN_CElegans_... 0.002924 2021-02-12 2021-10-06 235.0 34765090 2021-10-09 Nets-within-nets for modeling emergent pattern... Bardini, Benso, Politano, Di Carlo https://github.com/sysbio-polito/NWN_CElegans_... ... {'AGS Script': None, 'AMPL': None, 'ANTLR': No... 2021-02-12 12:41:36+00:00 2021-10-06 07:47:40+00:00 2755061 False 65375 Java
11157 https://github.com/lennylv/DGCddG 0.014386 2021-09-26 2024-06-12 990.0 37018301 2023-06-05 DGCddG: Deep Graph Convolution for Predicting ... Jiang, Quan, Li, Li, Zhou, Wu, Lyu https://github.com/lennylv/DGCddG ... {'AGS Script': None, 'AMPL': None, 'ANTLR': No... 2021-09-26 07:58:04+00:00 2024-06-12 13:12:07+00:00 8046743 False 207621 Python

11158 rows × 33 columns

What languages are used within PubMed article repositories?#

The following section observes the top 10 languages which are used in repositories from the dataset. Primary language is determined as the language which has the most lines of code within a repository.

language_grouped_data = (
    df.groupby(["Primary language"]).size().reset_index(name="Count")
)

# Create a horizontal bar chart
fig_languages = px.bar(
    language_grouped_data.sort_values(by="Count")[-10:],
    y="Primary language",
    x="Count",
    text="Count",
    orientation="h",
    width=700,
    height=400,
    title="Primary language count across all repositories",
)

fig_languages.show()

How is software entropy different across primary languages?#

The following section explores how software entropy manifests differently across different primary languages for repositories.

fig = px.scatter(
    df[
        df["Primary language"].isin(
            language_grouped_data.sort_values(by="Count")[-5:]["Primary language"]
        )
    ],
    x="Time of Existence (days)",
    y="Normalized Total Entropy",
    hover_data=["Repository URL"],
    width=700,
    height=400,
    title="Software entropy over time for PubMed GitHub repositories",
    marginal_x="rug",
    marginal_y="box",
    opacity=0.5,
    color="Primary language",
)

fig.update_layout(
    font=dict(size=13),
    title={"yref": "container", "y": 0.8, "yanchor": "bottom"},
    xaxis_title="Project duration (days)",
    yaxis_title="Software entropy",
)
fig.write_image("images/software-information-entropy-top-5-langs.png")
fig.show()

What is the relationship between GitHub Stars and Forks for repositories?#

We next explore how GitHub Stars and Forks are related within the repositories.

fig = px.scatter(
    df,
    x="GitHub Stars",
    y="GitHub Forks",
    hover_data=["Repository URL"],
    width=700,
    height=400,
    title="Stars and forks for PubMed GitHub repositories",
    opacity=0.5,
    log_y=True,
    log_x=True,
)

fig.update_layout(
    font=dict(size=13),
    title={"yref": "container", "y": 0.8, "yanchor": "bottom"},
    xaxis_title="GitHub Stars",
    yaxis_title="GitHub Forks",
)
fig.write_image("images/pubmed-stars-and-forks.png")
fig.show()

How do GitHub Stars, software entropy, and time relate?#

The next section explores how GitHub Stars, software entropy, and time relate to one another.

df["GitHub Stars (log)"] = np.log(
    df["GitHub Stars"].apply(
        # move 0's to None to avoid divide by 0
        lambda x: x if x > 0 else None
    )
)

fig = px.scatter(
    df.dropna(subset="GitHub Stars (log)").sort_values(by="GitHub Stars (log)"),
    x="Time of Existence (days)",
    y="Normalized Total Entropy",
    hover_data=["Repository URL"],
    width=700,
    height=400,
    title="Software entropy over time for PubMed GitHub repositories",
    marginal_x="histogram",
    marginal_y="histogram",
    opacity=0.5,
    color="GitHub Stars (log)",
)

fig.update_layout(
    font=dict(size=13),
    title={"yref": "container", "y": 0.8, "yanchor": "bottom"},
    xaxis_title="Project duration (days)",
    yaxis_title="Software entropy",
)
fig.write_image("images/software-information-entropy-gh-stars.png")
fig.show()

How do GitHub Forks, software entropy, and time relate?#

The next section explores how GitHub Forks, software entropy, and time relate to one another.

df["GitHub Forks (log)"] = np.log(
    df["GitHub Forks"].apply(
        # move 0's to None to avoid divide by 0
        lambda x: x if x > 0 else None
    )
)

fig = px.scatter(
    df.dropna(subset="GitHub Forks (log)").sort_values(by="GitHub Forks (log)"),
    x="Time of Existence (days)",
    y="Normalized Total Entropy",
    hover_data=["Repository URL"],
    width=700,
    height=400,
    title="Software entropy over time for PubMed GitHub repositories",
    marginal_x="histogram",
    marginal_y="histogram",
    opacity=0.5,
    color="GitHub Forks (log)",
    color_continuous_scale=px.colors.sequential.haline,
)

fig.update_layout(
    font=dict(size=13),
    title={"yref": "container", "y": 0.8, "yanchor": "bottom"},
    xaxis_title="Project duration (days)",
    yaxis_title="Software entropy",
)
fig.write_image("images/software-information-entropy-forks.png")
fig.show()

What is the relationship between GitHub Stars and Open Issues for the repositories?#

Below we explore how GitHub Stars and Open Issues are related for the repositories.

df["GitHub Open Issues (log)"] = np.log(
    df["GitHub Open Issues"].apply(
        # move 0's to None to avoid divide by 0
        lambda x: x if x > 0 else None
    )
)
fig = px.scatter(
    df,
    x="GitHub Stars",
    y="GitHub Open Issues",
    hover_data=["Repository URL"],
    width=700,
    height=400,
    title="Stars and open issues for PubMed GitHub repositories",
    opacity=0.5,
    log_y=True,
    log_x=True,
)

fig.update_layout(
    font=dict(size=13),
    title={"yref": "container", "y": 0.8, "yanchor": "bottom"},
    xaxis_title="GitHub Stars (log)",
    yaxis_title="GitHub Open Issues (log)",
)
fig.write_image("images/pubmed-stars-and-open-issues.png")
fig.show()

How do software entropy, time, and GitHub issues relate to one another?#

The next section visualizes how software entropy, time, and GitHub issues relate to one another.

df["GitHub Open Issues (log)"] = np.log(
    df["GitHub Open Issues"].apply(
        # move 0's to None to avoid divide by 0
        lambda x: x if x > 0 else None
    )
)

fig = px.scatter(
    df.dropna(subset="GitHub Open Issues (log)").sort_values(
        by="GitHub Open Issues (log)"
    ),
    x="Time of Existence (days)",
    y="Normalized Total Entropy",
    hover_data=["Repository URL"],
    width=700,
    height=400,
    title="Software entropy over time for PubMed GitHub repositories",
    marginal_x="histogram",
    marginal_y="histogram",
    opacity=0.5,
    color="GitHub Open Issues (log)",
    color_continuous_scale=px.colors.sequential.thermal,
)

fig.update_layout(
    font=dict(size=13),
    title={"yref": "container", "y": 0.8, "yanchor": "bottom"},
    xaxis_title="Project duration (days)",
    yaxis_title="Software entropy",
)
fig.write_image("images/software-information-entropy-open-issues.png")
fig.show()
[SB1]

Ahmed E. Hassan. Predicting faults using the complexity of code changes. In 2009 IEEE 31st International Conference on Software Engineering, 78–88. Vancouver, BC, Canada, 2009. IEEE. URL: http://ieeexplore.ieee.org/document/5070510 (visited on 2024-07-15), doi:10.1109/ICSE.2009.5070510.