Undersatanding PubMed GitHub repositories with the Software Gardening Almanack#

The content below seeks to better understand a dataset of ~10,000 PubMed article GitHub repositories using the Software Gardening Almanack.

PubMed GitHub repositories are extracted using the PubMed API to query for GitHub links within article abstracts. GitHub data about these repositories is gathered using the GitHub API. The code to perform data extractions may be found under the directory: gather-pubmed-repos .

These repositories are then processed using the Software Gardening Almanack to contextualize how they compare to one another.

Data Extraction#

The following section extracts data which includes software entropy and GitHub-derived data. We merge the data to form a table which includes PubMed, GitHub, and Almanack software entropy information on the repositories.

import pathlib

import numpy as np
import pandas as pd
import plotly.express as px
import plotly.io as pio
from IPython.display import Image, display

from almanack import process_repositories_batch

# set plotly default theme
pio.templates.default = "plotly_white"

# set almanack data directory
ALMANACK_DATA_DIR = "repository_analysis_results"
# if we don't already have a data dir, batch process almanack data
if not pathlib.Path(ALMANACK_DATA_DIR).is_dir():
    process_repositories_batch(
        # read pre-collected pubmeb github repos
        repo_urls=pd.read_parquet(
            path="gather-pubmed-repos/pubmed_github_links_with_github_data.parquet",
            columns=["github_link"],
        )
        .dropna()
        .github_link.tolist()[1:20],
        output_path=ALMANACK_DATA_DIR,
        split_batches=True,
        batch_size=500,
        max_workers=8,
        collect_dataframe=False,
        show_repo_progress=False,
        show_batch_progress=True,
    )

# read the dataset as a concenated dataframe from parquet chunks
df = pd.concat(
    [
        pd.read_parquet(pq_file)
        for pq_file in pathlib.Path(ALMANACK_DATA_DIR).glob("*.parquet")
    ],
    ignore_index=True,
    sort=False,
)
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9631 entries, 0 to 9630
Data columns (total 62 columns):
 #   Column                                                    Non-Null Count  Dtype  
---  ------                                                    --------------  -----  
 0   Repository URL                                            9631 non-null   object 
 1   repo-path                                                 9631 non-null   object 
 2   repo-commits                                              9587 non-null   float64
 3   repo-file-count                                           9587 non-null   float64
 4   repo-commit-time-range                                    9631 non-null   object 
 5   repo-days-of-development                                  9587 non-null   float64
 6   repo-commits-per-day                                      9587 non-null   float64
 7   almanack-table-datetime                                   9631 non-null   object 
 8   almanack-version                                          9631 non-null   object 
 9   repo-primary-language                                     5789 non-null   object 
 10  repo-primary-license                                      3503 non-null   object 
 11  repo-doi                                                  125 non-null    object 
 12  repo-doi-publication-date                                 107 non-null    object 
 13  repo-doi-fwci                                             46 non-null     float64
 14  repo-doi-is-not-retracted                                 75 non-null     float64
 15  repo-doi-grants-count                                     84 non-null     float64
 16  repo-doi-grants                                           0 non-null      object 
 17  repo-includes-readme                                      9587 non-null   float64
 18  repo-includes-contributing                                9587 non-null   float64
 19  repo-includes-code-of-conduct                             9587 non-null   float64
 20  repo-includes-license                                     9587 non-null   float64
 21  repo-is-citable                                           9587 non-null   float64
 22  repo-default-branch-not-master                            9587 non-null   float64
 23  repo-includes-common-docs                                 9587 non-null   float64
 24  repo-unique-contributors                                  9587 non-null   float64
 25  repo-unique-contributors-past-year                        9587 non-null   float64
 26  repo-unique-contributors-past-182-days                    9587 non-null   float64
 27  repo-tags-count                                           9587 non-null   float64
 28  repo-tags-count-past-year                                 9587 non-null   float64
 29  repo-tags-count-past-182-days                             9587 non-null   float64
 30  repo-stargazers-count                                     6024 non-null   float64
 31  repo-uses-issues                                          6024 non-null   float64
 32  repo-issues-open-count                                    6024 non-null   float64
 33  repo-pull-requests-enabled                                6024 non-null   float64
 34  repo-forks-count                                          6024 non-null   float64
 35  repo-subscribers-count                                    6024 non-null   float64
 36  repo-packages-ecosystems-count                            9587 non-null   float64
 37  repo-packages-versions-count                              9587 non-null   float64
 38  repo-social-media-platforms-count                         9295 non-null   float64
 39  repo-doi-valid-format                                     90 non-null     float64
 40  repo-doi-https-resolvable                                 84 non-null     float64
 41  repo-doi-cited-by-count                                   75 non-null     float64
 42  repo-days-between-doi-publication-date-and-latest-commit  75 non-null     float64
 43  repo-gh-workflow-success-ratio                            463 non-null    float64
 44  repo-gh-workflow-succeeding-runs                          463 non-null    float64
 45  repo-gh-workflow-failing-runs                             463 non-null    float64
 46  repo-gh-workflow-queried-total                            463 non-null    float64
 47  repo-code-coverage-percent                                0 non-null      object 
 48  repo-date-of-last-coverage-run                            0 non-null      object 
 49  repo-days-between-last-coverage-run-latest-commit         0 non-null      object 
 50  repo-code-coverage-total-lines                            0 non-null      object 
 51  repo-code-coverage-executed-lines                         0 non-null      object 
 52  repo-agg-info-entropy                                     9587 non-null   float64
 53  repo-almanack-score_json                                  9631 non-null   object 
 54  repo-packages-ecosystems_json                             9631 non-null   object 
 55  repo-social-media-platforms_json                          9631 non-null   object 
 56  checks_total                                              9587 non-null   float64
 57  checks_passed                                             9587 non-null   float64
 58  checks_pct                                                9587 non-null   float64
 59  repo-social-media-platforms                               0 non-null      float64
 60  repo-doi-grants_json                                      4200 non-null   object 
 61  almanack_error                                            6000 non-null   object 
dtypes: float64(42), object(20)
memory usage: 4.6+ MB
/tmp/ipykernel_3431/2236761407.py:21: FutureWarning: The behavior of DataFrame concatenation with empty or all-NA entries is deprecated. In a future version, this will no longer exclude empty or all-NA columns when determining the result dtypes. To retain the old behavior, exclude the relevant entries before the concat operation.
  df = pd.concat(

How many repositories are dormant?#

Software must evolve continuously to remain stable, useful, and aligned with current needs. The following section explores how repository dormancy appears within the dataset based on the date of most recent commit. Sustained, incremental updates help prevent this stagnation and ensure that a project continues to serve its users effectively.

# cleanup: drop columns or rows whose values are all empty
df = df.dropna(axis=1, how="all")
df = df.dropna(axis=0, how="all")
# gather repo commit start and end dates
df[["repo-commit-start-date", "repo-commit-end-date"]] = df[
    "repo-commit-time-range"
].str.extract(r"\('([^']+)', '([^']+)'\)")
# turn these into datetimes for datetime calcs
df[["repo-commit-start-date", "repo-commit-end-date"]] = df[
    ["repo-commit-start-date", "repo-commit-end-date"]
].apply(pd.to_datetime)
# set a cuttoff of 1 year for dormancy
cutoff = pd.Timestamp.now() - pd.DateOffset(years=1)
df["dormant"] = df["repo-commit-end-date"] <= cutoff
# show how many dormant repos there are in this dataset
df["dormant"].value_counts()
dormant
True     8288
False    1343
Name: count, dtype: int64
df["repo-days-of-existence"] = (
    pd.Timestamp.now() - df["repo-commit-start-date"]
).dt.days
df["repo-days-inactive"] = (pd.Timestamp.now() - df["repo-commit-end-date"]).dt.days
fig = px.ecdf(
    df,
    x="repo-days-inactive",
    title="CDF of repository inactivity by last commit date",
    width=1200,
    height=700,
)
fig.write_image(
    image_file := "images/cdf_repository_inactivity_by_last_commit_date.png"
)
display(Image(filename=image_file))
../../_images/ebbece1e6b0c3585a7959640dba4b5e5de98a8b923dd03b914a60426ff72cd73.png

What languages are used within PubMed article repositories?#

The following section observes the top 10 languages which are used in repositories from the dataset. Primary language is determined as the language which has the most lines of code within a repository.

# Count dormant & active repos per language
lang_status = (
    df.groupby(["repo-primary-language", "dormant"]).size().reset_index(name="Count")
)

# Map boolean to readable labels
lang_status["Status"] = lang_status["dormant"].map(
    {
        True: "Dormant (no commits >= 1 year)",
        False: "Active (< 1 year since last commit)",
    }
)

# Total repos per language
total_counts = lang_status.groupby("repo-primary-language")["Count"].sum()

# Top 10 languages by total repo count (sorted)
top_langs = total_counts.sort_values().tail(10)

# Filter to top languages
plot_data = lang_status[
    lang_status["repo-primary-language"].isin(top_langs.index)
].copy()

# Ensure y-axis follows the sorted order by total count
lang_order = top_langs.index.tolist()
plot_data["repo-primary-language"] = pd.Categorical(
    plot_data["repo-primary-language"],
    categories=lang_order,
    ordered=True,
)

# Stacked horizontal bar chart: Active (green) + Dormant (red)
fig = px.bar(
    plot_data,
    y="repo-primary-language",
    x="Count",
    color="Status",
    orientation="h",
    color_discrete_map={
        "Active (< 1 year since last commit)": "#2ca02c",
        "Dormant (no commits >= 1 year)": "#d62728",
    },
    width=1200,
    height=700,
    title="Primary language count (active vs dormant repos)",
)

fig.update_layout(
    barmode="stack",
    xaxis_title="Repository count",
    yaxis_title="Primary language",
    legend_title="Repository status",
    yaxis=dict(categoryorder="array", categoryarray=lang_order),
)


fig.write_image(image_file := "images/primary-language-counts.png")
display(Image(filename=image_file))
../../_images/2cf8806b96180df3848d1e435561d8c988cb472d0a81d2b6a4acf16a9eaf9a9b.png

How is software entropy different across primary languages?#

The following section explores how software entropy manifests differently across different primary languages for repositories.

language_grouped_data = (
    df.groupby(["repo-primary-language"]).size().reset_index(name="Count")
)

fig = px.scatter(
    df[
        df["repo-primary-language"].isin(
            language_grouped_data
            # exclude HTML to focus on formal programming languages
            .loc[
                ~language_grouped_data["repo-primary-language"].str.contains(
                    "html", case=False, na=False
                )
            ].sort_values(by="Count")[-5:]["repo-primary-language"]
        )
    ].sort_values(by="repo-primary-language"),
    x="repo-days-of-existence",
    y="repo-agg-info-entropy",
    hover_data=["Repository URL"],
    width=1200,
    height=700,
    title="Software entropy over time for PubMed GitHub repositories",
    marginal_x="rug",
    marginal_y="box",
    opacity=0.5,
    color="repo-primary-language",
)

fig.update_layout(
    font=dict(size=13),
    title={"yref": "container", "y": 0.9, "yanchor": "bottom"},
    xaxis_title="Project duration (days)",
    yaxis_title="Software entropy",
)
fig.write_image(image_file := "images/software-information-entropy-top-5-langs.png")
display(Image(filename=image_file))
../../_images/f5a1702eeef5ec9124f01fe6d9b8e37101b7f99e0ca1218a08900888e8b9b592.png

What is the relationship between GitHub Stars and Forks for repositories?#

We next explore how GitHub Stars and Forks are related within the repositories.

fig = px.scatter(
    df,
    x="repo-stargazers-count",
    y="repo-forks-count",
    hover_data=["Repository URL"],
    width=1200,
    height=700,
    title="Stars and forks for PubMed GitHub repositories",
    opacity=0.5,
    log_y=True,
    log_x=True,
)

fig.update_layout(
    font=dict(size=13),
    title={"yref": "container", "y": 0.9, "yanchor": "bottom"},
    xaxis_title="GitHub Stars",
    yaxis_title="GitHub Forks",
)
fig.write_image(image_file := "images/pubmed-stars-and-forks.png")
display(Image(filename=image_file))
../../_images/8df89a3f97b9579692957c61995d1d735abb9a8cc130eb03631fd2cd38665265.png

How do GitHub Stars, software entropy, and time relate?#

The next section explores how GitHub Stars, software entropy, and time relate to one another.

df["GitHub Stars (log)"] = np.log(
    df["repo-stargazers-count"].apply(lambda x: x if x > 0 else None)  # avoid log(0)
)

df_plot = df.dropna(subset=["GitHub Stars (log)"])

fig = px.density_heatmap(
    df_plot,
    x="repo-days-of-existence",
    y="repo-agg-info-entropy",
    z="GitHub Stars (log)",  # aggregate value per bin
    histfunc="avg",  # mean log stars per bin (could use "sum" or "count")
    nbinsx=60,
    nbinsy=60,
    color_continuous_scale=px.colors.sequential.thermal,
    hover_data=["Repository URL"],
    width=1200,
    height=700,
    title="Software entropy over time for PubMed GitHub repositories (binned density)",
)

fig.update_layout(
    font=dict(size=13),
    title={"yref": "container", "y": 0.9, "yanchor": "bottom"},
    xaxis_title="Project duration (days)",
    yaxis_title="Software entropy",
)
fig.write_image(image_file := "images/software-information-entropy-gh-stars.png")
display(Image(filename=image_file))
../../_images/51be6e0001ec133001b397e17cfc34edd13c0f02e0dbb04efb3b8973e323523a.png

How do GitHub Forks, software entropy, and time relate?#

The next section explores how GitHub Forks, software entropy, and time relate to one another.

df["GitHub Forks (log)"] = np.log(
    df["repo-forks-count"].apply(
        # move 0's to None to avoid log(0)
        lambda x: x if x > 0 else None
    )
)

# Drop rows with missing values in the relevant columns
df_plot = df.dropna(
    subset=["GitHub Forks (log)", "repo-days-of-existence", "repo-agg-info-entropy"]
)

fig = px.density_heatmap(
    df_plot,
    x="repo-days-of-existence",
    y="repo-agg-info-entropy",
    z="GitHub Forks (log)",  # value to aggregate per bin
    histfunc="avg",  # mean log forks per bin (can use "sum" or "count")
    nbinsx=60,
    nbinsy=60,
    color_continuous_scale=px.colors.sequential.thermal,
    width=1200,
    height=700,
    title="Software entropy over time for PubMed GitHub repositories (binned density)",
)

fig.update_layout(
    font=dict(size=13),
    title={"yref": "container", "y": 0.9, "yanchor": "bottom"},
    xaxis_title="Project duration (days)",
    yaxis_title="Software entropy",
)
fig.write_image(image_file := "images/software-information-entropy-forks.png")
display(Image(filename=image_file))
../../_images/8afb4984b4eb2a7876b405944f0ea8297b8ec44365fbbcd42c99ced0d6f93daf.png

What is the relationship between GitHub Stars and Open Issues for the repositories?#

Below we explore how GitHub Stars and Open Issues are related for the repositories.

df["GitHub Open Issues (log)"] = np.log(
    df["repo-issues-open-count"].apply(
        # move 0's to None to avoid divide by 0
        lambda x: x if x > 0 else None
    )
)

fig = px.scatter(
    df,
    x="repo-stargazers-count",
    y="repo-issues-open-count",
    hover_data=["Repository URL"],
    width=1200,
    height=700,
    title="Stars and open issues for PubMed GitHub repositories",
    opacity=0.5,
    log_y=True,
    log_x=True,
)

fig.update_layout(
    font=dict(size=13),
    title={"yref": "container", "y": 0.9, "yanchor": "bottom"},
    xaxis_title="GitHub Stars (log)",
    yaxis_title="GitHub Open Issues (log)",
)
fig.write_image(image_file := "images/pubmed-stars-and-open-issues.png")
display(Image(filename=image_file))
../../_images/3f826ec7572f2110f3dd7a5ad8776cffb659596e40e252a5f6f14067937f8232.png

How do software entropy, time, and GitHub issues relate to one another?#

The next section visualizes how software entropy, time, and GitHub issues relate to one another.

df["GitHub Open Issues (log)"] = np.log(
    df["repo-issues-open-count"].apply(
        # move 0's to None to avoid log(0)
        lambda x: x if x > 0 else None
    )
)

# Drop rows with missing values in the relevant columns
df_plot = df.dropna(
    subset=[
        "GitHub Open Issues (log)",
        "repo-days-of-existence",
        "repo-agg-info-entropy",
    ]
)

fig = px.density_heatmap(
    df_plot,
    x="repo-days-of-existence",
    y="repo-agg-info-entropy",
    z="GitHub Open Issues (log)",  # value to aggregate per bin
    histfunc="avg",  # mean log open issues per bin
    nbinsx=60,
    nbinsy=60,
    color_continuous_scale=px.colors.sequential.thermal,
    width=1200,
    height=700,
    title="Software entropy over time for PubMed GitHub repositories (binned density)",
)

fig.update_layout(
    font=dict(size=13),
    title={"yref": "container", "y": 0.9, "yanchor": "bottom"},
    xaxis_title="Project duration (days)",
    yaxis_title="Software entropy",
)

fig.write_image(image_file := "images/software-information-entropy-open-issues.png")
display(Image(filename=image_file))
../../_images/ed5d85442fa12881f67b967ea320e2177bf88c84c35b82846571fe45ca246afe.png