Undersatanding PubMed GitHub repositories with the Software Gardening Almanack

Undersatanding PubMed GitHub repositories with the Software Gardening Almanack#

The content below seeks to better understand a dataset of ~10,000 PubMed article GitHub repositories using the Software Gardening Almanack.

PubMed GitHub repositories are extracted using the PubMed API to query for GitHub links within article abstracts. GitHub data about these repositories is gathered using the GitHub API. The code to perform data extractions may be found under the directory: gather-pubmed-repos .

These repositories are then processed using the Software Gardening Almanack to contextualize how they compare to one another.

Data Extraction#

The following section extracts data which includes software entropy and GitHub-derived data. We merge the data to form a table which includes PubMed, GitHub, and Almanack software entropy information on the repositories.

import pathlib

import numpy as np
import pandas as pd
import plotly.express as px
import plotly.io as pio
from IPython.display import Image, display

from almanack import process_repositories_batch

# set plotly default theme
pio.templates.default = "plotly_white"

# set almanack data directory
ALMANACK_DATA_DIR = "repository_analysis_results"

# if we don't already have a data dir, batch process almanack data
if not pathlib.Path(ALMANACK_DATA_DIR).is_dir():
    process_repositories_batch(
        # read pre-collected pubmeb github repos
        repo_urls=pd.read_parquet(
            path="gather-pubmed-repos/pubmed_github_links_with_github_data.parquet",
            columns=["github_link"],
        )
        .dropna()
        .github_link.tolist()[1:20],
        output_path=ALMANACK_DATA_DIR,
        split_batches=True,
        batch_size=500,
        max_workers=8,
        collect_dataframe=False,
        show_repo_progress=False,
        show_batch_progress=True,
    )

# read the dataset as a concenated dataframe from parquet chunks
df = pd.concat(
    [
        pd.read_parquet(pq_file)
        for pq_file in pathlib.Path(ALMANACK_DATA_DIR).glob("*.parquet")
    ],
    ignore_index=True,
    sort=False,
)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9631 entries, 0 to 9630
Data columns (total 62 columns):
 #   Column                                                    Non-Null Count  Dtype  
---  ------                                                    --------------  -----  
 Repository URL                                            9631 non-null   object 
 repo-path                                                 9631 non-null   object 
 repo-commits                                              9587 non-null   float64
 repo-file-count                                           9587 non-null   float64
 repo-commit-time-range                                    9631 non-null   object 
 repo-days-of-development                                  9587 non-null   float64
 repo-commits-per-day                                      9587 non-null   float64
 almanack-table-datetime                                   9631 non-null   object 
 almanack-version                                          9631 non-null   object 
 repo-primary-language                                     5789 non-null   object 
repo-primary-license                                      3503 non-null   object 
repo-doi                                                  125 non-null    object 
repo-doi-publication-date                                 107 non-null    object 
repo-doi-fwci                                             46 non-null     float64
repo-doi-is-not-retracted                                 75 non-null     float64
repo-doi-grants-count                                     84 non-null     float64
repo-doi-grants                                           0 non-null      object 
repo-includes-readme                                      9587 non-null   float64
repo-includes-contributing                                9587 non-null   float64
repo-includes-code-of-conduct                             9587 non-null   float64
repo-includes-license                                     9587 non-null   float64
repo-is-citable                                           9587 non-null   float64
repo-default-branch-not-master                            9587 non-null   float64
repo-includes-common-docs                                 9587 non-null   float64
repo-unique-contributors                                  9587 non-null   float64
repo-unique-contributors-past-year                        9587 non-null   float64
repo-unique-contributors-past-182-days                    9587 non-null   float64
repo-tags-count                                           9587 non-null   float64
repo-tags-count-past-year                                 9587 non-null   float64
repo-tags-count-past-182-days                             9587 non-null   float64
repo-stargazers-count                                     6024 non-null   float64
repo-uses-issues                                          6024 non-null   float64
repo-issues-open-count                                    6024 non-null   float64
repo-pull-requests-enabled                                6024 non-null   float64
repo-forks-count                                          6024 non-null   float64
repo-subscribers-count                                    6024 non-null   float64
repo-packages-ecosystems-count                            9587 non-null   float64
repo-packages-versions-count                              9587 non-null   float64
repo-social-media-platforms-count                         9295 non-null   float64
repo-doi-valid-format                                     90 non-null     float64
repo-doi-https-resolvable                                 84 non-null     float64
repo-doi-cited-by-count                                   75 non-null     float64
repo-days-between-doi-publication-date-and-latest-commit  75 non-null     float64
repo-gh-workflow-success-ratio                            463 non-null    float64
repo-gh-workflow-succeeding-runs                          463 non-null    float64
repo-gh-workflow-failing-runs                             463 non-null    float64
repo-gh-workflow-queried-total                            463 non-null    float64
repo-code-coverage-percent                                0 non-null      object 
repo-date-of-last-coverage-run                            0 non-null      object 
repo-days-between-last-coverage-run-latest-commit         0 non-null      object 
repo-code-coverage-total-lines                            0 non-null      object 
repo-code-coverage-executed-lines                         0 non-null      object 
repo-agg-info-entropy                                     9587 non-null   float64
repo-almanack-score_json                                  9631 non-null   object 
repo-packages-ecosystems_json                             9631 non-null   object 
repo-social-media-platforms_json                          9631 non-null   object 
checks_total                                              9587 non-null   float64
checks_passed                                             9587 non-null   float64
checks_pct                                                9587 non-null   float64
repo-social-media-platforms                               0 non-null      float64
almanack_error                                            6000 non-null   object 
repo-doi-grants_json                                      4200 non-null   object 
dtypes: float64(42), object(20)
memory usage: 4.6+ MB

/tmp/ipykernel_3444/2236761407.py:21: FutureWarning: The behavior of DataFrame concatenation with empty or all-NA entries is deprecated. In a future version, this will no longer exclude empty or all-NA columns when determining the result dtypes. To retain the old behavior, exclude the relevant entries before the concat operation.
  df = pd.concat(

How many repositories are dormant?#

Software must evolve continuously to remain stable, useful, and aligned with current needs. The following section explores how repository dormancy appears within the dataset based on the date of most recent commit. Sustained, incremental updates help prevent this stagnation and ensure that a project continues to serve its users effectively.

# cleanup: drop columns or rows whose values are all empty
df = df.dropna(axis=1, how="all")
df = df.dropna(axis=0, how="all")

# gather repo commit start and end dates
df[["repo-commit-start-date", "repo-commit-end-date"]] = df[
    "repo-commit-time-range"
].str.extract(r"\('([^']+)', '([^']+)'\)")
# turn these into datetimes for datetime calcs
df[["repo-commit-start-date", "repo-commit-end-date"]] = df[
    ["repo-commit-start-date", "repo-commit-end-date"]
].apply(pd.to_datetime)
# set a cuttoff of 1 year for dormancy
cutoff = pd.Timestamp.now() - pd.DateOffset(years=1)
df["dormant"] = df["repo-commit-end-date"] <= cutoff
# show how many dormant repos there are in this dataset
df["dormant"].value_counts()

dormant
True     8708
False     923
Name: count, dtype: int64

df["repo-days-of-existence"] = (
    pd.Timestamp.now() - df["repo-commit-start-date"]
).dt.days
df["repo-days-inactive"] = (pd.Timestamp.now() - df["repo-commit-end-date"]).dt.days
fig = px.ecdf(
    df,
    x="repo-days-inactive",
    title="CDF of repository inactivity by last commit date",
    width=1200,
    height=700,
)
fig.write_image(
    image_file := "images/cdf_repository_inactivity_by_last_commit_date.png"
)
display(Image(filename=image_file))

../../_images/3850a47f8524b53f4e833635115df87b36bb7f12ec5caf9e99efc59fac75d981.png

What languages are used within PubMed article repositories?#

The following section observes the top 10 languages which are used in repositories from the dataset. Primary language is determined as the language which has the most lines of code within a repository.

# Count dormant & active repos per language
lang_status = (
    df.groupby(["repo-primary-language", "dormant"]).size().reset_index(name="Count")
)

# Map boolean to readable labels
lang_status["Status"] = lang_status["dormant"].map(
    {
        True: "Dormant (no commits >= 1 year)",
        False: "Active (< 1 year since last commit)",
    }
)

# Total repos per language
total_counts = lang_status.groupby("repo-primary-language")["Count"].sum()

# Top 10 languages by total repo count (sorted)
top_langs = total_counts.sort_values().tail(10)

# Filter to top languages
plot_data = lang_status[
    lang_status["repo-primary-language"].isin(top_langs.index)
].copy()

# Ensure y-axis follows the sorted order by total count
lang_order = top_langs.index.tolist()
plot_data["repo-primary-language"] = pd.Categorical(
    plot_data["repo-primary-language"],
    categories=lang_order,
    ordered=True,
)

# Stacked horizontal bar chart: Active (green) + Dormant (red)
fig = px.bar(
    plot_data,
    y="repo-primary-language",
    x="Count",
    color="Status",
    orientation="h",
    color_discrete_map={
        "Active (< 1 year since last commit)": "#2ca02c",
        "Dormant (no commits >= 1 year)": "#d62728",
    },
    width=1200,
    height=700,
    title="Primary language count (active vs dormant repos)",
)

fig.update_layout(
    barmode="stack",
    xaxis_title="Repository count",
    yaxis_title="Primary language",
    legend_title="Repository status",
    yaxis=dict(categoryorder="array", categoryarray=lang_order),
)


fig.write_image(image_file := "images/primary-language-counts.png")
display(Image(filename=image_file))

../../_images/d8b78867750bed58605abf73e639450961bd62f89dde385b7fec2d728a0eb17f.png

How is software entropy different across primary languages?#

The following section explores how software entropy manifests differently across different primary languages for repositories.

language_grouped_data = (
    df.groupby(["repo-primary-language"]).size().reset_index(name="Count")
)

fig = px.scatter(
    df[
        df["repo-primary-language"].isin(
            language_grouped_data
            # exclude HTML to focus on formal programming languages
            .loc[
                ~language_grouped_data["repo-primary-language"].str.contains(
                    "html", case=False, na=False
                )
            ].sort_values(by="Count")[-5:]["repo-primary-language"]
        )
    ].sort_values(by="repo-primary-language"),
    x="repo-days-of-existence",
    y="repo-agg-info-entropy",
    hover_data=["Repository URL"],
    width=1200,
    height=700,
    title="Software entropy over time for PubMed GitHub repositories",
    marginal_x="rug",
    marginal_y="box",
    opacity=0.5,
    color="repo-primary-language",
)

fig.update_layout(
    font=dict(size=13),
    title={"yref": "container", "y": 0.9, "yanchor": "bottom"},
    xaxis_title="Project duration (days)",
    yaxis_title="Software entropy",
)
fig.write_image(image_file := "images/software-information-entropy-top-5-langs.png")
display(Image(filename=image_file))

../../_images/146fecbe1999e0284861f50daa4c93163860162f31056c00fe8df465227943e4.png

What is the relationship between GitHub Stars and Forks for repositories?#

We next explore how GitHub Stars and Forks are related within the repositories.

fig = px.scatter(
    df,
    x="repo-stargazers-count",
    y="repo-forks-count",
    hover_data=["Repository URL"],
    width=1200,
    height=700,
    title="Stars and forks for PubMed GitHub repositories",
    opacity=0.5,
    log_y=True,
    log_x=True,
)

fig.update_layout(
    font=dict(size=13),
    title={"yref": "container", "y": 0.9, "yanchor": "bottom"},
    xaxis_title="GitHub Stars",
    yaxis_title="GitHub Forks",
)
fig.write_image(image_file := "images/pubmed-stars-and-forks.png")
display(Image(filename=image_file))

../../_images/d8e0e0989c7780f41e31264a84121eab2726c2676405c6c7389fee74b7e840b1.png

How do GitHub Stars, software entropy, and time relate?#

The next section explores how GitHub Stars, software entropy, and time relate to one another.

df["GitHub Stars (log)"] = np.log(
    df["repo-stargazers-count"].apply(lambda x: x if x > 0 else None)  # avoid log(0)
)

df_plot = df.dropna(subset=["GitHub Stars (log)"])

fig = px.density_heatmap(
    df_plot,
    x="repo-days-of-existence",
    y="repo-agg-info-entropy",
    z="GitHub Stars (log)",  # aggregate value per bin
    histfunc="avg",  # mean log stars per bin (could use "sum" or "count")
    nbinsx=60,
    nbinsy=60,
    color_continuous_scale=px.colors.sequential.thermal,
    hover_data=["Repository URL"],
    width=1200,
    height=700,
    title="Software entropy over time for PubMed GitHub repositories (binned density)",
)

fig.update_layout(
    font=dict(size=13),
    title={"yref": "container", "y": 0.9, "yanchor": "bottom"},
    xaxis_title="Project duration (days)",
    yaxis_title="Software entropy",
)
fig.write_image(image_file := "images/software-information-entropy-gh-stars.png")
display(Image(filename=image_file))

../../_images/4032e0b5932c17c7491c953a42cc5e7e3d8fd2e67b862ff3fe4ec498a243d918.png

How do GitHub Forks, software entropy, and time relate?#

The next section explores how GitHub Forks, software entropy, and time relate to one another.

df["GitHub Forks (log)"] = np.log(
    df["repo-forks-count"].apply(
        # move 0's to None to avoid log(0)
        lambda x: x if x > 0 else None
    )
)

# Drop rows with missing values in the relevant columns
df_plot = df.dropna(
    subset=["GitHub Forks (log)", "repo-days-of-existence", "repo-agg-info-entropy"]
)

fig = px.density_heatmap(
    df_plot,
    x="repo-days-of-existence",
    y="repo-agg-info-entropy",
    z="GitHub Forks (log)",  # value to aggregate per bin
    histfunc="avg",  # mean log forks per bin (can use "sum" or "count")
    nbinsx=60,
    nbinsy=60,
    color_continuous_scale=px.colors.sequential.thermal,
    width=1200,
    height=700,
    title="Software entropy over time for PubMed GitHub repositories (binned density)",
)

fig.update_layout(
    font=dict(size=13),
    title={"yref": "container", "y": 0.9, "yanchor": "bottom"},
    xaxis_title="Project duration (days)",
    yaxis_title="Software entropy",
)
fig.write_image(image_file := "images/software-information-entropy-forks.png")
display(Image(filename=image_file))

../../_images/31ca6cfcb8f8ab23b292e45c012d3352c6b372b82ae28449c96d31eeaf92f3a4.png

What is the relationship between GitHub Stars and Open Issues for the repositories?#

Below we explore how GitHub Stars and Open Issues are related for the repositories.

df["GitHub Open Issues (log)"] = np.log(
    df["repo-issues-open-count"].apply(
        # move 0's to None to avoid divide by 0
        lambda x: x if x > 0 else None
    )
)

fig = px.scatter(
    df,
    x="repo-stargazers-count",
    y="repo-issues-open-count",
    hover_data=["Repository URL"],
    width=1200,
    height=700,
    title="Stars and open issues for PubMed GitHub repositories",
    opacity=0.5,
    log_y=True,
    log_x=True,
)

fig.update_layout(
    font=dict(size=13),
    title={"yref": "container", "y": 0.9, "yanchor": "bottom"},
    xaxis_title="GitHub Stars (log)",
    yaxis_title="GitHub Open Issues (log)",
)
fig.write_image(image_file := "images/pubmed-stars-and-open-issues.png")
display(Image(filename=image_file))

../../_images/113fe45084bed099eb7965686c20ff259dba9450da8473fa5eb3a13ea4555058.png

How do software entropy, time, and GitHub issues relate to one another?#

The next section visualizes how software entropy, time, and GitHub issues relate to one another.

df["GitHub Open Issues (log)"] = np.log(
    df["repo-issues-open-count"].apply(
        # move 0's to None to avoid log(0)
        lambda x: x if x > 0 else None
    )
)

# Drop rows with missing values in the relevant columns
df_plot = df.dropna(
    subset=[
        "GitHub Open Issues (log)",
        "repo-days-of-existence",
        "repo-agg-info-entropy",
    ]
)

fig = px.density_heatmap(
    df_plot,
    x="repo-days-of-existence",
    y="repo-agg-info-entropy",
    z="GitHub Open Issues (log)",  # value to aggregate per bin
    histfunc="avg",  # mean log open issues per bin
    nbinsx=60,
    nbinsy=60,
    color_continuous_scale=px.colors.sequential.thermal,
    width=1200,
    height=700,
    title="Software entropy over time for PubMed GitHub repositories (binned density)",
)

fig.update_layout(
    font=dict(size=13),
    title={"yref": "container", "y": 0.9, "yanchor": "bottom"},
    xaxis_title="Project duration (days)",
    yaxis_title="Software entropy",
)

fig.write_image(image_file := "images/software-information-entropy-open-issues.png")
display(Image(filename=image_file))

../../_images/e1b92cbd183d76f2d3c606debdbe1d5dd0a726c6a06f1027aa2024faaa076c48.png