almanack Package API

Contents

almanack Package API#

Base#

Module for interacting with Software Gardening Almanack book content through a Python package.

almanack.book.read(chapter_name: str | None = None)[source]#

A function for reading almanack content through a package interface.

Parameters:

chapter_name – Optional[str], default None A string which indicates the short-hand name of a chapter from the book. Short-hand names are lower-case title names read from src/book/_toc.yml.

Returns:

None

The outcome of this function involves printing the content of the selected chapter from the short-hand name or showing available chapter names through an exception.

Setup almanack CLI through python-fire

class almanack.cli.AlmanackCLI[source]#

Bases: object

Almanack CLI class for Google Fire

The following CLI-based commands are available (and in alignment with the methods below based on their name):

  • almanack table <repo path>: Provides a JSON data

    structure which includes Almanack metric data. Always returns a 0 exit.

  • almanack check <repo path>: Provides a report

    of boolean metrics which include a non-zero sustainability direction (“checks”) that are failing to inform a user whether they pass. Returns non-zero exit (1) if any checks are failing, otherwise 0.

batch(output_path: str | None = None, parquet_path: str | None = None, repo_urls: List[str] | None = None, column: str = 'github_link', batch_size: int = 500, max_workers: int = 16, limit: int | None = None, compression: str = 'zstd', show_repo_progress: bool = True, processor: str | None = None, executor: str = 'process', split_batches: bool = False, collect_dataframe: bool = True, show_batch_progress: bool = False, show_errors: bool = True) None[source]#

Run Almanack across many repositories defined in a parquet file or a provided list.

Example

almanack batch links.parquet results.parquet –column github_link –batch_size 1000 –max_workers 8

Parameters:
  • output_path – Optional destination parquet for aggregated results. If omitted, results are printed as JSON.

  • parquet_path – Parquet file containing repository URLs.

  • repo_urls – Optional list of repository URLs to process.

  • output_path – Destination parquet for aggregated results.

  • column – Column name holding repository URLs.

  • batch_size – Repositories per batch.

  • max_workers – Parallel workers per batch.

  • limit – Optional maximum repositories to process.

  • compression – Parquet compression codec (default zstd).

  • show_repo_progress – Print per-repository progress to stdout.

  • show_batch_progress – Print per-batch progress to stdout.

  • show_errors – Print repository-level errors to stdout.

  • processor – Optional import path to a processor function (e.g., module:function). Defaults to Almanack processor.

  • executor – Parallelism backend: “process” (default) or “thread”.

  • split_batches – If True, write one parquet file per batch inside output_path (must be a directory).

  • collect_dataframe – If False, skip retaining the combined DataFrame (avoids large in-memory data).

check(repo_path: str, ignore: List[str] | None = None, exclude_paths: List[str] | None = None, verbose: bool = False) None[source]#

Check sustainability metrics and report failures.

Parameters:
  • repo_path – The path to the repository to analyze.

  • ignore – A list of metric IDs to ignore when running the checks.

  • exclude_paths – Repository-relative paths or glob patterns to exclude.

  • verbose – If True, print extra information and enable debug logging.

table(repo_path: str, dest_path: str | None = None, ignore: List[str] | None = None, exclude_paths: List[str] | None = None, verbose: bool = False) None[source]#

Generate a table of metrics for a repository.

Parameters:
  • repo_path – The path to the repository to analyze.

  • dest_path – A path to send the output to.

  • ignore – A list of metric IDs to ignore when running the checks.

  • exclude_paths – Repository-relative paths or glob patterns to exclude.

  • verbose – If True, print extra information and enable debug logging.

almanack.cli.trigger()[source]#

Trigger the CLI to run.

This module performs git operations

almanack.git.clone_repository(repo_url: str) Path[source]#

Clones the GitHub repository to a temporary directory.

Parameters:

repo_url (str) – The URL of the GitHub repository.

Returns:

Path to the cloned repository.

Return type:

pathlib.Path

almanack.git.count_files(tree: Tree | Blob) int[source]#

Counts all files (Blobs) within a Git tree, including files in subdirectories.

This function recursively traverses the provided tree object to count each file, represented as a pygit2.Blob, within the tree and any nested subdirectories.

Parameters:

tree (Union[pygit2.Tree, pygit2.Blob]) – The Git tree object (of type pygit2.Tree) to traverse and count files. The initial call should be made with the root tree of a commit.

Returns:

The total count of files (Blobs) within the tree, including nested files in subdirectories.

Return type:

int

almanack.git.detect_encoding(blob_data: bytes) str[source]#

Detect the encoding of the given blob data using charset-normalizer.

Parameters:

blob_data (bytes) – The raw bytes of the blob to analyze.

Returns:

The best detected encoding of the blob data.

Return type:

str

Raises:

ValueError – If no encoding could be detected.

almanack.git.file_exists_in_repo(repo: Repository, expected_file_name: str, check_extension: bool = False, extensions: list[str] = ['.md', '.txt', '.rtf', ''], subdir: str | None = None) bool[source]#

Check if a file (case-insensitive and with optional extensions) exists in the latest commit of the repository.

Parameters:
  • repo (pygit2.Repository) – The repository object to search in.

  • expected_file_name (str) – The base file name to check (e.g., “readme”).

  • check_extension (bool) – Whether to check the extension of the file or not.

  • extensions (list[str]) – List of possible file extensions to check (e.g., [“.md”, “”]).

  • subdir (str, optional) – Subdirectory to check within the repository tree (case-sensitive).

Returns:

True if the file exists, False otherwise.

Return type:

bool

almanack.git.find_file(repo: Repository, filepath: str, case_insensitive: bool = False, extensions: list[str] = ['.md', '.txt', '.rtf', '.rst', '']) Object | None[source]#

Locate a file in the repository by its path.

Parameters:
  • repo (pygit2.Repository) – The repository object.

  • filepath (str) – The path to the file within the repository.

  • case_insensitive (bool) – If True, perform case-insensitive comparison.

  • extensions (list[str]) – List of possible file extensions to check (e.g., [“.md”, “”]).

Returns:

The entry of the found file, or None if no matching file is found.

Return type:

Optional[pygit2.Object]

almanack.git.get_commits(repo: Repository) List[Commit][source]#

Retrieves the list of commits from the main branch.

Parameters:

repo (pygit2.Repository) – The Git repository.

Returns:

List of commits in the repository.

Return type:

List[pygit2.Commit]

almanack.git.get_edited_files(repo: Repository, source_commit: Commit, target_commit: Commit) List[str][source]#

Finds all files that have been edited, added, or deleted between two specific commits.

Parameters:
  • repo (pygit2.Repository) – The Git repository.

  • source_commit (pygit2.Commit) – The source commit.

  • target_commit (pygit2.Commit) – The target commit.

Returns:

List of file names that have been edited, added, or deleted between the two commits.

Return type:

List[str]

almanack.git.get_loc_changed(repo_path: Path, source: str, target: str, file_names: List[str]) Dict[str, int][source]#

Finds the total number of code lines changed for each specified file between two commits.

Parameters:
  • repo_path (pathlib.Path) – The path to the git repository.

  • source (str) – The source commit hash.

  • target (str) – The target commit hash.

  • file_names (List[str]) – List of file names to calculate changes for.

Returns:

A dictionary where the key is the filename, and the value is the lines changed (added and removed).

Return type:

Dict[str, int]

almanack.git.get_most_recent_commits(repo_path: Path) tuple[str, str][source]#

Retrieves the two most recent commit hashes in the test repositories

Parameters:

repo_path (pathlib.Path) – The path to the git repository.

Returns:

Tuple containing the source and target commit hashes.

Return type:

tuple[str, str]

almanack.git.get_remote_url(repo: Repository) str | None[source]#

Determines the remote URL of a git repository, if available. We use the upstream remote first, then origin, and finally any other remote. The upstream remote is preferred because it will be used for referential data lookups (such as GitHub issues, stars, etc.).

Parameters:

repo (pygit2.Repository) – The pygit2 repository object.

Returns:

The remote URL if found, otherwise None.

Return type:

Optional[str]

almanack.git.read_file(repo: Repository, entry: Object | None = None, filepath: str | None = None, case_insensitive: bool = False) str | None[source]#

Read the content of a file from the repository.

Parameters:
  • repo (pygit2.Repository) – The repository object.

  • entry (Optional[pygit2.Object]) – The entry of the file to read. If not provided, filepath must be specified.

  • filepath (Optional[str]) – The path to the file within the repository. Used if entry is not provided.

  • case_insensitive (bool) – If True, perform case-insensitive comparison when using filepath.

Returns:

The content of the file as a string, or None if the file is not found or reading fails.

Return type:

Optional[str]

almanack.git.repo_dir_exists(repo: Repository, directory_name: str) bool[source]#

Checks if a directory with the given name exists in the latest commit of the repository.

Parameters:
  • repo (pygit2.Repository) – The repository object to search in.

  • directory_name (str) – The name of the directory to look for.

Returns:

True if the directory exists, False otherwise.

Return type:

bool

almanack.git.resolve_redirects(url: str, timeout: int = 10) str[source]#

Follow HTTP redirects until the final URL is reached.

Parameters:
  • url (str) – The starting URL to check.

  • timeout (int, optional) – Timeout (in seconds) for each request, by default 10.

Returns:

The last non-redirect URL.

Return type:

str

Batch processing utilities for running Almanack across many repositories.

almanack.batch_processing._nullable_dtype(dtype: Any) Any[source]#

Map to nullable pandas dtypes so missing values keep schema.

almanack.batch_processing.process_repositories_batch(repo_urls: ~typing.Sequence[str], output_path: str | ~pathlib.Path | None = None, split_batches: bool = False, collect_dataframe: bool = True, batch_size: int = 500, max_workers: int = 16, limit: int | None = None, compression: str = 'zstd', processor: ~typing.Callable[[str], dict[str, ~typing.Any]] = <function process_repo_for_almanack>, executor_cls: ~typing.Type[~concurrent.futures._base.Executor] = <class 'concurrent.futures.process.ProcessPoolExecutor'>, show_repo_progress: bool = True, show_batch_progress: bool = False, show_errors: bool = True) DataFrame | None[source]#

Processes repositories in batches and writes results to a single parquet file.

Parameters:
  • repo_urls – Iterable of repository URLs to process.

  • output_path – Optional destination parquet path for all results.

  • split_batches – If True, write one parquet file per batch to the directory at output_path. If False (default), append all batches into a single parquet file.

  • collect_dataframe – If False, skip retaining per-batch DataFrames and return None to reduce memory.

  • batch_size – Number of repositories per batch.

  • max_workers – Maximum parallel workers for each batch.

  • limit – Optional maximum number of repositories to process.

  • compression – Parquet compression codec (default: zstd).

  • processor – Callable used to process a repository (default: process_repo_for_almanack).

  • executor_cls – Executor class used for parallelism (default: ProcessPoolExecutor).

  • show_repo_progress – Whether to print per-repository progress to stdout.

  • show_batch_progress – Whether to print per-batch progress to stdout.

  • show_errors – Whether to print repository-level errors to stdout.

Returns:

A DataFrame containing all processed results.

almanack.batch_processing.sanitize_for_parquet(df: DataFrame) DataFrame[source]#

Cleans a DataFrame so all columns are parquet-safe.

  • Expands dict columns into multiple fields.

  • Converts lists into JSON strings.

  • Casts generic object types into strings.

Parameters:

df – Input DataFrame with raw metrics.

Returns:

Sanitized DataFrame safe for parquet storage.

Return type:

df

Reporting#

This module creates entropy reports

almanack.reporting.report.pr_report(data: Dict[str, Any]) str[source]#

Returns the formatted PR-specific entropy report.

Parameters:

data (Dict[str, Any]) – Dictionary with the entropy data.

Returns:

Formatted GitHub markdown.

Return type:

str

almanack.reporting.report.repo_report(data: Dict[str, Any]) str[source]#

Returns the formatted entropy report as a string.

Parameters:

data (Dict[str, Any]) – Dictionary with the entropy data.

Returns:

Formatted entropy report.

Return type:

str

Metrics#

Data#

This module computes data for GitHub Repositories

almanack.metrics.data._get_almanack_version() str[source]#

Seeks the current version of almanack using either pkg_resources or dunamai to determine the current version being used.

Returns:

str

A string representing the version of almanack currently being used.

almanack.metrics.data._get_cli_entrypoints(repo: Repository) List[str][source]#

Return sorted CLI entrypoint names discovered from pyproject.toml, setup.cfg, and setup.py.

Parameters:

repo – The pygit2 Repository to read packaging files from.

Returns:

A sorted list of unique CLI command names. Returns an empty list if no entrypoints are found.

almanack.metrics.data._get_conda_python_version(content: str) str | None[source]#

Return the Python version constraint declared in a conda environment YAML string.

Scans the dependencies list for an entry that starts with python followed by a version separator (=, >, <, ~). Returns None if no Python dependency is found or the YAML cannot be parsed.

Parameters:

content – The raw string contents of a conda environment.yml file.

Returns:

The Python version string extracted from the dependency entry (for example, "3.11"), or None if no Python version is declared or the file cannot be parsed.

almanack.metrics.data._get_programming_extensions() frozenset[str][source]#

Return file extensions that belong to programming languages per GitHub Linguist.

Fetches and parses the Linguist languages.yml on the first call, then caches the result for the lifetime of the process. Falls back to _FALLBACK_PROGRAMMING_EXTENSIONS if the fetch or parse fails.

almanack.metrics.data._get_pyproject_python_version(content: str) str | None[source]#

Return the declared Python version constraint from a pyproject.toml string.

Checks tool.poetry.dependencies.python first (Poetry convention), then project.requires-python (PEP 621). Returns None if neither is present or if the TOML cannot be parsed.

Parameters:

content – The raw string contents of a pyproject.toml file.

Returns:

The Python version constraint string (for example, ">=3.9"), or None if no constraint is declared or the file cannot be parsed.

almanack.metrics.data._get_python_environment_data(repo: Repository) Dict[str, Any][source]#

Detect Python environment and dependency management tools in the repository.

Scans common Python packaging and environment configuration files to identify which tools are in use and what Python versions are declared. This function is intended for Python-primary repositories; call sites should guard invocation with a primary-language check.

Parameters:

repo – The pygit2 Repository to scan for environment configuration files.

Returns:

  • environment_managers: sorted list of detected environment manager names (for example, ["conda", "poetry"]), or None if none are found.

  • has_managed_environment: True if at least one environment manager is detected, False otherwise.

  • dependency_managers: sorted list of detected dependency manager names (for example, ["pip"]), or None if none are found.

  • has_declared_dependencies: True if at least one dependency manager is detected, False otherwise.

  • declared_python_versions: sorted list of declared Python version strings, or None if none are found.

Return type:

A dictionary with the following keys

almanack.metrics.data._get_repository_languages_data(remote_repo_data: Dict[str, Any], remote_url: str | None) Dict[str, int][source]#

Return repository language line counts from remote metadata or GitHub.

Parameters:
  • remote_repo_data – Metadata dict from a hosting platform or ecosyste.ms mirror. May contain a languages_lines, languages_loc, or languages key with per-language counts.

  • remote_url – Remote URL of the repository, used to fall back to the GitHub languages API when hosting metadata is unavailable.

Returns:

A dictionary mapping programming language name to an approximate line or byte count. Returns an empty dict if no language data is found.

almanack.metrics.data._get_software_description(repo: Repository, remote_repo_data: Dict[str, Any] | None, readme_exists: bool, readme_file: Object | None) str | None[source]#

Return the best available plain-text description for the repository.

The function uses the following priority order: 1. The description field from remote hosting metadata (e.g. ecosyste.ms / GitHub). 2. The abstract field from a CITATION.cff file in the repo root. 3. The first non-badge paragraph of the README.

Returns the first non-empty candidate, or None if the repo doesn’t include any priority entry.

Parameters:
  • repo – The pygit2 Repository to read local files from.

  • remote_repo_data – Metadata dict from a hosting platform or ecosyste.ms mirror, used to retrieve the remote description field.

  • readme_exists – Whether a README file was detected in the repository.

  • readme_file – The pygit2 object for the README file, used to read its contents when falling back to the first paragraph.

Returns:

The best available plain-text description string, or None if no description could be found.

almanack.metrics.data._normalize_exclude_paths(exclude_paths: str | List[str] | Tuple[str, ...] | None) List[str] | None[source]#

Normalize exclude path inputs into a list of strings.

Parameters:

exclude_paths – A string, list, or tuple of exclude paths. Accepts comma-separated strings and strips surrounding quotes.

Returns:

A list of normalized exclude paths, or None if no paths are provided.

almanack.metrics.data._parse_setup_py_console_scripts(content: str) set[str][source]#

Extract console_scripts command names from a setup.py string.

Parses the content as a Python Abstract Syntax Tree (AST) and looks for a setup() or setuptools.setup() call whose entry_points keyword argument contains a console_scripts list. Returns an empty set if the file cannot be parsed or contains no matching entries.

Parameters:

content – The raw string contents of a setup.py file.

Returns:

A set of CLI command names declared in console_scripts. Returns an empty set if none are found or the file cannot be parsed.

almanack.metrics.data._walk_tree_measure_size_of_noncode_files(tree: Tree | Blob, repo: Repository, prefix: str = '') Dict[str, int][source]#

Recursively walk a git tree and count lines per non-code file extension.

The function iterates through each blob (file) in the tree, and counts new lines for non-code files. Specifically, the function will skip a file if the extension appears in _get_programming_extensions() or if it lives inside a .git/ directory. The function counts non-code, text-based files by newline and counts binary files based on byte length. Lastly, the function computes a sum per file extension and writes into a {extension: line_count} dict where files missing an extension use the key "<no_ext>".

Parameters:
  • tree – A pygit2 Tree or Blob to walk. Top-level callers pass the root tree; recursive calls pass subtrees or blobs.

  • repo – The pygit2 Repository used to dereference object IDs when traversing subtrees.

  • prefix – Relative file path accumulated during recursion. Defaults to an empty string for the root call.

Returns:

A dictionary mapping each non-code file extension (for example, .md, .csv) to its approximate line or byte count. Files without an extension are keyed as "<no_ext>".

almanack.metrics.data.compute_almanack_score(almanack_table: List[Dict[str, int | float | bool]]) Dict[str, int | float][source]#

Computes an Almanack score by counting boolean Almanack table metrics to provide a quick summary of software sustainability.

Parameters:

almanack_table (List[Dict[str, Union[int, float, bool]]]) – A list of dictionaries containing metrics. Each dictionary must have a “result” key with a value that is an int, float, or bool. A “sustainability_correlation” key is included for values to specify the relationship to sustainability: - 1 (positive correlation) - 0 (no correlation) - -1 (negative correlation)

Returns:

Dictionary of length three, including the following: 1) number of Almanack boolean metrics that passed (numerator), 2) number of total Almanack boolean metrics considered (denominator), and 3) a score that represents how likely the repository will be maintained over time based (numerator / denominator).

Return type:

Dict[str, Union[int, float]]

almanack.metrics.data.compute_pr_data(repo_path: str, pr_branch: str, main_branch: str) Dict[str, Any][source]#

Computes entropy data for a PR compared to the main branch.

Parameters:
  • repo_path (str) – The local path to the Git repository.

  • pr_branch (str) – The branch name for the PR.

  • main_branch (str) – The branch name for the main branch.

Returns:

A dictionary containing the following key-value pairs:
  • ”pr_branch”: The PR branch being analyzed.

  • ”main_branch”: The main branch being compared.

  • ”total_entropy_introduced”: The total entropy introduced by the PR.

  • ”number_of_files_changed”: The number of files changed in the PR.

  • ”entropy_per_file”: A dictionary of entropy values for each changed file.

  • ”commits”: A tuple containing the most recent commits on the PR and main branches.

Return type:

dict

almanack.metrics.data.compute_repo_data(repo_path: str, exclude_paths: List[str] | None = None, required_metric_names: set[str] | None = None) Dict[str, Any][source]#

Compute comprehensive data for a GitHub repository.

Parameters:
  • repo_path – The local path to the Git repository.

  • exclude_paths – Repository-relative paths or glob patterns to exclude from checks.

  • required_metric_names – Optional set of metric names that will be used by downstream reporting. If provided, expensive data-collection blocks are skipped unless one of their dependent metrics is requested.

Returns:

A dictionary containing data key-pairs.

almanack.metrics.data.days_of_development(repo: Repository) float[source]#
Parameters:

repo (pygit2.Repository) – Path to the git repository.

Returns:

The average number of commits per day over the period of time.

Return type:

float

almanack.metrics.data.gather_failed_almanack_metric_checks(repo_path: str, ignore: List[str] | None = None, exclude_paths: List[str] | None = None) List[Dict[str, Any]][source]#

Gather checks on the repository metrics and return a list of failed checks.

Parameters:
  • repo_path – The file path to the repository which will have metrics calculated and includes boolean checks.

  • ignore – A list of metric IDs to ignore when running the checks.

  • exclude_paths – Repository-relative paths or glob patterns to exclude from checks.

Returns:

A list of dictionaries containing the metrics and their associated results. Each dictionary includes the name, id, and guidance on how to fix each failed check. The dictionary also includes data about the almanack score for use in summarizing the results.

almanack.metrics.data.get_github_build_metrics(repo_url: str, branch: str = 'main', max_runs: int = 100, github_api_endpoint: str = 'https://api.github.com/repos') dict[source]#

Fetches the success ratio of the latest GitHub Actions build runs for a specified branch.

Parameters:
  • repo_url (str) – The full URL of the repository (e.g., ‘software-gardening/almanack’).

  • branch (str) – The branch to filter for the workflow runs (default: “main”).

  • max_runs (int) – The maximum number of latest workflow runs to analyze.

  • github_api_endpoint (str) – Base API endpoint for GitHub repositories.

Returns:

The success ratio and details of the analyzed workflow runs.

Return type:

dict

almanack.metrics.data.get_table(repo_path: str, ignore: List[str] | None = None, exclude_paths: List[str] | None = None) List[Dict[str, Any]][source]#

Gather metrics on a repository and return the results in a structured format.

This function reads a metrics table from a predefined YAML file, computes relevant data from the specified repository, and associates the computed results with the metrics defined in the metrics table. If an error occurs during data computation, an exception is raised.

Parameters:
  • repo_path – The file path to the repository for which the Almanack runs metrics.

  • ignore – A list of metric IDs to ignore when running the checks.

  • exclude_paths – Repository-relative paths or glob patterns to exclude from checks.

Returns:

A list of dictionaries containing the metrics and their associated results. Each dictionary includes the original metrics data along with the computed result under the key “result”.

Raises:

ReferenceError – If there is an error encountered while processing the data, providing context in the error message.

almanack.metrics.data.is_conda_environment_yaml(content: str) bool[source]#

Return True if content looks like a conda environment YAML file.

A conda environment file is a YAML document whose top-level value is a mapping that contains at least one of the keys name, channels, or dependencies. This heuristic avoids false positives from arbitrary YAML files that happen to share the same filename.

Parameters:

content – The raw string contents of a YAML file to inspect.

Returns:

True if the content matches the conda environment YAML heuristic, False otherwise.

almanack.metrics.data.measure_coverage(repo: Repository, primary_language: str | None) Dict[str, Any] | None[source]#

Measures code coverage for a given repository.

Parameters:
  • repo (pygit2.Repository) – The pygit2 repository object to analyze.

  • primary_language (Optional[str]) – The primary programming language of the repository.

Returns:

Code coverage data or an empty dictionary if unable to find code coverage data.

Return type:

Optional[dict[str,Any]]

almanack.metrics.data.parse_python_coverage_data(repo: Repository) Dict[str, Any] | None[source]#

Parses coverage.py data from recognized formats such as JSON, XML, or LCOV. See here for more information: https://coverage.readthedocs.io/en/latest/cmd.html#cmd-report

Parameters:

repo (pygit2.Repository) – The pygit2 repository object containing code.

Returns:

A dictionary with standardized code coverage data or an empty dict if no data is found.

Return type:

Optional[Dict[str, Any]]

almanack.metrics.data.process_repo_for_almanack(repo_url: str, exclude_paths: List[str] | None = None) Dict[str, Any][source]#

Process a GitHub repository URL into a flat dictionary of Almanack metrics.

Parameters:
  • repo_url – The GitHub repository URL.

  • exclude_paths – Repository-relative paths or glob patterns to exclude from checks.

Returns:

Flattened metrics for the repository, including sustainability checks. If the processing fails, returns an error entry with the repository URL.

almanack.metrics.data.process_repo_for_analysis(repo_url: str) Tuple[float | None, str | None, str | None, int | None][source]#

Processes GitHub repository URL’s to calculate entropy and other metadata. This is used to prepare data for analysis, particularly for the seedbank notebook that process PUBMED repositories.

Parameters:

repo_url (str) – The URL of the GitHub repository.

Returns:

A tuple containing the normalized total entropy, the date of the first commit,

the date of the most recent commit, and the total time of existence in days.

Return type:

tuple

almanack.metrics.data.table_to_wide(table_rows: list[dict]) Dict[str, Any][source]#

Transpose Almanack table (name->result), compute checks summary, flatten nested. repo-file-info-entropy and repo-file-history-complexity-decay are omitted from the wide output because they are large, file-level metrics that are out of scope for this flattened summary representation.

Parameters:

table_rows (list[dict]) – The Almanack metrics table as a list of dictionaries, each containing metric metadata and a “result” field.

Returns:

A flattened dictionary mapping metric names to their results, including computed summary fields:

  • ”checks_total”: total number of sustainability-related checks

  • ”checks_passed”: number of checks passed

  • ”checks_pct”: percentage of checks passed

Return type:

dict

Remote#

This module focuses on remote API requests and related aspects.

almanack.metrics.remote.get_api_data(api_endpoint: str = 'https://repos.ecosyste.ms/api/v1/repositories/lookup', params: Dict[str, str] | None = None) dict[source]#

Get data from an API based on the remote URL, with retry logic for GitHub rate limiting.

Parameters:
  • api_endpoint (str) – The HTTP API endpoint to use for the request.

  • params (Optional[Dict[str, str]]) – Additional query parameters to include in the GET request.

Returns:

The JSON response from the API as a dictionary.

Return type:

dict

Raises:

requests.RequestException – If the API call fails for reasons other than rate limiting.

almanack.metrics.remote.request_with_backoff(method: str, url: str, *, headers: Dict[str, str] | None = None, params: Dict[str, str] | None = None, timeout: int = 30, allow_redirects: bool | None = None, max_retries: int = 5, base_backoff: float = 1.0, backoff_multiplier: float = 2.0, retry_statuses: Set[int] | None = None) Response | None[source]#

Perform an HTTP request with retry using a backoff for transient failures.

Parameters:
  • method (str) – The HTTP method to use (e.g., “GET”, “HEAD”).

  • url (str) – The URL to request.

  • headers (Optional[Dict[str, str]]) – Optional HTTP headers.

  • params (Optional[Dict[str, str]]) – Optional query parameters.

  • timeout (int) – Request timeout in seconds.

  • allow_redirects (Optional[bool]) – Whether to follow redirects.

  • max_retries (int) – Maximum number of attempts before giving up.

  • base_backoff (float) – Base backoff duration in seconds.

  • backoff_multiplier (float) – Multiplier for exponential backoff growth.

  • retry_statuses (Optional[Set[int]]) – HTTP status codes to retry.

Returns:

The response on success, or None on failure.

Return type:

Optional[requests.Response]

Entropy#

Calculate software change entropy and decay-weighted history complexity.

class almanack.metrics.entropy.calculate_entropy.HistoryComplexityConfig(decay_factor: float = 10.0, quiet_time_seconds: int = 3600)[source]#

Bases: object

Configuration values for the decay-weighted history complexity metric.

decay_factor#

Time scale (in hours) for exponential down-weighting of older burst periods. Larger values make older periods lose influence more slowly; smaller values make them fade faster.

Type:

float

quiet_time_seconds#

Maximum allowed time gap between adjacent commit events in the same burst period. If the gap is larger than this threshold, a new burst period starts.

Type:

int

decay_factor: float = 10.0#
quiet_time_seconds: int = 3600#
almanack.metrics.entropy.calculate_entropy._collect_period_file_changes(repo: Repository, source_commit: Commit, target_commit: Commit, tracked_files: set[str], quiet_time_seconds: int = 3600) list[tuple[int, dict[str, int]]][source]#

Collect per-period tracked-file change totals separated by quiet windows.

This function walks commits from target_commit backwards to source_commit, extracts tracked-file line changes from each commit, and then groups these commit events into burst periods. A commit event is one commit timestamp plus tracked-file counts of lines added and deleted. A new period starts when the elapsed time between consecutive commit events is greater than quiet_time_seconds.

Parameters:
  • repo – Open repository used to read commit history and diffs.

  • source_commit – Commit that marks the start boundary (exclusive).

  • target_commit – Commit that marks the end boundary (inclusive).

  • tracked_files – File paths that should contribute to change totals.

  • quiet_time_seconds – Threshold separating consecutive burst periods.

Returns:

Time-ordered list of periods. Each item contains the period end time (Unix timestamp) and file-level counts of lines added plus deleted for that period.

almanack.metrics.entropy.calculate_entropy._group_commit_events_by_quiet_window(commit_events: list[tuple[int, dict[str, int]]], quiet_time_seconds: int) list[tuple[int, dict[str, int]]][source]#

Group commit events into burst periods separated by quiet windows.

Parameters:
  • commit_events – Sequence of commit events, where each event is (event_time, file_changes). event_time is a commit timestamp, and file_changes maps tracked file paths to counts of lines added plus deleted from that commit.

  • quiet_time_seconds – Threshold that starts a new period when exceeded.

Returns:

Time-ordered periods with end timestamp and aggregated file changes.

almanack.metrics.entropy.calculate_entropy._resolve_commit(repo: Repository, commit_ref: str | Commit) Commit[source]#

Convert a commit input into a concrete pygit2.Commit.

Parameters:
  • repo – Open repository used to look up commit reference strings.

  • commit_ref – Existing commit object or rev-parse compatible commit string.

Returns:

Commit object for commit_ref.

almanack.metrics.entropy.calculate_entropy.calculate_aggregate_entropy(repo_path: Path, source_commit: str | Commit, target_commit: str | Commit, file_names: List[str]) float[source]#

Calculate mean normalized entropy across the provided files.

Parameters:
  • repo_path – Path to the local Git repository.

  • source_commit – Commit object or reference string for range start.

  • target_commit – Commit object or reference string for range end.

  • file_names – Repository-relative file paths to include in the calculation.

Returns:

Mean of per-file normalized entropy values, or 0.0 for no files.

References

Hassan, A. E. (2009). Predicting faults using the complexity of code changes. 2009 IEEE 31st International Conference on Software Engineering, 78-88. https://doi.org/10.1109/ICSE.2009.5070510

almanack.metrics.entropy.calculate_entropy.calculate_aggregate_history_complexity_with_decay(repo_path: Path, source_commit: str | Commit, target_commit: str | Commit, file_names: List[str], config: HistoryComplexityConfig = HistoryComplexityConfig(decay_factor=10.0, quiet_time_seconds=3600)) float[source]#

Calculate the mean decay-weighted history complexity across files.

Parameters:
  • repo_path – Path to the local Git repository.

  • source_commit – Commit object or reference string for range start.

  • target_commit – Commit object or reference string for range end.

  • file_names – Repository-relative file paths to include in the mean.

  • config – Decay and period-grouping configuration.

Returns:

Mean file-level history complexity score, or 0.0 for no files.

References

Hassan, A. E. (2009). Predicting faults using the complexity of code changes. 2009 IEEE 31st International Conference on Software Engineering, 78-88. https://doi.org/10.1109/ICSE.2009.5070510

almanack.metrics.entropy.calculate_entropy.calculate_history_complexity_with_decay(repo_path: Path, source_commit: str | Commit, target_commit: str | Commit, file_names: list[str], config: HistoryComplexityConfig = HistoryComplexityConfig(decay_factor=10.0, quiet_time_seconds=3600)) dict[str, float][source]#

Calculate decay-weighted history complexity for each tracked file.

For each burst period, this computes each file’s Shannon entropy contribution -(p_i * log2(p_i)) from file-level changed-line probabilities, then applies exponential decay by period age.

Parameters:
  • repo_path – Path to the local Git repository.

  • source_commit – Commit object or reference string for range start.

  • target_commit – Commit object or reference string for range end.

  • file_names – Repository-relative file paths to score.

  • config – Decay and period-grouping configuration.

Returns:

Mapping of file path to decay-weighted history complexity score.

Raises:

ValueError – If config.decay_factor is less than or equal to zero.

References

Hassan, A. E. (2009). Predicting faults using the complexity of code changes. 2009 IEEE 31st International Conference on Software Engineering, 78-88. https://doi.org/10.1109/ICSE.2009.5070510

almanack.metrics.entropy.calculate_entropy.calculate_normalized_entropy(repo_path: Path, source_commit: str | Commit, target_commit: str | Commit, file_names: list[str]) dict[str, float][source]#

Calculate per-file normalized Shannon entropy for changed lines.

The function computes entropy using per-file change probabilities between source_commit and target_commit, where each probability is: changed_lines_in_file / total_changed_lines_across_files.

Parameters:
  • repo_path – Path to the local Git repository.

  • source_commit – Commit object or reference string for range start.

  • target_commit – Commit object or reference string for range end.

  • file_names – Repository-relative file paths to include in the calculation.

Returns:

Mapping of file path to that file’s entropy contribution.

References

Hassan, A. E. (2009). Predicting faults using the complexity of code changes. 2009 IEEE 31st International Conference on Software Engineering, 78-88. https://doi.org/10.1109/ICSE.2009.5070510

This module procesess GitHub data

almanack.metrics.entropy.processing_repositories.process_pr_entropy(repo_path: str, pr_branch: str, main_branch: str) str[source]#

Processes GitHub PR data to calculate a report comparing the PR branch to the main branch.

Parameters:
  • repo_path (str) – The local path to the Git repository.

  • pr_branch (str) – The branch name of the PR.

  • main_branch (str) – The branch name for the main branch.

Returns:

A JSON string containing report data.

Return type:

str

Raises:

FileNotFoundError – If the specified directory does not contain a valid Git repository.

almanack.metrics.entropy.processing_repositories.process_repo_entropy(repo_path: str) str[source]#

Processes GitHub repository data to calculate a report.

Parameters:

repo_path (str) – The local path to the Git repository.

Returns:

A JSON string containing the repository data and entropy metrics.

Return type:

str

Raises:

FileNotFoundError – If the specified directory does not contain a valid Git repository.

Garden Lattice#

Understanding#

This module focuses on the Almanack’s Garden Lattice materials which encompass aspects of human understanding.

almanack.metrics.garden_lattice.understanding.includes_common_docs(repo: Repository) bool[source]#

Check whether the repo includes common documentation files and directories associated with building docsites.

Parameters:

repo (pygit2.Repository) – The repository object.

Returns:

True if any common documentation files are found, False otherwise.

Return type:

bool

Connectedness#

Module for Almanack metrics covering human connection and engagement in software projects such as contributor activity, collaboration frequency.

almanack.metrics.garden_lattice.connectedness._build_openalex_doi_metrics(openalex_result: Dict[str, Any]) Dict[str, Any][source]#

Build DOI-linked OpenAlex metrics for storage in citation results.

almanack.metrics.garden_lattice.connectedness._extract_doi_from_citation_data(citation_data: Dict[str, Any]) str | None[source]#

Extract a DOI value from parsed CITATION.cff payload.

almanack.metrics.garden_lattice.connectedness._extract_funder_key(funder: Any) str | None[source]#

Extract a stable funder key from known OpenAlex funder shapes.

almanack.metrics.garden_lattice.connectedness._funding_amount_to_usd(amount: Any, currency: str | None) float | None[source]#

Convert a funding amount to USD; assume USD when currency is unavailable.

almanack.metrics.garden_lattice.connectedness._get_currency_converter() CurrencyConverter[source]#

Return a cached currency converter for funding amount normalization.

almanack.metrics.garden_lattice.connectedness._get_work_funding_records(work_data: Dict[str, Any]) List[Dict[str, Any]][source]#

Return funding records from OpenAlex work payloads.

Supports both modern awards and legacy grants fields. In Almanack outputs, these are normalized as “funding records”.

almanack.metrics.garden_lattice.connectedness._summarize_openalex_funding(work_data: Dict[str, Any]) Dict[str, Any][source]#

Summarize OpenAlex funding payloads for one work record.

almanack.metrics.garden_lattice.connectedness.count_unique_contributors(repo: Repository, since: datetime | None = None) int[source]#

Counts the number of unique contributors to a repository.

If a since datetime is provided, counts contributors who made commits after the specified datetime. Otherwise, counts all contributors.

Parameters:
  • repo (pygit2.Repository) – The repository to analyze.

  • since (Optional[datetime]) – The cutoff datetime. Only contributions after this datetime are counted. If None, all contributions are considered.

Returns:

The number of unique contributors.

Return type:

int

almanack.metrics.garden_lattice.connectedness.default_branch_is_not_master(repo: Repository) bool[source]#

Checks if the default branch of the specified repository is “master”.

Parameters:

repo (Repository) – A pygit2.Repository object representing the Git repository.

Returns:

True if the default branch is “master”, False otherwise.

Return type:

bool

Analyzes README.md content to identify social media links.

Parameters:

readme_content (str) – The content of the README.md file as a string.

Returns:

A dictionary containing social media details discovered from readme.md content.

Return type:

Dict[str, List[str]]

almanack.metrics.garden_lattice.connectedness.find_doi_citation_data(repo: Repository) Dict[str, Any][source]#

Find and validate DOI information from a CITATION.cff file in a repository.

This function searches for a CITATION.cff file in the provided repository, extracts the DOI (if available), validates its format, checks its resolvability via HTTP, and performs an exact DOI lookup on the OpenAlex API.

Parameters:

repo (pygit2.Repository) – The repository object to search for the CITATION.cff file.

Returns:

A dictionary containing DOI-related information and metadata.

Return type:

Dict[str, Any]

almanack.metrics.garden_lattice.connectedness.find_openalex_citing_works_funding(openalex_work_id: str | None, max_references: int = 25) Dict[str, Any][source]#

Find funding signals from OpenAlex works that cite the repository work.

In this context, “sampled” means the subset of citing works returned by this query, limited by max_references and sorted by cited_by_count. OpenAlex awards (and legacy grants) are normalized as “funding records”.

Parameters:
  • openalex_work_id – OpenAlex work identifier for the project’s DOI-linked work.

  • max_references – Maximum number of citing works to query from OpenAlex.

Returns:

Dictionary with sampled citing-work funding aggregates and references.

almanack.metrics.garden_lattice.connectedness.find_software_mentions_openalex(repo: Repository, remote_url: str | None, max_references: int = 10) Dict[str, Any][source]#

Find OpenAlex works that mention a repository by software/project name.

Parameters:
  • repo – Repository used for software name discovery when no remote URL is available.

  • remote_url – Remote repository URL used to derive the project name.

  • max_references – Maximum number of matching works to include in results.

Returns:

Dictionary containing the query name, aggregate mention count, and a minimal list of matching works from OpenAlex.

almanack.metrics.garden_lattice.connectedness.is_citable(repo: Repository) bool[source]#

Check if the given repository is citable.

A repository is considered citable if it contains a CITATION.cff or CITATION.bib file, or if the README.md file contains a citation section indicated by “## Citation” or “## Citing”.

Parameters:

repo (pygit2.Repository) – The repository to check for citation files.

Returns:

True if the repository is citable, False otherwise.

Return type:

bool

Practicality#

This module focuses on the Almanack’s Garden Lattice materials which involve how people can apply software in practice.

almanack.metrics.garden_lattice.practicality.count_repo_tags(repo: Repository, since: datetime | None = None) int[source]#

Counts the number of tags in a pygit2 repository.

If a since datetime is provided, counts only tags associated with commits made after the specified datetime. Otherwise, counts all tags in the repository.

Parameters:
  • repo (pygit2.Repository) – The repository to analyze.

  • since (Optional[datetime]) – The cutoff datetime. Only tags for commits after this datetime are counted. If None, all tags are counted.

Returns:

The number of tags in the repository that meet the criteria.

Return type:

int

almanack.metrics.garden_lattice.practicality.get_ecosystems_package_metrics(repo_url: str) Dict[str, Any][source]#

Fetches package data from the ecosyste.ms API and calculates metrics about the number of unique ecosystems, total version counts, and the list of ecosystem names.

Parameters:

repo_url (str) – The repository URL of the package to query.

Returns:

A dictionary containing information about packages related to the repository.

Return type:

Dict[str, Any]