Skip to content

chhsdata/chhsdata-datahub-pkg-repo-python

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 

Repository files navigation

datahub-pkg-repo-python

Approved Python packages for the CDII DataHub Databricks platform.

This repository is the formal governance gate for all third-party Python packages used on DataHub clusters. A package must complete the vetting process and be merged into this repo before it may be installed on any cluster.

For the equivalent R package repo, see: chhsdata/datahub-pkg-repo-r


Repository Structure

datahub-pkg-repo-python/
├── approved-packages.yml          # Registry of all approved packages
├── packages/
│   └── <package-name>/
│       └── <version>/
│           ├── *.whl              # Linux x86_64 binary wheel (install this on Databricks)
│           ├── *.tar.gz           # Source distribution (governance artifact)
│           ├── checksum.md        # SHA256 checksums verified against PyPI
│           ├── scan-results.md    # pip-audit CVE scan output
│           ├── license-review.md  # License compatibility review
│           └── test_import.py     # Import test — validates install on a live cluster
└── sample-notebooks/              # Reference notebooks showing approved package usage

Requesting a New Package

Open a pull request with the following artifacts. The PR will not be merged until all steps are complete and branch protection checks pass.

Vetting checklist (all steps required):

  1. Source & Maintenance Review — Confirm the package is published on PyPI, identify the author, check release frequency and GitHub activity.

  2. License Review — Confirm the license is permissive and compatible with California state government use (MIT, BSD-2/3-Clause, Apache-2.0, PSF). Copyleft licenses (GPL, LGPL, AGPL) require legal review before approval.

  3. Download & Checksum Verification — Download the Linux x86_64 .whl and .tar.gz source distribution from PyPI. Verify SHA256 checksums against the PyPI JSON API. Save results to checksum.txt.

  4. CVE Scan — Run pip-audit against a requirements.txt pinned to the exact version. Save full output to scan-results.txt. Any findings must be resolved or formally accepted before merge.

  5. Write test_import.py — A minimal script that imports the package and confirms basic functionality. Run this on a live cluster after installing the wheel to validate the package works as expected.

  6. Update approved-packages.yml — Add an entry with all fields populated (version, license, checksums, scan date, Jira story).

Full process documentation: Confluence — CDII Package Governance


Installing an Approved Package on Databricks

Approved .whl files are stored in this repository. Use the method that matches your workspace type.

Method A — Cluster Libraries UI (Unity Catalog workspaces)

Unity Catalog workspaces do not expose DBFS in the same way as legacy workspaces. Use the cluster Libraries UI instead:

  1. Go to Compute in the left sidebar and select your cluster
  2. Click the Libraries tab → Install newUpload
  3. Upload the .whl file from this repository: packages/uuid-utils/0.10.0/uuid_utils-0.10.0-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
  4. Wait for status to show Installed

Method B — DBFS path via %pip (legacy workspaces)

Step 1 — Upload the wheel to DBFS

# Upload via Databricks CLI (run from your local machine)
databricks fs cp packages/uuid-utils/0.10.0/uuid_utils-0.10.0-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl \
  dbfs:/FileStore/packages/

Step 2 — Install in your notebook using %pip

# In a Databricks notebook cell — use %pip so it applies to the whole cluster
%pip install /dbfs/FileStore/packages/uuid_utils-0.10.0-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl

Verify the installation

import uuid_utils
print(uuid_utils.__version__)  # should print: 0.10.0

Notes:

  • Always use the exact .whl filename from approved-packages.yml — never install from PyPI directly (pip install uuid-utils is not permitted).
  • The wheel filenames encode the target platform. All wheels in this repo are built for Linux x86_64 (the Databricks cluster OS). Do not substitute macOS or Windows wheels.
  • Version pinning is mandatory. Never use >= or omit the version.

Using approved-packages.yml

approved-packages.yml is the authoritative registry of all approved packages. Before installing any package, confirm it is listed here with a matching version.

# Example entry
- name: uuid-utils
  version: "0.10.0"
  license: BSD-3-Clause
  whl_sha256: 263b2589111c61decdd74a762e8f850c9e4386fb78d2cf7cb4dfc537054cda1b
  cve_scan: PASS
  cve_scan_date: "2026-05-18"
  jira_story: HUB-1630

To verify a wheel file you already have on disk matches the approved checksum:

shasum -a 256 uuid_utils-0.10.0-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
# Compare output against whl_sha256 in approved-packages.yml

Approved Packages

Package Version License Approved Jira
uuid-utils 0.10.0 BSD-3-Clause 2026-05-18 HUB-1630

Governance Documentation

Full policy, process, and rationale: Confluence — CDII Package Governance

Maintainer: CDII DataHub Platform Team

About

Data Hub Approved Packages - Python

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors