Scan the QR code or go to https://github.com/AstroAI-CfA/tutorial-lsdb to open this repository.
Demos prepared for the AstroAI workshop, held June 15-18 2026, Cambridge, MA.
The notebooks showcase working with HATS-partitioned survey catalogs via LSDB (docs), and time domain analysis with nested-pandas(docs).
More information at this link.
- Konstantin (Kostya) Malanchev: malanchev@cmu.edu
- GitHub issues for LSDB: https://github.com/astronomy-commons/lsdb/issues
- LSST Discovery Alliance Slack channel
- Feel free to use
#lincc-frameworks-lsdbchannel on LSST-DA slack for any questions, bugs, or problems! - For Rubin specific questions, please also check the community forum at https://community.lsst.org/ and the Rubin Observatory DP1 documentation page for LSDB.
- Feel free to use
- "Contact us" documentation page for LSDB: https://docs.lsdb.io/en/latest/contact.html
- Slide deck
- LSDB (Main page)(LSDB catalogs)(on GitHub)(on ReadTheDocs)
- HATS (on GitHub)(on ReadTheDocs)
- nested-pandas (on GitHub)(on ReadTheDocs)
LSDB is a Python framework for working with large (and not so large) astronomical catalogs. Its main strength is the ability to work with catalogs that are too large to fit in memory, by using lazy loading and Dask for parallel processing. It also provides a convenient interface for cross-matching catalogs, and for working with time-domain and spectral data.
HATS (Hierarchical Astronomical Tiling System) is a storage format for astronomical catalogs that is designed to enable efficient bulk-processing of large catalogs, and optimize crossmatching, spatial queries, and table column selection.
Each Parquet file has the same structure, and corresponds to a single HEALpix tile on the sky. For example, here is the sky map for the HATS Gaia DR3 cataloh, each rectangle corresponds to a single Parquet file, and the color shows mean parallax.
You absolutely can run these notebooks on your local machine, but we recommend to use a remote environment, such as Google Colab, a science platform like Rubin Science Platform, or an HPC cluster, because of the large requirements for the data download, and the possibly limited networking capabilities of the workshop WiFi.
You need a Google account to use the Colab. We recommend using no more than two Dask workers in the default Colab environment, because of the limited resources. You can also use Colab Pro for more resources, but that is not required to run the notebooks.
You can run any notebook in Colab by clicking on the "Open in Colab" badge at the top of each notebook. This will open the notebook in Colab, where you can run it as usual. Please remember to uncomment the first code cell to install LSDB.
Before you start, save a personal copy to your Google Drive so your changes are not lost when the session ends. Open the command palette (the Commands button in the top-right toolbar, or Ctrl+Shift+P / Cmd+Shift+P), type Drive, and select Save a copy in Drive.
Make sure that you have access to the Rubin Science Platform and follow the instructions at lsdb.io/dp1.
For a complete guide to setting up an RSP account and getting custom versions of LSDB available in your notebooks, we've put together a system guide that you might find useful.
Warning
Note that you have to use the LATEST WEEKLY version of the Rubin Science Pipelines on Rubin Science Platform.
Make sure that you have access to the Rubin Science Platform and follow the instructions at lsdb.io/dp1.
In this notebook, we will learn how to:
- Import DASK client
- Load object and source catalogs (lazily)
- Show HATS partitioning with ZTF objects and sources
- Save the results of a science workflow to disk
Color-magnitude diagram of the h and χ Persei double cluster.
In this notebook, we will learn how to:
- Perform crossmatching with existing
LSDBcatalogs - Stream the results of LSDB operations instead of computing them all at once
Train a small neural network to predict stellar temperature from Gaia colors.
In this notebook, we will learn:
- What nested pandas is
- How to do basic operations on timeseries
- How to find periodic variables with a Lomb–Scargle periodogram
Find a transient by fitting the Bazin function to ZTF light curves.
Similarity search with Astromer2 light-curve embeddings.
In this notebook, we will learn how to:
- Build a multimodal dataset by cross-matching Gaia light curves with APOGEE spectra
- Split a catalog into train/validation/test sets and stream training batches
- Export the result to disk in different ML-ready formats: HATS, Lance, and PyTorch tensors
Bring your own model and train a model!
This project is supported by Schmidt Sciences.
This project is based upon work supported by the National Science Foundation under Grant No. AST-2003196.
This project acknowledges support from the DIRAC Institute in the Department of Astronomy at the University of Washington. The DIRAC Institute is supported through generous gifts from the Charles and Lisa Simonyi Fund for Arts and Sciences, and the Washington Research Foundation.





