CZII CryoET Object Identification

Nov 6, 2024 · 3 min read

The CZII CryoET Object Identification Challenge, hosted on Kaggle from November 2024 to February 2025, was a machine learning competition focused on automating the detection of protein complexes in cryo-electron tomography (cryoET) data. This competition addressed a critical bottleneck in structural biology research by developing algorithms to accurately identify and annotate multiple types of protein complexes in 3D tomographic volumes.

Challenge Design

The competition was structured around a carefully constructed “phantom” dataset containing five target protein types:

80S Ribosomes
Virus-like Particles (VLPs)
Thyroglobulin (THG)
Beta-galactosidase
Apoferritin

Each protein presented different levels of difficulty for detection due to variations in size, shape, and contrast. We designed the competition metric as a weighted F-beta score (beta=4) that prioritized recall over precision, with additional weight given to the more challenging proteins (THG and Beta-galactosidase).

Technical Infrastructure

Our team developed several open-source machine learning tools to support the competition:

Copick: A cross-platform API for working with cryoET datasets, providing standardized access to tomograms and annotations.
CellCanvas: A tool for segmentation and visualization of tomograms, allowing for interactive labeling.
DeepFindET: An adapted version of DeepFinder using a Residual U-Net architecture for multi-class particle detection.
Copick-torch: A PyTorch integration for working with cryoET data through custom dataset handling.
DenoisET: An implementation of Noise2Noise for improving tomogram contrast.

These tools formed a cohesive ecosystem that enabled participants to focus on algorithm development rather than data handling.

Data Preparation

The competition dataset included hundreds of high-quality tomograms with thousands of annotated particles across five protein classes. Creating this “ground truth” dataset required developing a multi-stage workflow that combined:

Template matching for initial candidate selection
Machine learning-based segmentation
Manual curation with specialized tools
Multiple rounds of 2D and 3D classification

This comprehensive approach provided participants with a reliable benchmark for evaluating their solutions.

Integration with CryoET Data Portal

The competition leveraged the CZ CryoET Data Portal (cryoetdataportal.czscience.com) as a platform for data distribution and supplemental training resources. This integration allowed participants to:

Access standardized tomograms in the OME-Zarr format
Use additional annotated datasets for model training
Explore tomograms through web-based visualization
Apply their solutions to the broader corpus of public cryoET data

Educational Materials

The competition included extensive educational components to lower the barrier to entry:

Example notebooks for popular model architectures (U-Net, TomoTwin)
Tutorials for working with the copick ecosystem
Documentation on tomogram preprocessing and interpretation
Example workflows for training and inference

Impact

This challenge has significant implications for the field of structural biology:

Accelerating protein complex annotation in cryoET data
Enabling systematic analysis of cellular architecture
Developing algorithms that generalize across diverse biological samples
Building an open-source software ecosystem for cryoET data analysis

The solutions developed through this competition will help scientists better understand cellular processes and potentially accelerate biomedical discoveries.

Kagglers pointed out that the competition had “an incredible level of stability across the public and private leaderboards,” which related to the significant amount of testing we did with the tools described in our NeurIPS paper (Harrington et al, 2024).

References

Harrington, K.I., Zhao, Z., Schwartz, J., Kandel, S., Ermel, U., Paraan, M., Potter, C. and Carragher, B., 2024. Open-source Tools for CryoET Particle Picking Machine Learning Competitions. bioRxiv, pp.2024-11.
Peck, A., Yu, Y., Schwartz, J., Cheng, A., Ermel, U.H., Kandel, S., Kimanius, D., Montabana, E., Serwas, D., Siems, H., Wang, F., Zhao, Z., Zheng, S., Haury, M., Agard, D., Potter, C., Carragher, B., Harrington, K.* and Paraan, M.*, 2024. Annotating CryoET Volumes: A Machine Learning Challenge. bioRxiv. Available at: https://doi.org/10.1101/2024.11.04.621686. * - co-corresponding author
Ermel, U., Cheng, A., Ni, J.X., Gadling, J., Venkatakrishnan, M., Evans, K., Asuncion, J., Sweet, A., Pourroy, J., Wang, Z.S., Khandwala, K., Nelson, B., McCarthy, D., Wang, E.M., Agarwal, R. and Carragher, B., 2024. A data portal for providing standardized annotations for cryo-electron tomography. Nature Methods, 21, pp.2200–2202. doi: 10.1038/s41592-024-02145-8.

Last updated on Nov 6, 2024