CZII CryoET Object Identification

The CZII CryoET Object Identification Challenge, hosted on Kaggle from November 2024 to February 2025, was a machine learning competition focused on automating the detection of protein complexes in cryo-electron tomography (cryoET) data. This competition addressed a critical bottleneck in structural biology research by developing algorithms to accurately identify and annotate multiple types of protein complexes in 3D tomographic volumes.
More on the Kaggle page
Challenge Design
The competition was structured around a carefully constructed “phantom” dataset containing five target protein types:
- 80S Ribosomes
- Virus-like Particles (VLPs)
- Thyroglobulin (THG)
- Beta-galactosidase
- Apoferritin
Each protein presented different levels of difficulty for detection due to variations in size, shape, and contrast. The competition metric used a weighted F-beta score (beta=4) that prioritized recall over precision, with additional weight given to the more challenging proteins (THG and Beta-galactosidase).
Technical Infrastructure
Several open-source software tools were developed to support the competition:
- Copick: A cross-platform API for working with cryoET datasets, providing standardized access to tomograms and annotations.
- CellCanvas: A tool for segmentation and visualization of tomograms, allowing for interactive labeling.
- DeepFindET: An adapted version of DeepFinder using a Residual U-Net architecture for multi-class particle detection.
- Copick-torch: A PyTorch integration for working with cryoET data through custom dataset handling.
- DenoisET: An implementation of Noise2Noise for improving tomogram contrast.
These tools formed a cohesive ecosystem that enabled participants to focus on algorithm development rather than data handling.
Data Preparation
The competition dataset included hundreds of high-quality tomograms with thousands of annotated particles across five protein classes. Creating this “ground truth” dataset required developing a multi-stage workflow that combined:
- Template matching for initial candidate selection
- Machine learning-based segmentation
- Manual curation with specialized tools
- Multiple rounds of 2D and 3D classification
This comprehensive approach provided participants with a reliable benchmark for evaluating their solutions.
Integration with CryoET Data Portal
The competition leveraged the CZ CryoET Data Portal (cryoetdataportal.czscience.com) as a platform for data distribution and supplemental training resources. This integration allowed participants to:
- Access standardized tomograms in the OME-Zarr format
- Use additional annotated datasets for model training
- Explore tomograms through web-based visualization
- Apply their solutions to the broader corpus of public cryoET data
Educational Materials
The competition included extensive educational components to lower the barrier to entry:
- Example notebooks for popular model architectures (U-Net, TomoTwin)
- Tutorials for working with the copick ecosystem
- Documentation on tomogram preprocessing and interpretation
- Example workflows for training and inference
Impact
This challenge has significant implications for the field of structural biology:
- Accelerating protein complex annotation in cryoET data
- Enabling systematic analysis of cellular architecture
- Developing algorithms that generalize across diverse biological samples
- Building an open-source software ecosystem for cryoET data analysis
The solutions developed through this competition will help scientists better understand cellular processes and potentially accelerate biomedical discoveries.
Kagglers pointed out that the competition had “an incredible level of stability across the public and private leaderboards,” which related to the significant amount of testing we did with the tools described in our NeurIPS paper (Harrington et al, 2024).
References
- Harrington, K.I., Zhao, Z., Schwartz, J., Kandel, S., Ermel, U., Paraan, M., Potter, C. and Carragher, B., 2024. Open-source Tools for CryoET Particle Picking Machine Learning Competitions. bioRxiv, pp.2024-11.
- Peck, A., Yu, Y., Schwartz, J., Cheng, A., Ermel, U.H., Kandel, S., Kimanius, D., Montabana, E., Serwas, D., Siems, H., Wang, F., Zhao, Z., Zheng, S., Haury, M., Agard, D., Potter, C., Carragher, B., Harrington, K.* and Paraan, M.*, 2024. Annotating CryoET Volumes: A Machine Learning Challenge. bioRxiv. Available at: https://doi.org/10.1101/2024.11.04.621686. * - co-corresponding author
- Ermel, U., Cheng, A., Ni, J.X., Gadling, J., Venkatakrishnan, M., Evans, K., Asuncion, J., Sweet, A., Pourroy, J., Wang, Z.S., Khandwala, K., Nelson, B., McCarthy, D., Wang, E.M., Agarwal, R. and Carragher, B., 2024. A data portal for providing standardized annotations for cryo-electron tomography. Nature Methods, 21, pp.2200–2202. doi: 10.1038/s41592-024-02145-8.