CZII CryoET Object Identification

The CZII CryoET Object Identification Challenge, hosted on Kaggle from November 2024 to February 2025, was a machine learning competition focused on automating the detection of protein complexes in cryo-electron tomography (cryoET) data. Published in Nature Methods (Peck et al., 2025), this competition addressed a critical bottleneck in structural biology research by developing algorithms to accurately identify and annotate multiple types of protein complexes in 3D tomographic volumes.
More on the Kaggle page
Challenge Design
The competition was structured around a carefully constructed “phantom” dataset containing five target protein types:
- 80S Ribosomes
- Virus-like Particles (VLPs)
- Thyroglobulin (THG)
- Beta-galactosidase
- Apoferritin
Each protein presented different levels of difficulty for detection due to variations in size, shape, and contrast. We designed the competition metric as a weighted F-beta score (beta=4) that prioritized recall over precision, with additional weight given to the more challenging proteins (THG and Beta-galactosidase).
Technical Infrastructure
Our team developed several open-source machine learning tools to support the competition:
- Copick: A cross-platform API for working with cryoET datasets, providing standardized access to tomograms and annotations.
- CellCanvas: A tool for segmentation and visualization of tomograms, allowing for interactive labeling.
- DeepFindET: An adapted version of DeepFinder using a Residual U-Net architecture for multi-class particle detection.
- Copick-torch: A PyTorch integration for working with cryoET data through custom dataset handling.
- DenoisET: An implementation of Noise2Noise for improving tomogram contrast.
These tools formed a cohesive ecosystem that enabled participants to focus on algorithm development rather than data handling.
Data Preparation
The competition dataset included hundreds of high-quality tomograms with thousands of annotated particles across five protein classes. Creating this “ground truth” dataset required developing a multi-stage workflow that combined:
- Template matching for initial candidate selection
- Machine learning-based segmentation
- Manual curation with specialized tools
- Multiple rounds of 2D and 3D classification
This comprehensive approach provided participants with a reliable benchmark for evaluating their solutions.
Integration with CryoET Data Portal
The competition leveraged the CZ CryoET Data Portal (cryoetdataportal.czscience.com) as a platform for data distribution and supplemental training resources. This integration allowed participants to:
- Access standardized tomograms in the OME-Zarr format
- Use additional annotated datasets for model training
- Explore tomograms through web-based visualization
- Apply their solutions to the broader corpus of public cryoET data
Educational Materials
The competition included extensive educational components to lower the barrier to entry:
- Example notebooks for popular model architectures (U-Net, TomoTwin)
- Tutorials for working with the copick ecosystem
- Documentation on tomogram preprocessing and interpretation
- Example workflows for training and inference
Impact
This challenge has significant implications for structural biology:
- Standardized Benchmark: Created a reliable dataset for training and evaluating particle detection algorithms
- Open Infrastructure: Built entirely on open-source tools, enabling community-driven innovation in cryoET analysis
- Foundation Model Development: Established data infrastructure to support training of large models on cellular imaging
- Cross-Team Collaboration: Demonstrated stable evaluation metrics across diverse approaches (as noted by participants)
The solutions developed through this competition help scientists better understand cellular processes and accelerate biomedical discoveries.
References
- Peck, A., Yu, Y., Schwartz, J., Cheng, A., Ermel, U.H., Hutchings, J., Kandel, S., Kimanius, D., Montabana, E.A., Serwas, D., Siems, H., Wang, F., Zhao, Z., Zheng, S., Haury, M., Agard, D., Potter, C., Carragher, B., Harrington, K.* and Paraan, M.*, 2025. A realistic phantom dataset for benchmarking cryo-ET data annotation. Nature Methods, pages 1-5. * - co-corresponding author
- Harrington, K.I., Zhao, Z., Schwartz, J., Kandel, S., Ermel, U., Paraan, M., Potter, C. and Carragher, B., 2024. Open-source Tools for CryoET Particle Picking Machine Learning Competitions. bioRxiv, pp.2024-11.