

WHO HAS CANCER? THE DATA DOES NOT LIE, BUT THE TRUTH IS HARD TO FIND
​​
Osnat Ashur-Fabian 1,2, Yuval Zehavi3, Tzipi Hornik-Lurie4, Aula Asali5, Ran Gilad-Bachrach3
1Translational Oncology Laboratory, Meir Medical Center, Kfar-Saba, Israel.
2Department of Human Molecular Genetics and Biochemistry, Gray Faculty of Medical and Health Sciences, Tel Aviv University, Tel Aviv, Israel.
3 Department of Biomedical Engineering, Tel-Aviv University, Tel-Aviv, Israel; Edmond J. Safra Center for Bioinformatics, Tel Aviv University, Tel-Aviv, Israel.
4 Meir Research Room, Meir Medical Center, Kfar Saba, 44281, Israel.
5 Division of Gynecologic Oncology, Meir Medical Center, Kfar Saba, Israel.
​
Objectives: The availability of large and detailed repositories of medical records provides an exciting opportunity to identify precursors for cancer, potentially enabling earlier detection or even prevention. The accuracy of this information, however, is critical to ensure the integrity of such medical research. We aimed to evaluate whether the integration of artificial intelligence (AI) tools, such as automatic tagging using ICD codes or utilizing large language models (LLMs), may enhance the quality of clinical data.
Methods &Results: We developed a unique method for evaluating the quality of annotations using a structured scoring system that includes both objective and subjective parameters. We used the final published annotations in peer-reviewed journals as a reference and compared different versions of the same case—manual annotation, AI-generated annotation, and hybrid annotation (manual with AI assistance). The study focused on medical publications in the fields of oncology and internal medicine. In this talk, we will share our insights, challenges, and lessons learned from working with such datasets. We will focus on two key topics: (1) identifying and verifying patients with cancer an ostensibly straightforward task that, despite access to EHRs, evolved into a complex and time-consuming investigation; and (2) leveraging LLMs to infer missing diagnostic codes, offering new possibilities for enhancing data completeness and utility.
Conclusions: Integration of AI-based tools can significantly enhance the quality and consistency of clinical annotations and improve the efficiency and accuracy of data extraction in medical studies.