Two notable innovations from “Big AI” for healthcare were released recently. These are both more on the R&D side of things, but demonstrate ongoing focus.
MedGemma was released by Google DeepMind team. It is an open source LLM that has text and multimodal features, with both text based medical question and answering and medical image classification and interpretation. The 4B model (smaller) is multimodal and the 27B model (larger) is suitable for text. On the model card, it states that MedGemma 4B was trained on deidentified Chest Xrays, dermatology images, opthalmology images, and histopathology slides. This strikes me as the data sets that have been used in previously published studies by the DeepMind group. MedGemma 27B is an instruction tuned model (ie optimal for chat based Q&A). Google suggests this is available as a base model that can be further trained and incorporated into workflows.
The models have a limitation described at the end, with the following caveats:”The outputs generated by MedGemma are not intended to directly inform clinical diagnosis, patient management decisions, treatment recommendations, or any other direct clinical practice applications.”
This is still a research model.
A second milestone was OpenAI’s release of evaluation criteria for healthcare, “HealthBench”. From their post:
Today, we’re introducing HealthBench: a new benchmark designed to better measure capabilities of AI systems for health. Built in partnership with 262 physicians who have practiced in 60 countries, HealthBench includes 5,000 realistic health conversations, each with a custom physician-created rubric to grade model responses.
The rigor of evaluation criteria for LLMs in the healthcare domain is low to date, and mostly focuses on responses to multiple choice questions and classifiers for data sets like Mimic (deidentified healthcare data). A lack of deep longitudinal healthcare data that is sufficiently cleaned to be useful for training, and also similar enough to real world data to be generalizable, remains a big challenge for the industry. It is amazing that a data set with a relatively small N (5000 conversations) is considered state of the art. This is probably less than 1 month of work for a busy clinician, but here we are.
I will have a larger post in the near future (after I’ve kicked the tires on these two ) that gives an opinion about their utility and for what purpose.
Leave a Reply