Digital public goods for AI in Health in India Nisheeth CSE,CGS,CDIS@IITK 12/17/2024 1
Failure mode: data leakage Epic Systems is one of the largest EHR vendors in the US They released a sepsis onset prediction model in 2017, which is deployed in hundreds of US hospitals A 2021 external validation study of the algorithm’s performance found It detects only 7% of sepsis cases It generated sepsis predictions for about 20% of all hospitalized patients Anatomy of a failure
Failure mode: data leakage The Epic sepsis model is proprietary, so nobody knew that they were using whether a doctor has prescribed antibiotics as a predictive feature in their model If a doctor is prescribing antibiotics, they are already suspecting sepsis So the model performed well in internal testing and pilot demonstrations on back-tested data But had low PPV in real-world settings Closed testing produces black swans
Trusting AI systems can be hard AI systems cannot be fully evaluated using conventional software engineering quality testing Governance of AI systems must recognize this basic fact AI systems must demonstrate that they do what they claim to do to show they are trustworthy Modern AI systems can sometimes fail in unexpected ways (Raji et al, 2022)
Governing misbehaving AI systems How could this have been avoided? Third-party testing But that is expensive, presupposes an ecosystem etc. Let's say standard ABC standardizes third-party testing procedures for a class of models Epic can choose to follow ABC, but how do vendors know they should follow ABC? How do vendors know they have correctly followed ABC? A regulator can require vendors delivering sepsis prediction models to follow ABC, but How does the regulator know which models, sectors or products should follow ABC? How do we trust AI systems?
Make AI in India: a short story Meet Dr. Sharma and her friend, they aim to develop a reliable AI model for heart failure prediction They obtain data from their state's premier superspeciality hospital They obtain some BIRAC funding, and start rolling
... and scaling failure Dr. Sharma trained a model using the data she had, here is the model performance (tested on some unseen held-out data)
Classical solution: standardization and certification BIS could standardize third-party testing procedures for AI systems used to make healthcare decisions Hospital regulators could require that algorithms deployed in their hospitals meet this standard Some agency could certify that an algorithm meets this standard But AI systems quality testing is very challenging
The AI quality testing trilemma Testing for a variety of use cases centrally enhances reliability. But this requires a central private data repository, restricting openness Letting people test independently permits openness and enables coverage, but people are likely overfitting independently then Testing on publicly available benchmarks is open and reliable for data distributions consistent with existing benchmarks, but not for use cases outside the coverage of these benchmarks Trilemma: You can get any two, but not all three
The trilemma explained Open datasets for model testing quickly lose statistical validity Closed datasets for model testing fail to take long tail coverage scenarios into account Model providers have no incentive to perform true out-of-sample validation Model consumers frequently have no clear understanding of potential failure modes of complex AI models
The statistical validity problem (Ioannidis, 2005)
Positive Predictive Value The probability that an effect that is reported as statistically significant is true
Table. PPV of Research Findings for Various Combinations of Power (1 - ß), Ratio of True to Not-True Relationships (R), and Bias (u) Ioannidis JPA (2005) Why Most Published Research Findings Are False. PLOS Medicine 2(8): e124. https://doi.org/10.1371/journal.pmed.0020124 http://journals.plos.org/plosmedicine/article?id=10.1371/journal.pmed.0020124
A Chance Encounter Dr. Sharma and her friend meet Prof Srivastava at a conference, Prof. suggests a solution to their problem —a set of three AI DPGs addressing the AI quality testing trilemma.
Health information exchange consent management (HIECM) The architecture of the ABDM National Health Stack already has the concept of consent management built-in Patients can provide consent tokens to registered medical entities to review their electronic health records held by other medical entities We propose to extend the concept of consent management from clinical settings to research settings Electronic unbundling of HER Informed consent for medical data usage for research purposes Provable differential privacy guarantees for anonymization Citizens
Quality preserving databases (QPD) A QPD is a management layer on top of a public research database that Requires users to specify test characteristics Requires a manager to allocate the significance level for the test Requires user to compensate for dataset usage in the form of additional data samples to restore statistical value of a dataset commensurate to that lost by their testing In theory, a QPD Can serve an infinite series of requests Satisfies fairness and stability requirements Controls statistical validity levels using alpha-investing Model Providers
An overview of the framework… FedClient : Supports data pre-processing and model training on private data FedServer : Responsible for learning global model and benchmarking it on the held-out data
Upload and Pre-process your data -One can view test data format on the server, then pre-process the private data as needed
1. 2. Register as a Client Select the Model
Define Model Config and request training Based on model config and one’s claims, server will calculate the price of the training (in terms of data points) and one has to pay that price to start the training Other registered Clients can see this request and participate in the training
Okay.. I’ll start by preprocessing our data Lets start training our Heart failure prediction model
Overall Training Steps
After Training… FedServer will update the benchmarks if the recent model beats the previous benchmark This benchmarking is trusted and publicly available. After benchmarking is done, data points equivalent to the training cost will be traded from the Client. *this image is not exactly from our framework, we are currently developing this feature
The Model learned after training
The Model learned after training No data is shared between stakeholders Model performance improves Third party testing Public benchmarks on a statistically robust private dataset
Solving the quality testing trilemma
Trilemma resolved with AI DPGs There is a central private test set, but it is open to testing, while retaining statistical validity using QPD Leveraging ABDM enables access to data across a nationwide context, through HIECM. This provides rich and comprehensive data access. The OBP enables trustworthy testing and benchmarking, simultaneously serving regulatory and quality assurance purposes
Key partners
Moving forward We are thinking about how to enable asymmetric federated learning on our platform Also adding model pipelines and datasets to the platform Lots of scope for digital forensics and authentication tools We're working with NHA to define a stage-wise deployment plan Inputs from other institutions extremely welcome at this stage We need beta-testers willing to play with models and datasets on our platform We will be launching it as part of an AI for health hackathon during Techkriti@IITK People interested in testing earlier are welcome to contact me