Dr. Khaled El Emam (Canada Research Chair in Medical Artificial Intelligence, As an Individual) at the Access to Information, Privacy and Ethics Committee

On February 7th, 2022. See this statement in context.

February 7th, 2022 / 12:05 p.m.

Dr. Khaled El Emam Canada Research Chair in Medical Artificial Intelligence, As an Individual

Thank you, Mr. Chair and members of the committee.

The purpose of my remarks is to offer an overview of de-identification. As someone who has worked in this area for close to 20 years in both academia and industry, perhaps this is where I can be helpful to the committee's study. I cannot comment on the specifics of the approach taken by Telus and PHAC because I do not have that information. My focus is on the state of the field and practice.

It's important to clarify terminology. Terms like anonymization, de-identification and aggregation are used interchangeably, but they don't mean the same thing. It's more precise to talk about the risk of re-identification. The objective when sharing datasets for a secondary purpose, as is the case here, is to ensure that the risk of re-identification is very small.

There are strong precedents on the definition of very small risk, which come from data releases by, for example, Health Canada, from guidance from the Ontario privacy commissioner, and from applications by European regulators and health departments in the U.S. Therefore, accepting a very small risk is typically not controversial as we rely on these precedents that have worked quite well in practice.

If we said that the standard is zero risk, then all data would be considered identifiable or considered personal information. This would have many negative consequences for health research, public health, drug development and the data economy in general in Canada. In practice, a very small risk threshold is set, and the objective is to transform data to meet that threshold.

There are many kinds of transformations to reduce the risk of re-identification. For example, dates can be generalized, geographical locations can be reduced in granularity, and noise can be added to data values. We can create synthetic data, which is fake data that retains the patterns and statistical properties of the real data but for which there is no one-to-one mapping back to the original data. Other approaches that involve cryptographic schemes can also be used to allow secure data analysis. All that is to say there's a tool box of privacy-enhancing technologies for the sharing of individual-level data responsibly, and each of those has some strengths and weaknesses.

Instead of sharing individual-level data, it's also possible to share summary statistics only. If done well, this has a very small risk of re-identification. Because the amount of information in summary statistics is significantly reduced, it does not always meet an organization's needs. If it does, it can be a good option, and that's how we tend to define “aggregate data”.

In practice, for datasets that are not released to the public, additional security, privacy and contractual controls must be in place. The risk is managed by a combination of data transformations and these controls. There are models to provide assurance that the combination of data transformations and controls has a very small risk of re-identification overall.

There are other best practices for responsible reuse and sharing of data, such as transparency and ethics oversight. Transparency means informing individuals about the purposes for which their data are used and can involve an opt-out. Ethics means having some form of independent review of the data-processing purposes to ensure that they are not harmful, surprising, discriminatory, or just creepy. Especially for sensitive data, another approach is a white-hat attack on the data: Someone is commissioned to launch a re-identification attack to test the re-identification risk empirically. This can complement the other methods and provide additional assurance.

All this means is that we have good technical and governance models to enable the responsible reuse of datasets, and there are multiple privacy-enhancing technologies, mentioned above, available to support data reuse.

Is everyone adopting these practices? No. One challenge is the lack of clear, pan-Canadian regulatory guidance or codes of practice for creating non-identifiable information that take into consideration the enormous benefits of using and sharing data and the risks of not doing so. This, and more clarity in law, would reduce uncertainty, provide clear direction for what reasonable, acceptable approaches are, and enable organizations to be assessed or audited to demonstrate compliance. While there are some efforts, for example by the Canadian Anonymization Network, it may be some time before they produce results.

See context to find out what was said next.