I will point out (again) that the second Netflix prize was cancelled because some enterprising individuals (without access to Netflix's database) were able to de-anonymize the database and identify actual people along with their movie renting habits. Netflix
was sued, which is generally considered to be a Bad Thing.
Now, imagine that this happened with a dataset of medical records...
The Netflix data had exact dates. The IMDB database of reviews had exact dates. When you have an obscure title, and an exact date in two large databases to also compare as a key, finding a small number of hits isn't unexpected. (I'm not saying necessarily
that the number of hits was small, merely that finding only a small number of matches was necessary to prove the point that the linkage was possible.)
I don't recall how the researchers in the NetFlix case established that they had in fact positively made a correct linkage, unless they contacted the people by email and asked "Are you the person who rented "Pitfalls of being a Vegetarian in the Piranha
Pool" on April 26, 2005, and posted a review to IMDB on April 30th of that same year?"
Assume for the sake of argument that there's a place on the internet where large numbers of people have posted information about their medical condition that is so voluminous and detailed that it's comparable to the IMDB ratings database. I don't think there
is such a thing, but maybe there is.
Without dates, the "dx-fingerprint" is very fuzzy.
That being the case, the "confirmation" becomes "Are you a 50-59 year old female who had the following ICD-9 diagnosis codes in a recent year .... [list of codes, which mean nothing to most people] ... and saw your PCP three times during the year, and a
(A) It's unknowable if there is ONLY ONE person who fits the criteria,
(B) It takes a voluntary contribution from the probable matchee to "confirm" that match ... the person can say "No" or "Not Interested, Go Away" and the anonymity is preserved,
(C) any "researchers" who attempted to match such linkages, after agreeing not to, would be potentially liable for the financial impact on the organizations running and sponsoring the contest, and likely liable under federal law for HIPAA violations.
So while I am aware of the NetFlix contest precedent, there are differences that are .... arguably ... different.
Can you make a coherent argument why diagnosis data should be withheld, but procedure, pharmaceutical and lab data should be made available during a later phase of data release ?
I think it's hard to make such a distinction.