The objective for participants in the HHP contest is to forecast how many days persons in the database will spend in the hospital in the next year. This outcome measure would seem to depend on how healthy the person is and on the person’s propensity for seeking medical attention for a perceived illness.
The data provided by the sponsors include age and gender but there is no information on family medical history, smoking history, dietary preferences, liquor consumption or exercise activity. Standard data from annual checkups such as blood test information (providing 40 or so measures), a list of drugs the person is taking and physical measurements such as height, weight and waist circumference are also missing.
Information on the individuals past history in seeking medical help is also very limited. Some individuals seek medical attention when they feel a minor twinge and others only seek medical assistance if they are incapacitated.
One would think that the measures described above would be readily available to medical practitioners. If the participants had access to this information, their forecasts would be more accurate. One of the objectives of the contest is to determine if modern predictive analytical techniques can make useful medical predictions. Given this, why have the sponsors organized a contest that handicaps the participants by not providing relevant data?
The sponsors have also mangled the data. Drug count and lab count are truncated and length of stay (the outcome measure) has been converted into a non-linear numeric. These conversions degrade the estimate of the cost of future hospitalizations and ignore the value of methodologies that are effective with outliers.
Much has been said about protecting the patients’ privacy but research efforts in other areas such as financial services in which highly sensitive personal data is used have been subjected to a less draconian privacy stance. The person’s name, address, phone number and social security number have been removed from each record. In theory, an individual’s pattern of financial activity might be used to identify that person. However, the probability that a single record among several hundred thousand records could be linked to one of the several hundred million people in this country is extremely small.
A cost benefit analysis would surely indicate that improvements in healthcare and reduction in healthcare costs that could result from more sophisticated medical data processing would outweigh by many orders of magnitude the negative impact of an occasional identification of a person in the database.