Anthony Goldbloom (Kaggle)'s image
Anthony Goldbloom (Kaggle)
Competition Admin
Kaggle Admin
Posts 382
Thanks 72
Joined 20 Jan '10 Email user
From Kaggle

Entrants are welcome to use other data to develop and test their algorithms and entries until 11:59:59 UTC on April 4, 2012 if the data are (i) freely available to all other Entrants’ and (i) published (or a link provided) to the data in the “External Data” on this Forum topic within one (1) week of an entry submission using the other data.  Entrants may not use any data other than the Data Sets after 11:59:59 UTC on April 4, 2012 without prior approval.

 
Christopher Hefele's image Posts 83
Thanks 50
Joined 1 Jul '10 Email user
Anthony, I'm confused about your statement above. In the rules, seems to emphasize NEW data --- "Entrants may not use new external data in connection with the development of their entries after 11:59:59 UTC on April 4, 2012 without the prior written permission of Sponsor." In contrast, your post above seems to say we cannot use ANY external data after 4/4/2012 (that is, we have to back the external data out of our models to continue competing). Can you clarify / confirm what happens after 4/4/2012 in regards to using external data (or not)? Can we use data sources that have been declared in the forums up to that point? Or is all external data forbidden after 4/4/2012?
 
Jeremy Howard (Kaggle)'s image Posts 166
Thanks 58
Joined 13 Oct '10 Email user
From Kaggle
Anthony's post is incorrect - the rules are correct. I emailed Anthony yesterday to ask him to update his post, but he has been a little pre-occupied with other things! ;) External data is allowed after 4/4/2012, as long as its been released before then.
 
factfiber's image Posts 4
Joined 5 Apr '11 Email user
Does "freely available" imply anything about the format? Should it be csv? Is, for instance, HCUPNet data (available via a query interface) allowed? I don't relish having to parse someone's post of HL7 EDI data....
 
Jeremy Howard (Kaggle)'s image Posts 166
Thanks 58
Joined 13 Oct '10 Email user
From Kaggle
That's a good question Shaunc - if there are external data sets that are only usable in practice if additional information is provided (lookup tables, parsing algorithms, etc) we will require all the information and data necessary to utilise the data set. It doesn't have to be CSV, but it does have to be a file in a format that can be utilised from standard programming and analysis tools without commercial libraries. If people have trouble getting files into a format that is readily sharable, they can contact us and we'll see if we can help. We certainly don't want technical problems to stop anyone from accessing data that could help get better answers!
 
Jeremy Howard (Kaggle)'s image Posts 166
Thanks 58
Joined 13 Oct '10 Email user
From Kaggle
Oh sorry forgot to answer re HCUP data. My understanding is that to download files from there requires purchase of a license (and it's not OK to use their online interface, unless you can and do save the results and share them on the forum, since the rules require external data be there). I'm not familiar with their licensing terms - if you are able to get a hold of data that you can legally share here and HPN can use for this purpose, please do. If you can't, but think any particular data set you're aware of would be really useful for HPN to purchase for this competition, please tell us the details and we'll pass it on to HPN to follow up - I expect they'll be happy to buy data for competitors if it helps get better solutions.
 
Uri Blass's image Posts 253
Thanks 4
Joined 5 Aug '10 Email user

I wonder how can you prove that people do not use other data to develop and test their algorithm.

People may use other data but practically deny it and explain the constants that they use or the conditions that they check as a simple guess because even people who do not use other data guess and a doctor with some experience may simply guess better based on her(his) experience.

 
ATG's image
ATG
Posts 2
Joined 12 Dec '10 Email user

I am also curious to know what the answer ti Uri's question would be - since we're not required to provide any model equation. Of course it can be made required for the top performers at a later stage. If there is any other way of checking the use of external data, admin, can you please let us know?

 
Uri Blass's image Posts 253
Thanks 4
Joined 5 Aug '10 Email user

I do not use external data but I am certainly guessing things by try and error in the leaderboard because I have no way to test.

The problem is that we have no set of people with data from year 1 to year 4 so we have no way to test models of how to predict year 4 based on year 1,2,3.

We can try to predict year 3 based on year 1 and year 2 data but it is clearly a different problem than predicting year 4 based on years 1,2,3.

Data about different people(that we do not need to predict) for years 1,2,3,4 certainly can help.

 
cnie's image Posts 4
Joined 7 Apr '11 Email user

You can predict Y3 based Y1,Y2, and then predict Y4 based on Y2,Y3. 

 
Justin Washtell's image Posts 48
Thanks 15
Joined 26 Aug '10 Email user

@Jeremy. I just ran a number of queries on HCUPnet and downloaded the results in spreadsheet format. At no point was I presented with any terms to accept and I can find no mention of limitations on the use or redistribution of the data on the site. What I could find was...

"You can purchase many HCUP databases to do more detailed analyses not possible through HCUPnet".

and

"Many of the databases that are featured in HCUPnet can be purchased through the HCUP Central Distributor or from the States. If you find that HCUPnet does not answer all your questions, or you need more sophisticated statistics, then you may wish to purchase the databases and do your own analysis... You will need to complete an application form and sign a data use agreement before purchasing your data."

This suggests that the aggregate data available through the site is entirely free to use in any way. Are you in a position to confirm that this is so and that we can use it in this contest? Despite its limitations, it looks extremely useful!

 
Justin Washtell's image Posts 48
Thanks 15
Joined 26 Aug '10 Email user

FYI, I have just submitted the following query to AHRQ, who manage HCPUnet:

"Hello. I cannot find on your websites any information concerning limitations on the use of the aggregare data that is freely available through the HCUPnet query interface. Can I ask you to clarify: Are there any limitations and, if so, where are these described? Or are any tables of figures produced by the free query interface essentially in the public domain?"

 
Jeremy Howard (Kaggle)'s image Posts 166
Thanks 58
Joined 13 Oct '10 Email user
From Kaggle

According to HCUPNet, "It is the responsibility of the user to contact and obtain the needed copyright permissions prior to reproducing materials in any form" (http://www.ahrq.gov/news/gdlcopyr.htm ). So I think we should wait for you to receive a reply from your query to HCUPNet - or directly contact the copyright owner of the data you wish to use.

Thanked by Justin Washtell , and Ian11
 
Justin Washtell's image Posts 48
Thanks 15
Joined 26 Aug '10 Email user

That link would seem to apply to "clinical practice guidelines" only, which I do not think are anything to do with the databases. At any rate, I received this today, in direct response to the email which I copied above...


"Dear Mr. Washtell:

Thank you for your e-mail and interest in HCUPnet. Your e-mail was forwarded to the HCUP User Support inbox. Information obtained through HCUPnet is considered public information and no special permission is required to publish the statistics. We do, however, request that you source the information with the appropriate citation. Recommendations for citing are located on the HCUP-US Website at http://hcup-us.ahrq.gov/tech_assist/citations.jsp.

If you have any additional questions, please contact User Support at this address.

Sincerely,

HCUP User Support"


Can I take it from this then that I/we can build models using this data as long as I/we post a copy of the data (and/or sufficient information for other users to generate the exact same data through the HCUPnet interface) on here - along with the requisite citation of course.

I can forward the actual email I received to Kaggle/HPN if necessary.

 
Justin Washtell's image Posts 48
Thanks 15
Joined 26 Aug '10 Email user

Further...


Dear Mr. Washtell:

I am responding to your inquiry on behalf of Randie Siegel, AHRQ's associate director for publishing and electronic dissemination. You want to know about limitations to publishing research based on HCUPnet data. The tables produced by HCUPnet are in the public domain, but source citation is greatly appreciated and encouraged. As far as I can tell, no data use agreement is needed.

There is a page on the HCUP Web site that addresses the issue of publication requirements, “Requirements for Publishing Results with HCUP Data” (http://www.hcup-us.ahrq.gov/db/publishing.jsp). I see from the description page about HCUPnet, that HCUPnet is programmed to automatically abide by the privacy rules set down for using any of the HCUP databases. Otherwise, the publishing requirements page links to a page of suggested citations (http://www.hcup-us.ahrq.gov/tech_assist/citations.jsp), with a section on citing HCUPnet:

Citing HCUPnet:

First list HCUPnet, then HCUP, followed by the appropriate data years, and then AHRQ and the related Web link. Lastly, include the date of access. Consider the following example:
HCUPnet. Healthcare Cost and Utilization Project (HCUP). 2006-2009. Agency for Healthcare Research and Quality, Rockville, MD. http://hcupnet.ahrq.gov/ Accessed May 5, 2003.

If you still have questions about using HCUPnet data, contact HCUP User Support via email (hcup@ahrq.gov).

Sincerely,


 
José A. Guerrero's image Rank 19th
Posts 144
Thanks 21
Joined 27 Jan '11 Email user

http://www.heritagehealthprize.com/c/hhp/forums/t/1268/i-found-this-useful

 
S.U.T.'s image Posts 43
Thanks 7
Joined 5 Sep '11 Email user

"Prognostic Indices" -

Study behind paywall: http://jama.ama-assn.org/content/307/2/182.short

Graphic: http://www.eprognosis.org/ (Note: charlson index not mentioned)

NYT summary: http://well.blogs.nytimes.com/2012/01/19/why-doctors-cant-predict-how-long-a-patient-will-live/?ref=health

 

 

 
CreativSolutions's image Posts 4
Joined 7 Jan '12 Email user

I don't yet know if this data is useful, but I did submit an entry using it, and I want to make sure I can use the data in the future if I find it useful.

4 Attachments —
 
Jeremy Howard (Kaggle)'s image Posts 166
Thanks 58
Joined 13 Oct '10 Email user
From Kaggle

It should be fine to use that data, as long as it can be licensed to competitors to use on this comp, and to HPN to use with the final model. Can you please confirm where you got that data, and how it is licensed?

 
HangZ's image Rank 5th
Posts 8
Thanks 1
Joined 1 Nov '11 Email user

I am sharing an external data source that might be useful for this competition. It is attached with this post. It is the 2010 census data that is publicly available from the census.gov website http://www.census.gov/prod/cen2010/briefs/c2010br-03.pdf

I just contacted the census.gov, what I was told is as follows:

"Census data is "public domain", you do not need our permission to use it, copy it, publish it, or cite it."

1 Attachment —
 
_JeremyA's image Posts 23
Thanks 6
Joined 5 Apr '11 Email user

http://www.heritageprovidernetwork.com/?p=medical-groups

http://www.calhospitalcompare.org/

I'll probably use the data located on these webpages as inputs at some point, I assume this is considered 'external data'?

 

~jba

 
DavidChudzicki's image
DavidChudzicki
Kaggle Admin
Posts 418
Thanks 106
Joined 21 Nov '10 Email user
From Kaggle

JeremyA wrote:

http://www.heritageprovidernetwork.com/?p=medical-groups

http://www.calhospitalcompare.org/

I'll probably use the data located on these webpages as inputs at some point, I assume this is considered 'external data'?

 

~jba

 

Yes, anything other than the data sets provided with the competition are "external data." 

 
G's image
G
Posts 1
Joined 18 Mar '11 Email user

In section "7. USE OF OTHER DATA" of the rules it states: "You may not, however, link the Data Sets to records in other external databases such that new demographic, socioeconomic or clinical information about the members in the Data Sets is gained."

Is a concise definition available for what exactly constitutes demographic, socioeconomic, and clinical information in the context of this sentence?

thanks

 
_JeremyA's image Posts 23
Thanks 6
Joined 5 Apr '11 Email user

G wrote:

In section "7. USE OF OTHER DATA" of the rules it states: "You may not, however, link the Data Sets to records in other external databases such that new demographic, socioeconomic or clinical information about the members in the Data Sets is gained."

Is a concise definition available for what exactly constitutes demographic, socioeconomic, and clinical information in the context of this sentence?

thanks

Do the two links I've provided fall under this rule? The avg LoS for California as well as the in-service provider info from the Hertiage Health wesite certainly qualify as "new demographic, socioeconomic or clinical information about the members", just not for the purposes of 'Patient Identification/Privacy'; which is what I thought the rule was geared towards...?

 

Thanks in Advance,

~jba

 
DavidChudzicki's image
DavidChudzicki
Kaggle Admin
Posts 418
Thanks 106
Joined 21 Nov '10 Email user
From Kaggle
G-- I'm sorry, but that's what we have. I think we'll just have to figure out how that applies on a case-by-case basis. JeremyA-- We'll need to have a look at that data and think about it with HHN. I'll be sure to give a response by Friday next week (March 16). Thanks, David
 
DavidChudzicki's image
DavidChudzicki
Kaggle Admin
Posts 418
Thanks 106
Joined 21 Nov '10 Email user
From Kaggle

JeremyA-- I'm sorry. I'm still trying to find out what HHN thinks of this. I'll be in touch again as soon as I can.

 
_JeremyA's image Posts 23
Thanks 6
Joined 5 Apr '11 Email user

Don't worry.  I anticipated it might elicit some difficulty.
And there's lots of time left in the competition.

~jba

 

 

 
Kno.e.sis's image Posts 4
Joined 28 Nov '11 Email user

We haven't submitted any prediction model yet to the competition but I will get to it some time. However, I'm planning to use some external data sources. Please let me know if I should be posting links to these datasets here.

Thanks a lot in advance!
Pramod.

 
DavidChudzicki's image
DavidChudzicki
Kaggle Admin
Posts 418
Thanks 106
Joined 21 Nov '10 Email user
From Kaggle

Yes, you should post links here.

 

7. USE OF OTHER DATA

Entrants may use data other than the Data Sets to develop and test their Prediction Algorithms and Entries provided that (i) such data are freely available to all other Entrants and (ii) the data and/or a link to the data are published in the "External Data" topic in the Forums section of the Website within one (1) week of the date on which an Entry that uses such data is submitted to the Website. Entrants may not use new external data in connection with the development of their Entries after 11:59:59 UTC on April 4, 2012 without the prior written permission of Sponsor. Any third-party service provider, consultant or contractor of Sponsor that received or receives data or other information in connection with work performed for or on behalf of Sponsor may not use such data or other information in connection with the Competition.

You may not, however, link the Data Sets to records in other external databases such that new demographic, socioeconomic or clinical information about the members in the Data Sets is gained. Sponsor reserves the right in its sole discretion to disqualify any Entrant who Sponsor discovers has undertaken or attempted to undertake such linking of the Data Sets.

 
Kno.e.sis's image Posts 4
Joined 28 Nov '11 Email user

Thanks a lot for the reply!

Here are some of the external data sets I'm planning to use:

Pubmed: http://www.ncbi.nlm.nih.gov/pubmed/ 

datasets on LODD : http://www.w3.org/wiki/HCLSIG/LODD/Data 

ICD9 data: http://www.cdc.gov/nchs/icd/icd9.htm 

Mortality statistics: http://www.cdc.gov/nchs/deaths.htm 

Disease ontology: http://do-wiki.nubic.northwestern.edu/index.php/Main_Page 

 

 

 

 

 
DavidChudzicki's image
DavidChudzicki
Kaggle Admin
Posts 418
Thanks 106
Joined 21 Nov '10 Email user
From Kaggle

JeremyA, I'm sorry -- I think we have to say not to use it.

-David

 
theafh's image Posts 1
Joined 11 Oct '11 Email user

Hi Kaggle Admins,

census.gov was already mentioned in this thread… I’m thinking about use of other external data from that source with a social-economic dimension. Like the data linked from that document: http://www.census.gov/hhes/www/income/income.html

Would it be ok to integrate that data in my models?

-theafh

 
DavidChudzicki's image
DavidChudzicki
Kaggle Admin
Posts 418
Thanks 106
Joined 21 Nov '10 Email user
From Kaggle

Hi Theafh,

I'll have to look into it and get back to you within a week, but I fear the answer will be the same as for JeremyA's question.

Thanks,

David

 
Becky's image Posts 1
Joined 22 Feb '12 Email user

Hi,

We are planning to leverage the following data and information which is free to the public:

http://www.dartmouthatlas.org

http://www.cdc.gov/nchs/data/nvsr/nvsr59/nvsr59_09.pdf

 ftp://ftp.cdc.gov/pub/Health_Statistics/NCHS/Dataset_Documentation/NHDS/NHDS_2009_Documentation.pdf

 http://www.cdc.gov/nchs/data/nvsr/nvsr60/nvsr60_04.pdf

 http://www.statehealthfacts.org

http://www.cdc.gov/nchs/fastats/hospital.htm

www.ahrq.gov

Thanks!

 
Varun Mazumdar's image Posts 1
Joined 8 Jun '12 Email user

Hi Admins,

             I just wanted to know what I needed to do to get approval for the use of external data sets after the april 4th deadline. In addition if I have compiled data via automated data mining from published and freely available journal articles, must I provide links to each article, or just provide the compiled dataset ? It may be easier to provide the compiled dataset as the number of articles used would be huge.

Cheers!

 
DavidChudzicki's image
DavidChudzicki
Kaggle Admin
Posts 418
Thanks 106
Joined 21 Nov '10 Email user
From Kaggle

As a general rule, external data won't be approved after the deadline.

 
David Gainer's image Posts 1
Joined 13 Jun '12 Email user

Hi,

I'm just starting with the contest. Did any external data sources get approved? It didn't look like it from this forum, but I wanted to be sure.

Thanks,

David

 
DavidChudzicki's image
DavidChudzicki
Kaggle Admin
Posts 418
Thanks 106
Joined 21 Nov '10 Email user
From Kaggle

David Gainer wrote:

I'm just starting with the contest. Did any external data sources get approved? It didn't look like it from this forum, but I wanted to be sure.

Prior to April 4, 2012 external data didn't need approval (as long as all of the conditions in the rules were satisfied). That's why you see people posting it here (without approval).

After that date, external data requires approval, which is unlikely to happen.

 
Mercicle's image Posts 1
Joined 27 Jul '12 Email user

David,

Can you provide a final list of specific external data sources that we can use? Some people have listed websites which seems vague. If any data from any url posted before the deadline can be used that you can just verify this.

Thanks,

John

 
DavidChudzicki's image
DavidChudzicki
Kaggle Admin
Posts 418
Thanks 106
Joined 21 Nov '10 Email user
From Kaggle

If there are particular cases that aren't clear from questions & responses on the forum thread, can you ask about those specifically?

 
baller's image Posts 1
Joined 25 Aug '12 Email user

Hi David,

 I have the same question as Mercicle and have read through the whole forum. It think it is a little disorganized as far as a means of declaring which external data people are using and what has been approved. I am sure I can pick through it and pull what I think fits the bill out. I do have a couple of questions:

1) If I do not see that a Kaggle admin has explicitly said not to use a posted source then it is fair game? This is assuming you all have actually checked the sources out at this point. Don't get me wrong, I will check them myself but I wanted to see if this assumption was correct.

2) I see there was a reply posted to theafh about the Census Bureau data that was never fully confirmed and it was stated that it most likely cannot be used. This is a little confusing because the rules just say "You may not, however, link the Data Sets to records in other external databases such that new demographic, socioeconomic or clinical information about the members in the Data Sets is gained. " But, Census Bureau data is anonymous and should not give insight into demographic, socioeconomic or clinical information about an individual member. I would think this is to cover the privacy of any inidividuals in the data but maybe you do mean it to cover people as a whole?

 

 
DavidChudzicki's image
DavidChudzicki
Kaggle Admin
Posts 418
Thanks 106
Joined 21 Nov '10 Email user
From Kaggle

(1) According to Rule 7, you don't need special permission for external data, as long as you satisfy the requirements. In some cases, we've clarified that certain external data isn't allowed. If there are particular ones you're still wondering about, feel free to ask.

(2) It's a good point, but I guess the sponsor just wanted to be totally safe.

 
K-czar's image Rank 57th
Posts 12
Joined 18 Sep '12 Email user

Becky, was all of this information you listed approved? It says posted 6 months ago (not the exact date), which is right around the deadline...so I'm not sure whether it's usable or not.  I am also new to the competition, so still figuring out how things work.  Like others, I tend to think it would be nice if someone could summarize all of the approved info that made it in before the deadline... I guess someone could try to go through and compile it, then double check with others and or the admins to verify everything is approved and nothing is missing.  I might give that a shot later.

Hi,

We are planning to leverage the following data and information which is free to the public:

http://www.dartmouthatlas.org

http://www.cdc.gov/nchs/data/nvsr/nvsr59/nvsr59_09.pdf

ftp://ftp.cdc.gov/pub/Health_Statistics/NCHS/Dataset_Documentation/NHDS/NHDS_2009_Documentation.pdf

http://www.cdc.gov/nchs/data/nvsr/nvsr60/nvsr60_04.pdf

http://www.statehealthfacts.org

http://www.cdc.gov/nchs/fastats/hospital.htm

www.ahrq.gov

Thanks!

 
ADP's image
ADP
Posts 12
Thanks 1
Joined 21 Aug '11 Email user

None of my submissions to date have used external data. If another competitor has requested (or stated prior to the deadline) that they have used external data, and provided the source of the data, am I free to also use that data at my discretion?

 
Sajid Z's image Posts 4
Joined 4 Feb '12 Email user

From some of the links here, it seems that people are trying to link up publicly available provider-specific and hospital-specific information with the HPN data. I have two questions:

1) Is this legal, according to the rules? I know the rules explicitly ban trying to match up patient data

2) Can anyone share how they are matching up this data, since the provider ids are masked?

Thanks

 

Reply

Flag alert Flagging is a way of notifying administrators that this message contents inappropriate or abusive content. Are you sure this forum post qualifies?