translating strings to numbers

« Prev
Topic
» Next
Topic
Uri Blass's image Posts 253
Thanks 4
Joined 5 Aug '10 Email user

I hate strings and I wonder if there is a program that simply translate all the strings that we have in the data to integers when different strings get different integers(when the program treat both 0234 and 234 as the same 234 integer).

A missing number in a column can be translated to -1(or to a different number that is not in the column(if the column include also -1 when the program tell me that some number means missing value)

The program should also generate files  that explain the meaning of the numbers in every column(except columns that include only numbers)

for example in 6th column of claims.csv it may generate file with the following content

claims.csc

Anesthesiology=0,Diagnostic Imaging=1,Emergency=2,...

I think that it is going to be easier if people who participate in this conmpetition do not need to deal with strings and the need to deal with strings is part of the reason that so far I did not make a submission in this contest.

 

 
inf2207's image Posts 9
Joined 28 Apr '11 Email user

what software are you using?  with R you can accomplish this with just one command: x<-as.numeric(claims_y3[,11]) , where claims_y3 are, of course, all claims made in y3 (its just a example from my code). in column 11 are the PrimaryConditionGroups. The result will be a numeric vector with 1="", 2="AMI", 3="APPCHOL" etc. with a numeric(integer) value for each ConditionGroup.

Or directly claims_y3[,11]<-as.numeric(claims_y3[,11]) if you want the result in the same dataset.

 

 
Uri Blass's image Posts 253
Thanks 4
Joined 5 Aug '10 Email user

I use C.

Unfortunately I know nothing about R.

Is it a computer language that it is easy to learn fast?

 
trezza's image Posts 25
Thanks 3
Joined 5 Apr '11 Email user

awk can do it. As in:

 awk 'BEGIN{FS=",";OFS=","}{a=$1+0;$1=a; print $0; }' < claims.csv > claims1.csv

It would convert the leading zeros in the first field of claims.csv and print out the converted lines to claims1.csv

sed can convert strings to whatever. For example

sed 's/Anesthesiology/0/g' claims.csv > claims1.csv

would change all occurances of Anesthesiology to the number 0 and save the resulting file in claims1.csv.

I use C for more complex manipulations myself. 

Both of these programs are available on unix/linux variants, and cygwin if you are on windows.

 
Zach's image Rank 31st
Posts 292
Thanks 64
Joined 2 Mar '11 Email user

Uri Blass wrote:

Is it a computer language that it is easy to learn fast?

 

Yes.

 
inf2207's image Posts 9
Joined 28 Apr '11 Email user

Yes, its easy to learn and it's possible to include C code (sometimes usefull, because loops are very slow in R). It is very good to "explore" data, even when you don't write your prediction algorithm with it.

 

But if you use C, you might want to have a look at C#Linq + MySQL / SQLLite. That was recommended by a kaggle admin in this forum.

 
Zach's image Rank 31st
Posts 292
Thanks 64
Joined 2 Mar '11 Email user

inf2207 wrote:

sometimes usefull, because loops are very slow in R

Try using vector operations instead of loops.  This will speed your code up a lot (and make it easier to parallelize down the road)

 

Reply

Flag alert Flagging is a way of notifying administrators that this message contents inappropriate or abusive content. Are you sure this forum post qualifies?