Issue
I have three input data files. Each uses a different delimiter for the data contained therein. Data file one looks like this:
apples | bananas | oranges | grapes
data file two looks like this:
quarter, dime, nickel, penny
data file three looks like this:
horse cow pig chicken goat
(the change in the number of columns is also intentional)
The thought I had was to count the number of non-alpha characters, and presume that the highest count was the separator character. However, the files with non-space separators also have spaces before and after the separators, so the spaces win on all three files. Here's my code:
def count_chars(s):
valid_seps=[' ','|',',',';','\t']
cnt = {}
for c in s:
if c in valid_seps: cnt[c] = cnt.get(c,0) + 1
return cnt
infile = 'pipe.txt' #or 'comma.txt' or 'space.txt'
records = open(infile,'r').read()
print count_chars(records)
It will print a dictionary with the counts of all the acceptable characters. In each case, the space always wins, so I can't rely on that to tell me what the separator is.
But I can't think of a better way to do this.
Any suggestions?
Solution
If you're using python, I'd suggest just calling re.split on the line with all valid expected separators:
>>> l = "big long list of space separated words"
>>> re.split(r'[ ,|;"]+', l)
['big', 'long', 'list', 'of', 'space', 'separated', 'words']
The only issue would be if one of the files used a separator as part of the data.
If you must identify the separator, your best bet is to count everything excluding spaces. If there are almost no occurrences, then it's probably space, otherwise, it's the max of the mapped characters.
Unfortunately, there's really no way to be sure. You may have space separated data filled with commas, or you may have | separated data filled with semicolons. It may not always work.
Answered By - JoshD Answer Checked By - Timothy Miller (PHPFixing Admin)
0 Comments:
Post a Comment
Note: Only a member of this blog may post a comment.