Ubuntu: Remove quotes around integers in a csv file



Question:

In a large (>1 gb) csv file I have something like

"34432", "name", "0", "very long description"  

but instead of that I'd like to have

34432, "name", 0, "very long description".  

I was looking at sed but this task is out of my scope.

Any advice how to achieve this?


Solution:1

Using perl:

perl -ne 's/"(\d+)"/$1/g; print' file.csv > new_file.txt  

All the work is done by s/"(\d+)"/$1/g where

  • s/patternA/patternB/ is used to replace patternA by patternB
  • then perl looks for one or more digits \d+ surrounded by double quotes.
  • the parenthesis around (\d+) are used to capture the digit(s) and reuse them as a replacement pattern with perl special variable $1.


Solution:2

A GNU sed regex that should work for this case is

sed -r 's/"([0-9]+)"/\1/g'      

For pure sed you need to escape the grouping parentheses and + modifier

sed 's/"\([0-9]\+\)"/\1/g'  

You can perform the substitution in-place with some versions of sed e.g.

sed -ri 's/"([0-9]+)"/\1/g' file.csv  

You could also use the POSIX class [[:digit:]] in place of character range [0-9]


Solution:3

Your description of the problem is not very specific. I am assuming you want to remove the double quotes around the 1st and 3rd fields only. If so, any of these should work:

  1. sed

    sed -r 's/^"([^"]+)"(\s*,\s*[^,]+)\s*,\s*"([^"]+)"/\1\2, \3/' file.csv  

    Explanation

    The -r enables extended regular expressions, allowing us to use parentheses to capture patterns without needing to escape them. So, we match a quote at the beginning of the line (^"), followed by one or more non-quote characters ([^"]+), then the closing quote, followed by 0 or more spaces, a comma, then 0 or more spaces again (\s*,\s*), then a stretch of non-commas until the next comma (this defines the 2nd field). Finally, we look for 0 or more spaces, a comma, and replace that with the 1st captured pattern (\1), then the 2nd (\2), a comma, a space and the 3rd.

  2. Perl

    perl -pe 's/^"([^"]+)"(\s*,\s*[^,]+)\s*,\s*"([^"]+)"/$1$2, $3/; ' file.csv  

    Explanation

    The -p means print every line after applying the script passed by -e. The script itself is basically the same regex as in the sed above. Only here, the captured patterns is $1.

  3. awk

    awk -F, -v OFS="," '{gsub("\"","",$1)0gsub("\"","",$3);}1;' file.csv   

    Explanation

    The -F sets the field separator to ,. OFS is the output field separator which is also set to , so that the lines are printed correctly. The gsub makes the substitution, replacing all " with nothing since we run it on the 1st ($1) and 3rd fields ($3) it will only remove the quotes from those fields. The 1; is just awk shorthand for "print the line".


Solution:4

Python solution

The small script below takes file command-line argument, iterates over each line in that file, and splits each line into list of items using , as separator. Each entry is then unquoted and checked for being a numeric string; if a string is numeric , it is left unquoted.

#!/usr/bin/env python  import sys    with open(sys.argv[1]) as fp:      for line in fp:          new_vals = []          vals = line.strip().split(',')          for val in vals:              val = val.strip().rstrip().replace('"','')              if not val.isdigit():                  val = '"' + val  + '"'              new_vals.append(val)          print(",".join(new_vals))  

Test run:

$ cat input.txt  "34432", "name", "0", "very long description"   "1234", "othe name" , "42", "another description"  $ ./unquote_integers.py  input.txt                                         34432,"name",0,"very long description"  1234,"othe name",42,"another description"  

Additional notes:

It was asked in the comments , why the script removes double quotes around each item before evaluating if the item is numeric string or not. The main reason for that is because inclusion of double quotes will make item like "123" evaluate to False, i.e. non numeric. Effectively, we need to evaluate what's within the double quotes somehow. Now, there is alternative way to approach this via taking list slice of each value. However, that's not any better than using .replace() from the beginning. It does make code shorter, but at least in this case shortness of a script is irrelevant - our goal is to make the code work, not code-golf it.

Here's the alternative solution with list slices:

#!/usr/bin/env python  import sys    with open(sys.argv[1]) as fp:      for line in fp:          new_vals = []          vals = line.strip().split(',')          for val in vals:              val = val.strip().rstrip() #remove extra spaces              val = val.replace('"','') if val[1:-1].isdigit() else val              new_vals.append(val)          print(",".join(new_vals))  

Note:If u also have question or solution just comment us below or mail us on toontricks1994@gmail.com
Previous
Next Post »