Ubuntu: sed/awk - remove all tags except only two tags and plain texts



Question:

here is a sample for my text file:

<w:r><w:t>  <w:r w:rsidR="00D171FD">  <w:t></w:t>  </w:r><w:r>  <w:t xml:space="preserve">  This is a sample text </w:t>  </w:r>  <w:highlight w:val="green"/>  <w:r w:rsidR="00D171FD">  <w:color w:val="FF0000"/>  <w:t>  Sample text</w:t>  </w:r>  

The problem is that I need both the pure text and the following tags only:
color w:val="FF0000"
highlight w:val="green"

How can this be done?


Solution:1

With the above command line it will function if you will never have those expression as text:

  1. <w:
  2. \>

The command line is:

cat Myfile.txt  | grep -E "color w:val=|highlight w:val=" | sed s/"<w:"/""/g | sed s/"\/>"/""/g  

Explanation:

  • grep -E, --extended-regexp PATTERN
    Interpret PATTERN as an extended regular expression
  • | logical OR inside the PATTERN of grep
  • | pipe symbol in shell environment
  • sed s/"<w:"/""/g substitutes globally (everywhere) "<w:" with empty string ""

Note: it's possible to write sed in many other way and in a more compact way. I think so is didactic and can be used in a more wide range of possibility when is needed to substitute an expression with another.

This is the output:

highlight w:val="green"    color w:val="FF0000"    

Note:If u also have question or solution just comment us below or mail us on toontricks1994@gmail.com
Previous
Next Post »