Tutorial :Tool to find duplicate sections in a text (XML) file?



Question:

I have an XML file, and I want to find nodes that have duplicate CDATA. Are there any tools that exist that can help me do this?

I'd be fine with a tool that does this generally for text documents.


Solution:1

Here is a first attempt, written in Python and using only standard libraries. You can improve it in many ways (trim leading and ending whitespaces, computing a hash of the text to decrease memory requirments, better displaying of the elements, with their line number, etc):

import xml.etree.ElementTree as ElementTree  import sys    def print_elem(element):      return "<%s>" % element.tag    if len(sys.argv) != 2:      print >> sys.stderr, "Usage: %s filename" % sys.argv[0]      sys.exit(1)  filename = sys.argv[1]      tree = ElementTree.parse(filename)  root = tree.getroot()  chunks = {}  iter = root.findall('.//*')  for element in iter:      if element.text in chunks:          chunks[element.text].append(element)      else:          chunks[element.text] = [element,]  for text in chunks:      if len(chunks[text]) > 1:          print "\"%s\" is a duplicate: found in %s" % \                (text, map(print_elem, chunks[text]))  

If you give it this XML file:

<foo>  <bar>Hop</bar><quiz>Gaw</quiz>  <sub>  <und>Hop</und>  </sub>  

it will output:

"Hop" is a duplicate: found in ['<bar>', '<und>']  


Solution:2

never heard about anything like that, but it might be an intresting task to write such a program based on a dictionary coder as used in archivers.


Solution:3

The description of the problem is too general.

Could you, please, provide a specific example: the source XML document and the wanted result?

Cheers,

Dimitre Novatchev


Solution:4

Not easily. My first thought is XSLT but it's hard to implement. You'd have to go through each node and then do an XPATH select on every node with the same data. That would find them, but you'd end up processing all of the nodes with the same data later as well (ie, no way to keep track of what node data you've already processed and ignore it). You could do it with a real programming language but that's outside of my experience.


Solution:5

You could write a simple C# app that uses Linq to read all the nodes twice as separate entities, then finds all values that are equal.


Solution:6

A very similar question (asked a year after this one) has some answers with very good tools for diffing chunks within the same file, including Atomiq.


Note:If u also have question or solution just comment us below or mail us on toontricks1994@gmail.com
Previous
Next Post »