Tutorial :how to find a loop in the file system?



Question:

how to find a loop in the file system in Linux? i am indexing all files for making search fast(O(1))... i am using c programming language to implement by using the library functions in dir.h.... I can scan through entire file system but it goes in a loop if there is loop in the filesystem(example loop mount)... how to find the loop in file system.. i have seen updatedb command reporting when there is loop in file system... i don't understand the logic... can anyone pls help find solution for this?


Solution:1

I found this interesting comment here about finding loops in a DAG:

Steinar H. Gunderson wrote:

On Thu, 26 Feb 2004 00:28:32 +0100, Orlondow wrote:

...also reproduced in the Cormen-Leiserson-Rivest, IIC. Which is easiest to find.

Yes, I actually have Cormen et al, but it never struck me to look up "strongly connected components" when I wanted cycle detection. Thanks, I'll have a look at it. :-)

To find a cycle in a directed graph (you don't care which cycle) as long as there exists one, you don't need to go overboard with SCC. Plain old depth first search DFS (in the same chapter of CLRS) is enough.

So, roughly speaking, while you are traversing the directory tree, create a DAG which represents the structure of the tree with the data at the node refering to the inode of the file. Then, you just check to make sure you don't visit a node more than once.


Solution:2

The general way to prevent re-scanning nodes in a graph is to mark nodes as you pass them, and then ignore nodes that are marked. This isn't terribly practical if you don't want to modify the graph you are scanning over, so you need a way to mark the nodes externally. The easiest way I can think of to do this under linux would be to store a the device/inode for each directory you visit. Then when you look at a directory, first check that you haven't already seen any directories with the same device/inode. This not only handles cycles, but also trees that merge back into each other.

To get the device/inode number take a look at the stat/fstat functions and the st_dev and st_ino members of the stat struct.

For storing the data, you're probably wanting to look at a hash-table or a binary tree.


Solution:3

Btw. You don't need to search loop in the file system.

You are indexing whole disk. So you don't need to follow symbolic links as each file must be accessible in normal way (without symlinks). You just need to check points of mount if some disk is mounted more than one time just ignore the rest points of mount.


Solution:4

Perhaps I'm being a bit dim here but aren't the two ways you could create a cycle:

  • by creating a symbolic link
  • by mounting something twice

To deal with these, you can get a list of the mounts before you start indexing and ifnore all but the first of the same fs, and you can ignore links as you come across them in the indexing process.


Solution:5

Simple way. Just do a depth-first tree-walk of the directory tree, keeping a stack of the nodes above you as you go. At each node you visit, if that node is already in the stack, you have a cycle.

 // here's a stack of nodes  node stack[1000];    walk(node, level){      if (node in stack[0..level-1]) then there is a cycle      else          stack[level] = node          for each subnode x of node              walk(x, level+1)  }  


Solution:6

As others have said, there is no such thing as a loop in a file system if you realize that the path is part of a file name, unless its a cyclical symbolic link.

For instance, if you bootstrap some distro (lets say Debian) to a loop device, or even to a directory, and do this ON a Debian machine, you have now duplicated a lot of things.

For instance, lets say you are running Debian Lenny, and bootstrap a minimal copy of it to /lenny.

/lenny/usr/* is going to be the same as /usr/*. There is no 'cheap' way of avoiding this.

Since you're already calling a stat() on each node (I'm assuming you're using ftw() / ftw64() , you may as well:

  • Have ftw()'s callback insert the name of the node into an array, that has structure members that can store a hash of the file which is unlikely to collide. md5 isn't going to cut it for this.
  • Update a hash table based on that digest and the file name (not path).

That's in no danger of speeding up your scan, but it will significantly cut down on time it takes to search.

If you use threads correctly and set affinity, the hashing and indexing can happen on one core while the other is i/o bound (when multiple cores are available).

However, 'just' checking for duplicate mounts is not going to be a cure, plus I'm sure your program would want to return the locations of all files named 'foo', even if there are four identical copies to mention.


Solution:7

This is more typically referred to as a "cycle." So you want to implement "cycle detection." There are many ways to do this; I don't know if this is for home work or not but one simple, not necessarily the most optimal, method is through pointer chasing.


Note:If u also have question or solution just comment us below or mail us on toontricks1994@gmail.com
Previous
Next Post »