Tutorial :How to list a 2 million files directory in java without having an “out of memory” exception



Question:

I have to deal with a directory of about 2 million xml's to be processed.

I've already solved the processing distributing the work between machines and threads using queues and everything goes right.

But now the big problem is the bottleneck of reading the directory with the 2 million files in order to fill the queues incrementally.

I've tried using the File.listFiles() method, but it gives me a java out of memory: heap space exception. Any ideas?


Solution:1

First of all, do you have any possibility to use Java 7? There you have a FileVisitor and the Files.walkFileTree, which should probably work within your memory constraints.

Otherwise, the only way I can think of is to use File.listFiles(FileFilter filter) with a filter that always returns false (ensuring that the full array of files is never kept in memory), but that catches the files to be processed along the way, and perhaps puts them in a producer/consumer queue or writes the file-names to disk for later traversal.

Alternatively, if you control the names of the files, or if they are named in some nice way, you could process the files in chunks using a filter that accepts filenames on the form file0000000-filefile0001000 then file0001000-filefile0002000 and so on.

If the names are not named in a nice way like this, you could try filtering them based on the hash-code of the file-name, which is supposed to be fairly evenly distributed over the set of integers.


Update: Sigh. Probably won't work. Just had a look at the listFiles implementation:

public File[] listFiles(FilenameFilter filter) {      String ss[] = list();      if (ss == null) return null;      ArrayList v = new ArrayList();      for (int i = 0 ; i < ss.length ; i++) {          if ((filter == null) || filter.accept(this, ss[i])) {              v.add(new File(ss[i], this));          }      }      return (File[])(v.toArray(new File[v.size()]));  }  

so it will probably fail at the first line anyway... Sort of disappointing. I believe your best option is to put the files in different directories.

Btw, could you give an example of a file name? Are they "guessable"? Like

for (int i = 0; i < 100000; i++)      tryToOpen(String.format("file%05d", i))  


Solution:2

If Java 7 is not an option, this hack will work (for UNIX):

Process process = Runtime.getRuntime().exec(new String[]{"ls", "-f", "/path"});  BufferedReader reader = new BufferedReader(new InputStreamReader(process.getInputStream()));  String line;  while (null != (line = reader.readLine())) {      if (line.startsWith("."))          continue;      System.out.println(line);  }  

The -f parameter will speed it up (from man ls):

-f     do not sort, enable -aU, disable -lst  


Solution:3

Use File.list() instead of File.listFiles() - the String objects it returns consume less memory than the File objects, and (more importantly, depending on the location of the directory) they don't contain the full path name.

Then, construct File objects as needed when processing the result.

However, this will not work for arbitrarily large directories either. It's an overall better idea to organize your files in a hierarchy of directories so that no single directory has more than a few thousand entries.


Solution:4

In case you can use Java 7 this can be done in this way and you won't have those out of memory problems.

Path path = FileSystems.getDefault().getPath("C:\\path\\with\\lots\\of\\files");          Files.walkFileTree(path, new FileVisitor<Path>() {              @Override              public FileVisitResult preVisitDirectory(Path dir, BasicFileAttributes attrs) throws IOException {                  return FileVisitResult.CONTINUE;              }                @Override              public FileVisitResult visitFile(Path file, BasicFileAttributes attrs) throws IOException {                  // here you have the files to process                  System.out.println(file);                  return FileVisitResult.CONTINUE;              }                @Override              public FileVisitResult visitFileFailed(Path file, IOException exc) throws IOException {                 return FileVisitResult.TERMINATE;              }                @Override              public FileVisitResult postVisitDirectory(Path dir, IOException exc) throws IOException {                return FileVisitResult.CONTINUE;              }          });  


Solution:5

You can do that with Apache FileUtils library. No memory problem. I did check with visualvm.

  Iterator<File> it = FileUtils.iterateFiles(folder, null, true);    while (it.hasNext())    {       File fileEntry = (File) it.next();    }  

Hope that helps. bye


Solution:6

Since you're on Windows, it seems like you should have simply used ProcessBuilder to start something like "cmd /k dir /b target_directory", capture the output of that, and route it into a file. You can then process that file a line at a time, reading the file names out and dealing with them.

Better late than never? ;)


Solution:7

At fist you could try to increase the memory of your JVM with passing -Xmx1024m e.g.


Solution:8

Why do you store 2 million files in the same directory anyway? I can imagine it slows down access terribly on the OS level already.

I would definitely want to have them divided into subdirectories (e.g. by date/time of creation) already before processing. But if it is not possible for some reason, could it be done during processing? E.g. move 1000 files queued for Process1 into Directory1, another 1000 files for Process2 into Directory2 etc. Then each process/thread sees only the (limited number of) files portioned for it.


Solution:9

Please post the full stack trace of the OOM exception to identify where the bottleneck is, as well as a short, complete Java program showing the behaviour you see.

It is most likely because you collect all of the two million entries in memory, and they don't fit. Can you increase heap space?


Solution:10

If file names follow certain rules, you can use File.list(filter) instead of File.listFiles to get manageable portions of file listing.


Solution:11

As a first approach you might try tweaking some JVM memory settings, e.g. increase heap size as it was suggested or even use AggressiveHeap option. Taking into account the large amount of files, this may not help, then I would suggest to workaround the problem. Create several files with filenames in each, say 500k filenames per file and read from them.


Solution:12

I faced same problem when I developed malware scanning application. My solution is execute shell command to list all files. It's faster than recursively methods to browse folder by folder.

see more about shell command here: http://adbshell.com/commands/adb-shell-ls

        Process process = Runtime.getRuntime().exec("ls -R /");          BufferedReader bufferedReader = new BufferedReader(new InputStreamReader(process.getInputStream()));            //TODO: Read the stream to get a list of file path.  


Solution:13

This also requires Java 7, but it's simpler than the Files.walkFileTree answer if you just want to list the contents of a directory and not walk the whole tree:

Path dir = Paths.get("/some/directory");  try (DirectoryStream<Path> stream = Files.newDirectoryStream(dir)) {      for (Path path : stream) {          handleFile(path.toFile());      }  } catch (IOException e) {      handleException(e);  }  

The implementation of DirectoryStream is platform-specific and never calls File.list or anything like it, instead using the Unix or Windows system calls that iterate over a directory one entry at a time.


Solution:14

You could use listFiles with a special FilenameFilter. The first time the FilenameFilter is sent to listFiles it accepts the first 1000 files and then saves them as visited.

The next time FilenameFilter is sent to listFiles, it ignores the first 1000 visited files and returns the next 1000, and so on until complete.


Solution:15

Try this, it works to me, but I hadn't so many documents...

File dir = new File("directory");  String[] children = dir.list();  if (children == null) {     //Either dir does not exist or is not a  directory    System.out.print("Directory doesn't  exist\n");  }  else {    for (int i=0; i<children.length; i++) {         // Get filename of file or directory         String filename = children[i];    }  

Note:If u also have question or solution just comment us below or mail us on toontricks1994@gmail.com
Previous
Next Post »