Tutorial :fread() on 6gb file fails



Question:

Ok, I have been reading up on fread() [which returns a type size_t]and saw several posts regarding large files and some issues others have been having - but I am still having some issues. This function passes in a file pointer and a long long int. The lld is from main where I use another function to get the actual filesize which is 6448619520 bytes.

char *getBuffer(FILE *fptr, long long size) {      char *bfr;      size_t result;        printf("size of file in allocate buffer:  %lld\n", size);          //size here is 6448619520          bfr = (char*) malloc(sizeof(char) * size);      if (bfr == NULL) {          printf("Error, malloc failed..\n");          exit(EXIT_FAILURE);      }          //positions fptr to offset location which is 0 here.      fseek(fptr, 0, SEEK_SET);          //read the entire input file into bfr      result = fread(bfr, sizeof(char), size, fptr);      printf("result = %lld\n",  (long long) result);          if(result != size)      {          printf("File failed to read\n");          exit(5);      }      return (bfr);    }  

I have tested it on files of around 1-2gb in size and it works fine, however, when I test it on a 6gb file, nothing is read in to the buffer. Ignore the other results, (focus on the bolded for results), the issue lies with reading in the data bfr. Here are some of the results I get.

1st of a file that is 735844352 bytes (700+MB)

root@redbox:/data/projects/C/stubs/# ./testrun -x 45004E00 -i /data/Helix2008R1.iso

Image file is /data/Helix2008R1.iso
hex string = 45004E00
>Total size of file: 735844352
size of file in get buffer: 735844352
result = 735844352

** Begin parsing the command line hex value: 45004E00
Total number of bytes in hex string: 4

Results of hex string search:
Hex string 45004E00 was found at byte location: 37441
Hex string 45004E00 was found at byte location: 524768
....

Run #2 against a 6gb file: root@redbox:/data/projects/C/stubs/# ./testrun -x BF1B0650 -i /data/images/sixgbimage.img

Image file is /data/images/sixgbimage.img
hex string = BF1B0650
Total size of file: 6448619520
size of file in allocate buffer: 6448619520
result = 0
File failed to read

I am still not sure why it it failing with large files and not smaller ones, is it a >4gb issue. I am using the following:

/* Support Large File Use */  #define _LARGEFILE_SOURCE 1  #define _LARGEFILE64_SOURCE 1  #define _FILE_OFFSET_BITS   64  

BTW, I am using an ubuntu 9.10 box (2.6.x kernel). tia.


Solution:1

If you're just going to be reading through the file, not modifying it, I suggest using mmap(2) instead of fread(3). This should be much more efficient, though I haven't tried it on huge files. You'll need to change my very simplistic found/not found to report offsets if that is what you would rather have, but I'm not sure what you want the pointer for. :)

#define _GNU_SOURCE  #include <string.h>    #include <fcntl.h>  #include <sys/mman.h>  #include <stdio.h>  #include <sys/types.h>  #include <sys/stat.h>  #include <unistd.h>      int main(int argc, char* argv[]) {      char *base, *found;      off_t len;      struct stat sb;      int ret;      int fd;      unsigned int needle = 0x45004E00;        ret = stat(argv[1], &sb);      if (ret) {              perror("stat");              return 1;      }        len = sb.st_size;        fd = open(argv[1], O_RDONLY);      if (fd < 0) {              perror("open");              return 1;      }        base = mmap(NULL, len, PROT_READ, MAP_PRIVATE, fd, 0);      if (!base) {              perror("mmap");              return 1;      }        found = memmem(base, len, &needle, sizeof(unsigned int));      if (found)              printf("Found %X at %p\n", needle, found);      else              printf("Not found");      return 0;  }  

Some tests:

$ ./mmap ./mmap  Found 45004E00 at 0x7f8c4c13a6c0  $ ./mmap /etc/passwd  Not found  


Solution:2

If this is a 32 bit process, as you say, then size_t is 32 bit and you simply cannot store more than 4GB in your process's address space (actually, in practice, a bit less than 3GB). In this line here:

bfr = (char*) malloc(sizeof(char) * size);  

The result of the multiplication will be reduced modulo SIZE_MAX + 1, which means it'll only try and allocate around 2GB. Similarly, the same thing happens to the size parameter in this line:

result = fread(bfr, sizeof(char), size, fptr);  

If you wish to work with large files in a 32 bit process, you have to work on only a part of them at a time (eg. read the first 100 MB, process that, read the next 100 MB, ...). You can't read the entire file in one go - there just isn't enough memory available to your process to do that.


Solution:3

When fread fails, it sets errno to indicate why it failed. What is the value of errno after the call to fread that returns zero?

Update: Are you required to read the entire file in one fell swoop? What happens if you read in the file, say, 512MB at a time?

According to your comment above, you are using a 32-bit OS. In that case, you will be unable to handle 6 GB at a time (for one, size_t won't be able to hold that large of a number). You should, however, be able to read in and process the file in smaller chunks.

I would argue that reading a 6GB file into memory is probably not the best solution to your problem even on a 64-bit OS. What exactly are you trying to accomplish that is requiring you to buffer a 6GB file? There's probably a better way to approach the problem.


Solution:4

Have you verified that malloc and fread are actually taking in the right type of parameters? You may want to compile with the -Wall option and check if your 64-bit values are actually being truncated. In this case, malloc won't report an error but would end up allocating far less than what you had asked for.


Solution:5

After taking the advice of everyone, I broke the 6GB file up into 4K chunks, parsed the hex bytes and was able to get what the byte locations which will help me later when I pull out MBR from a VMFS partition that has been dd imaged. Here was the quick and dirty way of reading it per chunk:

#define DEFAULT_BLOCKSIZE 4096
...

while((bytes_read = fread(chunk, sizeof(unsigned char), sizeof(chunk), fptr)) > 0) {      chunkptr = chunk;      for(z = 0; z < bytes_read; z++) {          if (*chunkptr == pattern_buffer[current_search]) {              current_search++;              if (current_search > (counter - 1)) {                  current_search = 0;                  printf("Hex string %s was found at starting byte location:  %lld\n",                         hexstring, (long long int) (offsetctr-1));                  matches++;              }          } else {              current_search = 0;          }          chunkptr++;          //printf("[%lld]: %02X\n", offsetctr, chunk[z] & 0xff);          offsetctr++;      }      master_counter += bytes_read;  }  

...

and here were the results I got...

root@redbox:~/workspace/bytelocator/Debug# ./bytelocator -x BF1B0650 -i /data/images/sixgbimage.img     Total size of /data/images/sixgbimage.img file:  6448619520 bytes  Parsing the hex string now: BF1B0650    Hex string BF1B0650 was found at starting byte location:  18  Hex string BF1B0650 was found at starting byte location:  193885738  Hex string BF1B0650 was found at starting byte location:  194514442  Hex string BF1B0650 was found at starting byte location:  525033370  Hex string BF1B0650 was found at starting byte location:  1696715251  Hex string BF1B0650 was found at starting byte location:  1774337550  Hex string BF1B0650 was found at starting byte location:  2758859834  Hex string BF1B0650 was found at starting byte location:  3484416018  Hex string BF1B0650 was found at starting byte location:  3909721614  Hex string BF1B0650 was found at starting byte location:  3999533674  Hex string BF1B0650 was found at starting byte location:  4018701866  Hex string BF1B0650 was found at starting byte location:  4077977098  Hex string BF1B0650 was found at starting byte location:  4098838010      Quick stats:  ================  Number of bytes that have been read:  6448619520  Number of signature matches found:  13  Total number of bytes in hex string:  4  

Note:If u also have question or solution just comment us below or mail us on toontricks1994@gmail.com
Previous
Next Post »