Ubuntu: Where in linux file system can i see files of Hadoop HDFS?



Question:

I am a just data analyst hails from non cs background (not a hardcore system programmer) working on linux OS. While exploiting analysis using hadoop, a question was coming to my mind that ... / is a root under which all files of your system exists in a hierarchical manner. In hadoop envirorment, a special file system called as HDFS is there which is actually meant to store huge files to be processed by hadoop programming frameworks.

hadoop -fs put localfile.txt   

Although, such files should be accessible under /. so where can i see such files by using cat, less or more linux commands without prefixing hadoop -fs

If, unfortunately i get some error in hadoop/HDFS environment, then how can i acess my data which is still residing on my linux machine.


Solution:1

You cannot directly browse HDFS from terminal using cat or similar commands. HDFS is a logical file system and does not directly map to Unix file system. You should have an HDFS client and your Hadoop cluster should be running. When you browse HDFS, you are getting your directory structure from namenode and actual data from datanodes.

Although you cannot browse, data is there stored by datanode daemon. Its path is specified by dfs.data.dir property in hdfs-site.xml.

Directory structure is stored by namenode daemon and its path is specified by dfs.name.dir property in hdfs-site.xml


Solution:2

Hadoop stores it data locally in forms of block on each datanode and that property is configurable in hdfs-site.xml file under dfs.data.dir property

In most of the case it is

$HADOOP_HOME/data/dfs/data/hadoop-${user.name}/current  


Solution:3

You can navigate to all the files who are managed by hadoop by simply writing following command:

More appropriate command is hdfs dfs -ls

The command given on linux terminal will show an output of / directory in which 1st columns shows the file's permissions, 2nd column shows the user of the file and 3rd column shows name of the file


Solution:4

In fact you can cat the contents of your file using;

hdfs dfs -cat /user/test/somefile.txt  

In Hadoop Namenode holds all the information about files like filename, metadata, directory, permission, the blocks which form the file, and block locations. In case of namenode failure you will lose the files since you dont know which blocks form which file although you have all the content on datanodes.

Since files are stored as blocks in Hadoop, if you know the blockid and datanodes of files you can see the content of them. Here we are assuming the files are text files.

Finally HDFS supports mapping an HDFS directory to a local NFS share. This way you can access hdfs without using any hdfs specific commands.


Solution:5

You can use hdfs fsck utility to locate the name of the block and then you can manually find it in the local filesystem:

$ echo "Hello world" >> test.txt  $ hdfs dfs -put test.txt /tmp/  $ hdfs fsck /tmp/test.txt -files -blocks  /tmp/test.txt 12 bytes, 1 block(s):  OK      0. BP-1186293916-10.25.5.169-1427746975858:blk_1075191146_1451047 len=12 repl=1  

Note the blk_.... string. Use that to locate the file:

$ find /hadoop/hdfs/data/current/BP-1186293916-10.25.5.169-1427746975858/current/finalized -name 'blk_1075191146*'  /hadoop/hdfs/data/current/BP-1186293916-10.25.5.169-1427746975858/current/finalized/subdir22/subdir29/blk_1075191146_1451047.meta  /hadoop/hdfs/data/current/BP-1186293916-10.25.5.169-1427746975858/current/finalized/subdir22/subdir29/blk_1075191146    $ cat /hadoop/hdfs/data/current/BP-1186293916-10.25.5.169-1427746975858/current/finalized/subdir22/subdir29/blk_1075191146  Hello world  

You can see full example with explanation here


Note:If u also have question or solution just comment us below or mail us on toontricks1994@gmail.com
Previous
Next Post »