Ubuntu: Wget from webpage in html to many text files



Question:

To download all the page one level below superSite.com I do:

wget -r -l1 http:/superSite.com  

But these pages are saved in the .html format. How can I have them be saved in .txt format? (I need to parse part of the numerical contents of these pages, so I don't care about losing the banners/images)


Solution:1

If you want to parse your downloaded HTML files, you can filter them through something like html2text (you have to install the package 'html2text').

This might be helpful if you want to get rid of the formatting in the .html documents, however, parsing the original .html or the new .txt files is pretty much the same thing.


Solution:2

.html files are text files. The file extension makes absolutely no difference. All files contain some form of binary in the end, and many files contain text in the end. HTML files are simply composed of the HTML markup as text, which is then parsed by the browser to show what the HTML describes.

If you want to view it as text, use a dedicated text editor and open the HTML files. Or, from your file browser, select "Open as", "Open with", or similar.


Note:If u also have question or solution just comment us below or mail us on toontricks1994@gmail.com
Previous
Next Post »