Tutorial :I need to parse an HTML formatted country list into SQL inserts. Is there an easier way to do this?



Question:

There is about 2000 lines of this, so manually would probably take more work than to figure out a way to do ths programatically. It only needs to work once so I'm not concerned with performance or anything.

<tr><td>Canada (CA)</td><td>Alberta (AB)</td></tr>  <tr><td>Canada (CA)</td><td>British Columbia (BC)</td></tr>  <tr><td>Canada (CA)</td><td>Manitoba (MB)</td></tr>  

Basically its formatted like this, and I need to divide it into 4 parts, Country Name, Country Abbreviation, Division Name and Division Abbreviation.

In keeping with my complete lack of efficiency I was planning just to do a string.Replace on the HTML tags after I broke them up and then just finding the index of the opening brackets and grabbing the space delimited strings that are remaining. Then I realized I have no way of keeping track of which is the country and which is the division, as well as figuring out how to group them by country.

So is there a better way to do this? Or better yet, an easier way to populate a database with Country and Provinces/States? I looked around SO and the only readily available databases I can find dont provide the full name of the countries or the provinces/states or use IPs instead of geographic names.


Solution:1

  1. Paste it into a spreadsheet. Some spreadsheets will parse the HTML table for you.

  2. Save it as a .CSV file and process it that way. Or. Add a column to the spreadsheet that says something like the following:

    ="INSERT INTO COUNTRY(CODE,NAME) VALUES=('" & A1 & "','" & B1 & "');"

Then you have a column of INSERT statements that you can cut, paste and execute.


Edit

Be sure to include the <table> tag when pasting into a spreadsheet.

<table><tr><th>country</th><th>name></th></tr>  <tr><td>Canada (CA)</td><td>Alberta (AB)</td></tr>  <tr><td>Canada (CA)</td><td>British Columbia (BC)</td></tr>  <tr><td>Canada (CA)</td><td>Manitoba (MB)</td></tr>  </table>  

Processing a CSV file requires almost no parsing. It's got quotes and commas. Much easier to live with than XML/HTML.


Solution:2

/<tr><td>([^\s]+)\s\(([^\)])\)<\/td><td>([^\s]+)\s\(([^\)])\)<\/td><\/tr>/  

Then you should have 4 captures with the 4 pieces of data from any PCRE engine :)

Alternatively, something like http://jacksleight.com/assets/blog/really-shiny/scripts/table-extractor.txt provides more completeness.


Solution:3

Sounds like a problem easily solved by a Regex.


Solution:4

I recently learned that if you open a url from Excel it will try and parse out the table data.


Solution:5

If you are able to see this table in the browser (Internet explorer), you can select the entire table, right click & "Export to Microsoft Excel"

That should help you get data into separate columns, I guess.


Solution:6

do you have to do this programatically? If not, may i suggest just copying and pasting the table (from the browser) onto MS Excel and then clearing all formats? This way tou get a nice table that can then be imported into your database without problem.

just a suggestion... hth


Solution:7

An assembly exists for .Net called System.Xml; you can just reference the assembly and convert your HTML document to a System.Xml.XmlDocument, you can easily pinpoint the HTML node that contains your required data, and use the use the children nodes to add into your data. This requires little string parsing on your part.


Solution:8

Load the HTML data as XElements, use LINQ to grab the values you need, and then INSERT.


Solution:9

Blowing my own trumpet here but my FOSS tool CSVfix will do it with a combination of the read_xml and sql_insert commands.


Note:If u also have question or solution just comment us below or mail us on toontricks1994@gmail.com
Previous
Next Post »