If you have spent any time doing web development or updating your blog or website using some sort of Content Management System,  you have likely come across the problem of converting MS Word files into HTML code. It seems like it would be a simple operation – Word does include a “Save As Webpage…” option, but if you take a look at the HTML generated you would be disappointed to see what a mess it is.

Cleaning up Word-junked content before using it online is very important for code compliance and decent, consistent display. Sure the simplest way to strip out Word garbage is to just copy and paste the text from Word into a basic text editor, then copy & paste it from the text editor to your email, blog, or CMS interface. The only problem is that this strips out ALL formatting, which you will need to painstakingly recreate for your online publishing. If you have long formatted documents, this will quickly become tedious and error-prone.

The other option is to seek out a “cleaning” or conversion utility, which would take either a regular Word Doc and convert it to compliant HTML, or would take a “Save As Webpage…” word-generated HTML file, and strip out the Word-only HTML crap. In general< i have found that these tools do a decent job of generating clean code that still includes the basic formatting tags that are necessary for proper display.

As a web developer who has been dealing with this issue for over a decade, I have certainly tried many solutions and have yet to find my “holy-grail”. The main problem I have found with conversion/cleanup programs is that they aren’t smart enough to convert Word-styled bulleted lists into properly formatted <ul>/<li> code. Believe me, the utility that can do THAT will be the winner in my book.

So, here are a handful of options for your Word-to-HTML projects.

Online Utilities

Recommended

Textism.com Word HTML Cleaner
http://www.textism.com/wordcleaner/
COST: Word files up to 20Kb are free, larger files require an inexpensive subscription (€5 - €20)
HOW-TO: Save a Word document ‘as Web Page…’ to your hard drive, then upload to the website
NOTES: Does a good job, but doesn't fix converted lists.

WordOff
http://wordoff.org/
COST: free
HOW-TO: Save a Word document ‘as Web Page…’ to your hard drive, then open it in notepad, copy & paste the HTML to the form on the website
NOTES: Does a good job, but doesn't fix converted lists.

Not Recommended

HTML Tidy Online
http://infohound.net/tidy/
COST: free
HOW-TO: Save a Word document ‘as Web Page…’ to your hard drive, then upload to website, or paste in some HTML from the saved Word doc
NOTES: For the "Tidy Settings" check "Clean" and "Word 2000" for best results. Doesn't remove Word styles (class="MsoBodyText", etc.), doesn't fix converted lists.

Microsoft Word 2000 HTML Mess Cleaner
http://www.algotech.dk/word-html-cleaner-input.htm
COST: free
HOW-TO: Save a Word document ‘as Web Page…’ to your hard drive, then open it in notepad, copy & paste the HTML to the form on the website
NOTES: Converts paragraphs using <BR> tags, which isn't ideal.

Desktop Installed Programs

Somewhat Recommended

Firefox Add-on: Html Validator
https://addons.mozilla.org/en-US/firefox/addon/249
COST: free
HOW-TO: Save a Word document ‘as Web Page…’ to your hard drive, then open it with Firefox. Go to Edit > View Source..., click the "Clean up this page..." button
NOTES: Requires that you have Firefox web browser installed. Doesn't remove Word styles (class="MsoBodyText", etc.), doesn't fix converted lists.

Zapadoo Word Cleaner
http://www.zapadoo.com/wordcleaner/
COST: $99
HOW-TO: Drag-n-Drop or open Word Docs into the program, choose the appropriate conversion template and click a button
NOTES: Can convert many documents at once, very full-featured including the ability to customize your own "templates" for cleaning, though I was dissapointed that the included templates don’t handle lists the way I want. I haven’t been able to  configure a custom one to my standards after spending quite some time on it.

RTF to XHTML Converter
http://rtftohtml.com/
COST: $34.50 (€29)
HOW-TO: In Word, save as RTF file, browse to it in the program, set an output file path, click "Convert" button
NOTES: This program did properly convert lists to <li> tags, but it also added all sorts of extra <div> and <span> tags with useless style info. There aren't any options to exclude this sort of formatting, which would have made this program a winner. Unfortunately, it just doesn't strip out enough junk.

WordHTML CV
http://www.technoriversoft.com/wordtohtmlconverter.html
COST: free
HOW-TO: Drag-n-Drop your Word Doc onto the program window
NOTES: Doesn't remove Word styles (class="MsoBodyText", etc.), doesn't fix converted lists

Not Recommended

Web Code Converter
http://www.web-code-converter.com/
COST: $19.95
NOTES: I couldn't test this, since it opened with an error message. Re-installing didn’t help.

Atrise ToHTML
http://www.atrise.com/to-html/
COST: $25
HOW-TO: Drag & Drop your Word Doc onto the little program window
NOTES: Easy to use, but not recommended because it strips out ALL formatting, leaving only paragraph breaks. I would expect more functionality for $25.

Word2html LT
http://www.wordcnv.com/word2html-lt.html
COST: €40
HOW-TO: Browse to your file, click Open.
NOTES: Even though their website claims "Full support of bullets and numbered lists" I found that it wasn't the case. No <li> in sight. I was also unimpressed with its inability to figure out heading tags.

Convert Doc
http://www.softinterface.com/Convert-Doc/Features/Convert-DOC-To-HTML.htm
COST: free, as far as I could tell
HOW-TO: Browse to your file, set some options, click Convert.
NOTES: Unfortunately, it doesn't seem to do very much differently than Word's own "Save As HTML" option. If you have other file conversion needs, though (PDFs, etc) you might find this a useful program.

WordToWeb 2.5
http://www.solutionsoft.com/w2w.htm
COST: $299
HOW-TO: Uses a Wizard-like interface to browse to your file, set a gazillion options and finally Convert.
NOTES: This has a lot of options to create webpages from your Word docs, but as far as I can tell, it does a terrible job at cleaning the html produced - if anything it seems to ADD extra junk.

 

If you have a favorite, feel free to post a link in the comments.