Cleaning Documents for HTML

Ways of Keeping HTML code clean when working with Microsoft Word and other Office documents

There are many ways of building websites and people of many different skill levels are doing so every single day. One of the problems encountered by web designers is the junk code that gets introduced when copying and pasting text from Microsoft Word documents, other webpages and emails. Basically, whenever you copy and paste something, you are not only copying the text you can see, but also the underlying instructions or code that tells it how to display.

If you use a professional WYSIWYG (What You See Is What You Get) editor like Adobe Dreamweaver, it will usually clean out the junk from other programs, and you can choose how it handles incoming styles and formatting. Sometimes, however, even Dreamweaver will bring in something from Microsoft Word that you don't want (eg. if your writer has mistakenly applied Heading 1 to the entire document, then used Word's formatting options to change the sizes manually, this will come into Dreamweaver all in Heading 1 format).

If you use Wordpress or a content management system that has a similar ‘Paste From Word’ option, this will also make it possible to paste text that does include important formatting like bold, italics, links and tables, but doesn't include complex unwanted code and irrelevant styles.

If you are using an older content management system (eg. certain versions of Joomla), or email marketing software (such as Stream Send), however, it may be important for you to be able to copy and paste from documents without bringing across unwanted code. One quick way of doing so is first pasting your text into Notepad, then copying and pasting it from there. On a PC, this takes it back to text only. On a Mac, you may need to play with some of the settings to get it into plain text, as the equivalent of Notepad does contain background information/code. The downside of using this method is that plain text removes all the formatting from your original document or web page.

If you have a large document or many and you don’t wish to have to reformat each one in HTML, you can use any of the following methods to clean the code without losing the formatting and links etc.:

Use Dreamweaver

As I inferred above, the best method when building websites or email newsletters is to use professional WYSIWG software like Adobe Dreamweaver. Even if you’re not using Dreamweaver to store a complete copy of your website and upload changes to your web host, you can still incorporate it in your workflow and save the HTML files to a logical place on your hard drive. For example, if you are sending email newsletters using email software such as Stream Send, you could write them in Dreamweaver using a template you (or I) have set up. You then save the file to HTML and when you are ready to send out the email, copy and paste from the code view of Dreamweaver into the code view of your email marketing software.

Use Word Press

Even if you’re not using Word Press to create a blog, you can create a dummy post so that you can use their ‘Paste From Word’ feature. Then copy and paste the code from code view into your actual website content management system or email marketing software. Word Press has some funny ways of dealing with line breaks that come through from Word, so just be sure to have a careful look through your document after doing this.

Use Open Office

Copying and pasting from Word to Open Office to your webiste CMS or email marketing software will usually produce better results than going straight from Word.

Use a Word Cleaner Online

If you do a search online, you will probably find a variety of free tools for converting Word documents to clean HTML. These will have varying effectiveness depending on your document. Try Word2CleanHTML, for example.

In all cases, to be sure what is going on, look through the resulting code for junk and unwanted styles. If you don't know HTML, this will be difficult. Remember, the final web page or email may look fine on your computer, but if the code is full of junk it may render poorly on other people’s machines. This will often occur between PCs and Macs or because of things like copying and pasting symbols or using non-Internet-friendly fonts. It could also result in a higgledy-piggledy design because you are not sticking to one Cascading Style Sheet and are introducing foreign styles by copying and pasting from other websites or documents.

For more advice on this and analysis of your particluar situation, please feel free to contact me.

 

G

Please email
Amanda@GreensladeCreations.com
or call 0403 124 533
to discuss your needs.

 

For details of all of Amanda's services, including how she can train your staff to maintain your website, email newsletter, image collection and more, please download this Greenslade Creations brochure.

 

Email | Layout | Writing | SEO | DAM