Stripping Styles and Deprecated HTML Elements and Attributes
One of the most frustrating and tedious tasks for web designers is stripping out improper characters and deprecated HTML elements and attributes. Improper characters include the apostrophes, quotation marks, and em-dashes that are created from Word or Pages documents. Deprecated HTML elements and attributes include <font>, target=”", and align=”".
The most common scenario involves getting a Word document from a client or reengineering the HTML of an old-school website. Usually a web designer has to do a series of Find & Replaces. If there’s a lot to Find & Replace, it ends up using valuable time that could be spent designing or reengineering the HTML.
To streamline this unavoidable process, I asked Scott to write me a script that I could use in BBEdit (or TextWrangler) that will strip all unnecessary information from HTML documents. For me, that meant stripping out all inline styles, classes, deprecated HTML elements and attributes, and replacing improper characters with proper ones.
Scott was able to create a Perl script that strips the HTML down to its bare essentials – the way God intended. Now, all I have to do is open the HTML document in BBEdit, and then run the Unix Filter script. Here’s a download link for the script, and instructions on how to install and use it.
- Download HTML Stripper You will want to right-click this link to save it
- Copy the file to /User/[home]/Library/Application Support/BBEdit/Unix Support/Unix Filters
- There are two way to run the script in BBEdit:
- From BBEdit’s menu, click on Window – Palettes – Unix Filters. The Unix Filters palette will appear with HTML Stripper as one of the options. Open the HTML document that you would like to strip, click on HTML Stripper in the palette, and then click on the Run button.
- Open the HTML document that you would like to strip. From BBEdit’s menu, click on Unix Filters – HTML Stripper.
Update (05/18/2006) – The HTML Stripper is now available for TextMate