Stripping Styles and Deprecated HTML Elements and Attributes
One of the most frustrating and tedious tasks for web designers is stripping out improper characters and deprecated HTML elements and attributes. Improper characters include the apostrophes, quotation marks, and em-dashes that are created from Word or Pages documents. Deprecated HTML elements and attributes include <font>, target=”", and align=”".
The most common scenario involves getting a Word document from a client or reengineering the HTML of an old-school website. Usually a web designer has to do a series of Find & Replaces. If there’s a lot to Find & Replace, it ends up using valuable time that could be spent designing or reengineering the HTML.
To streamline this unavoidable process, I asked Scott to write me a script that I could use in BBEdit (or TextWrangler) that will strip all unnecessary information from HTML documents. For me, that meant stripping out all inline styles, classes, deprecated HTML elements and attributes, and replacing improper characters with proper ones.
Scott was able to create a Perl script that strips the HTML down to its bare essentials – the way God intended. Now, all I have to do is open the HTML document in BBEdit, and then run the Unix Filter script. Here’s a download link for the script, and instructions on how to install and use it.
- Download HTML Stripper You will want to right-click this link to save it
- Copy the file to /User/[home]/Library/Application Support/BBEdit/Unix Support/Unix Filters
- There are two way to run the script in BBEdit:
- From BBEdit’s menu, click on Window – Palettes – Unix Filters. The Unix Filters palette will appear with HTML Stripper as one of the options. Open the HTML document that you would like to strip, click on HTML Stripper in the palette, and then click on the Run button.
- Open the HTML document that you would like to strip. From BBEdit’s menu, click on Unix Filters – HTML Stripper.
Update (05/18/2006) – The HTML Stripper is now available for TextMate
Filed under: Web Design
YOU RULE!!! Thanks!!
I’ve got something funky going on. I just tried a test and this:
<h1><a href=”/”><img src=”/images/logo.png” id=”logo” alt=”Some Logo” border=”0″ align=”left”></a></h1>
I end up with:
<h1><a href=”/”><em></a></h1>
Any idea’s? I’m running this through TextWrangler 2.1 on OS X.4.4.
This would be tooooo sweet if it ends up working for me. As much as I’ve come to appreciate the finer points of find & replace, this seems more logical and HTMLTidy is laborious to figure out. If only I knew more perl or even some regex I could have more valuable input …
Allan, I’ll check it out and get it fixed. I’ll repost it here once I get it figured out.
Okay, Scott fixed your problem. The download link has been updated with the latest script.
Radical. All is well with the cosmos
Thanks again!!
[...] The point to all of this rambling is that while surfing my feeds last night, I saw a headline that made my heart stop. Stripping Styles and Deprecated HTML Elements and Attributes. Could it be true? Had someone come up with the simple RegEx script I had always known could/should exist? YES!!!! The fine folks at Sitening, Jon Henshaw and Scott Holdren have come through with an extremely zen Perl script that easily plugs right into TextWrangler and provides the functionality that I’ve always wanted. I can now, with a single command bannish all deprecated (note: don’t confuse “deprecated” with “depreciated“) elements from any page that I’m currently working on. Wonderful. Simple. Elegant. Jon and Scott are my heros Thanks guys. [...]
Heck yeah! That script saved me twenty minutes at work yesterday. I even got to go to lunch twenty minutes early
[...] Earlier this year, Jon posted a Perl script that Scott wrote which strips out deprecated HTML tags and showed how to run this as a plugin for BBEdit. Slowly over the last few months I’ve convinced Jon to use TextMate as his text editor of choice. The one thing holding him back from fully migrating away from BBEdit was this plugin. Luckily, implementing it in TextMate is dead simple. Here’s how: [...]
Nice tool, Jon, too bad it only works on a MAC or a UNIX platform.
For those of us on PC’s and Windows, HTML Tidy does the same job nicely