Narayan Kamath | 70dce01 | 2013-10-21 12:26:25 +0100 | [diff] [blame] | 1 | TagSoup - Just Keep On Truckin' |
| 2 | |
| 3 | Introduction |
| 4 | |
| 5 | This is the home page of TagSoup, a SAX-compliant parser written in |
| 6 | Java that, instead of parsing well-formed or valid XML, parses HTML as |
| 7 | it is found in the wild: [1]poor, nasty and brutish, though quite often |
| 8 | far from short. TagSoup is designed for people who have to process this |
| 9 | stuff using some semblance of a rational application design. By |
| 10 | providing a SAX interface, it allows standard XML tools to be applied |
| 11 | to even the worst HTML. TagSoup also includes a command-line processor |
| 12 | that reads HTML files and can generate either clean HTML or well-formed |
| 13 | XML that is a close approximation to XHTML. |
| 14 | |
| 15 | This is also the README file packaged with TagSoup. |
| 16 | |
| 17 | TagSoup is free and Open Source software. As of version 1.2, it is |
| 18 | licensed under the [2]Apache License, Version 2.0, which allows |
| 19 | proprietary re-use as well as use with GPL 3.0 or GPL 2.0-or-later |
| 20 | projects. (If anyone needs a GPL 2.0 license for a GPL 2.0-only |
| 21 | project, feel free to ask.) |
| 22 | |
| 23 | Warning: TagSoup will not build on stock Java 5.x or 6.x! |
| 24 | |
| 25 | Due to a bug in the versions of Xalan shipped with Java 5.x and 6.x, |
| 26 | TagSoup will not build out of the box. You need to retrieve [3]Saxon |
| 27 | 6.5.5, which does not have the bug. Unpack the zipfile in an empty |
| 28 | directory and copy the saxon.jar and saxon-xml-apis.jar files to |
| 29 | $ANT_HOME/lib. The Ant build process for TagSoup will then notice that |
| 30 | Saxon is available and use it instead. |
| 31 | |
| 32 | TagSoup 1.2 released |
| 33 | |
| 34 | There are a great many changes, most of them fixes for long-standing |
| 35 | bugs, in this release. Only the most important are listed here; for the |
| 36 | rest, see the CHANGES file in the source distribution. Very special |
| 37 | thanks to Jojo Dijamco, whose intensive efforts at debugging made this |
| 38 | release a usable upgrade rather than a useless mass of undetected bugs. |
| 39 | * As noted above, I have changed the license to Apache 2.0. |
| 40 | * The default content model for bogons (unknown elements) is now ANY |
| 41 | rather than EMPTY. This is a breaking change, which I have done |
| 42 | only because there was so much demand for it. It can be undone on |
| 43 | the command line with the --emptybogons switch, or programmatically |
| 44 | with parser.setFeature(Parser.emptyBogonsFeature, true). |
| 45 | * The processing of entity references in attribute values has finally |
| 46 | been fixed to do what browsers do. That is, a reference is only |
| 47 | recognized if it is properly terminated by a semicolon; otherwise |
| 48 | it is treated as plain text. This means that URIs like |
| 49 | foo?cdown=32&cup=42 are no longer seen as containing an instance of |
| 50 | the )U character (whose name happens to be cup). |
| 51 | * Several new switches have been added: |
| 52 | + --doctype-system and --doctype-public force a DOCTYPE |
| 53 | declaration to be output and allow setting the system and |
| 54 | public identifiers. |
| 55 | + --standalone and --version allow control of the XML |
| 56 | declaration that is output. (Note that TagSoup's XML output is |
| 57 | always version 1.0, even if you use --version=1.1.) |
| 58 | + --norootbogons causes unknown elements not to be allowed as |
| 59 | the document root element. Instead, they are made children of |
| 60 | the default root element (the html element for HTML). |
| 61 | * The TagSoup core now supports character entities with values above |
| 62 | U+FFFF. As a consequence, the HTML schema now supports all 2,210 |
| 63 | standard character entities from the [4]2007-12-14 draft of XML |
| 64 | Entity Definitions for Characters, except the 94 which require more |
| 65 | than one Unicode character to represent. |
| 66 | * The SAX events startPrefixMapping and endPrefixMapping are now |
| 67 | being reported for all cases of foreign elements and attributes. |
| 68 | * All bugs around newline processing on Windows should now be gone. |
| 69 | * A number of content models have been loosened to allow elements to |
| 70 | appear in new and non-standard (but commonly found) places. In |
| 71 | particular, tables are now allowed inside paragraphs, against the |
| 72 | letter of the W3C specification. |
| 73 | * Since the span element is intended for fine control of appearance |
| 74 | using CSS, it should never have been a restartable element. This |
| 75 | very long-standing bug has now been fixed. |
| 76 | * The following non-standard elements are now at least partly |
| 77 | supported: bgsound, blink, canvas, comment, listing, marquee, nobr, |
| 78 | rbc, rb, rp, rtc, rt, ruby, wbr, xmp. |
| 79 | * In HTML output mode, boolean attributes like checked are now output |
| 80 | as such, rather than in XML style as checked="checked". |
| 81 | * Runs of < characters such as << and <<< are now handled correctly |
| 82 | in text rather than being transformed into extremely bogus |
| 83 | start-tags. |
| 84 | |
| 85 | [5]Download the TagSoup 1.2 jar file here. It's about 87K long. |
| 86 | [6]Download the full TagSoup 1.2 source here. If you don't have zip, |
| 87 | you can use jar to unpack it. |
| 88 | [7]Download the current CHANGES file here. |
| 89 | |
| 90 | TagSoup 1.1 released |
| 91 | |
| 92 | TagSoup 1.1 adds Tatu Saloranta's JAXP support for TagSoup. To use |
| 93 | TagSoup within the JAXP framework (which is not something I necessarily |
| 94 | recommend, but it is part of the Java XML platform), you can create a |
| 95 | SAXParser by calling |
| 96 | org.ccil.cowan.tagsoup.jaxp.SAXParserImpl.newInstance(). You can also |
| 97 | set the system property javax.xml.parsers.SAXParserFactory to |
| 98 | org.ccil.cowan.tagsoup.jaxp.SAXFactoryImpl, but be aware that doing |
| 99 | this will cause all JAXP-based XML parsing to go through TagSoup, which |
| 100 | is a Bad Thing if your application also reads XML documents. |
| 101 | |
| 102 | What TagSoup does |
| 103 | |
| 104 | TagSoup is designed as a parser, not a whole application; it isn't |
| 105 | intended to permanently clean up bad HTML, as [8]HTML Tidy does, only |
| 106 | to parse it on the fly. Therefore, it does not convert presentation |
| 107 | HTML to CSS or anything similar. It does guarantee well-structured |
| 108 | results: tags will wind up properly nested, default attributes will |
| 109 | appear appropriately, and so on. |
| 110 | |
| 111 | The semantics of TagSoup are as far as practical those of actual HTML |
| 112 | browsers. In particular, never, never will it throw any sort of syntax |
| 113 | error: the TagSoup motto is [9]"Just Keep On Truckin'". But there's |
| 114 | much, much more. For example, if the first tag is LI, it will supply |
| 115 | the application with enclosing HTML, BODY, and UL tags. Why UL? Because |
| 116 | that's what browsers assume in this situation. For the same reason, |
| 117 | overlapping tags are correctly restarted whenever possible: text like: |
| 118 | This is <B>bold, <I>bold italic, </b>italic, </i>normal text |
| 119 | |
| 120 | gets correctly rewritten as: |
| 121 | This is <b>bold, <i>bold italic, </i></b><i>italic, </i>normal text. |
| 122 | |
| 123 | By intention, TagSoup is small and fast. It does not depend on the |
| 124 | existence of any framework other than SAX, and should be able to work |
| 125 | with any framework that can accept SAX parsers. In particular, [10]XOM |
| 126 | is known to work. |
| 127 | |
| 128 | You can replace the low-level HTML scanner with one based on Sean |
| 129 | McGrath's [11]PYX format (very close to James Clark's ESIS format). You |
| 130 | can also supply an AutoDetector that peeks at the incoming byte stream |
| 131 | and guesses a character encoding for it. Otherwise, the platform |
| 132 | default is used. If you need an autodetector of character sets, |
| 133 | consider trying to adapt the [12]Mozilla one; if you succeed, let me |
| 134 | know. |
| 135 | |
| 136 | Note: TagSoup in Java 1.1 |
| 137 | |
| 138 | If you go through the TagSoup source and replace all references to |
| 139 | HashMap with Hashtable and recompile, TagSoup will work fine in Java |
| 140 | 1.1 VMs. Thanks to Thorbjørn Vinne for this discovery. |
| 141 | |
| 142 | The TSaxon XSLT-for-HTML processor |
| 143 | |
| 144 | [13]I am also distributing [14]TSaxon, a repackaging of version 6.5.5 |
| 145 | of Michael Kay's Saxon XSLT version 1.0 implementation that includes |
| 146 | TagSoup. TSaxon is a drop-in replacement for Saxon, and can be used to |
| 147 | process either HTML or XML documents with XSLT stylesheets. |
| 148 | |
| 149 | TagSoup as a stand-alone program |
| 150 | |
| 151 | It is possible to run TagSoup as a program by saying java -jar |
| 152 | tagsoup-1.0.1 [option ...] [file ...]. Files mentioned on the command |
| 153 | line will be parsed individually. If no files are specified, the |
| 154 | standard input is read. |
| 155 | |
| 156 | The following options are understood: |
| 157 | |
| 158 | --files |
| 159 | Output into individual files, with html extensions changed to |
| 160 | xhtml. Otherwise, all output is sent to the standard output. |
| 161 | |
| 162 | --html |
| 163 | Output is in clean HTML: the XML declaration is suppressed, as |
| 164 | are end-tags for the known empty elements. |
| 165 | |
| 166 | --omit-xml-declaration |
| 167 | The XML declaration is suppressed. |
| 168 | |
| 169 | --method=html |
| 170 | End-tags for the known empty HTML elements are suppressed. |
| 171 | |
| 172 | --doctype-system=systemid |
| 173 | Forces the output of a DOCTYPE declaration with the specified |
| 174 | systemid. |
| 175 | |
| 176 | --doctype-public=publicid |
| 177 | Forces the output of a DOCTYPE declaration with the specified |
| 178 | publicid. |
| 179 | |
| 180 | --version=version |
| 181 | Sets the version string in the XML declaration. |
| 182 | |
| 183 | --standalone=[yes|no] |
| 184 | Sets the standalone declaration to yes or no. |
| 185 | |
| 186 | --pyx |
| 187 | Output is in PYX format. |
| 188 | |
| 189 | --pyxin |
| 190 | Input is in PYXoid format (need not be well-formed). |
| 191 | |
| 192 | --nons |
| 193 | Namespaces are suppressed. Normally, all elements are in the |
| 194 | XHTML 1.x namespace, and all attributes are in no namespace. |
| 195 | |
| 196 | --nobogons |
| 197 | Bogons (unknown elements) are suppressed. |
| 198 | |
| 199 | --nodefaults |
| 200 | suppress default attribute values |
| 201 | |
| 202 | --nocolons |
| 203 | change explicit colons in element and attribute names to |
| 204 | underscores |
| 205 | |
| 206 | --norestart |
| 207 | don't restart any normally restartable elements |
| 208 | |
| 209 | --ignorable |
| 210 | output whitespace in elements with element-only content |
| 211 | |
| 212 | --emptybogons |
| 213 | Bogons are given a content model of EMPTY rather than ANY. |
| 214 | |
| 215 | --any |
| 216 | Bogons are given a content model of ANY rather than EMPTY |
| 217 | (default). |
| 218 | |
| 219 | --norootbogons |
| 220 | Don't allow bogons to be root elements; make them subordinate to |
| 221 | the root. |
| 222 | |
| 223 | --lexical |
| 224 | Pass through HTML comments and DOCTYPE declarations. Has no |
| 225 | effect when output is in PYX format. |
| 226 | |
| 227 | --reuse |
| 228 | Reuse a single instance of TagSoup parser throughout. Normally, |
| 229 | a new one is instantiated for each input file. |
| 230 | |
| 231 | --nocdata |
| 232 | Change the content models of the script and style elements to |
| 233 | treat them as ordinary #PCDATA (text-only) elements, as in |
| 234 | XHTML, rather than with the special CDATA content model. |
| 235 | |
| 236 | --encoding=encoding |
| 237 | Specify the input encoding. The default is the Java platform |
| 238 | default. |
| 239 | |
| 240 | --output-encoding=encoding |
| 241 | Specify the output encoding. The default is the Java platform |
| 242 | default. |
| 243 | |
| 244 | --help |
| 245 | Print help. |
| 246 | |
| 247 | --version |
| 248 | Print the version number. |
| 249 | |
| 250 | SAX features and properties |
| 251 | |
| 252 | TagSoup supports the following SAX features in addition to the standard |
| 253 | ones: |
| 254 | |
| 255 | http://www.ccil.org/~cowan/tagsoup/features/ignore-bogons |
| 256 | A value of "true" indicates that the parser will ignore unknown |
| 257 | elements. |
| 258 | |
| 259 | http://www.ccil.org/~cowan/tagsoup/features/bogons-empty |
| 260 | A value of "true" indicates that the parser will give unknown |
| 261 | elements a content model of EMPTY; a value of "false", a content |
| 262 | model of ANY. |
| 263 | |
| 264 | http://www.ccil.org/~cowan/tagsoup/features/root-bogons |
| 265 | A value of "true" indicates that the parser will allow unknown |
| 266 | elements to be the root of the output document. |
| 267 | |
| 268 | http://www.ccil.org/~cowan/tagsoup/features/default-attributes |
| 269 | A value of "true" indicates that the parser will return default |
| 270 | attribute values for missing attributes that have default |
| 271 | values. |
| 272 | |
| 273 | http://www.ccil.org/~cowan/tagsoup/features/translate-colons |
| 274 | A value of "true" indicates that the parser will translate |
| 275 | colons into underscores in names. |
| 276 | |
| 277 | http://www.ccil.org/~cowan/tagsoup/features/restart-elements |
| 278 | A value of "true" indicates that the parser will attempt to |
| 279 | restart the restartable elements. |
| 280 | |
| 281 | http://www.ccil.org/~cowan/tagsoup/features/ignorable-whitespace |
| 282 | A value of "true" indicates that the parser will transmit |
| 283 | whitespace in element-only content via the SAX |
| 284 | ignorableWhitespace callback. Normally this is not done, because |
| 285 | HTML is an SGML application and SGML suppresses such whitespace. |
| 286 | |
| 287 | http://www.ccil.org/~cowan/tagsoup/features/cdata-elements |
| 288 | A value of "true" indicates that the parser will process the |
| 289 | script and style elements (or any elements with type='cdata' in |
| 290 | the TSSL schema) as SGML CDATA elements (that is, no markup is |
| 291 | recognized except the matching end-tag). |
| 292 | |
| 293 | TagSoup supports the following SAX properties in addition to the |
| 294 | standard ones: |
| 295 | |
| 296 | http://www.ccil.org/~cowan/tagsoup/properties/scanner |
| 297 | Specifies the Scanner object this parser uses. |
| 298 | |
| 299 | http://www.ccil.org/~cowan/tagsoup/properties/schema |
| 300 | Specifies the Schema object this parser uses. |
| 301 | |
| 302 | http://www.ccil.org/~cowan/tagsoup/properties/auto-detector |
| 303 | Specifies the AutoDetector (for encoding detection) this parser |
| 304 | uses. |
| 305 | |
| 306 | More information |
| 307 | |
| 308 | I gave a presentation (a nocturne, so it's not on the schedule) at |
| 309 | [15]Extreme Markup Languages 2004 about TagSoup, updated from the one |
| 310 | presented in 2002 at the New York City XML SIG and at XML 2002. This is |
| 311 | the main high-level documentation about how TagSoup works. Formats: |
| 312 | [16]OpenDocument [17]Powerpoint [18]PDF. |
| 313 | |
| 314 | I also had people add [19]"evil" HTML to a large poster so that I could |
| 315 | [20]clean it up; View Source is probably more useful than ordinary |
| 316 | browsing. The original instructions were: |
| 317 | |
| 318 | SOUPE DE BALISES (BE EVIL)! |
| 319 | Ecritez une balise ouvrante (sans attributs) |
| 320 | ou fermante HTML ici, s.v.p. |
| 321 | |
| 322 | There is a [21]tagsoup-friends mailing list hosted at [22]Yahoo Groups. |
| 323 | You can [23]join via the Web, or by sending a blank email to |
| 324 | [24]tagsoup-friends-subscribe@yahoogroups.com. The [25]archives are |
| 325 | open to all. |
| 326 | |
| 327 | Online TagSoup processing for publicly accessible HTML documents is now |
| 328 | [26]available courtesy of Leigh Dodds. |
| 329 | |
| 330 | References |
| 331 | |
| 332 | 1. http://oregonstate.edu/instruct/phl302/texts/hobbes/leviathan-c.html |
| 333 | 2. http://opensource.org/licenses/apache2.0.php |
| 334 | 3. http://prdownloads.sourceforge.net/saxon/saxon6-5-5.zip |
| 335 | 4. http://www.w3.org/TR/2007/WD-xml-entity-names-20071214 |
| 336 | 5. http://home.ccil.org/~cowan/XML/tagsoup/tagsoup-1.2.jar |
| 337 | 6. http://home.ccil.org/~cowan/XML/tagsoup/tagsoup-1.2-src.zip |
| 338 | 7. http://home.ccil.org/~cowan/XML/tagsoup/CHANGES |
| 339 | 8. http://tidy.sf.net/ |
| 340 | 9. http://www.crumbmuseum.com/truckin.html |
| 341 | 10. http://www.cafeconleche.org/XOM |
| 342 | 11. http://gnosis.cx/publish/programming/xml_matters_17.html |
| 343 | 12. http://jchardet.sourceforge.net/ |
| 344 | 13. http://www.ccil.org/~cowan |
| 345 | 14. http://home.ccil.org/~cowan/XML/tagsoup/tsaxon |
| 346 | 15. http://www.extrememarkup.com/extreme/2004 |
| 347 | 16. http://home.ccil.org/~cowan/XML/tagsoup/tagsoup.odp |
| 348 | 17. http://home.ccil.org/~cowan/XML/tagsoup/tagsoup.ppt |
| 349 | 18. http://home.ccil.org/~cowan/XML/tagsoup/tagsoup.pdf |
| 350 | 19. http://home.ccil.org/~cowan/XML/tagsoup/extreme.html |
| 351 | 20. http://home.ccil.org/~cowan/XML/tagsoup/extreme.xhtml |
| 352 | 21. http://groups.yahoo.com/group/tagsoup-friends |
| 353 | 22. http://groups.yahoo.com/ |
| 354 | 23. http://groups.yahoo.com/group/tagsoup-friends/join |
| 355 | 24. mailto:tagsoup-friends-subscribe@yahoogroups.com |
| 356 | 25. http://groups.yahoo.com/group/tagsoup-friends/messages |
| 357 | 26. http://xmlarmyknife.org/docs/xhtml/tagsoup/ |