html2xhtml(1)                                                  html2xhtml(1)



NAME
       html2xhtml - Converts HTML files to XHTML

SYNTAX
       html2xhtml [ filename ] [ options ]


DESCRIPTION
       Html2xhtml  is  a command-line tool that converts HTML files to XHTML
       files. The path of the HTML input file can be provided as a  command-
       line argument. If not, it is read from stdin.

       Xhtml2xhtml  tries  always to generate valid XHTML files.  It is able
       to correct many common errors in input HTML files  without  loose  of
       information. However, for some errors, html2xhtml may decide to loose
       some information in order to generate a valid XHTML output.  This can
       be  avoided  with  the -e option, which allows html2xhtml to generate
       non-valid output in these cases.

       Html2xhtml can generate the XHTML output compliant to one of the fol‐
       lowing document types: XHTML 1.0 (Transitional, Strict and Frameset),
       XHTML 1.1, XHTML Basic and XHTML Mobile Profile.

OPTIONS
       The command line options/arguments are:

       filename            Read the HTML input from filename (optional argu‐
                           ment). If this argument is not provided, the HTML
                           input is read from standard input.

       -o filename         Output XHTML file. The file is overwritten if  it
                           exists. If not provided, the output is written to
                           standard output.

       -e                  Instructs the program to propagate  input  chunks
                           to  the output even if it is unable to adapt them
                           to the output XHTML doctype. Using  this  option,
                           the XHTML output may be non-valid. Not using this
                           option, some input data could be removed from the
                           output in some [rare] cases.

       -t output-doctype   Doctype  of  the output XHTML file. If not speci‐
                           fied, the program  selects  automatically  either
                           XHTML  1.0  Transitional  or  XHTML  1.0 Frameset
                           depending on the input.  Current  available  doc‐
                           types are:
                            o transitional XHTML 1.0 Transitional
                            o frameset XHTML 1.0 Frameset
                            o strict XHTML 1.0 Strict
                            o 1.1 XHTML 1.1
                            o basic-1.0 XHTML Basic 1.0
                            o basic-1.1 XHTML Basic 1.1
                            o mp XHTML Mobile Profile
                            o print-1.0 XHTML Print 1.0

       --ics input_charset Character  set of the input document. This option
                           overrides the default input character set  detec‐
                           tion mechanism.

       --ocs output_charset
                           Character  set  for the output XHTML document. If
                           this option is not present, the character set  of
                           the input is used as default.

       --lcs               Dump  the list of available character set aliases
                           and exit html2xhtml.  No conversion is  performed
                           when this option is present.

       -l line_length      Number  of characters per line. The default value
                           is 80.  It must be greater or equal to 40, other‐
                           wise the parameter is ignored.

       -b tab_length       Tab  length in number of characters. It must be a
                           number between 0 and 16, otherwise the  parameter
                           is  ignored.   Use  0 to avoid indentation in the
                           output.

       --preserve-space-comments
                           Use this option to preserve  white  spaces,  tabs
                           and  ends of lines in HTML comments. The default,
                           if not provided, is to rearrange spacing.

       --no-protect-cdata  Enclose CDATA sections in  "script"  and  "style"
                           following  the  XHTML  1.0  specification  (using
                           "<!CDATA[[" and "]]>"). It might be  incompatible
                           with  some browsers.  The default in this version
                           is to enclose CDATA sections using  "//<!CDATA[["
                           and  "//]]>",  because  major  browsers handle it
                           properly.

       --compact-block-elements
                           No  white  spaces  or  line  breaks  are  written
                           between  the start tag of a block element and the
                           start tag of its first  enclosed  inline  element
                           (or  character  data)  and between the end tag of
                           its last enclosed inline  element  (or  character
                           data)  and  the  end tag of the block element. By
                           default, if this option is not set,  a  new  line
                           character  and  indentation  is  written  between
                           them.

       --compact-empty-elm-tags
                           Do not write a whitespace before  the  slash  for
                           empty element tags (i.e. write "<br/>" instead of
                           the default "<br />").  Note that  although  both
                           notations are correct in XML, the XHTML 1.0 stan‐
                           dard recommends the latter to improve compatibil‐
                           ity with old browsers.

       --empty-elm-tags-always
                           By  default,  empty element tags are written only
                           for elements declared as empty in the  DTD.  This
                           option makes any element not having content to be
                           written with the empty element tag, even if it is
                           not declared as empty in the DTD. This option may
                           cause problems when the XHTML document is  opened
                           by browsers in HTML (tag soup) mode.

       --dos-eol           Write  the  output  XHTML  file  with  DOS--style
                           (CRLF)  end  of  line,  instead  of  the  default
                           UNIX--style end of line.  Both end of line styles
                           are allowed by the XML recommendation.

       --help              Show a brief help message and exit.

       --version           Show the version number and exit.


NOTE ON CHARACTER SETS
       Since version 1.1.2, html2xhtml is able to parse and write  HTML  and
       XHTML  documents  using  the most popular character sets / encodings.
       It is also able to read the input document using  a  given  character
       set  and  generate an output that uses a different character set. The
       iconv implementation in the GNU C library is used with that purpose.

       Any IANA-registered character set that  is  supported  by  the  iconv
       library  may be used. When naming a character set, any IANA--approved
       alias for it may be used. The full  list  of  aliases  recognised  by
       html2xhtml can be obtained with the --lcs command-line option.

       If  the  character  set  of  the  input  document  is  not specified,
       html2xhtml tries to guess it automatically.  If the character set  of
       the  output  document  is not specified, html2xhtml writes the output
       using the same character set as the input document.


NOTE ON END OF LINE CHARACTES
       By default, the UNIX-style one-byte end of line is used.  It  can  be
       changed to DOS-style CRLF end of line with the --dos-eol command-line
       option.

       However, when the program is compiled in the  MinGW  environment  and
       the  output  is  sent to standard output, the output is automatically
       converted by the environment to CRLF  by  default.  Do  not  use  the
       --dos-eol  command-line option in that situation.  When the output is
       sent to a file with the -o command-line  option,  the  output  is  as
       expected  (UNIX-style  by  default),  and the --dos-eol option may be
       used.


ACKNOWLEDGMENTS
       Program developer up to current version:
       Jesus Arias Fisteus <jaf@it.uc3m.es>

       The first working version of this program has been developed as
       a Master Thesis at the University of Vigo (Spain) [http://www.uvigo.es],
       advised by:

       Rebeca Diaz Redondo
       Ana Fernandez Vilas

       Copyright 2000-2001 by Jesus Arias Fisteus, Rebeca Diaz Redondo, Ana
       Fernandez Vilas.
       Copyright 2002-2009 by Jesus Arias Fisteus






                                                               html2xhtml(1)
