HTML: Hyper Text Markup Language Module

HTML is a widely used Markup Language that, while very similar to XML, differs enough to have its own specific libraries.

To use the bindings from this module:

(import :std/markup/html)

If HTML templates for web development are up your alley have a look at our Template Attribute Language (TAL) which uses this parser and printer.

HTML Parser and Printer

Element, aka Tag Types

"There are six different kinds of elements: void elements, the template element, raw text elements, escapable raw text elements, foreign elements, and normal elements."

While HTML and XML are friends there are some elements in HTML that cannot be expressed in XML. Knowing what/where they are is important for both parsing and printing.

Void: current-html-void-tags and html-void-tag?

Void elements

area, base, br, col, embed, hr, img, input, link, meta, source, track, wbr

Void elements can't have any contents (since there's no end tag, no content can be put between the start tag and the end tag).

--https://html.spec.whatwg.org/multipage/syntax.html#void-elements

The void tags are stored in a parameter current-html-void-tags. It has more than the spec says but there's more than one spec and version so we try to be complete,
```
> (current-html-void-tags)
(area base br col command embed hr img input keygen
 link meta param source track wbr)
```
There's an html-void-tag? procedure to test. It's case-insensitive as HTML is meant to be.
```
 > (html-void-tag? 'InPut)
(input keygen link meta param source track wbr)
> (html-void-tag? 'InPuter)
#f
```
Raw Text: current-html-raw-tags and html-raw-tag?

Raw text elements script, style --https://html.spec.whatwg.org/multipage/syntax.html#raw-text-elements

These are not escaped non-html contents.
```
> (current-html-raw-tags)
(script style xmp)
> (html-raw-tag? 'ScRipt)
(script style xmp)
> (html-raw-tag? 'html)
#f
```

Reading

html-parser is intended as a permissive HTML parser for people who prefer the scalable interface described in Oleg Kiselyov's SSAX parser, as well as providing simple convenience utilities. It correctly handles all invalid HTML, inserting "virtual" starting and closing tags as needed to maintain the proper tree structure. A major goal of this parser is bug-for-bug compatibility with the way common web browsers parse HTML.

html->sxml

(def (html->sxml
      port-or-string
      start: (start (pgetq start: default-html->sxml-plist))
      end: (end (pgetq end: default-html->sxml-plist))
      decl: (decl (pgetq decl: default-html->sxml-plist))
      process: (process (pgetq process: default-html->sxml-plist))
      comment: (comment (pgetq comment: default-html->sxml-plist))
      text: (text (pgetq text: default-html->sxml-plist))
      bodyless: (bodyless (current-html-void-tags))
      literals: (literals (current-http-raw-tags)))
  ...)

Returns the SXML representation of the document from port-or-string, using the default or provided parsing options.

default-html->sxml-plist

This is where the default parsing options come from.

(def default-html->sxml-plist
  [start: (lambda (tag attrs seed virtual?) '())
   end:   (lambda (tag attrs parent-seed seed virtual?)
	    `((,tag ,@(if (pair? attrs)
			`((@ ,@attrs) ,@(reverse seed))
			(reverse seed)))
	      ,@parent-seed))
   decl:    (lambda (tag attrs seed) `((*DECL* ,tag ,@attrs) ,@seed))
   process: (lambda (attrs seed) `((*PI* ,@attrs) ,@seed))
   comment: (lambda (text seed) `((*COMMENT* ,text) ,@seed))
   text:    (lambda (text seed) (cons text seed))])

html-strip

(html-strip port-or-string)

Returns a string representation of the document from PORT with all tags removed. No whitespace reduction or other rendering is done.

> (html-strip "<h1>This is a title.</h1>\n\n<p>This is the summary of things</p>")
"This is a title.\n\nThis is the summary of things"

make-html-parser

(make-html-parser start: #f end: #f text: #f
		comment: #f decl: #f process: #f
		entity: #f entities: *default-entities*
		tag-levels: *tag-levels*
		unnestables: *unnestables*
		bodyless:  (current-html-void-tags)
		literals:  (current-html-raw-tags)
		terminators: *terminators*)

Returns a procedure of two arguments, an initial seed and an optional input port, which parses the HTML document from the port with the callbacks specified by a keyword.

The following callbacks are recognized:

start: tag attrs seed virtual?
    fdown in foldts, called when a start-tag is encountered.
  tag :=         tag name
  attrs :=       tag attributes as a alist
  seed :=        current seed value
  virtual? =:    #t if this start tag was inserted to fix the HTML tree

end: tag attrs parent-seed seed virtual?
    fup in foldts, called when an end-tag is encountered.
  tag :=         tag name
  attrs :=       tag attributes of the corresponding start tag
  parent -=SEED: parent seed value (i.e. seed passed to the start tag)
  seed :=        current seed value
  virtual? =:    #t if this end tag was inserted to fix the HTML tree

text: text seed
    fhere in foldts, called when any text is encountered.  May be
    called multiple times between a start and end tag, so you need
    to string-append yourself if desired.
  text :=        entity-decoded text
  seed :=        current seed value

comment: text seed
    fhere on comment data

decl: name attrs seed
    fhere on declaration data

process: list seed
    fhere on process-instruction data

In addition, entity-mappings may be overriden with the entities: keyword.

Writing

sxml->html

(sxml->html sxml (port #f))

Convert the HTML representation of sxml to a string which it outputs to the passed port.

If the port is #f, or not provided, return a string.

html-escape

(html-escape str (port #f) escapes: (esc #f))

Returns or writes an HTML escaped string to the port by default replacing the characters <>&"' with the appropriate HTML entities.

If the port is #f, or not provided, return a string.

If other escapes are wanted a list can be passed with the escapes: keyword. If false the defaults are in html-character-escapes.

> html-character-escapes
((#\< . "&lt;")
 (#\> . "&gt;")
 (#\& . "&amp;")
 (#\" . "&quot;")
 (#\' . "&#39;"))
> (html-escape "< ' >")
"&lt; &#39; &gt;"
> (html-escape "< ' >" escapes: '((#\< . "Less Than")))
"Less Than ' >"

html-character-escapes

These are the characters that are escaped when writing HTML.

> html-character-escapes
((#\< . "&lt;")
 (#\> . "&gt;")
 (#\& . "&amp;")
 (#\" . "&quot;")
 (#\' . "&#39;")

← XML: eXtensible Markup Language TAL: The Template Attribute Language→

# HTML: Hyper Text Markup Language Module

# HTML Parser and Printer

# Element, aka Tag Types

# Reading

# Writing

# sxml->html

# html-escape

# html-character-escapes