HTML Primer
Syntax
- HTML and XML syntax
- HTML can be written using the HTML syntax or the XML syntax. The two are roughly equivalent, and will usually look nearly identical. The major differences are that the XML syntax allows namespaces (the HTML syntax doesn't - it handles namespaces implicitly), the XML syntax can't have unquoted attributes or unclosed elements, and XML can have self-closed elements like <div /> (HTML has a handful of predefined "void elements" which are self-closed, but no general facility for self-closing elements). I recommend using the HTML syntax when writing HTML.
- Escaping
- HTML uses a few magic characters in its syntax that you need to be careful about when writing pages, and doubly careful about when including user-submitted content in a page. The contents of an element should have all instances of "<" and "&" replaced with "<" and "&". The contents of an attribute should have all instances of ">" and your quoting character escaped, using ">" and either "" (for double quotes) or "'" (for single quotes).
Language Structure
- The doctype
- The doctype is an artifact of HTML's history as an SGML language. It doesn't have any effect, other than triggering standards-mode in all browsers. The doctype for HTML is "<!DOCTYPE html>", and it must appear as the very first thing on your page.
- The XML declaration
- If you're using the XML syntax, do not use the doctype. Instead, use the XML declaration, which will look like "<?xml version='1.0' encoding='utf-8'>".
- <html>, <head>, <body>
- <html> is the root element of an HTML document. It can only have two children, a <head> which contains metadata about the page, and a <body> which contains the visible content of the page. All of these tags are optional - the HTML parser will insert them automatically if they're missing.
Metadata Elements
- <title>
- This gives the title of your page. It's required! Search engines also use this as the name of the page in search results.
- <meta>, <link>
- These provide name/value pairs for the page. <meta> is for plain-text data - specify the name with @name or @http-equiv attributes, and the value with the @content attribute. <link> is for pointing to external resources as the data - specify the name with the @rel attribute and the value with the @href attribute. These are both void elements - they don't have end tags. Here are some common instances of these elements you'll see or use:
<meta charset=utf-8>
- An exception to the syntax I described above, this sets the charset of the page. You should always include this in your page (assuming your pages are encoded as utf-8, which they should be) as without a charset your page is assumed to be in the Windows-1252 encoding.
<meta http-equiv=content-type content=text/html;charset=utf-8>
- An older way of specifying the charset.
<meta name=description content='A description of this page'>
- This provides the page description that search engines use when displaying the page in search results. Without this, search results will just include a scrap of content from near the searched-for terms.
<link rel=stylesheet href="mystylesheet.css">
- This pulls in the external CSS stylesheet and applies it to the page.
<link rel=canonical href="basepage.html">
- When there are multiple ways to get to the same page (such as irrelevant query parameters, or a self-hosted link shortener), this specifies what the "canonical" URL for the page is, so search engines can collect all the links together and treat them as pointing to the same thing.
- <script>, <style>
- These let you provide Javascript and CSS inline with the page, rather than as external resources. Just put your JS or CSS between the tags. Note that many tutorials will recommend putting @type or @language on these; that's completely unnecessary voodoo, as the browser will automatically treat them as JS and CSS respectively. <script> is also used to refer to external scripts - if you have a @src attribute, the contents of the element are ignored and the external script is fetched instead. (This is, unfortunately, completely different to how you refer to external CSS. Blame the browser developers of the past.)
Page Organization Elements
The page organization elements let you tag the various sections of your page with semantic meaning. This helps you directly (the different tagnames make the page easier to read and make styling easier), and it help applications like screen readers navigate the page.
- <section>, <article>, <nav>, <aside>
- These elements define "sections" of your page. Use these if it would be appropriate to include the section in an outline or Table of Contents for your page. <section> is the workhorse generic sectioner. Use <article> instead if the section defines an "independent" section of the page, something that would be appropriate to view on its own (you don't need an <article> around the content of your page - <body> carries the same semantic!). <nav> is for the "main navigation" - don't use it for all your links, just the ones that are most useful for getting around your site - the ones you'd point people to if they asked how to navigate the site. Finally, <aside> is for sections that are irrelevant to the main content, things like a blogroll on a blog (completely unrelated to the content of the page, which is some blog post), or a pullquote in an article (repeats something that's already said in the article).
- <h1> through <h6>
- These define headings for your sections. <h1> is the most important heading, <h2> is below that, etc. Some elements, such as the sectioning elements described above, "scope" the headings within themselves, so that an <h1> within a <section> defines the top-level heading for that section, rather than for the entire document.
- <header>, <footer>
- These wrap the headers and footers of a section. Note the difference between a "header" and a "heading" - <header> contains <h1-6>, it doesn't replace them.
Content-Grouping Elements
- <p>
- The workhorse of elements, <p> defines a paragraph. Just wrap its start and end tags around your paragraph.
- <ol>, <ul>, <li> and <dl>, <dt>, <dd>
- These define the three kinds of lists. <ol> is an Ordered List (each item is numbered), while <ul> is an Unordered List (each item is bulleted). Both of these only accept <li>s (List Items) as children - the content of each item goes inside the <li>. <dl> is for a Description List, or a list of name/value pairs. It instead takes <dt> (Description Title) and <dd> (Description Definition).
- <blockquote>
- This is for containing quotations from other sources. It can take a @cite attribute which contains a url to the resource the quote is pulled from. <blockquote> is also a sectioning root, so headings in your quote won't mess up the outline of the rest of your page.
- <pre>
- This contains "preformatted" text. In other words, it exactly preserves the whitespace you put in (HTML normally collapses runs of whitespace into a single space character). This is useful for code example, and other things where linebreaks or spacing is important to the meaning of the content, like poetry.
- <figure>, <figcaption>
- These are used to specify a chunk of content, like a picture, code sample, or table, that is referenced from the main content but doesn't have a particular location that it needs to be in. In books and magazines, these would correspond to examples and callouts. <figcaption> can be a child of the <figure>, and provides a caption for the rest of the content.
Text-level elements
- <a>
- Use <a> to make text into a link. The @href attribute takes a url that the user should be navigated to when they click the link.
- <em>, <strong>, <i>, <b>
- These impart various semantics to your text. Use <em> to indicate emphasis and <strong> to indicate importance. <i> and <b> are used to italicize or bold text where the style conveys some semantic meaning that isn't otherwise covered by an HTML element, like italicizing species names, or bolding the first letter of each word in a phrase to explain an abbreviation, like I did for the lists earlier. Do not use <em> just to italicize things, and don't use <i> to add italics purely for stylistic purposes (use CSS to do that); same advice applies to <strong> and <b>.
- <ins>, <del>, <s>
- <ins> and <del> indicate insertions or deletions in a document, so you can track changes inline when, for example, copyediting a document. <s> has a similar default styling to <del> - the text is "struck-through" - but it's meant for things like the "sarcastic strike-through", when one wishes to purposely indicate that one really means the opposite of what one has written, similar to how "^H^H^H" is sometimes used in plaintext to pretend that something was deleted while leaving it in for rhetorical purposes.
- <small>
- Use this for "small print" - text that isn't less important, but is purposely less emphasized. As usual, don't use it just to make text smaller for stylistic purposes.
- <time>
- This is for marking up dates and times in a machine-readable way. The exact date/time can be given in the @datetime attribute. You don't need to mark up every date or time in your page with <time> - only do so when you actually have a need to make something machine-readable, such as if you need a script to understand what the exact time being indicated is (for example, so it can find more messages posted at the same time), but you'd like to present a friendly, more human-readable version to the user (like "4pm").
Media Elements
- <img>
- The oldest media element, <img> displays an img, err, image. Point the @src attribute to the url of the image, and fill in the @alt attribute with a textual equivalent of the image (useful for the blind and robots like Google's crawler).
- <canvas>
- This represents a javscript-created image. You can use it for static image manipulation (draw an <img> into the canvas, then manipulate the pixels from script), totally script-generated images (use the various drawing commands available on the element), or even complex animated things like games (redraw the canvas from scratch every frame).
- <video>, <source>
- Basic video player. Just point the @src at the url of a video. Some other useful attributes are @controls (adds play/pause/etc controls to the video automatically; otherwise, you have to make them yourself), @loop (automatically restarts from the beginning when the video ends), and @autoplay (starts playing the video automatically). Not all browsers support the same video codecs, so you may need to supply multiple versions of the video - in this case, rather than using @src on the <video> element, give it some <source> children and set @src on them, along with @type indicating the MIME-type of the video. The browser will automatically choose the first one it understands. You can also put arbitrary fallback content into <video>, after all the <source>s, which will be ignored by browsers that support <video> - it's just for showing something to older browsers, as they won't understand the element and will just show the contents instead.
- <audio>, <source>
- Basic audio player. This is basically identical to <video> in terms of what attributes and contents it allows, it's just intended to play audio instead of video, and so has a somewhat different default interface.
- <track>
- This is used to provide lyrics, transcriptions, subtitles, and similar text-based equivalent of the <audio> and <video> elements. Just add it as a child of the <audio> or <video> and point its @src to a WebSRT file providing the text. Work is ongoing on this element, the API it exposes, and the WebSRT format.
Form Controls
- <form>
- This allows the page to submit data back to the server. It can contain any markup, but the form inputs inside of it are given special treatment, with their data extracted and packaged when the form is submitted. The @action attribute points to the url that the data will be sent to, and the @method attribute specifies how the data will be packaged.
method="get"
(the default) makes the form send its data as query parameters on the url, and should be used when the form request doesn't have side effects (that is, if the form is basically just a complicated link, so that refreshing the page or sharing the url with the query parameters won't cause bad effects). For example, a search form should use GET, because searching multiple times for the same data is fine. method="post"
sends the data instead in a non-visible way as part of the HTTP request, and should be used when the data has side effects, like creating or deleting an item from a database.
- <input>
- This element allows the user to interact with the page, and is used to construct the form's dataset when the form is submitted. All inputs are void elements, so they don't have an end tag and can't have any contents. Each input contributes a name/value pair to the form's data, with the name coming from the input's @name attribute and the value coming from their value. There are many different types of inputs, each presenting a different user interface, which can be selected by setting the @type attribute on the input. These are the "classic" inputs which have been around since HTML4:
type="text"
- A basic text input that the user can type in. You can pre-fill the value by setting the @value attribute, or supply a "placeholder" (a hint that is displayed only when the input is empty and the user isn't trying to type into it) by setting the @placeholder attribute.
type="password"
- This is identical to type="text"
, except that it masks the visible display of what the user typed by replacing each character with "*" or similar. The actual data sent to the server is completely unmasked and not encrypted or anything, though.
type="checkbox"
- A checkbox, usually represented as a square, which can be checked or unchecked by the user. You can make it checked by default by setting the boolean @checked attribute. If the element is checked, it contributes its name and value to the form's dataset as normal, but if it's unchecked it's instead completely ignored.
type="radio"
- Radio buttons are identical to checkboxes, except that they're grouped together with other radio buttons such that only one radio in the group is allowed to be checked at a time. The grouping is defined by the @name attribute - all the radio buttons with the same @name are in the same group, and checking a radio button automatically unchecks any other radio button in the same group.
type="file"
- A file submit input that allows the user to select a file and then submit it to the server. Note that file inputs only work if the form has method=POST
and enctype="multipart/form-data"
(this last bit is kinda weird, and only ever necessary if you're doing a file input).
type="submit"
- This creates a submit button. Pressing it automatically submits the form it's associated with. The submit button isn't required to have a @name attribute, but if it does, it's included in the data as normal. You can change the displayed text on the button by setting the @value attribute. (This means that the submit button will send its displayed text as its value, which isn't always ideal. You may instead want to use the <button> element, described further down.) You can put multiple submit buttons in a single form; only the one actually used to submit the form contributes its value to the form's dataset; the rest are ignored.
type="button"
- This is identical to type="submit"
, except that pressing it doesn't submit the form, or do anything at all, at least by default. It's designed to just give you a button that you can attach javascript to, so you can run script when the user presses the button.
type="hidden"
- This is a special type of input that is used solely to "smuggle in" extra values to the form's dataset. It's not displayed at all to the user, so its value has to be set either by the server or by script in the page
HTML5 introduces several new input types. These aren't fully supported across all browsers yet, but they should be in a year or two, and for now there are various "polyfill"/"shim" JS libraries that will fill in functional copies on the older browsers. Most of these new inputs have a specialized user interface, so the actual value that they contribute to the form's dataset may be substantially different than what is actually shown to the user. They are:
type="date"
- A datepicker, usually presented as a calendar. There is also a type="time"
that presents a timepicker, and a type="datetime"
that combines the two, plus type="week"
and type="month"
for getting week/year and month/year combos. The value contributed to the form's dataset is a UTC date like "2011-05-14", or a time like "23:59:30", or something similar for the other types.
type="color"
- A color picker, like what you'd see in a drawing program. The value contributed to the form's dataset is a 6-digit CSS hex color, like "#FFFF00" for yellow.
type="range"
- A range slider, like what you see for setting the volume on your computer. You can give it @min, @max, and @step attributes to control what values it can take; otherwise it default to [0,100] with a step of 1.
type="number"
- This is like a text input, but only lets you enter numbers into it. It may also give you + and - buttons to easily increment/decrement the value. It takes @min, @max, and @step, just like type="range"
, and has the same defaults.
type="tel", "url", and "email"
- these are all specialized types of text inputs that are meant to accept, respectively, telephone numbers, urls, and email addresses. They may affect how the input is displayed (for example, on a mobile phone with a software keyboard, clicking in an email input may bring up a specialized keyboard with easy access to "." and "@"), but it should generally look similar to a plain text input. They also have an effect on validation, which is described below.
- <select>, <option>, <optgroup>
- The <select> element is another type of form input that lets the user pick one choice out of several, and is usually presented as a dropdown list. Conceptually, it's similar to a group of radio button, which also give the user a group of options and allow them to pick only one, but it's contained in a single element instead of potentially being spread around the document. It can only contain <option> and <optgroup> children, which represent the choices. Each <option>, then, can contain only text, which is displayed to the user. If the user selects a particular option, the <select> contributes either the text content of that <option> to the form's dataset, or the @value attribute on the <option> if it's present, so you can show one value (for example, the name of a state like "Texas") but submit another value (for example, the state abbreviation like "TX"). <optgroup> lets you group options together - give it a @label attribute which'll be displayed as the name of the group, and any <option>s which are children of it will be grouped together visually.
- <button>
- <button> is a more powerful version of <input type="submit"> or <input type="button">. Rather than using the @value attribute both to display and to submit, <button> displays whatever its contents are, and submits its @value attribute. It still takes a @type attribute as well, which can be either "submit" or "button", with the same meanings as they had in <input>.
- <textarea>
- This is the beefy version of the text input, designed for multiple lines of text. Unlike <input, it has both a start and end tag, and rather than having a @value attribute, you prefill a <textarea> by giving it text as its contents. It can take a @wrap attribute to determine how to treat the automatic line-wrapping when it submits the value. With
wrap="soft"
, only explicit linebreaks added by the user are included in its contribution to the form's dataset; with wrap="hard"
, it also includes linebreaks based on how it was displayed, so the value will break in the same way that it did in the textarea itself.
- Form Validation
- Another HTML5 feature (not yet supported everywhere, but can be faked with any of several JS shims) is automatic form validation. Some types of inputs have implicit validation requirements: email and url inputs must be valid emails and urls, while the value of a number input must not be less than the min, more than the max, or incompatible with the step. You can also supply explicit validation requirements with attributes: the @required attribute requires the input's value to be something other than the empty string, while the @pattern attribute takes a regular expression and requires the input's value to match it. Validity is checked both continuously and when you try to submit the form, with different results - the continuous checking only affects the value of the input's "valid" property and whether or not it matches the CSS
:invalid
pseudoclass, while the submit checking actually stops submission and automatically shows error messages to the user for the invalid inputs.
- @placeholder
- This lets you supply a hint as to what type of value is expected for the input. The value of the attribute is displayed in the input as long as the input's value is the empty string. You can also use the @title attribute for longer hints, which is displayed when the user hovers over the input. This shouldn't be used to actually describe the input; for that, use the <label> element.
- <label>
- This is for describing the input. There are two ways to associate it with an input - if you give it a @for attribute that contains the ID of an input, it's associated with that input; otherwise, it's automatically associated with the first input inside of itself. Labels have a pretty important special feature - clicking a label magically transfers the click to the associated input. This is really useful when trying to click on tiny checkboxes or radio buttons, so you should always add a label to your inputs.
- <fieldset>, <legend>
- These elements group form inputs inside of a form. Wrap your <fieldset> around all the inputs and other content that's thematically related. A <fieldset> can also have a label if you give it a <legend> element as its first child.
Embedding Elements
- <iframe>
- This lets you embed another webpage into your webpage. Just point the @src attribute at the url of the page you want to embed. Note that CSS and javascript won't percolate down into the embedded page or up into the embedder by default; they're scoped to their own document. However, if the embedded document is on the same domain as the embedder, javascript in either document can manually reach into the other document and then act like normal, as if it was running in the other document in the first place. To go from embedder to embedded, grab the "contentDocument" property off of the iframe; to go in the opposite direction, grab the "ownerDocument" property off of your own window.
- <object>
- <object> is a multi-purpose embedding element. It was envisioned as a generic solution to the embedding problem, so that you wouldn't need individual elements like <iframe>, <img>, or <video>. As it turns out, this isn't a very good idea, as all of these types of embeds want to expose different APIs, and it's clumsy to handle that with a single element. Now, <object> is mainly used to embed Flash. For basic usages, point the @data attribute at the url of what you want to embed. You can also supply a @type attribute to let the browser know what the type of the embed is; for example, to embed Flash, you can just say
type="application/x-shockwave-flash"
, and the browser will automatically embed Flash if the user has it installed. You can pass additional data to the embed through child <param> elements, which take @name and @value attributes. As well, just like <video> and <audio, you can put extra content inside of the <object> as fallback. Every browser understands the <object> element, but if it doesn't understand what you're trying to embed (for example, if the user doesn't have Flash installed), the browser will show the fallback instead.
Tables
- <table>
- This represents tabular data - anything that's organized into rows and columns. Note that <table> is defined as a row-major table - the data is first organized into rows and then cells. You can define columns, but they don't define content; they exist for styling purposes only.
Before CSS became as powerful as it is today, tables were commonly used as layout aids, as most page layouts can be sliced up into a grid. Don't do this - it's made screen-readers very hard to write, and it massively bloats your page code and makes it very difficult to maintain.
- <thead>, <tbody>, <tfoot>
- These organize the table into sections. A <thead> should contain column headings for the table, and is always displayed at the top of the table, even if the actual location in the source code is lower down. <tfoot> is identical, but gets placed at the bottom of the table automatically; use it when the table is long enough that having the column headings at the bottom as well would be useful. <tbody> is for the meat of the table, where the actual data lives. All of these elements are restricted to containing only <tr>s.
- <tr>, <td>, <th>
- These represent rows and cells of the table. The <tr> element just wraps the cells to organize them into rows; it's nothing special, and is restricted to containing only <td>s and <th>s. <td> represents an individual cell. It can contain arbitrary content, finally. If a table cell should span several rows or columns, use the @rowspan or @colspan attributes with an integer value. @colspan makes the cell take up additional columns to the right, while @rowspan makes it take up additional rows down. Use @rowspan with caution, as it can make it difficult to read the sourcecode for the table. A row may appear to have only two cells, but a cell in a previous row may be spanning down, actually making it three or more columns wide. <th> is identical to <td>, but it represents a table heading cell. Additionally, it can take a @scope attribute, which indicates whether the cell is a heading for the row (if you give it a
"row"
value) or for the columns (if you give it a "col"
value).
- <col>, <colgroup>
- These elements must be direct children of the <table> element if they're used. They don't contain anything; they're used solely as styling hooks for the columns. You can set the four CSS properties 'width', 'border', 'background', and 'visibility' on <col> elements, and they'll apply to the cells in that column. Similar to table cells, you can add a @span attribute to <col> to style multiple columns the same way. <colgroup> just groups <col>s together.
- <caption>
- This labels the table, similar to the <figcaption> element for <figure>. It must be a direct child of <table>, but it can contain arbitracy content.