Try it yourself with our free Html Beautifier tool — runs entirely in your browser, no signup needed.

How to Format HTML in Rust

How to format HTML in Rust

Formatting HTML in Rust is an essential task for web development, as it allows for the creation of clean, readable, and maintainable HTML code. This process involves parsing and manipulating HTML documents to conform to a specific style or structure. In this guide, we will explore how to format HTML in Rust using the html5ever crate, a popular and efficient HTML parser.

Quick Example

use html5ever::tendril::StrTendril;
use html5ever::tokenizer::Attribute;
use html5ever::tokenizer::Tag;
use html5ever::tokenizer::Tokenizer;

fn format_html(html: &str) -> String {
    let mut tokenizer = Tokenizer::new();
    let mut formatted_html = String::new();

    for token in tokenizer.tokenize(StrTendril::from(html)) {
        match token {
            Tag::Start(tag) => {
                formatted_html.push_str(&format!("<{}", tag.name));
                for attr in tag.attrs {
                    formatted_html.push_str(&format!(" {}", attr.name, attr.value));
                }
                formatted_html.push_str(">\n");
            }
            Tag::End(tag) => {
                formatted_html.push_str(&format!("</{}>\n", tag.name));
            }
            _ => {}
        }
    }

    formatted_html
}

fn main() {
    let html = "<html><body><h1>Hello World!</h1></body></html>";
    let formatted_html = format_html(html);
    println!("{}", formatted_html);
}

To use this code, add the following dependency to your Cargo.toml file:

[dependencies]
html5ever = "0.25.1"

Then, run cargo build to install the dependency.

Step-by-Step Breakdown

Let's walk through the code line by line:

  1. We import the necessary modules from the html5ever crate.
  2. We define a function format_html that takes a string slice html as input and returns a formatted HTML string.
  3. We create a Tokenizer instance to parse the input HTML.
  4. We iterate over the tokens produced by the tokenizer using a for loop.
  5. For each token, we use a match statement to handle different token types.
  6. For Start tags, we append the tag name and attributes to the formatted_html string.
  7. For End tags, we append the closing tag to the formatted_html string.
  8. We ignore other token types (e.g., text nodes, comments).
  9. Finally, we return the formatted HTML string.

Handling Edge Cases

Empty/null input

fn format_html(html: &str) -> String {
    if html.is_empty() {
        return String::new();
    }
    // ...
}

In this case, we simply return an empty string if the input is empty.

Invalid input

fn format_html(html: &str) -> Result<String, String> {
    let mut tokenizer = Tokenizer::new();
    // ...
    if let Err(err) = tokenizer.tokenize(StrTendril::from(html)) {
        return Err(format!("Error parsing HTML: {}", err));
    }
    // ...
}

Here, we wrap the format_html function in a Result type and return an error message if the input HTML is invalid.

Large input

fn format_html(html: &str) -> String {
    let mut tokenizer = Tokenizer::new();
    let mut formatted_html = String::with_capacity(html.len() * 2);
    // ...
}

To handle large input, we pre-allocate a string buffer with a capacity twice the size of the input HTML.

Unicode/special characters

fn format_html(html: &str) -> String {
    let mut tokenizer = Tokenizer::new();
    let mut formatted_html = String::new();
    for token in tokenizer.tokenize(StrTendril::from(html)) {
        match token {
            Tag::Start(tag) => {
                formatted_html.push_str(&format!("<{}", tag.name));
                for attr in tag.attrs {
                    formatted_html.push_str(&format!(" {}", attr.name, attr.value));
                }
                formatted_html.push_str(">\n");
            }
            Tag::End(tag) => {
                formatted_html.push_str(&format!("</{}>\n", tag.name));
            }
            Text(text) => {
                formatted_html.push_str(&text);
            }
            _ => {}
        }
    }
    formatted_html
}

To handle Unicode and special characters, we add a Text token handler to append the text content to the formatted_html string.

Common Mistakes

1. Not handling errors

// Wrong
fn format_html(html: &str) -> String {
    let mut tokenizer = Tokenizer::new();
    // ...
}

// Corrected
fn format_html(html: &str) -> Result<String, String> {
    let mut tokenizer = Tokenizer::new();
    // ...
    if let Err(err) = tokenizer.tokenize(StrTendril::from(html)) {
        return Err(format!("Error parsing HTML: {}", err));
    }
    // ...
}

Not handling errors can lead to unexpected behavior or crashes.

2. Not handling edge cases

// Wrong
fn format_html(html: &str) -> String {
    let mut tokenizer = Tokenizer::new();
    // ...
}

// Corrected
fn format_html(html: &str) -> String {
    if html.is_empty() {
        return String::new();
    }
    let mut tokenizer = Tokenizer::new();
    // ...
}

Not handling edge cases can lead to incorrect results or crashes.

3. Using String instead of StrTendril

// Wrong
fn format_html(html: &str) -> String {
    let mut tokenizer = Tokenizer::new();
    tokenizer.tokenize(html);
    // ...
}

// Corrected
fn format_html(html: &str) -> String {
    let mut tokenizer = Tokenizer::new();
    tokenizer.tokenize(StrTendril::from(html));
    // ...
}

Using String instead of StrTendril can lead to performance issues and incorrect results.

Performance Tips

  1. Use StrTendril instead of String: StrTendril is a more efficient and flexible string type that is optimized for parsing and manipulation.
  2. Pre-allocate string buffers: Pre-allocating string buffers can reduce memory allocations and improve performance.
  3. Avoid unnecessary cloning: Avoid cloning strings and tokens unnecessarily, as this can lead to performance issues.

FAQ

Q: What is the html5ever crate?

The html5ever crate is a Rust library for parsing and manipulating HTML documents.

Q: How do I install the html5ever crate?

Add the following dependency to your Cargo.toml file: html5ever = "0.25.1"

Q: What is the difference between String and StrTendril?

StrTendril is a more efficient and flexible string type that is optimized for parsing and manipulation.

Q: How do I handle errors in the format_html function?

Use the Result type to wrap the format_html function and return an error message if the input HTML is invalid.

Q: How do I handle large input in the format_html function?

Pre-allocate a string buffer with a capacity twice the size of the input HTML.

AI agent tools available. The CodeTidy MCP Server gives Claude, Cursor, and other AI agents access to 60+ developer tools. One command: npx @codetidy/mcp