Html2Markdown (html2markdown v0.3.1)
Convert HTML documents to clean, readable Markdown.
Html2Markdown intelligently extracts content from HTML while filtering out navigation, advertisements, and other non-content elements. It's designed for web scraping, content migration, and any scenario where you need to convert HTML to Markdown.
Basic Usage
iex> Html2Markdown.convert("<h1>Hello</h1><p>World</p>")
"# Hello\n\nWorld"Configuration
The library supports extensive configuration through the second parameter:
Html2Markdown.convert(html, %{
navigation_classes: ["nav", "menu", "sidebar"],
non_content_tags: ["script", "style", "iframe"],
markdown_flavor: :basic,
normalize_whitespace: true
})Features
- Smart filtering - Automatically removes common non-content elements
- HTML5 support - Handles modern semantic elements
- Table conversion - Converts HTML tables to Markdown tables
- Entity decoding - Automatically handled by Floki
- Whitespace normalization - Optional cleanup of excessive whitespace
- Configurable - Customize filtering behavior to your needs
Examples
Web Scraping
# Extract article content from a web page
{:ok, %{body: html}} = HTTPoison.get("https://example.com/article")
content = Html2Markdown.convert(html, %{
navigation_classes: ["header", "footer", "nav", "sidebar"],
normalize_whitespace: true
})Content Migration
# Convert WordPress posts to Markdown
post_html
|> Html2Markdown.convert()
|> File.write!("post.md")Email Processing
# Clean up HTML emails
email_body
|> Html2Markdown.convert(%{
non_content_tags: ["style", "meta", "link"],
navigation_classes: ["unsubscribe", "footer"]
})Supported HTML Elements
Text Elements
- Headings:
<h1>through<h6> - Paragraphs:
<p> - Emphasis:
<em>,<i>→*italic* - Strong:
<strong>,<b>→**bold** - Strikethrough:
<del>→~~strikethrough~~ - Code:
<code>→`code` - Preformatted:
<pre>→code blocks
Lists
- Unordered lists:
<ul>,<li>→- item - Ordered lists:
<ol>,<li>→1. item - Definition lists:
<dl>,<dt>,<dd>
Links and Media
- Links:
<a href="...">→[text](url) - Images:
<img>→ - Picture:
<picture>with fallback to<img>
Tables
Full support for HTML tables with automatic header detection:
<table>
<tr><th>Name</th><th>Value</th></tr>
<tr><td>Elixir</td><td>1.15</td></tr>
</table>Converts to:
| Name | Value |
| --- | --- |
| Elixir | 1.15 |HTML5 Elements
<details>/<summary>- Collapsible sections<mark>- Highlighted text (GFM:==marked==)<abbr title="...">- Abbreviations with expansion<cite>- Citations in italics<q cite="...">- Inline quotes with optional citation<time datetime="...">- Time with preserved datetime<video>- Converted to markdown link
Entity Handling
HTML entities are automatically decoded by Floki:
&→&<→<>→> → non-breaking space{→{«→«
Summary
Functions
Converts the content from an HTML document to Markdown (removing non-content sections and tags)
Converts the content from an HTML document to Markdown with custom options
Types
Functions
@spec convert(html_content()) :: markdown_content()
Converts the content from an HTML document to Markdown (removing non-content sections and tags)
Uses default options for conversion. To customize behavior, use convert/2.
@spec convert(html_content(), conversion_options()) :: markdown_content()
@spec convert(any(), any()) :: {:error, String.t()}
Converts the content from an HTML document to Markdown with custom options
Options
:navigation_classes- List of CSS classes to identify navigation elements to remove. Defaults to["footer", "menu", "nav", "sidebar", "aside"]:non_content_tags- List of HTML tags to filter out during conversion. Defaults to common non-content tags like script, style, form, etc.:markdown_flavor- Markdown flavor to use. Currently only:basicis supported. Defaults to:basic(future enhancement for:gfm,:commonmark):normalize_whitespace- Whether to normalize whitespace. When enabled, multiple spaces/tabs are converted to single spaces and leading/trailing whitespace is trimmed. Whitespace in code blocks and inline code is always preserved. Defaults totrue
Examples
iex> Html2Markdown.convert("<p>Hello</p>", %{navigation_classes: ["custom-nav"]})
"Hello"