Comments Page - Show HN: HTML-to-Markdown – convert entire websites to Markdown with Golang/CLI

« Back Show HN: HTML-to-Markdown – convert entire websites to Markdown with Golang/CLIgithub.comSubmitted by JohannesKauf 4 days ago

miki123211 4 days ago
If you need this sort of thing in any other language, there's a free, no-auth, no-api-key-required, no-strings-attached API that can do this at https://jina.ai/reader/
You just fetch a URL like `https://r.jina.ai/https://www.asimov.press/p/mitochondria`, and get a markdown document for the "inner" URL.
I've actually used this and it's not perfect, there are websites (mostly those behind Cloudflare and other such proxies) that it can't handle, but it does 90% of the job, and is an one-liner in most languages with a decent HTTP requests library.
- petercooper 3 days ago
  I use this too and, not to detract from your enthusiasm, it's not exactly no-strings-attached. There's a token limit on free use and you can't use it for any commercial purposes. Luckily the pricing for unrestricted use is reasonable though at 2 cents per million tokens.
  People will also want to note that it's LLM-powered which has pros and cons. One pro being that you can download and run their model yourself for non commercial use cases: https://huggingface.co/jinaai/reader-lm-1.5b
- JohannesKauf 4 days ago
  Thanks, Jina actually looks quite nice for use in LLMs.
  I also provide a REST API [1] that you can use for free (within limits). However you have get an API Key by registering with Github (see reason below).
  ---
  The demo was previously hosted on Vercel. Someone misused the demo and send ~5 million requests per day. And would not stop — which quickly brought me over the bandwidth limits of Vercel. And bandwidth is really really expensive!
  So that is the reason for requiring API Keys and hosting it on a VPS… Lessons learned!
  [1] https://html-to-markdown.com/api
  emptiestplace 3 days ago
  Seems pretty risky to not implement rate limits either way.
  JohannesKauf 3 days ago
  The problem was: Doing rate limiting on the application level was not enough. Once the request hit my backend the incoming bandwidth was already consumed — and I was charged for it.
  I contacted Vercel's Support to block that specific IP address but unfortunately they weren't helpful.
  emptiestplace 3 days ago
  So you're probably still vulnerable to this even with the key requirement, but they stopped once you removed the incentive? Did you notice what they were scraping?
  JohannesKauf 3 days ago
  Sorry, I mixed up a few topics here:
  - Moved everything to a VPS - way better value for money. Extra TB of traffic only costs €1-10 with Hetzner/DigitalOcean compared to 400€ with Vercel's old pricing.
  - Put Cloudflare in front - gives me an extra layer of control (if I ever need it)
  - Built a proper REST API - now there's an official way to use the converter programmatically
  - Made email registration mandatory for API keys - lets me reach out before having to block anyone
  That other server was probably running a scraper and then converting the html-websites to markdown. After about 2 weeks they noticed that I was just returning garbage and it stopped :)
  emptiestplace 3 days ago
  Ah! Makes sense now, thanks for sharing.
  I've had good success with Cloudflare's free-tier features for rate limiting. If you haven't tried it, it only takes a couple minutes to enable and should be pretty set-and-forget for your API.
NotACracker 4 days ago
Pandoc
http://www.cantoni.org/2019/01/27/converting-html-markdown-u...
- bbor 3 days ago
  For clarity: I'm a pandoc diehard (especially because it's written by a philosopher!) but it intentionally doesn't approach this level of functionality, AFAIK.
plaidwombat 4 days ago
Great work. I thank you for it. I've used your library for a few years in a Lambda function which takes a URL and converts it to Markdown for storage in S3. I hooked it into every "bookmark" app I use as a webhook so I save a Markdown copy of everything I bookmark, which makes it very handy for importing into Obsidian.
- JohannesKauf 4 days ago
  Oh very nice to hear, thank you very much!
  That’s actually a great idea!
  I personally use Raindrop for bookmarking articles. But I can’t find stuff in the search.
  The other day “Embeddings are underrated” was on HN. That would actually be a good approach for finding stuff later on. Using webhooks, converting to markdown, generating embedding and then enjoying a better search. You just gave me the idea for my next weekend project :-)
rty32 4 days ago
Nice! And glad to see it's MIT licensed.
I wonder if it is feasible to use this as a replacement for p2k, instapaper etc for the purpose of reading on Kindle. One annoyance with these services is that the rendering is way off -- h elements not showing up as headers, elements missing randomly, source code incorrectly rendered in 10 different ways. Some are better than others, but generally they are disappointing. (Yet they expect you to pay a subscription fee.) If this is an actively maintained project that welcomes contribution, I could test it out with various articles and report/fix issues. Although I wonder how much work there will be for handling edge cases of all the websites out there.
- JohannesKauf 4 days ago
  There are two parts to it:
  1) convert html to markdown
  This is what my library specifically addresses, and I believe it handles this task robustly. There was a lot of testing involved. For example, I used the CommonCrawl Dataset to automatically catch edge cases.
  2) Identify article content
  This is the more challenging aspect. You need to identify and extract the main content while removing peripheral elements (navigation bars, sidebars, ads, etc.)
  For example, the top of the markdown document will have lots of links from the navbar otherwise.
  Mozilla's "Readability" project (and its various ports) is the most used solution in this space. However, it relies on heuristic rules that need adjustments to work on every website.
  ---
  The html-to-markdown project in combination with some heuristic would be great match! There is actually a comment below [1] about this topic. Feel free to contact me if you start this project, would be happy to help!
  [1] https://news.ycombinator.com/item?id=42094012
  dleeftink 3 days ago
  I'm working on a Textify API that collates elements based on the visible/running flow of text elements. It's not quite there yet, but is able to get the running content of HTML pages quite consistently. You can check it out here:
  [0]: https://github.com/dleeftink/plainmark
cpursley 4 days ago
This is really nice, especially for feeding LLMs web page data (they generally understand markdown well).
I built something similar for the Elixir world but it’s much more limited (I might borrow some of your ideas):
https://github.com/agoodway/html2markdown
- JohannesKauf 4 days ago
  > built something similar for the Elixir
  We interact with the web so much that it’s worth having such a library in every language... Great that you took the time and wrote one for the Elixir community!
  Feel free to contact me if you want to ping-pong some ideas!
  > feeding LLMs web page data
  Exactly, that one use case that got quite popular. There is also the feature of keeping specific HTML tags (e.g. <article> and <footer>) to give the LLM a bit more context about the page.
- jaggirs 4 days ago
  Why not just give the html to the llm?
  zexodus 4 days ago
  Context size limits are usually the reason. Most websites I want to scrape end up being over 200K tokens. Tokenization for HTML isn't optimal because symbols like '<', '>', '/', etc. end up being separate tokens, whereas whole words can be one token if we're talking about plain text.
  Possible approaches include transforming the text to MD or minimizing the HTML (e.g., removing script tags, comments, etc.).
  dtjohnnyb 4 days ago
  I was trying to do this recently for Web page summarization. As said below the token sizes would end up over the context length, so I trimmed the html to fit just to see what would happen. I found that the LLM was able to extract information, but it very commonly would start trying to continue the html blocks that had been left open in the trimmed input. Presumably this is due to instruction tuning on coding tasks
  I'd love to figure out a way to do it though, it seems to me that there's a bunch of rich description of the website in the html
  kgeist 4 days ago
  I remember there was a paper which found that LLMs understand HTML pretty well, you don't need additional preprocessing. The downside is that HTML produces more tokens than Markdown.
  simonw 4 days ago
  Right: the token savings can be enormous here.
  Use https://tools.simonwillison.net/jina-reader to fetch the https://news.ycombinator.com/ homepage as Markdown and paste it into https://tools.simonwillison.net/claude-token-counter - 1550 tokens.
  Same thing as HTML: 13367 tokens.
throwup238 4 days ago
This is probably out of scope for your tool but it’d be nice to have built in n-gram deduplication where the tool strips any identical content from the header and footer, like navigation, when pointed at a few of these markdown files.
- JohannesKauf 3 days ago
  My final university project was about a clean-up-approach on the HTML nodes before sending it to the html-to-markdown converter. But that was extremely difficult and dependent on some heuristics that had to be tweaked.
  Your idea of comparing multiple pages would be a great approach. It would be amazing if you build something like this! This would enable so many more use cases... For example a better “send to kindle” (see other comment from rty32 [1]).
  [1] https://news.ycombinator.com/item?id=42093964
jot 3 days ago
This is great!
If you also want to grab an accurate screenshot with the markdown of a webpage you can get both with Urlbox.
We have a couple of free tools that use this feature:
https://screenshotof.com https://url2text.com
paradite 4 days ago
I have been using these two:
https://farnots.github.io/RedditToMarkdown/
https://urltomarkdown.com/
Incredibly useful for leveraging LLMs and building AI apps.
ssousa666 3 days ago
I have been looking for a similar lib to use in a Kotlin/Spring app - any recommendations? My specific use-case does not need to support sanitizing during the HTML -> MD conversion, as the HTML doc strings that I will be converting are sanitized during the extraction phase (using JSoup).
- ktosobcy 3 days ago
  If it doesn't have to be in Kotlin there is flexmark:
  * https://github.com/vsch/flexmark-java/tree/master/flexmark-h... * https://github.com/vsch/flexmark-java
sureglymop 4 days ago
Reminds me of Aaron Swartz' html2text that I think serves the same purpose: http://www.aaronsw.com/2002/html2text/
- stevekemp 4 days ago
  Same idea I guess, but it's Aaron's has been broken for years - and probably for the best because it didn't stop people specifying things like "file:////etc/passwd" as the URL to export to markdown.
juliuskiesian 3 days ago
One of the pain points of using this kind of tools is handling syntax highlighted code blocks. How does html-to-markdown perform in such scenarios?
- JohannesKauf 3 days ago
  Yeah good point, that's actually difficult. They use many `<span>` html tags to color individual words and syntax.
  But I wrote logic to handle that. It probably needs to be adapted at some point, but works surprisingly well. Have a look at the testdata files ("code.in.html" and "code.out.md" files [1]).
  Feel free to give it a try & let me know if you notice any edge cases!
  [1] https://github.com/JohannesKaufmann/html-to-markdown/blob/ma...
vergessenmir 3 days ago
is there a plugin to convert the ast to json? Similar to the mistune package in python. I'm using this as part of a rag ingestion pipeline and working with markdown ast provides a flatter structure than raw html
lollobomb 3 days ago
This is nice, I tried a plugin for pandoc in the past but didn't really work well.
Savageman 4 days ago
I remember a long time ago I used Pandoc for this.
Fresh tools and more choice is very welcome, thanks for your work!
- JohannesKauf 4 days ago
  Pandoc is amazing. Especially because they support so many formats.
  And their html to markdown converter is (in my opinion) still the best right now.
  But html-to-markdown is getting close. My goal is to cover more edge cases than pandoc...
oezi 4 days ago
Does it also include logic to download JS-driven sites properly or is this out of scope?
- simonw 4 days ago
  It doesn't. For that you would need to execute a full headless browser first, extract the HTML (document.body.innerHTML after the page has finished loading can work) and process the result.
  If you're already running a headless browser you may as well run the conversion in JavaScript though - I use this recipe pretty often with my shot-scraper tool: https://shot-scraper.datasette.io/en/stable/javascript.html#... - adding https://github.com/mixmark-io/turndown to the mix will get you Markdown conversion as well.
- jot 3 days ago
  We do that with Urlbox’s markdown feature: https://urlbox.com/extracting-text
- JohannesKauf 4 days ago
  That is unfortunately out of scope. I like the philosophy of doing one thing really well.
  But nowadays—with Playwright and Puppeteer—there are great choices for Browser automation.
- bni 3 days ago
  I used https://github.com/mozilla/readability for this
inhumantsar 3 days ago
i've made some modest contributions to Mozilla's Readability library and didn't see anything like their heuristics in this.
are you using a separate library for that or did I miss something in this?
- inhumantsar 3 days ago
  oops, refreshed the page and saw other comments addressing this! nevermind!
yayoohooyahoo 4 days ago
Turndown works quite well too: https://github.com/mixmark-io/turndown
hello_computer 3 days ago
This is honorable work. Thank you.
linhns 3 days ago
Very neat tool. Well done!
lakomen 3 days ago
Why?