Show HN: Robust LLM Extractor for Websites in TypeScript

(github.com)

15 points | by andrew_zhong 2 hours ago

8 comments

sheept
20 minutes ago
> LLMs return malformed JSON more often than you'd expect, especially with nested arrays and complex schemas. One bad bracket and your pipeline crashes.
This might be one reason why Claude Code uses XML for tool calling: repeating the tag name in the closing bracket helps it keep track of where it is during inference, so it is less error prone.
dmos62
6 minutes ago
What's your experience with not getting blocked by anti-bot systems? I see you've custom patches for that.
Flux159
54 minutes ago
This looks pretty interesting! I haven't used it yet, but looked through the code a bit, it looks like it uses turndown to convert the html to markdown first, then it passes that to the LLM so assuming that's a huge reduction in tokens by preprocessing. Do you have any data on how often this can cause issues? ie tables or other information being lost?
Then langchain and structured schemas for the output along w/ a specific system prompt for the LLM. Do you know which open source models work best or do you just use gemini in production?
Also, looking at the docs, Gemini 2.5 flash is getting deprecated by June 17th https://ai.google.dev/gemini-api/docs/deprecations#gemini-2.... (I keep getting emails from Google about it), so might want to update that to Gemini 3 Flash in the examples.
plastic041
1 hour ago
> Avoid detection with built-in anti-bot patches and proxy configuration for reliable web scraping.
And it doesn't care about robots.txt.
[-]
- andrew_zhong
  2 minutes ago
  Good point. The anti-bot patches here (via Patchright) are about preventing the browser from being detected as automated — things like CDP leak fixes so Cloudflare doesn't block you mid-session. It's not about bypassing access restrictions.
  Our main use case is retail price monitoring — comparing publicly listed product prices across e-commerce sites, which is pretty standard in the industry. But fair point, we should make that clearer in the README.
zx8080
35 minutes ago
Robots.txt anyone?
[-]
- andrew_zhong
  2 minutes ago
  Good point. The anti-bot patches here (via Patchright) are about preventing the browser from being detected as automated — things like CDP leak fixes so Cloudflare doesn't block you mid-session. It's not about bypassing access restrictions.
  Our main use case is retail price monitoring — comparing publicly listed product prices across e-commerce sites, which is pretty standard in the industry. But fair point, we should make that clearer in the README.
johnwhitman
3 minutes ago
[dead]
Remi_Etien
52 minutes ago
[dead]
gautamborad
1 hour ago
[dead]