Comments Page - Crawling a billion web pages in just over 24 hours, in 2025

« Back Crawling a billion web pages in just over 24 hours, in 2025andrewkchan.devSubmitted by pseudolus 12 hours ago

bndr 2 hours ago
I run a small startup called SEOJuice, where I need to crawl a lot of pages all the time, and I can say that the biggest issue with crawling is the blocking part and how much you need to invest to circumvent Cloudflare and similar, just to get access to any website. The bandwith and storage are the smallest cost factor.
Even though, in my case, users add their own domains, it's still took me quite a bit of time to reach 99% chance to crawl a website — with a mix of residential proxies, captcha solvers, rotating user-agents, stealth chrome binaries, otherwise I would get 403 immediately with no HTML being served.
- mrweasel an hour ago
  Can't your users just whitelist your IPs?
  bndr an hour ago
  They're mostly non-technical/marketing people, but yes that would be a solution. I try to solve the issue "behind the scenes" so for them it "just works", but that means building all of these extra measures.
- gilrain 41 minutes ago
  > the biggest issue with crawling is the blocking part and how much you need to invest to circumvent Cloudflare and similar … mix of residential proxies, captcha solvers, rotating user-agents, stealth chrome binaries
  I would like to register my hatred and contempt for what you do. I sincerely hope you suffer drastic consequences for your antisocial behavior.
  bndr 38 minutes ago
  Please elaborate, why exactly is it antisocial? Because Cloudflare decides who can or cant access a users website? When they specifically signed up for my service.
  gilrain 35 minutes ago
  It intentionally circumvents the explicit desires of those who own the websites being exploited. It is nonconsensual. It says “fuck you, yes” to a clearly-communicated “please no”.
  joncrane 13 minutes ago
  OP literally said that users add their domains, meaning they are explicitly ASKING OP to scrape their websites.
  bndr 34 minutes ago
  Users sign up for my service.
  gilrain 30 minutes ago
  You employ residential proxies. As such, you enable and exploit the ongoing destruction of the Internet commons. Enjoy the money!
throwaway77385 2 hours ago
> spinning disks have been replaced by NVMe solid state drives with near-RAM I/O bandwidth
Am I missing something here? Even Optane is an order of magnitude slower than RAM.
Yes, under ideal conditions, SSDs can have very fast linear reads, but IOPS / latency have barely improved in recent years. And that's what really makes a difference.
Of course, compared to spinning disks, they are much faster, but the comparison to RAM seems wrong.
In fact, for applications like AI, even using system RAM is often considered too slow, simply because of the distance to the GPU, so VRAM needs to be used. That's how latency-sensitive some applications have become.
- fluoridation an hour ago
  >for applications like AI, even using system RAM is often considered too slow, simply because of the distance to the GPU
  That's not why. It's because RAM has a narrower bus than VRAM. If it was a matter of distance it'd just have greater latency, but that would still give you tons of bandwidth to play with.
dangoodmanUT 2 hours ago
> because redis began to hit 120 ops/sec and I’d read that any more would cause issues
Suspicious. I don’t think I’ve ever read anything that says redis taps out below tens of thousands of ops…
finnlab 8 hours ago
Nice work, but I feel like it's not required to use AWS for this. There are small hosting companies with specialized servers (50gbit shared medium for under 10$), you could probably do this under 100$ with some optimization.
- nurettin 3 hours ago
  I did some crawling on hetzner back in the day. They monitor traffic and make sure you don't automate publically available data retrieval. They send you an email telling you that they are concerned because you got the ip blacklisted. Funny thing is: They own the blacklist that they refer to.
- varispeed 3 hours ago
  This. AWS is like a cash furnace, only really usable for VC backed efforts with more money than sense.
thefounder 3 hours ago
Well the most important part seems to be glossed over and that’s the IP addresses. Many websites simply block /want to block anything that’s not google and is not a “real user”.
ph4rsikal 3 hours ago
When I read this, I realize how small Google makes the Internet.
handfuloflight 3 hours ago
There was a time when being able to do this meant you were on the path to becoming a (m)(b)illionaire. Still is, I think.
sunpolice 2 hours ago
I was able to get 35k req/sec on a single node with Rust (custom http stack + custom html parser, custom queue, custom kv database) with obsessive optimization. It's possible to scrape Bing size index (say 100B docs) each month with only 10 nodes, under 15k$.
Thought about making it public but probably no one would use it.
- charlesdenault an hour ago
  please do
  mamsouuu 23 minutes ago
  Yes! Please do!