You want to scrape some data from the web. How hard can it be? A few HTTP requests, parse some HTML, done.
Then reality hits. Your IP gets blocked after 100 requests. Cloudflare throws a challenge page. The site requires login. Your "simple scraper" needs infrastructure you didn't plan for.
This post breaks down what production web scraping actually requires. Not an ad for any service - just an honest look at the problem space.
1. The Proxy Problem
The moment you send more than a handful of requests to any serious website, you'll get blocked. Your IP address is your identity, and websites track it.
The IP Hierarchy
Not all proxies are created equal. There's a clear hierarchy:
| Proxy Type | Description | Detection Risk | Cost |
|---|---|---|---|
| Data Center IPs | IPs from cloud providers (AWS, GCP, etc.) | Very High | $ |
| Residential IPs | IPs from real ISP customers | Medium | $ |
| ISP/Static Residential | Static IPs from ISPs, residential-like | Low | $$ |
| Mobile IPs | IPs from cellular networks (4G/5G) | Very Low | $$ |
Data Center IPs are cheap but essentially useless for any serious scraping. Websites maintain blocklists of known data center IP ranges. One request and you're flagged.
Residential IPs come from real home internet connections. They look like normal users. The catch: they're shared across many scrapers, rotate frequently, and quality varies wildly.
ISP/Static Residential are the sweet spot for many use cases. Static, legitimate residential addresses. Expensive but reliable.
Mobile IPs are the gold standard. Cellular carriers use CGNAT (Carrier-Grade NAT), meaning thousands of legitimate users share the same IP. Blocking a mobile IP means blocking real customers. Sites are extremely hesitant to do this.
The Pool Problem
You don't just need one proxy - you need thousands. A proxy pool with:
- Geographic distribution (many sites geo-restrict)
- Rotation logic (avoid hitting same IP twice in a row)
- Health monitoring (dead proxies need removal)
- Session stickiness (some sites require consistent IP during a session)
Building this yourself means:
- Sourcing proxy providers (multiple, for redundancy)
- Building rotation infrastructure
- Monitoring and replacing dead proxies
- Handling authentication and bandwidth billing
Most teams underestimate this. It's not a one-time setup - it's ongoing operations.
2. The Anti-Bot Arms Race
Having good proxies is just the entry ticket. Modern websites deploy sophisticated anti-bot systems.
Cloudflare and Friends
Cloudflare, Akamai, PerimeterX, DataDome - these services protect millions of websites. They analyze:
- TLS fingerprints: Your HTTP client's TLS handshake reveals what software you're using
- JavaScript challenges: Browser automation? They can tell
- Behavioral analysis: Mouse movements, scroll patterns, timing
- Cookie/session tracking: Inconsistent sessions get flagged
- Request patterns: Too fast? Too regular? Blocked
CAPTCHA Hell
When behavioral detection fails, CAPTCHAs appear:
- reCAPTCHA v2: The classic checkbox
- reCAPTCHA v3: Invisible, score-based
- hCaptcha: reCAPTCHA alternative
- Custom challenges: Image selection, puzzle solving
Solving CAPTCHAs at scale requires either:
- Human solving services (slow, expensive)
- Machine learning models (unreliable, arms race)
- Avoiding detection in the first place (best option)
The Unblocker Concept
This is where "unblocker" services come in. They handle:
- TLS fingerprint spoofing
- JavaScript rendering
- CAPTCHA solving
- Cookie management
- Retry logic with different configurations
You send a URL, they return the content. Simple API, complex infrastructure behind it.
3. Hosted Scrapers: The Lambda Model
Even with proxies and anti-bot bypass solved, you still need:
- Servers to run your scraper code
- Job queuing and scheduling
- Result storage
- Error handling and retries
- Monitoring and alerting
The Traditional Approach
Run EC2 instances, deploy your scraper, manage a queue (SQS, Redis), store results in PostgreSQL or S3. You're now maintaining:
- Server infrastructure
- Database operations
- Job orchestration
- Scraper code updates
- Scaling logic
The Hosted Alternative
Services like Brightdata, Apify, and others offer a "serverless scraper" model:
- Write your scraper logic (JavaScript/Python)
- Upload to their platform
- Trigger on-demand or schedule
- Results delivered to your storage
Similar to AWS Lambda - you write the function, they handle execution. Benefits:
- No infrastructure to maintain
- Pay per execution
- Built-in proxy integration
- Automatic retries and error handling
Trade-off: vendor lock-in and less control over execution environment.
4. The Login Problem
Public data is one thing. But many valuable data sources require authentication:
- Social media (Facebook, X/Twitter, LinkedIn)
- E-commerce (order history, pricing after login)
- SaaS platforms (dashboards, reports)
Why Login Scraping is Hard
- Session management: Maintain cookies across requests
- 2FA/MFA: Many sites require additional verification
- Rate limiting per account: Even with valid login, aggressive scraping gets accounts banned
- Terms of Service: Often explicitly prohibits automation
The Dataset Approach
Some scraping services maintain pre-crawled datasets. For example, Brightdata claims 45M+ records for Facebook data.
How does this work? A few possibilities:
- They've already crawled with authenticated accounts
- They aggregate data from multiple sources
- They use APIs where available (official or unofficial)
When you query their "Facebook dataset," you're not scraping Facebook in real-time. You're querying their cached database.
This raises questions:
- How fresh is the data?
- What's the coverage?
- Is this compliant with the source site's ToS?
For well-known sites, this approach often makes more sense than building custom scrapers. The data already exists - why reinvent the wheel?
The Edge Case: Login-Required Non-Famous Sites
Here's where things get complicated. If you need to scrape a niche B2B platform or internal tool that:
- Requires login
- Has no pre-existing datasets
- Has custom anti-bot measures
You're back to building custom infrastructure:
- Secure credential management
- Session handling
- Custom anti-detection logic
- Monitoring for account bans
No SaaS solution covers every site. For truly custom needs, you still need engineering work.
5. Compliance: The Invisible Constraint
Technical capability isn't everything. Legal and ethical constraints matter.
What You Can't Scrape
Most scraping services explicitly exclude certain sites:
- Apple.com (strict legal enforcement)
- Government sites (legal restrictions)
- Sites with explicit scraping bans in ToS
- Personal data without consent (GDPR/CCPA)
Brightdata, for example, maintains a blocklist. Try to scrape apple.com and you'll get rejected.
The Risk Calculus
Even if technically possible, consider:
- Legal risk: CFAA in the US, similar laws elsewhere
- Business risk: Cease and desist letters, lawsuits
- Ethical risk: Scraping personal data at scale has consequences
The hiQ vs LinkedIn case established some legal precedent for scraping public data, but the law remains murky. When in doubt, consult legal counsel.
6. Build vs Buy: The Real Calculus
So should you build scraping infrastructure or buy from a SaaS provider?
Build When:
- You have a small, specific target (one or two sites)
- Sites have minimal anti-bot protection
- You need full control over execution
- Cost sensitivity at very high scale
- Compliance requires owning the infrastructure
Buy When:
- You're targeting multiple sites
- Sites have sophisticated anti-bot measures
- You need to move fast
- Infrastructure isn't your core competency
- Pre-existing datasets cover your needs
The Hybrid Approach
Many teams end up with a hybrid:
- SaaS for difficult sites (Cloudflare-protected, login-required)
- Custom scrapers for simple targets
- Pre-crawled datasets for common data sources
7. Conclusion
Web scraping looks simple until you try to do it at scale. The real infrastructure includes:
| Layer | DIY Complexity | SaaS Solution |
|---|---|---|
| Proxy Pool | High - sourcing, rotation, monitoring | Included |
| Anti-Bot Bypass | Very High - TLS, JS, CAPTCHAs | Unblocker API |
| Execution Environment | Medium - servers, queues, storage | Hosted Scraper |
| Login Handling | High - sessions, 2FA, account management | Datasets (partial) |
| Compliance | Ongoing - legal review, blocklists | Built-in restrictions |
The complexity multiplies across layers. Each solved problem reveals the next.
SaaS scraping services exist because this problem is genuinely hard. They've invested years in infrastructure most teams can't justify building. Whether their pricing makes sense depends on your scale and needs.
For most teams, the answer isn't "build everything" or "buy everything" - it's understanding which layers need custom solutions and which can be outsourced.
The web scraping landscape keeps evolving. Anti-bot systems get smarter. Scrapers adapt. The arms race continues. What works today may not work tomorrow - factor that into your build vs buy decision.
你想从网上抓取一些数据。能有多难?几个HTTP请求,解析一些HTML,搞定。
然后现实给你一击。100个请求后IP被封了。Cloudflare抛出验证页面。网站需要登录。你的"简单爬虫"需要你没计划到的基础设施。
这篇文章拆解生产级网络爬虫实际需要什么。不是给任何服务做广告——只是诚实地审视这个问题领域。
1. 代理问题
当你向任何正规网站发送超过几个请求时,你就会被封禁。你的IP地址就是你的身份,网站会追踪它。
IP等级制度
代理并非生而平等。有明确的等级:
| 代理类型 | 描述 | 检测风险 | 成本 |
|---|---|---|---|
| 数据中心IP | 来自云服务商的IP(AWS、GCP等) | 极高 | $ |
| 住宅IP | 来自真实ISP客户的IP | 中等 | $ |
| ISP/静态住宅 | 来自ISP的静态IP,类住宅性质 | 低 | $$ |
| 移动IP | 来自蜂窝网络的IP(4G/5G) | 极低 | $$ |
数据中心IP便宜但对任何认真的爬取基本无用。网站维护着已知数据中心IP段的黑名单。一个请求就被标记。
住宅IP来自真实的家庭网络连接。看起来像普通用户。问题是:它们在多个爬虫间共享,频繁轮换,质量参差不齐。
ISP/静态住宅是许多场景的最佳选择。静态、合法的住宅地址。贵但可靠。
移动IP是黄金标准。蜂窝运营商使用CGNAT(运营商级NAT),意味着数千个合法用户共享同一个IP。封禁移动IP意味着封禁真实客户。网站对此极其谨慎。
池化问题
你不只需要一个代理——你需要数千个。一个代理池需要:
- 地理分布(很多网站有地域限制)
- 轮换逻辑(避免连续命中同一IP)
- 健康监控(失效代理需要移除)
- 会话粘性(某些网站要求会话期间IP一致)
自建意味着:
- 寻找代理供应商(多个,保证冗余)
- 构建轮换基础设施
- 监控和替换失效代理
- 处理认证和带宽计费
大多数团队低估了这点。这不是一次性设置——是持续运维。
2. 反机器人军备竞赛
有好的代理只是入场券。现代网站部署了复杂的反机器人系统。
Cloudflare及其同类
Cloudflare、Akamai、PerimeterX、DataDome——这些服务保护着数百万网站。它们分析:
- TLS指纹:HTTP客户端的TLS握手暴露你使用的软件
- JavaScript挑战:浏览器自动化?它们能识别
- 行为分析:鼠标移动、滚动模式、时间特征
- Cookie/会话追踪:不一致的会话会被标记
- 请求模式:太快?太规律?封禁
CAPTCHA地狱
当行为检测失效时,CAPTCHA出现:
- reCAPTCHA v2:经典复选框
- reCAPTCHA v3:隐形的,基于评分
- hCaptcha:reCAPTCHA替代品
- 自定义挑战:图片选择、拼图解谜
规模化解决CAPTCHA需要:
- 人工解答服务(慢、贵)
- 机器学习模型(不可靠、军备竞赛)
- 从一开始就避免被检测(最佳选项)
解封器概念
这就是"解封器"服务的用武之地。它们处理:
- TLS指纹伪造
- JavaScript渲染
- CAPTCHA求解
- Cookie管理
- 用不同配置重试
你发送URL,它们返回内容。简单的API,背后是复杂的基础设施。
3. 托管爬虫:Lambda模式
即使代理和反机器人绕过都解决了,你仍需要:
- 运行爬虫代码的服务器
- 任务队列和调度
- 结果存储
- 错误处理和重试
- 监控和告警
传统方式
运行EC2实例,部署爬虫,管理队列(SQS、Redis),将结果存储在PostgreSQL或S3。现在你需要维护:
- 服务器基础设施
- 数据库运维
- 任务编排
- 爬虫代码更新
- 扩缩容逻辑
托管替代方案
Brightdata、Apify等服务提供"无服务器爬虫"模式:
- 编写爬虫逻辑(JavaScript/Python)
- 上传到他们的平台
- 按需触发或定时调度
- 结果传送到你的存储
类似AWS Lambda——你写函数,他们处理执行。好处:
- 无需维护基础设施
- 按执行付费
- 内置代理集成
- 自动重试和错误处理
代价:供应商锁定,对执行环境控制较少。
4. 登录问题
公开数据是一回事。但许多有价值的数据源需要认证:
- 社交媒体(Facebook、X/Twitter、LinkedIn)
- 电商(订单历史、登录后价格)
- SaaS平台(仪表盘、报告)
为何登录爬取很难
- 会话管理:跨请求维护Cookie
- 双因素/多因素认证:很多网站需要额外验证
- 账户级限速:即使有效登录,激进爬取也会导致账户被封
- 服务条款:通常明确禁止自动化
数据集方式
一些爬取服务维护预爬取的数据集。例如,Brightdata声称有4500万+条Facebook数据记录。
这是怎么做到的?几种可能:
- 他们已经用认证账户爬取过了
- 他们从多个来源聚合数据
- 他们使用可用的API(官方或非官方)
当你查询他们的"Facebook数据集"时,你并非实时爬取Facebook。你是在查询他们的缓存数据库。
这引发了问题:
- 数据有多新鲜?
- 覆盖范围如何?
- 这符合源站的服务条款吗?
对于知名网站,这种方法通常比构建自定义爬虫更有意义。数据已经存在——何必重复造轮子?
边缘场景:需要登录的非知名网站
这就是事情变复杂的地方。如果你需要爬取一个小众B2B平台或内部工具,它:
- 需要登录
- 没有现成数据集
- 有自定义反机器人措施
你又回到了构建自定义基础设施的状态:
- 安全的凭证管理
- 会话处理
- 自定义反检测逻辑
- 监控账户封禁
没有SaaS方案能覆盖所有网站。对于真正的定制需求,你仍需要工程投入。
5. 合规:隐形约束
技术能力不是全部。法律和道德约束很重要。
不能爬的内容
大多数爬取服务明确排除某些网站:
- Apple.com(严格的法律执行)
- 政府网站(法律限制)
- ToS中明确禁止爬取的网站
- 未经同意的个人数据(GDPR/CCPA)
例如,Brightdata维护着一个黑名单。试图爬取apple.com会被拒绝。
风险计算
即使技术上可行,也要考虑:
- 法律风险:美国的CFAA,其他地区的类似法律
- 商业风险:停止函、诉讼
- 道德风险:大规模爬取个人数据有后果
hiQ诉LinkedIn案为爬取公开数据确立了一些法律先例,但法律仍然模糊。有疑问时,咨询法律顾问。
6. 自建还是购买:真正的考量
那么你应该自建爬取基础设施还是从SaaS供应商购买?
自建适合:
- 你有小的、特定的目标(一两个网站)
- 网站反机器人保护最小
- 你需要完全控制执行
- 超大规模时对成本敏感
- 合规要求自有基础设施
购买适合:
- 你的目标是多个网站
- 网站有复杂的反机器人措施
- 你需要快速推进
- 基础设施不是你的核心能力
- 现有数据集覆盖你的需求
混合方式
很多团队最终采用混合:
- SaaS用于困难网站(Cloudflare保护、需要登录)
- 自定义爬虫用于简单目标
- 预爬取数据集用于常见数据源
7. 结论
网络爬取看起来简单,直到你尝试规模化。真正的基础设施包括:
| 层级 | DIY复杂度 | SaaS方案 |
|---|---|---|
| 代理池 | 高 - 寻源、轮换、监控 | 内置 |
| 反机器人绕过 | 极高 - TLS、JS、CAPTCHA | 解封器API |
| 执行环境 | 中等 - 服务器、队列、存储 | 托管爬虫 |
| 登录处理 | 高 - 会话、2FA、账户管理 | 数据集(部分) |
| 合规 | 持续 - 法律审查、黑名单 | 内置限制 |
复杂度跨层级倍增。每解决一个问题就暴露下一个。
SaaS爬取服务存在是因为这个问题确实很难。他们在大多数团队无法证明值得自建的基础设施上投入了多年。他们的定价是否合理取决于你的规模和需求。
对于大多数团队,答案不是"全部自建"或"全部购买"——而是理解哪些层需要定制方案,哪些可以外包。
网络爬取领域持续演变。反机器人系统变得更聪明。爬虫适应。军备竞赛继续。今天有效的方法明天可能失效——在你的自建还是购买决策中考虑这一点。