Web Scraping with Python: A Practical and Ethical Guide

2026年6月19日 Tech & Tools

Web scraping is one of the most practically useful data skills. It is also one where the legal and ethical dimensions matter. Here is how to do it well and responsibly.

The Technical Basics

Python’s web scraping ecosystem: requests (HTTP requests library — download the raw HTML), BeautifulSoup4 (parse and extract from HTML), and lxml (faster HTML/XML parser, used by BeautifulSoup as a backend). For JavaScript-heavy sites that don’t serve content in the initial HTML: Playwright (Python API for a headless browser, more reliable than Selenium, actively maintained by Microsoft) or Selenium (older, more complex but well-documented). Simple scraping example: `import requests; from bs4 import BeautifulSoup; r = requests.get(url); soup = BeautifulSoup(r.text, ‘lxml’); titles = soup.select(‘h2.article-title’)`. CSS selectors (`soup.select()`) or XPath are used to target specific elements — browser DevTools (right-click → Inspect) shows the structure needed to write the selectors.

Handling Anti-Scraping Measures

Most websites implement measures to detect and block scrapers: rate limiting (too many requests too fast triggers a block), User-Agent checking (blocking requests without a browser User-Agent header), IP blocking (after too many requests from one IP), and bot detection (Cloudflare, reCAPTCHA). Responsible counter-measures: add delays between requests (time.sleep(1–3), randomised is less detectable than fixed), set a realistic User-Agent header (`headers={‘User-Agent’: ‘Mozilla/5.0 …}`), implement retry logic with exponential backoff (requests-retry or tenacity library), and rotate IP addresses using a proxy pool if permitted. The difference between responsible and irresponsible scraping is primarily rate — a scraper that sends one request per 2–3 seconds uses less server resource than a human browser; one that sends 100 requests per second is a denial-of-service attack.

The Ethical and Legal Framework

robots.txt: every website can specify a robots.txt file (at domain/robots.txt) that lists pages that should not be scraped. You are not legally required to follow robots.txt in most jurisdictions, but ignoring it is considered unethical and can be used as evidence of bad intent in legal proceedings. Terms of service: most sites’ ToS prohibit scraping. This creates legal risk, particularly in the US (Computer Fraud and Abuse Act) and EU (GDPR for personal data). The hiQ v. LinkedIn ruling (US 9th Circuit, 2022) established that scraping publicly available data does not violate the CFAA, but the area remains legally unsettled. Personal data (names, emails, contact details): scraping and storing personal data is a GDPR concern in the EU — a legitimate purpose and privacy impact assessment are required. The practical risk assessment: large-scale commercial scraping of a direct competitor’s data carries significant legal risk; scraping publicly available, non-personal data for personal research or academic purposes is generally low-risk.

Alternatives to Scraping

Before scraping, check: does the site have an official API? Many services that appear to require scraping (Twitter/X, Reddit, Google, LinkedIn) have APIs that provide structured data access. The API may be rate-limited or paid, but the data is cleaner, the legal position is clearer, and your scraper won’t break when the site redesigns. Common data sources that provide pre-scraped or structured alternatives: Common Crawl (petabytes of web content, freely available), government open data portals (data.gov, data.europa.eu), and academic datasets. For specific data needs: consider whether the data is already available via an established data provider before building a scraper.

作者：

链接：https://www.sunqi.org/web-scraping-python-ethical-guide.html

文章版权归作者所有，未经允许请勿转载。