How to Build a Reliable URL Getter in Python (Step-by-Step)

Secure URL Getter: Protecting Against SSRF and Malicious Redirects

Building a URL getter (a component that fetches remote resources by URL) is common in web apps, bots, and integrations. But accepting arbitrary URLs from users or downstream systems creates serious attack surface—most notably Server-Side Request Forgery (SSRF) and malicious redirects. This article explains practical threats and gives concrete, implementable defenses you can add to a URL getter to make it safe.

Threats at a glance

  • SSRF: Attacker-submitted URL causes your server to request internal-only resources (metadata endpoints, internal services, databases), exposing sensitive data or enabling pivoting.
  • Malicious redirects: URL points to a redirect chain that eventually reaches a local/internal address or an adversary-controlled server; naive getters follow them and reach dangerous targets or leak requests.
  • Open proxies & request smuggling: Attackers use your getter as a proxy to reach blocked hosts or to obscure origin.
  • Payload/response attacks: Large responses, slowloris-style servers, or content that triggers downstream vulnerabilities (e.g., HTML with embedded scripts if you render it).

Security goals

  • Prevent access to internal/private network addresses.
  • Limit where redirects can take the request.
  • Restrict which protocols and ports are allowed.
  • Control resource usage (timeout, size).
  • Log and alert suspicious attempts.

Defensive measures (practical, prioritized)

  1. Input validation and allowed-scheme policy
  • Allow only safe schemes: typically http and https. Reject file:, gopher:, ftp:, data:, and other unusual schemes.
  • Reject URLs lacking a hostname (e.g., file paths, relative URLs).
  • Normalize and parse URLs robustly (use a vetted URL-parsing library).
  1. Resolve and validate host before connecting
  • Perform DNS resolution and inspect resolved IP(s) before opening a socket.
  • Deny requests if any resolved IP is in a private or reserved range (examples below).
  • Consider using getaddrinfo and checking all returned addresses (IPv4 and IPv6).
  1. Block private, link-local and special IP ranges
  • Deny ranges such as:
    • IPv4 private: 10.0.0.0/8, 172.16.0.0/12, 192.168.0.0/16
    • IPv4 Link-local: 169.254.0.0/16
    • Loopback: 127.0.0.0/8
    • Multicast: 224.0.0.0/4
    • Reserved, documentation, etc.
    • IPv6 equivalents: ::⁄128, fc00::/7 (ULA), fe80::/10 (link-local).
  • Also block IPs belonging to cloud provider metadata services (e.g., 169.254.169.254) explicitly if applicable.
  • Maintain and update the list periodically.
  1. Avoid DNS rebinding and on-the-fly resolution tricks
  • Re-resolve DNS after redirects and before following each redirect; enforce the same IP validation on each hop.
  • Optionally, disallow following redirects to a different effective host/IP family.
  • Consider pinning to the first resolved IP only if safe for your use case.
  1. Strict redirect handling
  • Limit maximum number of redirects (e.g., 3).
  • Only follow redirects for allowed schemes.
  • Disallow redirects to hosts whose resolved addresses fail the IP checks.
  • Prefer handling redirects at the application level (inspect Location header) rather than letting the HTTP client follow them automatically.
  1. Enforce host allowlist or denylist (defensive layering)
  • Maintain an allowlist of domains you trust (recommended for high-risk use cases).
  • If allowlist is not feasible, use a denylist of known risky domains and cloud metadata addresses.
  • Use wildcard/pattern matching carefully; prefer exact matches.
  1. Network-level isolation
  • Run the getter in a sandboxed environment: separate container, restricted network namespace, or dedicated proxy server.
  • Use egress firewall rules to block all private ranges and only permit outbound to approved public IP ranges.
  • Consider a dedicated outbound proxy that enforces policies centrally.
  1. Timeouts, size limits, and connection caps
  • Set connection and total request timeouts (e.g., connect timeout 3s, overall 10s).
  • Limit response body size (e.g., 5–10 MB) and stream content to disk if needed.
  • Limit concurrent outbound requests per user and global rate limits.
  1. Validate response content and headers
  • Check Content-Type and Content-Length; reject or treat suspicious content types cautiously.
  • Sanitize or avoid rendering fetched HTML; if you must render, run further isolation and content security measures.
  • Scan responses for known malware patterns if relevant.
  1. Authentication & credentials handling
  • Never automatically include internal credentials or