Post

Cookie State Management for Long-Running Scraping Jobs

Cookie State Management for Long-Running Scraping Jobs

Long-running scrapers live and die by how well they manage cookies. A scraper that runs for hours or days will inevitably face expired authentication tokens, rotated session identifiers, and restarts caused by crashes or deployments. If your cookie state evaporates every time one of these events occurs, your scraper wastes time re-authenticating, loses its place in a crawl, and risks triggering rate limits or account locks from repeated logins. Proper cookie and session management means your scraper can survive restarts, detect when authentication has lapsed, and renew sessions automatically without human intervention. This post covers every technique you need to build that resilience into your Python scrapers using requests.Session, http.cookiejar, and browser automation tools.

A long-running scraper goes through a predictable cycle with cookies. It authenticates, receives session cookies, uses them for a potentially unbounded number of requests, and eventually those cookies expire or get invalidated. The scraper must detect this and re-authenticate to continue.

flowchart TD
    A[Scraper Starts] --> B{Saved Cookies<br>on Disk?}
    B -->|Yes| C[Load Cookies<br>from Disk]
    B -->|No| D[Authenticate]
    C --> E{Cookies Still<br>Valid?}
    E -->|Yes| F[Make Requests<br>with Cookies]
    E -->|No| D
    D --> G[Receive Session<br>Cookies]
    G --> H[Save Cookies<br>to Disk]
    H --> F
    F --> I{Response OK?}
    I -->|200| J[Process Data]
    J --> K[Continue Scraping]
    K --> F
    I -->|401 / 403| L[Cookies Expired]
    L --> D
    I -->|Crash / Restart| B

The key insight is that cookie persistence and expiration detection form a loop. Every successful authentication saves state. Every failed request checks whether re-authentication is needed. Every restart loads state from disk instead of starting from scratch.

The requests.Session object is the foundation of cookie management in Python scrapers. It maintains a CookieJar across requests, automatically sending and receiving cookies just like a browser would.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
import requests

session = requests.Session()

# Login -- the session automatically captures Set-Cookie headers
login_payload = {"username": "scraper_user", "password": "s3cret"}
resp = session.post("https://example.com/login", data=login_payload)

# Subsequent requests include the session cookies automatically
data_resp = session.get("https://example.com/api/data?page=1")
print(data_resp.status_code)  # 200, because session cookies are sent

# Inspect the cookies the session is holding
for cookie in session.cookies:
    print(f"{cookie.name} = {cookie.value} (domain={cookie.domain}, expires={cookie.expires})")

Without a Session, each call to requests.get() or requests.post() is stateless. The Set-Cookie headers from the login response would be discarded, and the next request would arrive at the server without any session identifier. The Session object solves this by maintaining a RequestsCookieJar internally, which is a subclass of http.cookiejar.CookieJar.

You can also pre-load cookies into a session manually:

1
2
session.cookies.set("session_id", "abc123", domain="example.com", path="/")
session.cookies.set("csrf_token", "xyz789", domain="example.com", path="/")

This is useful when you have cookies from another source, like a browser or a previous scraper run. For the basics of managing cookies across requests, including how requests.Session handles Set-Cookie headers, our introductory guide covers the fundamentals.

Persisting Cookies to Disk

A session that only lives in memory is useless across restarts. You need to serialize cookies to disk so the scraper can pick up where it left off.

JSON Serialization

JSON is human-readable and easy to debug. The trade-off is that you lose some cookie metadata that http.cookiejar tracks internally, but for most scraping use cases the essential fields are enough.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
import json
import time
from pathlib import Path

COOKIE_FILE = Path("cookies.json")

def save_cookies_json(session, filepath=COOKIE_FILE):
    """Save session cookies to a JSON file."""
    cookies = []
    for cookie in session.cookies:
        cookies.append({
            "name": cookie.name,
            "value": cookie.value,
            "domain": cookie.domain,
            "path": cookie.path,
            "expires": cookie.expires,
            "secure": cookie.secure,
            "rest": {"HttpOnly": cookie.has_nonstandard_attr("HttpOnly")},
        })
    filepath.write_text(json.dumps(cookies, indent=2))

def load_cookies_json(session, filepath=COOKIE_FILE):
    """Load cookies from a JSON file into a session."""
    if not filepath.exists():
        return False
    cookies = json.loads(filepath.read_text())
    for c in cookies:
        session.cookies.set(
            c["name"],
            c["value"],
            domain=c["domain"],
            path=c["path"],
        )
    return True

Using http.cookiejar for Native Persistence

The http.cookiejar module in the standard library includes MozillaCookieJar and LWPCookieJar, both of which support saving and loading cookies to disk in standard formats.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
import http.cookiejar
import requests

def create_persistent_session(cookie_path="cookies.txt"):
    """Create a requests session backed by a persistent MozillaCookieJar."""
    cookie_jar = http.cookiejar.MozillaCookieJar(cookie_path)

    # Load existing cookies if the file exists
    try:
        cookie_jar.load(ignore_discard=True, ignore_expires=True)
        print(f"Loaded {len(cookie_jar)} cookies from {cookie_path}")
    except FileNotFoundError:
        print("No existing cookie file, starting fresh")

    session = requests.Session()
    session.cookies = cookie_jar
    return session

def save_session_cookies(session, cookie_path="cookies.txt"):
    """Persist current session cookies to disk."""
    session.cookies.save(ignore_discard=True, ignore_expires=True)
    print(f"Saved {len(session.cookies)} cookies to {cookie_path}")

The ignore_discard=True flag saves session cookies (those without an explicit expiry) that would normally be discarded when the “browser” closes. The ignore_expires=True flag saves cookies even if they have already expired, which can be useful for debugging.

Pickle Serialization

Pickle preserves the full CookieJar object graph, but it is not human-readable and comes with the usual pickle security caveats – only load pickle files you trust.

1
2
3
4
5
6
7
8
9
10
11
12
13
import pickle

def save_cookies_pickle(session, filepath="cookies.pkl"):
    with open(filepath, "wb") as f:
        pickle.dump(session.cookies, f)

def load_cookies_pickle(session, filepath="cookies.pkl"):
    try:
        with open(filepath, "rb") as f:
            session.cookies = pickle.load(f)
        return True
    except FileNotFoundError:
        return False

For most scraping projects, JSON serialization strikes the right balance. It is debuggable, portable, and does not carry the security baggage of pickle.

Cookies expire. Session cookies vanish when the process ends. Persistent cookies have an expires timestamp after which they should not be sent. And servers can invalidate cookies at any time without the expiry changing. A robust scraper checks for expiration both proactively and reactively.

Proactive: Checking Expiry Timestamps

1
2
3
4
5
6
7
8
9
10
11
12
13
14
import time

def has_valid_cookies(session, required_cookie="session_id"):
    """Check whether the session holds a non-expired required cookie."""
    for cookie in session.cookies:
        if cookie.name == required_cookie:
            if cookie.expires is None:
                return True  # Session cookie -- valid as long as process is alive
            if cookie.expires > time.time():
                return True
            print(f"Cookie '{cookie.name}' expired at {cookie.expires}")
            return False
    print(f"Cookie '{required_cookie}' not found in jar")
    return False

Reactive: Catching 401 and 403 Responses

Proactive checks are not enough. A server might revoke a session at any time. The scraper must treat 401 Unauthorized and 403 Forbidden responses as signals that re-authentication is needed.

1
2
3
4
5
6
7
8
9
10
11
12
REAUTH_STATUS_CODES = {401, 403}

def needs_reauth(response):
    """Determine if a response indicates expired authentication."""
    if response.status_code in REAUTH_STATUS_CODES:
        return True

    # Some sites redirect to a login page instead of returning 401/403
    if response.status_code == 200 and "/login" in response.url:
        return True

    return False
Cookies are small, but they carry the weight of authentication.
Cookies are small, but they carry the weight of authentication. Photo by hello aesthe / Pexels

Auto-Renewal: Detecting Expired Auth and Re-Authenticating

Combining proactive and reactive detection with automatic re-authentication creates a self-healing scraper. The pattern is straightforward: wrap every request in logic that checks the response and retries after re-authenticating if needed.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
import requests
import time
import logging

logger = logging.getLogger(__name__)

class AuthenticatedSession:
    def __init__(self, login_url, credentials, max_retries=2):
        self.login_url = login_url
        self.credentials = credentials
        self.max_retries = max_retries
        self.session = requests.Session()

    def authenticate(self):
        """Perform login and capture session cookies."""
        logger.info("Authenticating at %s", self.login_url)
        resp = self.session.post(self.login_url, data=self.credentials)
        resp.raise_for_status()

        if not self.session.cookies:
            raise RuntimeError("Authentication succeeded but no cookies were set")

        logger.info("Authentication successful, got %d cookies", len(self.session.cookies))

    def request(self, method, url, **kwargs):
        """Make an authenticated request with auto-renewal on auth failure."""
        for attempt in range(self.max_retries + 1):
            resp = self.session.request(method, url, **kwargs)

            if not needs_reauth(resp):
                return resp

            logger.warning(
                "Auth expired (status=%d, url=%s), re-authenticating (attempt %d/%d)",
                resp.status_code, resp.url, attempt + 1, self.max_retries,
            )
            self.authenticate()

        raise RuntimeError(f"Failed to authenticate after {self.max_retries} retries")

    def get(self, url, **kwargs):
        return self.request("GET", url, **kwargs)

    def post(self, url, **kwargs):
        return self.request("POST", url, **kwargs)

This pattern keeps the calling code clean. The scraper just calls session.get(url) and the session handles re-authentication transparently.

Building a CookieManager Class

Bringing persistence and renewal together into a single class gives you a reusable component for any long-running scraper.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
import json
import time
import logging
import requests
import http.cookiejar
from pathlib import Path

logger = logging.getLogger(__name__)


class CookieManager:
    """Manages cookie persistence, expiration detection, and auto-renewal."""

    def __init__(
        self,
        cookie_file="cookies.json",
        login_url=None,
        credentials=None,
        required_cookie="session_id",
        renewal_buffer_seconds=300,
    ):
        self.cookie_file = Path(cookie_file)
        self.login_url = login_url
        self.credentials = credentials
        self.required_cookie = required_cookie
        self.renewal_buffer = renewal_buffer_seconds
        self.session = requests.Session()
        self._load()

    def _load(self):
        """Load cookies from disk if available."""
        if not self.cookie_file.exists():
            logger.info("No cookie file found at %s", self.cookie_file)
            return

        try:
            data = json.loads(self.cookie_file.read_text())
            for c in data:
                self.session.cookies.set(
                    c["name"], c["value"],
                    domain=c.get("domain", ""),
                    path=c.get("path", "/"),
                )
            logger.info("Loaded %d cookies from %s", len(data), self.cookie_file)
        except (json.JSONDecodeError, KeyError) as exc:
            logger.warning("Failed to load cookies: %s", exc)

    def save(self):
        """Persist current cookies to disk."""
        cookies = []
        for cookie in self.session.cookies:
            cookies.append({
                "name": cookie.name,
                "value": cookie.value,
                "domain": cookie.domain,
                "path": cookie.path,
                "expires": cookie.expires,
                "secure": cookie.secure,
            })
        self.cookie_file.write_text(json.dumps(cookies, indent=2))
        logger.info("Saved %d cookies to %s", len(cookies), self.cookie_file)

    def is_valid(self):
        """Check if required cookies exist and are not expired."""
        for cookie in self.session.cookies:
            if cookie.name == self.required_cookie:
                if cookie.expires is None:
                    return True
                remaining = cookie.expires - time.time()
                if remaining > self.renewal_buffer:
                    return True
                logger.info(
                    "Cookie '%s' expires in %.0f seconds (buffer=%d)",
                    cookie.name, remaining, self.renewal_buffer,
                )
                return False
        return False

    def authenticate(self):
        """Perform login and save new cookies."""
        if not self.login_url or not self.credentials:
            raise RuntimeError("Cannot authenticate: login_url or credentials not set")

        logger.info("Authenticating at %s", self.login_url)
        resp = self.session.post(self.login_url, data=self.credentials)
        resp.raise_for_status()
        self.save()

    def ensure_valid(self):
        """Ensure the session has valid cookies, re-authenticating if needed."""
        if not self.is_valid():
            self.authenticate()

    def request(self, method, url, **kwargs):
        """Make a request with automatic cookie management."""
        self.ensure_valid()
        resp = self.session.request(method, url, **kwargs)

        if resp.status_code in {401, 403}:
            logger.warning("Got %d from %s, re-authenticating", resp.status_code, url)
            self.authenticate()
            resp = self.session.request(method, url, **kwargs)

        return resp

    def get(self, url, **kwargs):
        return self.request("GET", url, **kwargs)

    def post(self, url, **kwargs):
        return self.request("POST", url, **kwargs)

Usage is straightforward:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
manager = CookieManager(
    cookie_file="scraper_cookies.json",
    login_url="https://example.com/login",
    credentials={"username": "scraper", "password": "s3cret"},
    required_cookie="session_id",
    renewal_buffer_seconds=600,  # renew 10 minutes before expiry
)

for page in range(1, 1001):
    resp = manager.get(f"https://example.com/api/products?page={page}")
    products = resp.json()
    process_products(products)

    # Save cookies periodically so a crash does not lose state
    if page % 50 == 0:
        manager.save()

Browser Automation Cookies: Saving Playwright and Selenium State

When scraping sites that require a full browser – heavy JavaScript, anti-bot checks, complex login and form automation flows – you need to persist browser-level cookies between runs.

Playwright Storage State

Playwright has built-in support for saving and restoring the entire browser context state, including cookies and localStorage.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
from playwright.sync_api import sync_playwright
import json
from pathlib import Path

STORAGE_FILE = "playwright_state.json"

def save_playwright_state(context, filepath=STORAGE_FILE):
    """Save browser context state including cookies and localStorage."""
    state = context.storage_state()
    Path(filepath).write_text(json.dumps(state, indent=2))

def scrape_with_playwright():
    with sync_playwright() as p:
        browser = p.chromium.launch(headless=True)

        # Try to restore previous state
        if Path(STORAGE_FILE).exists():
            context = browser.new_context(storage_state=STORAGE_FILE)
            print("Restored previous browser state")
        else:
            context = browser.new_context()
            page = context.new_page()

            # Perform login
            page.goto("https://example.com/login")
            page.fill("#username", "scraper_user")
            page.fill("#password", "s3cret")
            page.click("#login-button")
            page.wait_for_url("**/dashboard**")

            # Save state after login
            save_playwright_state(context)
            print("Saved browser state after login")

        # Now scrape with authenticated context
        page = context.new_page()
        page.goto("https://example.com/data")
        data = page.content()

        # Save state periodically
        save_playwright_state(context)
        browser.close()

Selenium does not have a built-in storage state mechanism, so you have to extract and restore cookies manually.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
import json
from pathlib import Path

SELENIUM_COOKIE_FILE = "selenium_cookies.json"

def save_selenium_cookies(driver, filepath=SELENIUM_COOKIE_FILE):
    """Save Selenium cookies to JSON."""
    cookies = driver.get_cookies()
    Path(filepath).write_text(json.dumps(cookies, indent=2))

def load_selenium_cookies(driver, url, filepath=SELENIUM_COOKIE_FILE):
    """Load cookies into a Selenium driver."""
    if not Path(filepath).exists():
        return False

    # Must navigate to the domain first before setting cookies
    driver.get(url)
    cookies = json.loads(Path(filepath).read_text())

    for cookie in cookies:
        # Remove keys that Selenium does not accept in add_cookie
        cookie.pop("sameSite", None)
        try:
            driver.add_cookie(cookie)
        except Exception as e:
            print(f"Skipped cookie {cookie.get('name')}: {e}")

    driver.refresh()
    return True

The important detail with Selenium is that you must navigate to the target domain before calling add_cookie(). Selenium enforces domain scoping – you cannot set a cookie for example.com while the driver is on about:blank.

Managing cookies well means fewer blocked requests and more reliable scraping.
Managing cookies well means fewer blocked requests and more reliable scraping. Photo by Anastasia Shuraeva / Pexels

A scraper that targets multiple sites needs separate cookie management for each domain. Mixing cookies across domains causes subtle bugs and can leak session tokens to the wrong server.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
class MultiSiteCookieManager:
    """Manage separate cookie state for multiple target domains."""

    def __init__(self, cookie_dir="cookies"):
        self.cookie_dir = Path(cookie_dir)
        self.cookie_dir.mkdir(exist_ok=True)
        self.managers = {}

    def get_manager(self, domain, login_url=None, credentials=None):
        """Get or create a CookieManager for a specific domain."""
        if domain not in self.managers:
            cookie_file = self.cookie_dir / f"{domain.replace('.', '_')}.json"
            self.managers[domain] = CookieManager(
                cookie_file=str(cookie_file),
                login_url=login_url,
                credentials=credentials,
            )
        return self.managers[domain]

    def save_all(self):
        """Persist cookies for all managed domains."""
        for domain, manager in self.managers.items():
            manager.save()

    def get(self, url, **kwargs):
        """Route a request through the correct domain-specific manager."""
        from urllib.parse import urlparse
        domain = urlparse(url).netloc
        manager = self.get_manager(domain)
        return manager.get(url, **kwargs)

Each domain gets its own CookieManager instance, its own cookie file on disk, and its own authentication credentials. The get() method parses the URL to route requests through the correct manager automatically.

Thread Safety: Sharing Cookies Across Concurrent Scrapers

When running concurrent scraping threads or async tasks, the cookie jar becomes a shared mutable resource. Without synchronization, you get race conditions – two threads might detect an expired cookie and both try to re-authenticate simultaneously, stomping on each other’s session.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
import threading
import requests
import logging

logger = logging.getLogger(__name__)


class ThreadSafeCookieManager(CookieManager):
    """A CookieManager that is safe to share across threads."""

    def __init__(self, **kwargs):
        self._lock = threading.Lock()
        super().__init__(**kwargs)

    def authenticate(self):
        with self._lock:
            # Double-check: another thread may have already re-authenticated
            if self.is_valid():
                logger.info("Another thread already refreshed cookies")
                return
            super().authenticate()

    def save(self):
        with self._lock:
            super().save()

    def request(self, method, url, **kwargs):
        # ensure_valid and the actual request must be atomic
        with self._lock:
            self.ensure_valid()

        # The request itself can run without holding the lock
        resp = self.session.request(method, url, **kwargs)

        if resp.status_code in {401, 403}:
            self.authenticate()
            resp = self.session.request(method, url, **kwargs)

        return resp

The double-check pattern inside authenticate() is critical. When multiple threads detect a 401 simultaneously, only the first one should actually log in. The others should see that the cookies are now valid and skip the redundant authentication.

sequenceDiagram
    participant T1 as Thread 1
    participant T2 as Thread 2
    participant CM as CookieManager
    participant S as Server

    T1->>S: GET /data (expired cookie)
    T2->>S: GET /data (expired cookie)
    S-->>T1: 401 Unauthorized
    S-->>T2: 401 Unauthorized
    T1->>CM: authenticate() -- acquires lock
    T2->>CM: authenticate() -- blocks on lock
    CM->>S: POST /login
    S-->>CM: 200 + new cookies
    T1->>CM: releases lock
    T2->>CM: acquires lock, checks is_valid()
    Note over T2,CM: Cookies already refreshed,<br>skips login
    T2->>CM: releases lock
    T1->>S: GET /data (new cookie)
    T2->>S: GET /data (new cookie)
    S-->>T1: 200 OK
    S-->>T2: 200 OK

For asyncio-based scrapers, replace threading.Lock with asyncio.Lock and use aiohttp.ClientSession instead of requests.Session.

Common Issues

Secure Cookies Over HTTP

Cookies marked with the Secure flag are only sent over HTTPS connections. If your scraper is connecting over plain HTTP – for example through a local proxy for debugging – secure cookies will silently disappear from requests.

1
2
3
4
# Check if you are accidentally dropping secure cookies
for cookie in session.cookies:
    if cookie.secure:
        print(f"WARNING: '{cookie.name}' is Secure-only, requires HTTPS")

Domain Scoping

A cookie set for .example.com is sent to www.example.com, api.example.com, and any other subdomain. A cookie set for www.example.com (without the leading dot) is only sent to that exact host. If your scraper authenticates on www.example.com but fetches data from api.example.com, the session cookie might not be included. Iterate over session.cookies and compare each cookie’s domain against the target URL’s hostname to diagnose missing cookies.

Path Matching

A cookie with path=/api is only sent for requests to URLs starting with /api. If the login endpoint sets a cookie with a restrictive path, requests to other paths will not include it. Check cookie.path for any cookie that is not set to "/".

Browsers limit cookies to about 4 KB each and roughly 50 cookies per domain. If a site sets a very large number of cookies, some may be silently dropped. The requests library does not enforce these limits, but upstream proxies or load balancers might.

Putting It All Together

The CookieManager class above already handles every scenario a long-running scraper needs: it loads cookies on startup, detects expiration before and during requests, re-authenticates automatically, and saves state after each login. To build a production scraper, instantiate the manager, add a signal handler to call manager.save() on shutdown, and wrap your main loop around manager.get(). Add the MultiSiteCookieManager when you target multiple domains. Wrap it in ThreadSafeCookieManager when you add concurrency. Swap in Playwright’s storage_state when a full browser is required. The core loop – load, validate, use, renew, save – stays the same regardless of the underlying HTTP client.

The pattern scales to more complex setups, including keeping logins alive across days-long runs. The RENEWAL_BUFFER in the CookieManager ensures re-authentication happens before the cookie actually expires, avoiding the brief window where a request goes out with a cookie that expires between sending the request and receiving the response.

Contact Arman for Complex Problems
This post is licensed under CC BY 4.0 by the author.