Score:0

How to configure a forward proxy to keep a historical mirror of the websites accessed?

cn flag

I'm scraping information regarding civil servants' calendars. This is all public, text-only information. I'd like to keep a copy of the raw HTML files I'm scraping for historical purposes, and also in case there's a bug and I need to re-run the scrapers.

This sounds like a great usage for a forward proxy like Squid or Apache Traffic Server. However, I couldn't find in their docs a way to both:

  • Keep a permanent history of the cached pages
  • Access old versions of the cached pages (think Wayback Machine)

Does anyone know if this is possible? I could potentially mirror the pages using wget or httrack, but a forward cache is a better solution as the caching process is driven by the scraper itself.

Thanks!

Score:0
in flag
  • If the site is available via HTTP, it can be done quite simply with Squid and some script, which would follow the Squid access log and store the appropriate content somewhere, using plain old wget for example
  • If the site is available only via HTTPS it would be much trickier
    • In simple case it's impossible to see what's being accessed, because the proxy is only aware of the domain it connects to, not the full URL.
    • There is a possibility to create so-called transparent proxy setup, which requires setup of DNS and a few TLS certificates, which would need to be trusted by browsers (or one common CA)
mangohost

Post an answer

Most people don’t grasp that asking a lot of questions unlocks learning and improves interpersonal bonding. In Alison’s studies, for example, though people could accurately recall how many questions had been asked in their conversations, they didn’t intuit the link between questions and liking. Across four studies, in which participants were engaged in conversations themselves or read transcripts of others’ conversations, people tended not to realize that question asking would influence—or had influenced—the level of amity between the conversationalists.