Score:0

wget: Edit broken links while crawling before visiting them

in flag

Situation:
I want to mirror an old website. This website is on https://example.com/website/. The website uses absolute links to http://www.example.com/website/.

Problem:
For whatever reason, wget cannot reach https://www.example.com (the www. folder), the connection will just timeout - no idea why, it works fine in the browser (neither can curl btw).

Possible solutions:

  • Have wget rewrite the links before following them while it's still crawling.
  • Make wget work with the www. folder.

To maybe make .www work, I already tried setting the user-agent to FF: --header="Accept: text/html" --user-agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:95.0) Gecko/20100101 Firefox/95.0" but that did not work.

So I somehow need to rewrite the links on that website while crawling.

in flag
Not possible with pure wget. Find out why it times out.
us flag
So are the links to `https:` or `http:` URLs.. you are talking about both.
in flag
I have no idea how I could find out why www. does not work. wget/curl debug give no hint. The links are to http: but that does not really matter since HSTS enforces https:. The server works fine with https, also on the www. folder. If I run the same wget command from my home PC it downloads everything as expected (in my question I run wget from my server - but it's also not an IP block, because the non-www. stuff works (I usually crawl `https://example.com/site/` without issues)).
mangohost

Post an answer

Most people don’t grasp that asking a lot of questions unlocks learning and improves interpersonal bonding. In Alison’s studies, for example, though people could accurately recall how many questions had been asked in their conversations, they didn’t intuit the link between questions and liking. Across four studies, in which participants were engaged in conversations themselves or read transcripts of others’ conversations, people tended not to realize that question asking would influence—or had influenced—the level of amity between the conversationalists.