Score:1

Website blocks my requests from linux ubuntu server

us flag

I'm a Java engineer with zero dev ops experience. Lately I was playing around with linux ubuntu server first time and used docker with my selenium project and faced this problem:

I try to scrape HTML from a website, but my calls are getting blocked, and I get 403 forbidden response. I tried to curl same website and also get same response.

Furthermore, I only get blocked in my Linux machine, everything works in local dev env with same docker image, so thats why I think its "server fault".

Any ideas what my Linux server is missing here? Maybe I don't have some sort of certificate or have cors problem? Any ideas, what can I try? (For learning purposes only)

curl call here

in flag
Pass the web browser and your curl and Java apps through a proxy like mitmproxy and check the request, especially the headers. I am sure will will see the differences that cause the web server to send different responses.
cn flag
Bob
Not really on topic for ServerFault, getting selenium and curl commands to work is more StackOverflow. But most likely: the site tries to detect scrapers and uses mechanisms like cookies and sessions to identify real interactive users/browsers.
us flag
@Bob I would say it's ServerFault, because it works with my local machine with same docker image.
us flag
@Robert appreciate your suggestion, I'm going to investigate and update this question.
in flag
Just being the servers fault doesn't make it on topic for ServerFault. If this is your server you are trying to scrape, provide your server configuration and log files and we can try to help you. If this is not your server, it's off topic here. And in that case, I'd stop doing what you are doing. Now you are just getting a 403, the next notice might be from a lawyer.
us flag
As I mentioned, I'm a total noob in this and I can provide any config files which you think could help. Basically, at this point, I don't know what I don't know. Had no idea this can be illegal, but I don't think that few calls in a day could lead to these consequences, I don't have a server running and spamming calls. Definitely, I'm now more cautious and will do my research about this too. I also would like to mention that my main purpose is to learn trough practice, and I don't have any other goal here than just understanding "how I'm being recognized and blocked". Thanks
Score:1
cn flag

I believe you're getting rate-limited or blocked by the website. If I run the same curl command from my laptop, I get the webpage back.

Remember to respect robots.txt if you're doing web scraping.

us flag
Did not know about robots.txt, great findings, thanks. I had no idea about rate-limiting, but I think it's not the case, because from the start after deploy first call was blocked.
mangohost

Post an answer

Most people don’t grasp that asking a lot of questions unlocks learning and improves interpersonal bonding. In Alison’s studies, for example, though people could accurately recall how many questions had been asked in their conversations, they didn’t intuit the link between questions and liking. Across four studies, in which participants were engaged in conversations themselves or read transcripts of others’ conversations, people tended not to realize that question asking would influence—or had influenced—the level of amity between the conversationalists.