Score:1

How do we prevent accidental Graylog denial of service problems without multiple graylog instances?

cn flag

Our original problem

Last year we had a problem where a rogue piece of software on one server spammed our central Graylog Server with so many messages that it caused problems for other applications.

The main problem was older useful messages from other applications being purged earlier than normal, with the index filled up with the useless messages from the rogue application.

My suggested fix was to give each application it's own index, so no application could starve any other application of log storage space. This would not need any changes to the applications themselves, and would only require changes inside Graylog. Nothing was done however, as a new kubernetes based Graylog solution was being planned.

The solution we were offered

Fast forward to now, and we are now in the process of our replacement Graylog system being commissioned.

Initially we were told that every application would have it's own independent graylog server (load balancer, gelf endpoints, graylog nodes, elastic search cluster) and graylog front end website.

The problem is, there is a complex relationship between different applications, and having to go to different graylog web-servers for logs from different applications (graylog-application1.site for logs from application1 and graylog-application2.site for logs from application2, rather than just going to graylog.site) was going to make cross application searches really difficult.

The revised solution

After this was pointed out, a solution was proposed to group together applications by how likely they are to need to be searched together, so now we are expecting to be given have separate Graylog servers per group (application-group-a.site for application 1 and 3, and application-group-b.site for application 2, 4 and 5 etc.).

I wonder though if this is necessary or sufficient.

It will mean that many likely cross-application searches will be easy to do, but some of the hardest support problems to solve are those which cross application boundaries which are less obvious, and these searches will no longer be as easy (and may in fact be impossible, if you don't know which application is involved).

People have argued that just having separate indices on a single central graylog server does not provide sufficient isolation between application groups. They want to be sure that ingress of messages from one application could never interfere with ingress of messages from other applications, so they want complete isolation between groups.

My problem with this is that it wouldn't help with an application within a group going rogue and spamming the group graylog server. If we can find a solution which prevents denial of service solution within a group, we would also have a solution for a single centralised Graylog service.

I would argue that scaling a single central service horizontally with more load balanced graylog nodes, more elastic search nodes, more GELF end point etc. would be a better solution than having dozens of graylog servers.

Questions

  • Would separate graylog servers actually provide the level of isolation (denial of service mitigation) that people seem to want, when it is all hosted on the same kubernetes cluster?

  • Can we provide a similar or better level of isolation with a single central Graylog server than with separate group graylog servers?

  • Do other organisations use Graylog in this way, with many front end websites, or would a single central web-site to access all logs be expected?

Essentially I'm looking to either convince myself that I'm worrying about nothing, and that this solution is common; or arguments to convince people that what we are considering is contrary to best practice, and we really shouldn't be doing this.

I would really like to find a solution that works for everyone, but it seems to me at the moment that we are rather throwing the baby out with the bath water with our currently proposed solution.

HBruijn avatar
in flag
IMHO the main problem you tackle with any log aggregation solution is that you will get all your events in one place to facilitate searching and event correlation. Creating any number `X >1` unique and separate log aggregation environments, one for each application (almost) completely defeats that purpose. Rather than getting a single pane of glass you will still need to look in `X-1` too many places to get to a complete picture.
cn flag
That is exactly the point I keep trying to make @HBruijn, but people seem to think they need the isolation for spurious "what if another rogue applications spams the logging system" reasons. Hence this question.
cn flag
By the way, if anyone thinks there is a better Stack Exchange for this question than here, I would appreciate the suggestion.
mangohost

Post an answer

Most people don’t grasp that asking a lot of questions unlocks learning and improves interpersonal bonding. In Alison’s studies, for example, though people could accurately recall how many questions had been asked in their conversations, they didn’t intuit the link between questions and liking. Across four studies, in which participants were engaged in conversations themselves or read transcripts of others’ conversations, people tended not to realize that question asking would influence—or had influenced—the level of amity between the conversationalists.