Riak Developer Guidance

The “Client Round Robin Anti-Pattern”

One of the features that is often available in Riak Client software (including the CorrguatedIron .NET Client, the riak-js client and others) is the ability to send requests to the Riak Cluster through a round robin style approach. What this means is each IP, of each node within the Riak Cluster is entered into a config file for the client. The client then goes through that list to send off requests to read, write or delete data in the database.

The client being responsible and knowledgeable about the data tier of the application in an architecture is an immediate red flag! The concept around SoC (Separation of Concerns) dictates that

“SoC is a principle for separating a computer program into distinct sections, such that each section addresses a separate concern.

Having the client provide a network tier layer to round robin communication with the database leaves us in a scenario that should be separated into individual concerns. Below is some basic guidance on eliminating this SoC issue.

  • Client ONLY sends and receives communication: The client, especially in the situation with a distributed system like Riak should only be dealing with sending and receiving information from the cluster or a facade that provides an interface for that cluster.
  • Another layer should deal with the network communication and division of nodes and node communication. Ideally, in the case or Riak, and most distributed systems this should be dealt with at the network device layer (router).
  • The network device (router) layer would ideally be able to have (through software likely) a way to automate the failure, inclusion or exclusion of nodes with the cluster system. If a node goes down, the network device should handle the immediate cessation of communication with that node from all clients, routing the communication accordingly to an active node.
  • The node itself needs to maintain a continual information state available to the network. Ideally the network state would identify any addition or removal of a node and if possible the immediate failure of a node. Of course it isn’t always possible to be informed of a failure, but the first line of defense should start within the cluster itself among the nodes.

The Anti-Pattern

Having the client handle all of these parts of the functional architecture leads to a number of problems, not merely that the guidance of the SoC concept is broken. With the client attempting to track and be aware of the individual nodes in the cluster, it sets the client with a huge responsibility.

Take for instance the riak-js client. If a node goes down the client will need to be aware of which node has gone down. For a few seconds (yes, you have to wait entire seconds at this level) the node will be gone and the client won’t know it is down. The client would just have to reasonably wait. When the communication times out, the client would then have to have the responsibility of marking that particular node as down. At this point the client must track which node it is in some type of data repository local to the client. The client must also set a time or some way to identify when the node comes back up. Several questions start to come up such as;

  • Does the client do an arbitrary test to determine when the node comes back up?
  • When the node comes back up is it considered alive or damaged?
  • How would the client manage the IP (or identifier) of the node that has gone down?
  • How long would the client store that the node is down?

The list of questions can get long pretty quick, thus the bad karma of not following a good practice around separating your concerns appropriately! One has to be careful, a god class might be right around the corner otherwise! That’s it for this quick journey into some distributed database usage guidelines. Until next, happy data sciencing.  ;)

11 thoughts on “Riak Developer Guidance

    • You can use an IP address straight or you could use a cname style reference, so that you don’t have to worry about changing IP addresses. The later being the generally preferred approach.

      • There’s a tradeoff, however: using DNS introduces another point of failure, with the benefit of making infrastructure changes easier.

        In most scenarios, DNS being broken will lead to random failures anyway, but in a large enough environment partial DNS failures aren’t nearly as uncommon as one would like.

  1. I’m inclined to agree that forcing the client to be aware of cluster changes is a bad idea. If nothing else, I think it’s safe to say that software like haproxy is savvier at responding to changing network conditions than any Riak client code is likely to be.

    However, as with DNS in another comment thread, using a load balancer does introduce another point of failure, so ultimately you flips a coin and takes your chances.

    • Valid point and in large part it is a trade off. I absolutely agree with that and understand the implications. In my entry I primarily am drawing the point that if one wants clean SoC then they’ll need to toss the client responsibilities and put them in another concern elsewhere in the architecture.

      Also it seems that this is a standard situation of “it depends”. However every scenario needs a DNS server somewhere, so that isn’t going away. I’m merely advocating that the DNS & HaProxy and whatever tools are used for what they’re good at and the client does what it is good at, which is be a client and communicate back and forth.

      …and thanks for your input John, keep rocking the Riak!

      • I have used load balancing routers and proxies to manage cluster availability as well and for most cases I think it is the preferable method. The load balancing routers I have worked with were made redundant without much fuss, but this was within the same data center. In one case where I had a cluster that spanned two data centers (which should be unnecessary in most cases) I used a third party DNS service that offered service monitoring with inclusion/exclusion management, which also worked reasonably well. The third party DNS service was redundant and we never suffered from outages, so it was essentially as reliable as our internet connections. The advantage of the load balancing router was that the fail-over latency was much lower than with a DNS approach. I agree that SoC is a Good Thing even within distributed systems, but the biggest troubles with the “Client Round Robin Anti-Pattern” that you pointed out are the monitoring and inclusion/exclusion management concern. Assuming that DNS is reliable enough, allowing a client to chose which server to use based on a DNS entry is not as troublesome. In any case though, acute intermittent failures will always happen. So with user facing services, having a pleasant and intuitive means for them to eventually achieve their goal in spite of partial system failure can be just as important as the back-end systems design choices you make. Target a pleasant and intuitive UX with explicit SLAs and failure as a feature, then chose an architecture that meets those requirements with the least amount of fuss.

  2. I went through a similar thing recently w/ a Rabbit cluster. Ended up going the load balancer route. It can maintain all the state of who is up or down for the clients, and the cluster itself can handle all it’s clusterosity. A load balancer is a single point of failure, but you *can* make them redundant if you want to go that extra mile.

    It just felt dirty to make the client library have to have this really complicated state logic in it. Seems like that should be a level up or down from “talk to this thing” client library responsibility.

    Probably not the hugest deal in the world though if you’re not experiencing any pain from it.

  3. Some of my comment is going to reiterate some of what other people have said. But anways….

    (First I should say, I’m not distinguishing between round-robin and weighted-round-robin. I’m assuming you probably aren’t either. But, just to be explicit, I’ll just use “round-robin” to mean *either* and *both*. So, getting to it….)

    If I were on the end of designing the *server end* of things, I’d want to allow for the flexibility to allow any of the round-robin arrangements.

    Let the clients decide if they want to manually do round-robin themselves, do the round-robin in DNS, have some part of the server system do the round-robin, etc.

    *Just because I’m not creative enough to think of ways that clients may want to use this stuff, doesn’t mean there aren’t good reasons for them to want them.*

    So, my preference is for the server support the different options.

    Although, out-of-the-box, you’d want the default set up for the server to be what’s “best” for most people. While also having the goal of making it as easy as possible for a novice to get up and running.

    Of course, if I’m dealing with the ops site of things, I’m probably going to want to push the round-robin stuff away from the client application. It would be better if ops could change the server setup without requiring a code change and and code push to production.

    From an ops point of view, I’d want to separate the two. (And would be a fan of SoC.)

    Another thing to consider in monitoring.

    For example, how is monitoring of the utilization of the server nodes going to happen?

    If the round-robin was done in the client application, then I could send messages to StatsD from there (pretty easily). Could monitor errors with this too.

    How do you do this with DNS round-robin?! You’d probably need to add a bit more complexity to your client application. You’d have to resolve the domain name yourself (to get the IP address) and then tell StatsD what IP address (and thus what node) you got. (Suppose you could write your own DNS server too, and put the StatsD messaging in there.)

    With a server end round-robin strategy the client might be SOL if the server software doesn’t allow for ways to hook in StatsD monitoring (or whatever monitoring system the ops team wants to use).

    (Note, you can switch out StatsD and switch in whatever monitoring tool you like. I’m just saying StatsD because I use it.)

    So, to sum it up… I think your SoC line of thinking is along the lines of what I’d do too. BUT, some others might have some odd set up or some non-typical needs, so if possible to support alternatives (without it causing a code-maintenance nightmare for you) then that’s probably better.

  4. (Disclaimer: I designed the new Java client for Riak)

    The one thing you’re somewhat missing is that regardless of design principles, people (e.g. paying customers) want it in the client.

    The current Java client for Riak was written not to offer load balancing in any way, then due to user demand it was bolted on top. Now it’s the worst of both worlds; a feature people want that doesn’t work well. We end up telling people to “Use HAProxy” which in many cases flies about as well as a lead balloon. Idealism meets reality.

    In a perfect world? Sure. Buy yourself a big ‘ol load balancer from F5 (Oh wait, you’ll need two for redundancy) and let it do the heavy lifting. In the real world, people often don’t want to have to build/buy the space shuttle right out of the gate. Or they may never want to if they can avoid it / don’t have a need.

    The new 2.0 version of the Java client? It’ll do node selection/management for you if you’d like. It’s also separated into it’s own class and is injectable meaning an end user could even write their own. And of course if someone wants to forgo it and simply use an external LB … they can do that too.

    In the end, I can’t find much of a compelling argument to not offer something our users have asked for due to a design principle; “Sorry, can’t do that … because … reasons”.

    I’d include a car analogy about how rear wheel drive is better than front due to separation of concerns yet everyone thinks front wheel is better because for them it’s easier to drive … but I won’t :-D

    • Hey Roach. Thanks for commenting. I totally agree that if a client wants something, especially if they want to pay for it in some way, then by all means provide. In that same vein, I’ve seen enterprises buy projects that they damn well knew would turn into a hodge podge of bad design decisions and anti-patterns – often going even further and turning into software death marches. So even though I’d totally provide a company the tools to shoot themselves, I’d still advise them against doing it. ;)

      Also, I’d add a comment, if I were designing the clients myself. I’d have made the same decisions – solely because of my above statement and the rare occasion that it actually makes sense in some wierd way to not have clear separation of concerns. I’ve seen it be a completely valid reason to tear apart sperations in agency work. The application needs built, there is no real maintenance ever done, and in that situation a company NEEDS to have a highly scalable solution and having the client manage the round-robin balancing is often the fastest way to get something working.

      But then of course… agencies aren’t exactly known for building long standing, long life software. It gets done, it gets shipped, then it gets replaced. Since this article was about design patterns, principles, etc – thus my point of calling a client having the intimate knowledge and doing the round-robin for an app as an anti-pattern.

      In an agency environment, it’d likely just be a pattern. :D

      • Yeah, I mean … in the end, the “get something working now” either has merit/value or it’s a highway to hell. I’ve worked at a place where they pretended to “be agile” except it was really “Produce crap then move on to the next shiny” – apparently that whole “iteration” thing got lost somewhere.

        That said, the other argument for the client to have this logic is client-side routing in a distributed system.

        This is something we are looking to explore next year. It’s also something that some other distributed databases do in their clients; if the client can hash then route to the appropriate node you eliminate a hop, reduce latency and node workload. You can also automatically react to nodes being added/removed if the cluster publishes that info to the client.

        The notion of having “smart clients” is not without merit. Yes, once again there’s a lot of things to deal with (vnode migration, etc) but it’s something that a generic load balancer simply can’t do.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s