martedì 11 aprile 2017

Majordodo and fine grained resource allocation

Majordodo is a distributed resource manager, its primary purpose is to manage thousands of concurrent tasks in a cluster of Java Virtual Machines.
It has been designed with multi tenancy as the primary goal, that is to provide a sophisticated facility to allocate resources by distributing tasks all over the cluster.

A typical Majordodo cluster is made of these major components:
- A ZooKeeper ensemble which is used for service discovery and leader election
- A BookKeeper ensemble which provides a distributed write-ahead commit log
- A set of  Brokers, which actually allocate resources and handle the state of each task
- A set of  Workers, which run the tasks and share their local resources
- A set of clients, which submit tasks to the Brokers



Resources

Usually a Majordodo cluster will run many different types of tasks (CPU intensive, RAM bounded, temporary local disk usage requirements) on very heterogeneous machines: slow/fast disks, more/less RAM, many/few CPUs, network connectivity and so on.

Apart from this host-specific resources we usually have to deal with datasources: for instance on each JVM we have a single pool of connections per datasource, and on each SQL Database (PostGre) server we have a global limit of concurrent connections to be respected, the same applies to HBase regions.

The main idea is to be able to allocate tasks in order to leverage most of the resources on each specific machine and to respect global and local limits on local and remote resource usages.
All of this is to be seen at the light of multi-tenancy, that is that every 'logical' user of the system has a limited quantity of resources to use in a given time interval: this way you can have users with guaranteed resources and you can be sure that a single user will not saturate the cluster (or a single resource) with only its own tasks.

Configuring Resource realtime usage constrains with Majordodo

Each user of Majordodo usually is bound to a set of resources, for instance its tasks will work on a specific database or SAN, but at any time this binding can changes, due to other 'external' resource managers which decide to fail over a DBMS to other machines or SLA managers which request to route tasks for some kind of user to 'faster' machines.

The primary settings for Majordodo configuration are:
- the user-resource mapping function on the brokers
- the configuration of each worker

You have to provide a function which tells to the Broker which resources a task is going to use if it gets executed instantly: this way even if the actual resources dedicated to an user change after the submission of a task we can count on the correct allocation of them.

On each Worker you are going to configure:
- the maximum number of concurrent tasks overall
- the maximum number of concurrent tasks for each task types
- a priority for each group of users
- a description of the local resources and the maximum number of tasks using them

Even on the Worker side the configuration is dynamic and can be changed at runtime without restart (which in turn would need to failover all the tasks running on the worker).

One note: all of the dynamic reconfiguration features are currently available using "Embedded Majordodo": you will be running the Broker and the Worker inside JVMs which supply all of the callback to the system.