elastisys cloud pool REST API¶

All elastisys cloud pool endpoints are required to publish the REST API described below.

The primary purpose of the cloud pool API is to serve as a bridge between an autoscaler and a certain cloud provider, allowing the autoscaler to operate in a cloud-neutral manner. As such, it focuses on primitives for managing a dynamic collection of machines.

All API methods assume a Content-Type of application/json.

The REST API should be made available over secure HTTP (HTTPS).

Terminology and machine state model¶

Cloud providers differ in how they refer to the computational resources they provide. Some common terms are instances, servers and VMs/virtual machines.

The cloud pool API strives to be as cloud-neutral as possible and simply refers to the computational resources being managed as machines. The logical group of machines that a cloud pool manages is referred to as its machine pool.

machine state¶

A cloud pool needs to be able to report the execution state of its machine pool members in a cloud-neutral manner (see Get machine pool). Since cloud providers differ quite a lot in the state models they use, the cloud pool needs to map the cloud-native state of the machine to one of the machine states supported by the cloud pool API. These states are described in the machine state table below.

Machine state	Description
`REQUESTED`	The machine has been requested from the underlying infrastructure and the request is pending fulfillment.
`REJECTED`	The machine request was rejected by the underlying infrastructure.
`PENDING`	The machine is in the process of being launched.
`RUNNING`	The machine is launched. However, the boot process may not yet have completed and the machine may not be operational (the machine’s service state may provide more detailed state information).
`TERMINATING`	The machine is in the process of being stopped/shut down.
`TERMINATED`	The machine has been stopped/shut down.

The diagram below illustrates the state transitions that describe the lifecycle of a machine.

The PENDING and RUNNING states are said to be the started machine states. Machines in a started state are executing. However, just because a machine is executing doesn’t necessarily mean that it is doing useful work. For example, it may have failed to properly boot, it may have crashed or encountered a fatal bug.

So the machine state is the execution state of the machine, as reported by the cloud API, which really only tells us if a particular pool member is started or not. To be able to reason about the health of a pool member, each machine’s metadata carries two additional state fields – the membership status and the service state.

These states are intended to be set by external means, such as by a human operator or an external health monitoring service. A cloud pool is required to be ready to receive state updates these fields (see Set membership status and Set service state) for the machines its pool and to include those states on subsequent queries about the pool members (Get machine pool).

membership status¶

The membership status is used to indicate to the cloud pool that a certain machine needs to be given special treatment. The membership status can, for example, be set to protect a machine from being terminated (by setting its evictability) or to mark a machine as being in need of replacement (by setting its activity flag). This allows us, for example, to isolate a failed machine for further inspection and to provision a replacement to sustain sufficient capacity. It also allows us to have “blessed”/”seed” pool members that may not be terminated. See the Set membership status method for more deatils.

The active and evictable fields of the membership status can be combined according to the table below to produce four main membership states:

	active	not active
evictable	default	disposable
not evictable	blessed	awaiting service

default: a machine that is both an active and evictable group member.

blessed: a machine that is a permanent pool member that cannot be evicted. This can, for example, be used to include reserved machine instances in the pool.

awaiting service: a machine that is in need of service. The machine is to be replaced and should be kept alive for troubleshooting.

disposable: a machine that is non-functioning and should be replaced and terminated.

At any time, the active size of the cloud pool should be interpreted as the number of allocated machines that have not been marked with an inactive membership status. That is, all machines in one of the machine states REQUESTED, PENDING, or RUNNING and not having a membership status with active set to false.

service state¶

There are cases where we need to be able to reason about the operational state of the service running on the machine. For example, we may not want to register a running machine to a load balancer until it is fully initialized and ready to accept requests, and we may want to unregister unhealthy machines. To this end, a cloud pool may include a service state for a machine. Whereas the machine state should be viewed as the execution state of the machine, the service state should be viewed as the operational health of the service running on the machine. Service states have no semantic implications to the cloud pool. They should be regarded as informational “marker states” that may be used by third party services (such as a load balancer).

The range of permissible service states are as follows:

Service state	Description
`BOOTING`	The service is being bootstrapped and may not (yet) be operational.
`IN_SERVICE`	The service is operational and ready to accept work (health checks pass).
`UNHEALTHY`	The service is not functioning properly (health checks fail).
`OUT_OF_SERVICE`	The service is unhealthy and has been taken out of service for troubleshooting and/or repair.
`UNKNOWN`	The service state of the machine cannot be (or has not yet been) determined.

See the Set service state method for more deatils.

Failure handling¶

Things fail in distributed systems. At different scales (network partitions, data center outage, cloud provider API errors, cloud pool misconfigurations, cloud pool bugs, etc) and on different time-scales (lasting for a single call, a few minutes to several hours). Since we live in a failure-prone world, cloud pool implementers are encouraged to temporarily mask cloud API failures to avoid prematurely panicking and propagating errors to upstream components.

In practice, a combination of measures can be taken to make a cloud pool less sensitive to cloud API errors:

API calls can be retried a few times on failure (for example, with an exponential back-off algorithm). For example, cloud provider API calls could be retried a few times to guard against one-off API errors or short-lived error conditions.

The cloud pool can be written to operate in an eventually consistent manner, periodically retrieving state data from the cloud API and storing it locally, responding to clients with the most recently observed state rather than synchronously trying to re-fetch the state from the cloud API on every invocation. The cloud pool should strive to provide “sufficiently up-to-date” data. At times when the cloud API is unreachable, it may suffice to serve somewhat dated data for a limited time, until fresh data can be retrieved again. When serving cached data, it is important to timestamp-mark the data so that the receiver of the data can determine if the data is fresh enough.

Some operations are more suitable than others for this type of semantics (Get machine pool, Get pool size, Set desired size). Operations that are less suitable for this type of delayed semantics includes operations that act on individual machines (Terminate machine, Attach machine, Detach machine, Set service state, Set membership status). For such operations, it makes more sense to immediately try to “write through” to the cloud API and respond with an error on failure.

Operations¶

Set configuration¶

Method: POST /config

Description: Sets a new configuration for the cloud pool.

The configuration is a JSON document whose appearance depends on the particular cloud pool implementation.

This operation will not change the cloud pool’s started state – if the cloud pool had been started (see Start) it will remain started, and if it was in a stopped state it will remain stopped.

Input: A JSON document with a configuration that follows the schema of the particular cloud pool implementation.

Output:

on success: HTTP response code 200.

on invalid input (for example, if the cloud pool fails to validate the configuration): HTTP response code 400 (Bad Request) with an Error response message.

cloud API errors: HTTP response code 502 (Bad Gateway) with an Error response message

other errors: HTTP response code 500 (Internal Server Error) with an Error response message

Get configuration¶

Method: GET /config

Description: Retrieves the configuration currently set for the cloud pool (if any).

The configuration is a JSON document whose appearance depends on the particular cloud pool implementation.

Input: None

Output:

HTTP response code 200 with a configuration JSON document on success.

HTTP response code 404 (Not Found) if no configuration has been set.

On error: HTTP response code 500 (Internal Server Error) with an Error response message

Start¶

Method: POST /start

Description: Starts the cloud pool.

This will set the cloud pool in an activated state where it will start to accept requests to query or modify the machine pool.

If the cloud pool has not been configured (see Set configuration) the method will fail. If the cloud pool is already started this is a no-op.

Input: None

Output:

HTTP response code 200 on success.

HTTP response code 400 (Bad Request) with an Error response message on an attempt to start an unconfigured cloud pool.

HTTP response code 500 (Internal Server Error) with an Error response message on error.

Stop¶

Method: POST /stop

Description: Stops the cloud pool.

A stopped cloud pool is in a passivated state and will not accept any requests to query or modify the machine pool.

If the cloud pool is already in a stopped state this is a no-op.

Input: None

Output:

HTTP response code 200 on success.

HTTP response code 500 (Internal Server Error) with an Error response message on error.

Get status¶

Method: GET /status

Description: Retrieves the execution status for the cloud pool.

Input: None

Output:

HTTP response code 200 on success with a Status message.

HTTP response code 500 (Internal Server Error) with an Error response message on error.

Get machine pool¶

Method: GET /pool

Description: Retrieves the current machine pool members.

Note that the returned machines may be in any machine state (REQUESTED, RUNNING, TERMINATED, etc).

The membership status of a started machine determines if it is to be considered an active member of the pool.The active size of the machine pool should be interpreted as the number of allocated machines (in any of the non-terminal machine states REQUESTED, PENDING or RUNNING that have not been marked with an inactive membership status.

The service state should be set to UNKNOWN for all machine instances for which no service state has been reported (see Set service state).

Similarly, the membership status should be set to the default (active, evictable) status for all machine instances for which no membership status has been reported (see Set membership status).

Input: None

Output:

On success: HTTP response code 200 with a Machine pool message

On cloud API errors: HTTP response code 502 (Bad Gateway) with an Error response message

On other errors: HTTP response code 500 (Internal Server Error) with an Error response message

Get pool size¶

Method: GET /pool/size

Description: Returns the current size of the machine pool – both in terms of the desired size and the actual size (as these may differ at any time).

Input: None

Output:

On success: HTTP response code 200 with a Pool size message

On cloud API errors: HTTP response code 502 (Bad Gateway) with an Error response message

On other errors: HTTP response code 500 (Internal Server Error) with an Error response message

Set desired size¶

Method: POST /pool/size

Description: Sets the desired number of machines in the machine pool. This method is asynchronous and returns immediately after updating the desired size. There may be a delay before the changes take effect and are reflected in the machine pool.

Note: the cloud pool should take measures to ensure that requested machines are recognized as pool members. The specific mechanism to mark group members, which may depend on the features offered by the particular cloud API, is left to the implementation but could, for example, make use of tags.

Input: The desired number of machine instances in the pool as a Set desired size message.

Output:

On success: HTTP response code 200 without message content.

On error:

on illegal input: code 400 with an Error response message

otherwise: HTTP response code 500 (Internal Server Error) with an Error response message

Terminate machine¶

Method: POST /pool/terminate

Description: Terminates a particular machine pool member. The caller can control if a replacement machine is to be provisioned via the decrementDesiredSize parameter.

Note: a machine that is protected from removal by a membership status with evictable: false can not be terminated.

Input: A Terminate machine message.

Output:

On success: HTTP response code 200 without message content.

On error:

on illegal input: code 400 with an Error response message

if the machine is not a pool member: code 404 with an Error response message

on cloud API errors: HTTP response code 502 (Bad Gateway) with an Error response message

On other errors: HTTP response code 500 (Internal Server Error) with an Error response message

Set membership status¶

Method: POST /pool/membershipStatus

Description: Sets the membership status of a given pool member.

The membership status for a machine can be set to protect the machine from being terminated (by setting its evictability status) and/or to mark a machine as being in need of replacement by flagging it as an inactive pool member.

The specific mechanism to mark group members, which may depend on the features offered by the particular cloud API, is left to the implementation but could, for example, make use of tags.

Input: A Set membership status message.

Output:

On success: HTTP response code 200 without message content.

On error:

on illegal input: code 400 with an Error response message

if the machine is not a pool member: code 404 with an Error response message

on cloud API errors: HTTP response code 502 (Bad Gateway) with an Error response message

On other errors: HTTP response code 500 (Internal Server Error) with an Error response message

Set service state¶

Method: POST /pool/serviceState

Description: Sets the service state of a given machine pool member.

Setting the service state does not have any functional implications on the pool member, but should be seen as way to supply operational information about the service running on the machine to third-party services (such as load balancers).

The specific mechanism to mark group members, which may depend on the features offered by the particular cloud API, is left to the implementation but could, for example, make use of tags.

Input: A Set service state message.

Output:

On success: HTTP response code 200 without message content.

On error:

on illegal input: code 400 with an Error response message

if the machine is not a pool member: code 404 with an Error response message

on cloud API errors: HTTP response code 502 (Bad Gateway) with an Error response message

On other errors: HTTP response code 500 (Internal Server Error) with an Error response message

Detach machine¶

Method: POST /pool/detach

Description: Removes a particular machine pool member from the pool without terminating it.

The machine keeps running but is no longer considered a pool member and, therefore, needs to be managed independently. The caller can control if a replacement machine is to be provisioned via the decrementDesiredSize parameter.

Note: a machine that is protected from removal by a membership status with evictable: false can not be detached.

Input: A Detach machine message.

Output:

On success: HTTP response code 200 without message content.

On error:

on illegal input: code 400 with an Error response message

if the machine is not a pool member: code 404 with an Error response message

on cloud API errors: HTTP response code 502 (Bad Gateway) with an Error response message

On other errors: HTTP response code 500 (Internal Server Error) with an Error response message

Attach machine¶

Method: POST /pool/attach

Description: Attaches an already running machine to the machine pool, growing the pool with a new member. This operation implies that the desired size of the group is incremented by one.

Input: An Attach machine message.

Output:

On success: HTTP response code 200 without message content.

On error:

on illegal input: code 400 with an Error response message

if the machine does not exist: code 404 with an Error response message

on cloud API errors: HTTP response code 502 (Bad Gateway) with an Error response message

On other errors: HTTP response code 500 (Internal Server Error) with an Error response message

Messages¶

Status message¶

Description

A message used to report the state of the cloud pool.

The status message has the following schema:

{
  "started": <boolean>,
  "configured": <boolean>
}

Sample document:

{
  "started": true,
  "configured": true
}

Set desired size message¶

Description	A message used to request that the machine pool be resized to a desired number of machine instances.
Schema	`{ "desiredSize": <number> }`

Sample document:

{ "desiredSize": 3 }

States that we want three machine instances in the pool.

Error response message¶

Description	Contains further details (in addition to the HTTP response code) on server-side errors.
Schema	`{ "message": <string>, "detail": <string> }`

The message is a human-readable error message intended for presentation, whereas the detail attribute holds error details (such as a stack trace).

This is a sample error message:

{
  "message": "failure to process pool get request",
  "detail": "... long stacktrace ..."
}

Machine pool message¶

Description

Describes the current status of the monitored machine pool.

The machine pool schema has the following structure:

{
  "timestamp": <iso-8601 datetime>,
  "machines": [ <machine> ... ]
}

The timestamp is the time at which the pool observation was made. Note that in case the cloud pool serves locally cached data, this field may be used by the client to determine if the data is fresh enough to be acted upon.

Every <machine> entry is a json object with the following structure:

{
  "id":               <string>,
  "machineState":     <machine state>,
  "membershipStatus": {"active": bool, "evictable": bool},
  "serviceState":     <service state>,
  "cloudProvider":    <string>,
  "region":           <string>,
  "machineSize":      <string>,
  "launchTime":       <iso-8601 datetime>,
  "requestTime":      <iso-8601 datetime>,
  "publicIps":        [<ip-address>, ...],
  "privateIps":       [<ip-address>, ...],
  "metadata":         <jsonobject>
}