Writing operable code

Code does weird things when exposed to “the real world.” Users are unpredictable. Networks are unreliable. Things go wrong. Production software must keep working. Writing operable code helps you deal with the unforeseen. It also makes IT operations specialists life easier. Google IT operations for more details about it.

Operable code has built-in protection, diagnostics, and controls.

Protect your system by programming defensively with safe and resilient coding practices.
You also need to be able to see what’s going on so you can diagnose failures. Expose logging, metrics, and call trace information for easy diagnostics.
Finally, you need to control systems without rewriting code.
An operable system has configuration parameters and system tools.

Defensive Programming

Defensive code fails less often, and when it does, it is more likely to recover. Make your code safe and resilient.

Safe code takes advantage of compile-time validation to avoid runtime failures.
Resilient code uses exception-handling best practices and handles failures gracefully.

Apply these defensive programming tips to make your code safe and resilient.

Avoid Null Values: The null object pattern uses objects in lieu of null values. An example of this pattern is a search method that returns an empty list instead of null when no objects are found.

Make Variables Immutable: As a bonus, using immutable variables makes parallel programming simpler.

Use Type Hinting and Static Type Checking: documents code, eases finding bugs and etc.

Validate Inputs:

Reject bad input as early as possible.
Be as constrained as possible.
Validate that all numbers are in appropriate ranges: if a parameter should be greater than zero, ensure that it is; if a parameter is an IP address, check that it’s a valid IP.
Look out for the vulnerabilities listed here: https://owasp.org/www-project-top-ten/

Use Exceptions: Don’t use special return values (null, 0, –1, and so on) to signal an error.

Be Precise with Exceptions:

Use built-in exceptions when possible and avoid creating generic exceptions.
Use exceptions for failures, not to control application logic.

Throw Exceptions Early, Catch Exceptions Late:

Throwing early means raising exceptions as close to the error as possible so developers can quickly find the relevant code.
Catching exceptions late means propagating exceptions up the call stack until you reach the level of the program that is capable of handling the exception.
When calling code that might throw exceptions, either handle them completely or propagate them up the stack

Retry Intelligently:

retry with backoff and random jitter (to prevent thundering herd problem).
It is better to let the application crash when it encounters an error it was not designed to handle; this is called failing fast.

Write Idempotent Systems:

The best way to deal with retries is to build idempotent systems.
An idempotent operation is one that can be applied multiple times and still yield the same outcome.
Adding a value to a set is idempotent.
Remote APIs can be made idempotent by allowing clients to supply a unique ID for each request.

Clean Up Resources: Release memory, data structures, network sockets, and file handles that you no longer need.

Logging

Use a logging framework to make your code easier to operate and debug. Set log levels so your operators can control your application’s log volume. Keep logs atomic, fast, and secure.

Apply these tips to make logging efficient.

Use Log Levels: While log levels are not completely standard, the following levels are common:

TRACE This is an extremely fine level of detail that only gets turned on for specific packages or classes. This is rarely used outside of development. If you need line-by-line logs or data structure dumps, this level is for you. If you find yourself using TRACE frequently, you should consider using a debugger to step through code instead.
DEBUG This is used when the message will be useful during a production issue but not during normal operations. Don’t use debug- level logging so much that the output is unusable when debugging; save that for TRACE.
INFO This is nice-to-have information about the state of the application but not indicative of any problems. Application state messages like “Service started” and “Listening on port 5050” go here. INFO is the default log level. Don’t emit frivolous logs with INFO— “just in case” logging goes into TRACE or DEBUG. INFO logging should tell us something useful during normal operations.
WARN These are messages about potentially problematic situations. A resource nearing its capacity merits a warning. Whenever you log a warning, there should be a concrete action you want the person seeing the message to take. If the warning is not actionable, log it to INFO.
ERROR These messages indicate that an error that needs attention is occurring. An unwritable database usually merits an ERROR log. ERROR logs should be detailed enough to diagnose problems. Log explicit details, including relevant stack traces and the resulting actions the software is performing.
FATAL These are the “last gasp” log messages. If the program encounters a condition so severe that it must exit immediately, a message about the cause of the problem can be logged at the FATAL level. Include relevant context about the program’s state; locations of recovery or diagnostic-related data should be logged.

Keep Logs Atomic:

If information is useful only when coupled with other data, log it all atomically in one message.
Avoid newlines in log messages; many log aggregators treat each new line as a separate message.
If log messages can’t be output atomically, include a unique ID in the messages so they can be stitched together later.

Keep Logs Fast:

Use parameterized logging and asynchronous appenders to keep logging fast.
Log frameworks provide mechanisms to delay string concatenation until it’s actually needed.
Beware that changing log verbosity and configuration can eliminate race conditions and bugs because it slows down the application. If you enable verbose logging to debug an issue and discover a bug disappears, the logging change itself might be the reason.

Don’t Log Sensitive Data:

Most frameworks support rule-based string replacement and redaction; configure them, but do not rely on them as your only defense.
Be paranoid; logging sensitive data can create security risks and violate privacy regulations.

Metrics

Instrument your application with metrics to see what it is doing. Metrics are the numerical equivalent of logs; they measure application behavior. How long did a query take? How many elements are in a queue? How much data was written to disk?

There are three common metric types: counters, gauges, and histograms:

Counters measure the number of times an event happens
Gauges are point-in-time measurements that can go up or down; think of a speedometer
Histograms break events into ranges based on their magnitude.

Application metrics are aggregated into centralized observability systems like Datadog, LogicMonitor, or Prometheus.

Observability is a concept from control theory that defines how easy it is to determine the state of a system by looking at its outputs. As a developer, it is your job to ensure that important metrics are exposed to observability systems.

Autoscaling is common in environments that provide dynamic resource allocation to adhere to promised SLO using monitoring systems

Apply these tips to make metrics efficient.

Use Standard Metrics Libraries: While counters, gauges, and histograms are pretty easy to calculate, don’t roll your own metrics library. Nonstandard libraries are a maintenance nightmare. Example StatsD in python

Measure Everything:

Measure all of the following data structures, operations, and behaviors:
- Resource pools
- Caches
- Data structures
- CPU-intensive operations
- I/O-intensive operations
- Data size
- Exceptions and errors
- Remote requests and responses
Use gauges to measure the size of resource pools.

Traces

Developers all know about stack traces, but there’s a less familiar kind of trace: a distributed call trace. A single call to a frontend API might result in hundreds of downstream RPC calls to different services. Distributed call traces stitch all of these downstream calls together into one graph. Use proper tools to be able to see traces from the code.

Configuration

Applications and services should expose settings that allow developers or site reliability engineers (SREs) to configure runtime behavior. Applying configuration best practices will make your code easier to run.

Don’t get too creative;

use a standard configuration format,
provide sensible defaults,
validate configuration inputs,
and avoid dynamic configuration when possible.

Configuration can be expressed in many ways:

Files in plain, human-readable formats such as INI, JSON, or YAML
Environment variables Command line flags
A custom domain-specific language (DSL)
The language the application is written in

Apply these tips to make configuration management efficient.

Don’t Get Creative with Configuration: A static configuration file in a single standard format is ideal. (Like YAML) Dynamic configuration not worth it

Log and Validate All Configuration: Log all (nonsecret) configuration immediately upon startup to show what the application is seeing.

Provide Defaults: Set good defaults so your application will work well for most users out of the box.

Group Related Configuration: Use a standard format like YAML that allows for nesting. Grouping related properties makes configuration easier to organize and maintain. Rather than defining timeout_duration=10 and timeout_units=second, use timeout=10s or timeout: { duration: 10, units = second }.

Treat Configuration as Code: To keep configuration changes safe, configuration should be version controlled, reviewed, tested, built, and published. Keep configuration in a VCS like Git so you have a history of changes.

Keep Configuration Files Clean: Delete unused configuration, use standard formatting and spacing, and don’t blindly copy configuration from other files

Don’t Edit Deployed Configuration.

Tools

Operable systems come with tools that help operators run the application. Operators might need to bulk-load data, run a recovery, reset database state, trigger a leadership election, or shift a partition assignment from one machine to another. Systems should come with tools to help operators deal with common operations.

Tool writing is collaborative. In some cases, you will be expected to write and supply operations tools. Organizations with strong SRE teams might also write tools for your systems. Regardless, work with your operations team to understand what they need.

Note

Sources: