GraphQL Observability

Questions you need to be able to answer

4 min readAug 19, 2022

The book Observability Engineering by Charity Majors, Liz Fong-Jones, George Miranda (highly recommended) starts with an explanation of what “Observability” is, and how to know if you software has it. They propose a small litmus test for determining the answer. The test is basically a set of questions you should be able to answer about the state and inner workings of your app.

Combined with Jenn’s recent tweet about wanting to have more resources on what kind of metrics GraphQL servers should collect to help debugging, I thought it would be pretty cool to have a list of “question you must be able to answer about your GraphQL server”. It’s really hard to be precise without knowing the architecture you’re working on, so the questions are probably the best we can do for now! I’ll add some general tips at the end. This is meant to be on top of typical observability requirements of an HTTP server / computerz in general.

Questions

How much time your GraphQL engine spends in the different phases? (e.g Parsing, Validating, Analyzing, Middleware, Execution)
Can you tell the percentage of operations that are erroring out?
What about queries with “errors as data”?
When errors are elevated, are you able to tell which operations are failing the most often?
Can you tell which clients are sending the most operations? The most operations that fail?
When you’ve identified an operation that fails often, can you tell why it is failing? Can you tell when it started failing?
For a particular operation, can you identify which client or user sent it?
The overall response time for a GraphQL API doesn’t mean too much, can you tell the response time (percentiles) for particular operations?
If an operation is slow / timeouts, can you identify in which areas it spends the most time? (GraphQL engine phases, resolver/business logic code, external calls)
Which operations timeout the most often?
Which services were called during the execution of the query? How much time was spent in each of them?
If your GraphQL server resolves some fields using a database, are you able to tell which SQL / query set is executed during a GraphQL operation? How much time did it spend fetching data?
How much time is spent inside the GraphQL engine vs doing external calls? (DB queries, service-calls, etc)
After the query has been executed, how much time was spend serializing the response?
Would you be able to tell if some client operations had a performance regression, before the clients have to let you know?
If you are using dataloaders (which you should!), can you tell how well they perform? Which are the slowest?
Can you tell which fields are the slowest? (Often it is best to answer this one through dataloaders, since fields may just enqueue work to be done later, and appear very fast when in reality they cause a slow request). Adding a span / metrics for every resolver provides little value and probably quite a bit of overhead, so focusing on loaders / where the logic is is probably your best bet.
Are you able to tell how much time was spent in different key parts of your business logic during a query?
If you’re sampling, can you selectively sample some problematic operations / users?

Some Tips

Tracing is awesome to answer a lot of questions like what parts of your code is called, how long it takes, which services and databases are called, etc.
In my experience, performance issues are very, very often related to underlying data stores and services. You don’t have to log these on every request, but it’s very useful to be able to tell (sometimes with a magic header for internal employees only) what DB queries result from a GraphQL operation, and how much time they took. This can help you figure out things like a missing index or inefficient data access. You can return those in the “extensions” field of the GraphQL response which means you could use GraphiQL with a custom header to debug slow queries. Depending on your tracing setup you may already have this for free.
Generate unique IDs for operations through hashing. Use these IDs to tag your metrics/spans so that you can answer per-operation questions. Having unique operation names is also useful.
Always require a client ID / user ID when executing operations. This helps you answer questions related to specific clients.
Make sure the different parts of the GraphQL execution are traced. Parsing and validation are sometimes culprits in large queries.
From Andy Ingram: Be careful of CPU time especially in single threaded environments. e.g. if JSON serialization is a bottleneck for one query, all other active queries for that process will be held up by it. Even if I/O itself can happen async, you can’t process the response until the CPU is free again.

That’s it! I’ll try to keep this post up to date with your suggestions and as I think of more questions.