Superior Testing: Managing Production

July 30, 2019

The testing process does not magically stop after deploying an artifact. We’ve only just begun. It is impossible to check everything in a sandboxed environment. From a certain perspective it seems like the whole run time is the testing process. How is it possible to check the correctness without tests? Right — by running the thing. We do the same in production — just on a bigger scale and with bigger risks.

Testing in a controlled environment is trivial. After all — we see the result. If tests are in place we see it as a report or an output in an IDE. If not — we can analyze the output like a UI or a text. We don’t see things in production. Imagine the application as a pure function. A user supplies inputs (files, text, clicks, swipes) and consumes outputs (different files, UI, UX). The code is in between, it owns neither inputs nor outputs. What do we do? Introduce side-effects in a form of monitoring.

Analytics

Business values usually are covered with analytics events. Otherwise, it is tricky to identify good and not so good spots in business-related flows. Technical metrics are important as well! There is a small trick though — it is undesirable to mix business and technical analytics in the same pool. Most of the time product-related people do not care about technical details.

There are great tools on the market doing what we need — like Fabric Answers and Google Analytics for Firebase. Both provide dedicated analytics storage for tech-specific needs. What do we put there?

Such characteristics help to see a bigger picture and make argumented decisions.

Logging

For reasons unknown a lot of developers neglect logs. Do they match a category of archaic tech? No idea. Well-structured logs are far more powerful than analytics. The key to power is changing the logging mindset. Think about logs not as a wall of text (OH HAI Android Logs) but as a database.

Let’s imagine that we got a nasty crash. Having a stacktrace is helpful but in non-obvious situations it is not trivial to understand what went wrong. It is great to have logs in this scenario. Having them as a plain text is not so great. What do we do if we want to take a range from one date to another? Or take a look at all HTTP requests made from a device this month? Or analyze how often we get 500 HTTP errors from a specific endpoint?

Enter the ELK (Elastic) Stack. Extremely popular in the backend world it is not so well-known in the mobile one. I’m gonna skip the general introduction — there is literally an infinity of guides.

The idea from the client perspective is simple. Instead of logging plain text we are gonna log… JSON!

/v1/books returned 500

becomes

{
    "application_version": "1.0.2",
    "os_version": "7.1.1",
    "thread": "main",
    "http_path": "/v1/books",
    "http_request_id": "87814f00-e3a4-41a6-ba21-174f2476bc75",
    "http_request_duration": 42,
    "http_response_code": 500
}

An enumeration of such elements is put into a file, files get batch-processed and sent to the backend where they are processed by the ELK stack. Then, using Kibana as a frontend and Lucene as a query language we can make all sorts of analysis.

Those are basic examples, the possibilities are almost endless. Even better — it is possible to make custom graphs using Kibana or even create monitoring dashboards with Grafana.

This kind of approach to the logging process can save a lot of time spent on investigation and analysis. Not gonna lie — it feels great to use such powerful instruments.

Feature Flags, Alerts, Non-Fatals…

There are a lot of practices for managing production environments. The idea should stay the same — there is no end to testing. Like it or not — things will go south. It is always better to have tools to understand why did it happen.