Testing in Production
Welcome to Testing in Production, my new personal blog and a place where I talk about engineering, startups, and all things tech. Who am I? My name is Khang Tran and I’m an engineering leader who has spent over two decades in Silicon Valley working mostly in startups. I’ve seen some wild things, made a lot of mistakes, and most importantly learned some valuable lessons. By sharing my experiences and perspectives, I hope I can help at least one leader or builder in making the world a better place through tech. At a minimum, I hope it’s at least entertaining 🤣
So, why name this “Testing in Production”? A few things you should know about me:
I like to seek truth, so when someone declares “X is bad,” I don’t just take it at face value. Instead, I ask “Why?” and rarely accept “because that’s how it’s done.”
Reality is messy—life is not black and white—so I believe seeking balance and understanding nuance is the recipe for long term success. Learning to “live in the grey,” as a mentor once told me.
I accept failure as a fact of life, so if it’s bound to happen, you’ll be most resilient if you design your life (in this context, software and systems) to deal with failure well.
I love experimentation and iteration. The tried and true get you only so far, the rest is all innovation which only actualizes through velocity and agility.
Thus, this title I think is a perfect personification of my philosophy and my leadership style. A little contrarian, a little risky, but rooted in pragmatism and reasoning. So, for my first post, let’s actually talk about testing in production!
Background
Conventional wisdom tells us production environments are sacrosanct, reserved exclusively for stable, rigorously tested software to avoid severe repercussions like data corruption, financial losses, and reputational damage. Merely mentioning “testing in production” conjures up images of an unmanaged environment where code quality is disregarded. While it’s true that prod should be stable and code thoroughly vetted, locking it down entirely isn't a foolproof solution.
I’ve seen many teams err too far on the side of caution, putting up hurdles in the path to production in the well-intentioned spirit of trading off speed for quality. Outages subside not because code improves but rather velocity drops and engineers ship less code. An unintended side effect is that teams spend more time in pre-production trying to prevent outages and less time hardening prod itself, so when outages do happen (they will, even if you don’t ship code) they are ill-prepared. After which the cycle continues, they add more hurdles, spiraling the team to a grinding halt because it’s become impossible to ship.
Fortunately, the perception of testing in production has shifted significantly, driven by the maturation and widespread adoption of CI/CD and experimentation tools. With the necessary automation and infrastructure in place, teams can test with confidence. This evolution has fostered the development of specific strategies that span both infra and growth. To some, these are not new at all, but a surprising number of startups have yet to embrace let alone learn about them.
Canary testing involves releasing new features to a small segment of users to assess real-world impact.
Progressive rollouts similar allows you to slowly ramp up exposure to your feature so you get feedback sooner but also have a buffer for recovery if needed.
Load testing in production evaluates application resilience under heavy traffic.
Chaos engineering proactively injects failures to identify system weaknesses.
Shadow testing mirrors production traffic to a new version to observe its behavior under real load without affecting users.
Real User Monitoring (RUM) analyzes actual user experiences to pinpoint performance and usability issues.
Feature flags allow for toggling functionality on or off without redeployment, providing immediate control.
A/B testing exposes different feature versions to distinct user groups to optimize based on performance metrics.
Now, what was perhaps formerly seen as YOLO’ing is now codified into sophisticated and intentional strategies to essentially “test” in production. For organizations operating at scale with complex systems (and even some that don't), this toolkit has become practically indispensable. In essence, testing in production, when executed correctly, is now a critical component of high-performing engineering teams.
Good vs Bad
Not all “testing” in prod is safe nor good. Startups tend to (necessarily) cut corners, but there are definitely some slippery slopes that are better not to fall into. Here are some examples and how you can adjust.
NOTE: This only applies under standard operating conditions. If you’re battling an active outage, especially a P0/SEV1, the top priority is mitigation so anything goes.
Logging Into Prod
Behavior: Routinely opening a prod console to run commands or debug
Philosophy: Any manual human intervention in prod should pass through an automation layer. Human error is just too great a risk, so you’re best off writing a script or ideally automating it. There’s a good chance it’ll happen again, so you’ll have made it safer for everyone going forward.
Solutions:
Add an admin view where you can inspect business objects and automate common admin operations (wrapped in transactions). Many frameworks include some support for this, and if not then you can use Retool or a similar low-code operations tool. It can be a game-changer because you’ve then also now empowered non-engineers to pitch in on prod issues.
Add data integrity checks to run periodically. Instead of waiting for errors to occur, put all your assertions and exception handling into a script (per object or flow) and run it proactively so your system self-heals and flags issues early.
Product analytics tools (e.g. Mixpanel, Amplitude) will let you instrument user behavior and perform some pretty advanced analysis, including viewing user activity timelines which are great for debugging. You can even pipe it to your data warehouse if you need more powerful querying.
Logging is your friend, particularly for new code or critical flows. To save money, you can always demote logging once code matures. Both GCP and AWS offer powerful logging tools that can be lifesavers.
Relying on Customers for QA
Behavior: Deploying code without following up with user testing only to repeatedly redeploy a “fix” after your customers complain
Philosophy: Customers are the most expensive form of QA and they are usually not very forgiving. You should be catching errors before they reach real users, and if possible it should be automated.
Solutions:
Feature flag your new code so you can test it on yourself first. From there you can broaden it to a larger but still safe group like your immediate team or other employees.
Progressive rollouts can automate the exposure ramp so it might be a good idea to have this always on. At Uber for example, new app releases always deploy to employees first before expanding to general pop. You can control this as granularly as you like.
Synthetic monitoring allows you to automate UI monitoring, so you can automate yourself. Combine this with feature flagging and include test accounts first.
In the Real World
Netflix is a prime example of an organization with a mature approach to testing in production. They famously employ chaos engineering through tools like Chaos Monkey to proactively test the resilience of their systems by intentionally introducing failures. This practice validates their failure handling mechanisms under real-world conditions. Netflix also relies heavily on canary releases for software updates, gradually deploying new versions to a small percentage of users. A/B testing is integral to their product development, enabling continuous optimization of the user experience. They closely monitor metrics like play-delay during canary tests to quickly identify performance regressions. Feature flags provide granular control over feature rollouts. Netflix's approach demonstrates how deeply embedded testing in production can drive both high availability and continuous innovation at a massive scale.
There are countless more examples: Google frequently utilizes canary releases for features in products like the Play Store. Feature flags are extensively used in Android development for stability and controlled rollouts. Meta also utilizes canary deployments and feature flags for controlled rollouts and also has deeply integrated experimentation into their product infrastructure. These examples illustrate the widespread adoption of testing in production among leading technology companies, each adapting their strategies to their specific needs.
While the scale of these examples might seem daunting for startups, the underlying principles apply to companies of all sizes. As a startup, you can begin with simpler versions of this, such as using feature flags for beta users or running A/B tests on new landing pages or flows. There’s no shortage of platforms that provide this, even with free tiers and startup programs (Statsig looks promising). And I always advise startups to invest in monitoring and alerting from the outset so they can safely move fast while keeping production stable. You’re a busy startup with a million problems, so just start small and keep it simple. Even some basic instrumentation on key metrics (latency, I/O) and business flows (login, transactions) will go a long way. Finally, the core concept of iterating based on real user feedback is critical in achieving product-market fit. The more iterations you can make the quicker you’ll get there.
The key is to adopt the foundational principles early, starting small and gradually maturing practices. In this way testing in production becomes ingrained in your team culture.
Conclusion
Software development has evolved dramatically so you no longer need to have an allergic reaction to testing in production, because now it means something entirely different. Embrace the abundance of tools available to (safely) play in production where your actual users are; many are free and streamlined to your platform of choice. In doing so, you might even be able to achieve the holy grail of both quality and speed.
In closing, I’ll pose a challenge: if you manage a staging environment, consider dropping it. Staging environments have come to be a standard practice, one of the pre-prod hurdles meant to increase confidence, but has perhaps become the new But it works on my laptop! And for some teams, it comes at a non-trivial cost, escalating as the team or business scales.
Ask yourself What would it take to drop staging? What would it take to feel confident shipping code straight to production? This might be a good way to frame the problem, helping you understand what you hope staging does for you and what other tools/strategies could do the job better or more directly. If after this introspection you find you still absolutely need staging, then at least you’ve explored the possibilities of testing in production!