
By Jennifer Shin, Tejas Shikhare, Will Emmanuel
In 2022, a serious change was made to Netflix’s iOS and Android functions. We migrated Netflix’s cellular apps to GraphQL with zero downtime, which concerned a complete overhaul from the consumer to the API layer.
Till not too long ago, an inside API framework, Falcor, powered our cellular apps. They’re now backed by Federated GraphQL, a distributed method to APIs the place area groups can independently handle and personal particular sections of the API.
Doing this safely for 100s of thousands and thousands of consumers with out disruption is exceptionally difficult, particularly contemplating the numerous dimensions of change concerned. This weblog submit will share broadly-applicable methods (past GraphQL) we used to carry out this migration. The three methods we’ll focus on right now are AB Testing, Replay Testing, and Sticky Canaries.
Earlier than diving into these methods, let’s briefly look at the migration plan.
Earlier than GraphQL: Monolithic Falcor API carried out and maintained by the API Workforce
Earlier than transferring to GraphQL, our API layer consisted of a monolithic server constructed with Falcor. A single API workforce maintained each the Java implementation of the Falcor framework and the API Server.
Created a GraphQL Shim Service on high of our present Monolith Falcor API.
By the summer season of 2020, many UI engineers have been prepared to maneuver to GraphQL. As an alternative of embarking on a full-fledged migration high to backside, we created a GraphQL shim on high of our present Falcor API. The GraphQL shim enabled consumer engineers to maneuver shortly onto GraphQL, determine client-side considerations like cache normalization, experiment with totally different GraphQL purchasers, and examine consumer efficiency with out being blocked by server-side migrations. To launch Section 1 safely, we used AB Testing.
Deprecate the GraphQL Shim Service and Legacy API Monolith in favor of GraphQL companies owned by the area groups.
We didn’t need the legacy Falcor API to linger perpetually, so we leaned into Federated GraphQL to energy a single GraphQL API with a number of GraphQL servers.
We may additionally swap out the implementation of a subject from GraphQL Shim to Video API with federation directives. To launch Section 2 safely, we used Replay Testing and Sticky Canaries.
Two key components decided our testing methods:
- Purposeful vs. non-functional necessities
- Idempotency
If we have been testing purposeful necessities like knowledge accuracy, and if the request was idempotent, we relied on Replay Testing. We knew we may take a look at the identical question with the identical inputs and persistently count on the identical outcomes.
We couldn’t replay take a look at GraphQL queries or mutations that requested non-idempotent fields.
And we undoubtedly couldn’t replay take a look at non-functional necessities like caching and logging person interplay. In such circumstances, we weren’t testing for response knowledge however total habits. So, we relied on higher-level metrics-based testing: AB Testing and Sticky Canaries.
Let’s focus on the three testing methods in additional element.
Netflix historically makes use of AB Testing to guage whether or not new product options resonate with clients. In Section 1, we leveraged the AB testing framework to isolate a person phase into two teams totaling 1 million customers. The management group’s visitors utilized the legacy Falcor stack, whereas the experiment inhabitants leveraged the brand new GraphQL consumer and was directed to the GraphQL Shim. To find out buyer influence, we may evaluate varied metrics corresponding to error charges, latencies, and time to render.
We arrange a client-side AB experiment that examined Falcor versus GraphQL and reported coarse-grained high quality of expertise metrics (QoE). The AB experiment outcomes hinted that GraphQL’s correctness was less than par with the legacy system. We spent the subsequent few months diving into these high-level metrics and fixing points corresponding to cache TTLs, flawed consumer assumptions, and so forth.
Wins
Excessive-Degree Well being Metrics: AB Testing supplied the peace of mind we would have liked in our total client-side GraphQL implementation. This helped us efficiently migrate 100% of the visitors on the cellular homepage canvas to GraphQL in 6 months.
Gotchas
Error Analysis: With an AB take a look at, we may see coarse-grained metrics which pointed to potential points, but it surely was difficult to diagnose the precise points.
The subsequent part within the migration was to reimplement our present Falcor API in a GraphQL-first server (Video API Service). The Falcor API had change into a logic-heavy monolith with over a decade of tech debt. So we had to make sure that the reimplemented Video API server was bug-free and similar to the already productized Shim service.
We developed a Replay Testing instrument to confirm that idempotent APIs have been migrated appropriately from the GraphQL Shim to the Video API service.
The Replay Testing framework leverages the @override directive out there in GraphQL Federation. This directive tells the GraphQL Gateway to route to 1 GraphQL server over one other. Take, as an illustration, the next two GraphQL schemas outlined by the Shim Service and the Video Service:
The GraphQL Shim first outlined the certificationRating subject (issues like Rated R or PG-13) in Section 1. In Section 2, we stood up the VideoService and outlined the identical certificationRating subject marked with the @override directive. The presence of the similar subject with the @override directive knowledgeable the GraphQL Gateway to route the decision of this subject to the brand new Video Service slightly than the previous Shim Service.
The Replay Tester instrument samples uncooked visitors streams from Mantis. With these sampled occasions, the instrument can seize a reside request from manufacturing and run an similar GraphQL question towards each the GraphQL Shim and the brand new Video API service. The instrument then compares the outcomes and outputs any variations in response payloads.
Be aware: We don’t replay take a look at Personally Identifiable Data. It’s used just for non-sensitive product options on the Netflix UI.
As soon as the take a look at is accomplished, the engineer can view the diffs displayed as a flattened JSON node. You may see the management worth on the left aspect of the comma in parentheses and the experiment worth on the best.
/knowledge/movies/0/tags/3/id: (81496962, null)
/knowledge/movies/0/tags/5/displayName: (Série, worth: “S303251rie”)
We captured two diffs above, the primary had lacking knowledge for an ID subject within the experiment, and the second had an encoding distinction. We additionally noticed variations in localization, date precisions, and floating level accuracy. It gave us confidence in replicated enterprise logic, the place subscriber plans and person geographic location decided the client’s catalog availability.
Wins
- Confidence in parity between the 2 GraphQL Implementations
- Enabled tuning configs in circumstances the place knowledge was lacking because of over-eager timeouts
- Examined enterprise logic that required many (unknown) inputs and the place correctness will be arduous to eyeball
Gotchas
- PII and non-idempotent APIs ought to not be examined utilizing Replay Checks, and it might be helpful to have a mechanism to stop that.
- Manually constructed queries are solely pretty much as good because the options the developer remembers to check. We ended up with untested fields just because we forgot about them.
- Correctness: The concept of correctness will be complicated too. For instance, is it extra right for an array to be empty or null, or is it simply noise? Finally, we matched the prevailing habits as a lot as potential as a result of verifying the robustness of the consumer’s error dealing with was tough.
Regardless of these shortcomings, Replay Testing was a key indicator that we had achieved purposeful correctness of most idempotent queries.
Whereas Replay Testing validates the purposeful correctness of the brand new GraphQL APIs, it doesn’t present any efficiency or enterprise metric perception, such because the total perceived well being of person interplay. Are customers clicking play on the identical charges? Are issues loading in time earlier than the person loses curiosity? Replay Testing additionally can’t be used for non-idempotent API validation. We reached for a Netflix instrument referred to as the Sticky Canary to construct confidence.
A Sticky Canary is an infrastructure experiment the place clients are assigned both to a canary or baseline host for the whole length of an experiment. All incoming visitors is allotted to an experimental or baseline host based mostly on their gadget and profile, much like a bucket hash. The experimental host deployment serves all the purchasers assigned to the experiment. Watch our Chaos Engineering discuss from AWS Reinvent to study extra about Sticky Canaries.
Within the case of our GraphQL APIs, we used a Sticky Canary experiment to run two situations of our GraphQL gateway. The baseline gateway used the prevailing schema, which routes all visitors to the GraphQL Shim. The experimental gateway used the brand new proposed schema, which routes visitors to the most recent Video API service. Zuul, our main edge gateway, assigns visitors to both cluster based mostly on the experiment parameters.
We then gather and analyze the efficiency of the 2 clusters. Some KPIs we monitor carefully embody:
- Median and tail latencies
- Error charges
- Logs
- Useful resource utilization–CPU, community visitors, reminiscence, disk
- Gadget QoE (High quality of Expertise) metrics
- Streaming well being metrics
We began small, with tiny buyer allocations for hour-long experiments. After validating efficiency, we slowly constructed up scope. We elevated the proportion of buyer allocations, launched multi-region exams, and ultimately 12-hour or day-long experiments. Validating alongside the way in which is important since Sticky Canaries influence reside manufacturing visitors and are assigned persistently to a buyer.
After a number of sticky canary experiments, we had assurance that part 2 of the migration improved all core metrics, and we may dial up GraphQL globally with confidence.
Wins
Sticky Canaries was important to construct confidence in our new GraphQL companies.
- Non-Idempotent APIs: these exams are suitable with mutating or non-idempotent APIs
- Enterprise metrics: Sticky Canaries validated our core Netflix enterprise metrics had improved after the migration
- System efficiency: Insights into latency and useful resource utilization assist us perceive how scaling profiles change after migration
Gotchas
- Detrimental Buyer Affect: Sticky Canaries can influence actual customers. We would have liked confidence in our new companies earlier than persistently routing some clients to them. That is partially mitigated by real-time influence detection, which is able to robotically cancel experiments.
- Quick-lived: Sticky Canaries are meant for short-lived experiments. For longer-lived exams, a full-blown AB take a look at must be used.
Expertise is continually altering, and we, as engineers, spend a big a part of our careers performing migrations. The query isn’t whether or not we’re migrating however whether or not we’re migrating safely, with zero downtime, in a well timed method.
At Netflix, we’ve got developed instruments that guarantee confidence in these migrations, focused towards every particular use case being examined. We coated three instruments, AB testing, Replay Testing, and Sticky Canaries that we used for the GraphQL Migration.
This weblog submit is a part of our Migrating Crucial Visitors sequence. Additionally, try: Migrating Crucial Visitors at Scale (half 1, part 2) and Making certain the Profitable Launch of Advertisements.