
Notifications are a key side of the Slack person expertise. Customers depend on well timed notifications of mentions and DMs to maintain on high of essential info. Poor notification completeness erodes the belief of all Slack customers.
Notifications circulation by nearly all of the techniques in our infrastructure. As illustrated in Determine 1 under, a notification request flows by the webapp (our utility logic and net / Desktop shopper monorepo), job queue, push service, and several other third-party providers earlier than hitting our iOS, Android, Desktop, or net shoppers.
Additional, the choice about when and the place to ship a notification can also be very sophisticated, as proven in Determine 2 under, which is from our 2017 weblog submit (additionally summarized here).
Since 2017, our notification workflow has solely grown extra complicated, by the addition of recent options like Huddles and Canvas. Because of this, fixing notification points can result in multi-day debugging periods throughout a number of groups. Buyer tickets associated to notifications additionally had the bottom NPS scores and took the longest time to resolve in comparison with different buyer points.
Debugging notification points inside our techniques was troublesome as a result of every system had a distinct logging pipeline and information format, making it needed to take a look at information with totally different codecs and backends. This course of required deep technical experience and took a number of days to finish. The context by which occasions had been logged additionally diverse throughout techniques, prolonging any investigations. This resulted in a time-consuming course of requiring experience in all elements of the stack simply to know what occurred.
We started a undertaking to hint the circulation of notifications throughout our techniques to deal with these challenges. The objective was to standardize the information format and semantics of occasions to make it simpler to know and debug notification information. We needed to reply questions on notifications corresponding to: if it was despatched, the place it was despatched, if it was seen, and if the person had opened it. This submit paperwork our multi-quarter, cross-organizational journey of tracing notifications all through Slack’s backend techniques, and the way we use this hint information to enhance the Slack buyer expertise for everybody.
Notification circulation
The sequence of steps to know how notifications had been despatched and acquired is one thing we’ve dubbed the “notification circulation.” Step one to enhance the notification circulation was to mannequin the steps within the notification course of the identical manner throughout all our shoppers. We additionally aimed to seize all occasions in a standard information mannequin constantly in the identical format.
We created a notification spec to know all of the occasions in a notification hint. This concerned figuring out all of the occasions in a hint, creating an idealized funnel, and setting the context by which every occasion will likely be logged. We additionally needed to agree on the semantics of a span and the names of the occasions, which was a difficult activity throughout totally different platforms. The result’s a notification circulation (simplified for this weblog submit), proven within the picture under.
Mapping notification circulation to a hint
After we completed planning the circulation of our system, we would have liked to choose a option to preserve observe of that info. We selected to make use of SlackTrace as a result of a hint was a pure option to signify a circulation, and all of the elements of our system can already ship info within the span occasion format. Nonetheless, we encountered two main challenges when modeling notification flows as traces.
- 100% sampling for notification flows: In contrast to backend requests—which had been sampled at 1%—notification flows shouldn’t be sampled since our CE group needed 100% constancy to reply all buyer requests. In some eventualities like `@right here` and `@channel`, a push notification message can be doubtlessly despatched to tons of of hundreds of customers throughout a number of units, leading to billions of spans for a single hint of a slack message. A hint with doubtlessly billions of spans would wreak havoc on our hint ingestion pipeline and storage backends. No sampling would additionally pressure us to hint each Slack message despatched.
- Tracing notifications as a circulation separate from the unique message despatched hint. At present, OpenTelemetry (OpenTracing) instrumentation tightly {couples} tracing to a request context. In a notification circulation, this tight coupling would break for the reason that notification circulation executes in a number of contexts and doesn’t cleanly map to a single request context. Additional, mixing a number of hint contexts additionally made implementing tracing throughout our code difficult.
To resolve each of those challenges we determined to mannequin every notification despatched as its personal hint. To tie the sender’s hint to every of the notifications despatched, we used span links to causally hyperlink the spans collectively. Every notification was assigned a notification_id which was used as a trace_id for the notification circulation.
This strategy has a number of benefits:
- Since SlackTrace’s instrumentation doesn’t tightly couple hint context propagation with request context propagation, modeling these flows drastically simplifies the hint instrumentation.
- Since every notification despatched was its personal hint, it made the traces smaller and simpler to retailer and question.
- It allowed 100% sampling for notification traces, whereas conserving the senders sampling charge at 1%.
- Span linking helped us protect causality for the hint information.
Totally different groups labored collectively to map the steps within the notification circulation to a span. The result’s a desk as proven under.
Span title | Description | Hint id | Guardian span id | Span tags |
notification:set off | Decide if the notification ought to be despatched or not. | Trace_id is the request id. Span hyperlinks have an inventory of notification_id’s despatched. | trigger_type (DM, @right here, @channel), user_id, team_id channel_id message_ts notification_id | |
notification:notify | Notify the person on all of their shoppers. | Trace_id is notification_id. | Id of notification:set off span. | user_id, team_id channel_id message_ts |
notification:despatched | Notification is distributed to a slack shopper to all of the a number of slack shoppers on the person’s system. | Trace_id is notification_id | ID of notification:notify | channel_id platform particular notification tags. |
notification:acquired | Notification is acquired on the person’s slack shopper. | Trace_id is notification_id | ID of notification:despatched span. | Service title is shopper title and shopper tags. |
notification:opened | Consumer opened a notification on the system. | Trace_id is notification_id | ID of notification:acquired span. | Service title is shopper title and shopper tags. |
notification:learn in app | Consumer clicked on the notification to view the notification within the app.The beginning of the span is true after opening. The tip of the span is when the message is rendered within the channel. | Trace_id is notification_id | ID of notification:opened span. | Service title is shopper title and shopper tags. |
Benefits of modeling a notification circulation as a hint
Representing the notification circulation as a Hint/SpanEvent has the next benefits over our present strategies.
- Constant information format: Since all of the providers reported the information as a Span, the information from numerous backend and shopper techniques was in the identical format.
- Service title to determine supply: We set the service title area to Desktop, iOS, or Android to uniquely determine the shopper or service that generated an occasion.
- Normal names for contexts: We used the span title and repair title to uniquely determine an occasion throughout techniques. For instance, the service title for a notification :acquired occasion can be iOS, Android and Internet to precisely tag these occasions. Beforehand, the occasions from these three shoppers would have totally different codecs and it was onerous to uniformly question them.
- Standardized timestamps and period fields: All of the occasions have a constant timestamp in the identical decision and time zone as the remainder of the occasions. If there’s a period related to an occasion, we set the period area or set it to a default worth of 1 when reporting a one-off occasion. This offered a single place for storing all of our period info.
- Constructed-in periods: We’d use the notification ID because the hint ID for the whole circulation. Because of this all of the occasions in a circulation are already sessionized and there’s no have to additional sessionize the information. For instance, we couldn’t use the notification ID because the be part of key in every single place since just some occasions would have a notification ID. For instance, the notification triggered of a notification learn occasion wouldn’t have a notification ID in them. We are able to use the hint ID to tie these occasions collectively as a substitute of utilizing bespoke occasions.
- Clear, easy, and dependable instrumentation: Since a hint is sessionized, we solely want so as to add the tags to the hint as soon as after we mannequin the notification circulation as a hint. This additionally made the instrumentation code cleaner, less complicated, and dependable for the reason that modifications had been localized to small elements of the code that may be unit examined nicely. It additionally made the information simpler to make use of since there is just one be part of key as a substitute of bespoke be part of key for some subset of occasions.
- Versatile information mannequin: This mannequin can also be versatile and extendable. If a shopper wants so as to add further context, they’ll add further tags to an present span. If not one of the present spans are a great match, they’ll add a brand new span to the hint, with out altering the present hint information or hint queries.
- No duplicate occasions: The SpanID within the occasion helped seize the individuality of occasions at supply. This decreased the variety of occasions that had been double reported and eliminated the necessity to de-dupe occasions in our backend once more. The older technique reported thrift objects with out distinctive IDs which led to utilizing de-dupe jobs to determine double reporting of occasions.
- Span linking for tying associated traces collectively: Linking spans throughout traces helps protect causality with out resorting to advert hoc information modeling.
How we use notification hint information at Slack
After a number of quarters of onerous work by a number of groups we had been in a position to hint notifications end-to-end throughout all of the Slack shoppers. Our traces had been despatched to a real-time retailer and our information warehouse utilizing the hint ingestion pipeline.
Builders use the notification hint information to triage points. Beforehand, monitoring notification failures concerned going by logs of a number of techniques to know the place a notification was dropped. This course of was concerned and took a number of hours of very senior engineers’ time to know what went on. Nonetheless, after notification tracing, anybody was in a position to take a look at a hint of the notification to exactly see the place a hint was despatched and the place within the circulation a notification was dropped.
Our buyer expertise group makes use of hint information to triage buyer points a lot sooner as of late. We now know exactly the place within the notification circulation a message dropped. Since our traces are simpler to learn, our CE engineers can have a look at a hint to study what occurred in a notification to reply a buyer’s question as a substitute of escalating it to the event group, who then needed to comb by the various logs. This helped us triage our notifications way more shortly, and decreased the time to triage notification tickets for our CE group by 30%.
Notification analytics
At present, we ingest notification hint information to ElasticSearch/Grafana and our information warehouse.
Our iOS engineers and Android engineers have began utilizing this information to construct Grafana dashboards and alerts to know the efficiency of our shoppers. Usually, shopper engineers don’t use dashboarding instruments like Grafana, however our shopper engineers have used them very successfully to triage and debug points in our notification circulation.
We have now additionally ingested this information into our information warehouse, over which anybody can run complicated analytics on this information. Initially information scientists used this information to know efficiency regressions in our shoppers over lengthy durations of time.
The span occasion format and tracing system additionally has an sudden profit. Our information scientists used this information to construct a product analytics dashboard displaying funnel analytics on notification flows, to raised perceive notification open charges. Usually, that product analytics information can be captured by a separate set of instrumentation ingested by way of a distinct pipeline into the information warehouse. Nonetheless, since we despatched the hint information to the information warehouse, our information scientists can use it to compute funnel analytics on the information to get the identical insights.
An much more extraordinary consequence was when the information scientists had been in a position to mine the hint information to determine and report bugs in utility and instrumentation. Previously two years since, notification traces had been used many instances exterior of the preliminary use case. This exhibits the benefits of utilizing hint information as a single supply of fact, because of its help for a number of use instances.
Conclusion
Modeling flows or funnels as a hint is a superb concept, however there are some challenges. On this weblog submit now we have proven how Slack modeled notification flows as traces, the challenges we confronted, and the right way to overcome these challenges by cautious modeling.
Implementing notification tracing wouldn’t have been potential with out decoupling the hint context propagation from a request context within the SlackTrace framework. The instrumentation helped us shortly and cleanly implement tracing throughout a number of backend providers, whereas avoiding the unfavorable unwanted effects of present libraries, corresponding to cluttered instrumentation and huge traces. At present, we instrument a number of different flows within the manufacturing Slack app utilizing the identical technique.
Modeling notification flows as hint information helped our CE group resolve notification points 30% sooner whereas additionally decreasing escalations to the event group.
Along with the unique use case of debugging notification points, notification hint information was additionally used for calculating funnel analytics for manufacturing analytics use instances. Modeling product analytics information as traces offers high-quality information in a constant information format throughout all of our complicated stack. Additional, the built-in sessionization of hint information simplified our analytics pipeline by eliminating further jobs to de-dupe and sessionize the hint information. Previously two years, backend and frontend builders and information scientists have used the hint information as a single supply of fact for a number of use instances.
The success of notification tracing has inspired a number of different use instances the place flows are modeled as traces at Slack. At this time within the Slack app there are at the least a dozen tracers operating concurrently within the Slack app.
Concerned about taking over fascinating tasks, making folks’s work lives simpler, or optimizing some code? We’re hiring! 💼 Apply now