
At Meta, Bento is our inner Jupyter notebooks platform that’s leveraged by many inner customers. Notebooks are additionally getting used extensively for creating studies and workflows (for instance, performing data ETL) that have to be repeated at sure intervals. Customers with such notebooks must bear in mind to manually run their notebooks on the required cadence – a course of individuals would possibly overlook as a result of it doesn’t scale with the variety of notebooks used.
To handle this downside, we invested in constructing a scheduled notebooks infrastructure that matches in seamlessly with the remainder of the inner tooling out there at Meta. Investing in infrastructure helps be sure that privateness is inherent in the whole lot we construct. It allows us to proceed constructing revolutionary, priceless options in a privacy-safe means.
The power to transparently reply questions on information move by way of Meta techniques for functions of information privateness and complying with rules differentiates our scheduled notebooks implementation from the remainder of the business.
On this publish, we’ll clarify how we married Bento with our batch ETL pipeline framework referred to as Dataswarm (assume Apache Airflow) in a privateness and lineage-aware method.
The problem round doing scheduled notebooks at Meta
At Meta, we’re dedicated to enhancing confidence in manufacturing by performing static evaluation on scheduled artifacts and sustaining coherent narratives round dataflows by leveraging clear Dataswarm Operators and information annotations. Notebooks pose a particular problem as a result of:
- Because of dynamic code content material (assume desk names created through f-strings, for example), static evaluation gained’t work, making it tougher to know information lineage.
- Since notebooks can have any arbitrary code, their execution in manufacturing is taken into account “opaque” as information lineage can’t be decided, validated, or recorded.
- Scheduled notebooks are thought-about to be on the manufacturing facet of the production-development barrier. Earlier than something runs in manufacturing, it must be reviewed, and reviewing pocket book code is non-trivial.
These three concerns formed and influenced our design selections. Particularly, we restricted notebooks that may be scheduled to these primarily performing ETL and people performing information transformations and displaying visualizations. Notebooks with some other unwanted effects are at the moment out of scope and usually are not eligible to be scheduled.
How scheduled notebooks work at Meta
There are three important parts for supporting scheduled notebooks:
- The UI for establishing a schedule and making a diff (Meta’s pull request equal) that must be reviewed earlier than the pocket book and related dataswarm pipeline will get checked into supply management.
- The debugging interface as soon as a pocket book has been scheduled.
- The mixing level (a customized Operator) with Meta’s inner scheduler to truly run the pocket book. We’re calling this: BentoOperator.
How BentoOperator works
With a purpose to deal with nearly all of the considerations highlighted above, we carry out the pocket book execution state in a container with out entry to the community. We additionally leverage enter & output information annotations to indicate the move of information.

For ETL, we fetch information and write it out in a novel means:
- Supported notebooks carry out information fetches in a structured method through customized cells that we’ve constructed. An instance of that is the SQL cell. When BentoOperator runs, step one includes parsing metadata related to these cells and fetching the info utilizing clear Dataswarm Operators and persisting this in native csv information on the ephemeral distant hosts.
- Situations of those customized cells are then changed with a name to pandas.read_csv() to load that information within the pocket book, unlocking the flexibility to execute the pocket book with none entry to the community.
- Information writes additionally leverage a customized cell, which we change with a name to pandas.DataFrame.to_csv() to persist to an area csv file, which we then course of after the precise pocket book execution is full and add the info to the warehouse utilizing clear Dataswarm Operators.
- After this step, the momentary csv information are garbage-collected; the ensuing pocket book model with outputs uploaded and the ephemeral execution host deallocated.


Our method to privateness with BentoOperator
We have now built-in BentoOperator inside Meta’s information objective framework to make sure that information is used just for the aim it was meant. This framework ensures that the info utilization objective is revered as information flows and transmutes throughout Meta’s stack. As a part of scheduling a pocket book, a “objective coverage zone” is provided by the consumer and this serves as the combination level with the info objective framework.
Total consumer workflow
Let’s now discover the workflow for scheduling a pocket book:
We’ve uncovered the scheduling entry level straight from the pocket book header, so all customers must do is hit a button to get began.
Step one within the workflow is establishing some parameters that can be used for routinely producing the pipeline for the schedule.
The subsequent step includes previewing the generated pipeline earlier than a Phabricator (Meta’s diff evaluation device) diff is created.
Along with the pipeline code for working the pocket book, the pocket book itself can also be checked into supply management so it may be reviewed. The outcomes of trying to run the pocket book in a scheduled setup are additionally included within the take a look at plan.
As soon as the diff has been reviewed and landed, the schedule begins working the following day. Within the occasion that the pocket book execution fails for no matter purpose, the schedule proprietor is routinely notified. We’ve additionally constructed a context pane extension straight in Bento to assist with debugging pocket book runs.
What’s subsequent for scheduled notebooks
Whereas we’ve addressed the problem of supporting scheduled notebooks in a privacy-aware method, the notebooks which might be in scope for scheduling are restricted to these performing ETL or these performing information evaluation with no different unwanted effects. That is solely a fraction of the notebooks that customers need to ultimately schedule. With a purpose to improve the variety of use instances, we’ll be investing in supporting different clear information sources along with the SQL cell.
We have now additionally begun work on supporting parameterized notebooks in a scheduled setup. The thought is to assist cases the place as an alternative of checking in many notebooks into supply management that solely differ by just a few variables, we as an alternative simply examine in a single pocket book and inject the differentiating parameters throughout runtime.
Lastly, we’ll be engaged on event-based scheduling (along with the time-based method we have now right here) so {that a} scheduled pocket book can even watch for predefined occasions earlier than working. This would come with, for instance, the flexibility to attend till all information sources the pocket book is determined by land earlier than pocket book execution can start.
Acknowledgments
Among the approaches we took had been straight impressed by the work executed on Papermill.