Tofino, data storage, and how we got to Mentat

Wed 31 May 2017 / tagged: firefox, experimental, tofino, photino, clojure, clojurescript, rust, datomic, datomish, datascript, mentat

Project Tofino was an attempt by Mozilla to develop a new web browser. I think Mark Mayo, Senior Vice President of Firefox, described Tofino best:

[W]e’re working on browser prototypes that look and feel almost nothing like the current Firefox… [we’re trying to solve] the kinds of problems people have that aren’t currently solved by anybody’s browser product.

My colleague Richard Newman and I were responsible for designing and implementing a data storage layer to back Tofino. The engineering organizations I’ve been a part of and have witnessed at Mozilla do not have a culture of lessons learned, final reports, and post-mortems. I’m a context-building history-oriented learner so I’d like to change that, one little bit, by providing my biased account of the forces that resulted in us proposing Mentat to meet Tofino’s data storage needs.

Data storage in Tofino

Richard and I have a lot of experience storing data for browsers (which I’ll return to), and Richard captured some of his thinking about storage in a blog post. All told, we wanted storage that:

could handle existing Firefox data like bookmarks, history, and passwords;
was easy to evolve to handle new Tofino data like page thumbnails, fulltext web content indexes, and A/B testing results;
performed well on an under-powered Windows tablet (the kind you might buy at a big-box store).

However, we also carry the scars from implementing multiple Firefox Sync clients. For those not intimately familiar with browsers and Firefox, Sync is a distributed system that co-ordinates browser data (bookmarks, history, passwords, etc) across your devices (Desktop, Android, iOS). Sync in general is a very hard problem and Firefox Sync is a poor rendition of a sync solution [1]. Richard and I knew that Tofino would eventually evolve to become a Firefox Sync client (or grow a similar Sync solution). So we also wanted storage that:

would support an eventual Firefox Sync client.

To meet Tofino’s needs we proposed and are prototyping Project Mentat, a data store designed for embedded client applications. The initial blog post announcing Mentat (née Datomish) does a great job framing the data problems Tofino and Firefox face and how we’re trying to solve them. In a nutshell, Mentat:

is designed to be embedded into client applications;
manages a strong schema and helps you evolve that schema over time;
separates readers from writers, allowing to optimize critical queries;
maintains a historical log, supporting three-way merges when syncing.

How we got to Mentat

Ancient history [2011-2016]

Any good origin story starts before the beginning, so in that spirit, let us start with ancient history that predates Tofino. For simplicity, I’ll refer to myself and Richard Newman as the team [2]. Our experience led us to to design a storage system supporting (and eventually integrated with) the Firefox Sync implementation. We got here in stages:

Firefox for Desktop is the cautionary tale: this is what happens when storage and syncing are not integrated;
Firefox for Android is the second system that learned some architectural lessons but "didn’t go far enough"; and
Firefox for iOS gets it "mostly right".

To set the scene, Richard inherited Firefox Sync, the Firefox for Desktop implementation. The initial development was done in an add-on [3]; it used observer notifications to witness changes relevant to Sync and track those changes in memory to propagate upstream. The code landed with this architecture and the foundational decision to rely on observer notifications resulted in significant issues [4]:

not all data sources produced the notifications required for Sync;
notifications were lost at start-up, shutdown, and when Sync itself had bugs;
the notification flow is inherently non-transactional [5].

Richard and I wrote the Firefox for Android Sync implementation. Richard owned [6] Sync and, eventually, the Firefox for Android storage implementation. Sync was built as a stand-alone process from day one, factored out of the browser front-end via the Android ContentProvider storage abstraction [7]. Unfortunately, we didn’t appreciate how significant transactional syncing would be, and we made Firefox for Android’s data storage "live": the browser and Sync update the store non-transactionally — potentially at the same time — leading to subtle UI conflicts and missed changes in Sync [8].

Richard also owned the Firefox for iOS storage and Firefox Sync implementations. The two systems were designed to support each other from day one. The storage system is:

entirely transactional and robust in the face of extreme behaviours;
well factored, enabling performance improvements to key user interactions due to the single point of responsibility for data;
significantly less likely to lose or corrupt Sync data than the other implementations.

These performance improvements are most notable in the iOS top sites implementation. Top sites is the panel of most frecent [9] sites shown every time the user taps the URL bar to navigate somewhere new. The iOS top sites panel displays instantly, since the underlying store keeps a materialized view (a read-only computed cache) in memory and updates it efficiently independent of the user interface.

Enter Tofino [April 2016]

Tofino was pitched as an entirely new browser product. We anticipated a data footprint similar to Firefox for iOS, so we started to express the architectural lessons we had learned from the two previous storage implementations using web technologies [10]. The team started to build an Electron/Node.js desktop application. We quickly discovered (or perhaps, realized) that a client/server architecture is the best model for our web technology implementation. We started to develop a User Agent service: an always available backend that stored browsing data for the user and leveraged that data to help the user exploit the web. We settled into a familiar pattern, recognizable to anyone who has built a Web App:

UI -> transport -> UA service
UI updates locally (optimistic update)
UA service -> transport -> notifies UI to update locally (authoritative update)

where the transport was variously HTTP requests, Web Socket messages, or Electron’s IPC.

Tofino almost immediately pivoted to product and user research. Suddenly there was much more interest in capturing event streams, and an immediate need to support rapid prototyping. We started to build a Node.js service, backed by SQLite, to store events, materialize views, and update and publish changes to clients.

Tofino product evolution [September 2016]

We quickly observed that adding data to our hand-written SQLite store required migrating the SQLite store forward rapidly. Each data type we wanted to add required a back-and-forth between the product owner, the front-end team, and the storage team. In response, we started to investigate event stores that might do this work for us. Unfortunately, most offerings were:

not embeddable (targeted the Java Virtual Machine (JVM), or were intended to scale horizontally in the cloud); or
more general than we were comfortable with (graph stores like Neo4J generally don’t query for time ranges efficiently; document stores like Mongo generally don’t support strong schemas); or
not general enough (key-value stores like LMDB and LevelDB don’t support strong schemas and high-level queries); or
not mature enough to ship to a market of Firefox’s size (side-projects like Cayley).

Evaluating this technical landscape, Richard and I found the ideas behind Cognitect’s Datomic most compelling. Datomic:

is assertion (event) oriented;
maintains a full transaction log;
exposes an expressive, extendable schema;
models row-oriented (relational) data efficiently;
is flexible enough to model graph-oriented data.

Sadly, Datomic targets the JVM and we don’t see a path to shipping 200+ Mb of VM and database to Firefox’s market. (In addition, Datomic is not open-source, making it an awkward cultural fit for Mozilla.)

Concurrently, the product and front-end teams wanted to rapidly prototype using tools like GraphQL. Our research into GraphQL suggested that performance would be poor and very difficult to address, but we see Datomic’s query and transact syntax as enabling experimentation by the front-end team similar in spirit (if not expression) to GraphQL, and also possible to make performant.

To research the ideas by Datomic, we started to adapt Nikita Prokopov’s awesome DataScript, a Clojure{Script} Datomic-alike that transpiles to JavaScript and can run in the browser and in Node.js. DataScript is an in-memory store, not suitable for Firefox-sized work loads, and fundamentally synchronous (which is a big problem in the highly concurrent JavaScript browser environment).

Datomish [July 2016]

The work to make DataScript asynchronous, backed by a persistent store (a flat file, IndexedDB, SQLite) was close to a rewrite, but we still felt that Datomic’s model was compelling for our requirements — particularly our emphasis on experimentation and managing change. So we started to build a Clojure{Script} Datomic-alike. This short-lived prototype was named Datomish. Datomish:

re-used key pieces of DataScript’s source code;
persisted to SQLite;
translated Datomic’s Datalog queries to SQLite queries.

The main idea was to reduce the unknown performance of Datomic’s Datalog queries to the better known performance of SQLite’s SQL queries. The Datomish prototype convinced us that the SQLite translation could be done efficiently and yield performant queries against the working-sets we expect to witness in Firefox in the wild. It was flexible in the ways we wanted, and early experience suggested that the store was performant enough to back the Tofino browser and related experiments. However, the Datomish prototype was not fit for greater purposes in three regards:

the transpiled JavaScript could not be reasonably shipped in a product with Firefox’s audience;
the ClojureScript prototype suffered from emergent memory leaks due to subtle bugs with our use of ClojureScript’s persistent data structures and communicating sequential processes implementation, and exposed impedance mismatches between JavaScript and ClojureScript; and
we felt that ClojureScript and JavaScript were not the right technologies to back data storage for Android or iOS.

The end of Tofino [December 2016]

At this point, the Tofino product experiment was stopped. None of the UX experiments and prototypes were deemed worthy of future investment. The people working on Tofino joined the people working on Datomish (me and Richard) to prototype a Datomic-alike written in Rust. We hoped to:

ship in Firefox for Desktop;
be able to ship on Android and iOS;
improve on the performance and robustness of the Clojure{Script} implementation.

In response, the newly enlarged team rapidly stood up a Rust version of Datomish, which we named Project Mentat.

Redirection [April 2017]

Throughout Q1 2016, we focused on implementing the core features of Project Mentat. However, senior management redirected effort away from "ship in Firefox for Desktop" and toward two alternate goals:

prototype the new Firefox UI (Photon), using lessons learned from the existing Firefox UI (Australis) implementation expressed using React in Tofino;
re-focus on Mentat as a component of new product experiences, in the same way that Tofino had focused on new product experiences.

In response, we re-focused the people who had been working on Tofino onto the "Photino" Photon UI prototype [11].

Crucially, we continued to build Mentat as a store that:

focused on flexibility and schema evolution; and
could meet performance requirements through suitable abstractions rather than manual tinkering; and
would support a robust Sync solution (that was not necessarily Firefox Sync);

while additionally proposing an architectural split between the user interface, browser data, and web rendering platform that we believe Mozilla should invest in across all its browser offerings.

Status [May 2017]

Where does the Project Mentat codebase stand? As of May 2017 we’ve implemented basic transacting:

assertion and retraction with :db/add and :db/retract
map notation like {:db/id ... :some/attribute :some/value ...}
foundational data types like :db.type/long, :db.type/string, :db.type/keyword, etc
cardinality constraints with :db.cardinality/one and :db.cardinality/many
custom identifiers with :db/ident
schema mutation equivalent to Datomic’s :db.install/attribute
temporary identifier resolution like Datomic`s {:db/id "tempid"}
:db.unique/identity and upserts

and basic querying:

accepting a large subset of Datomic’s Datalog query language
non-trivial joining with :or and :or-join
negation with :not and :not-join
interpolating input values with :in
projecting scalar, vector, tuple, and relation results
ordering and limiting result sets
some non-trivial query pruning and type-aware optimization

The Rust implementation is not yet as full featured as the Clojure{Script} prototype:

no support for aggregates like (count), (max), and (min) in queries;
no fulltext search with :db/fulltext true attributes in queries;
no schema registration and migration layer on top of the basic store;
no transactor loop and transaction listeners.

But we think what is implemented is robust and has a clear path to production. We anticipate it will take roughly 3 months to land Mentat into Firefox for Desktop and to back a simple store like logins or form history using the new technology. The most significant work will be managing the application life cycles and locking and concurrency; the foundational work for applying transactions and querying the store is essentially complete.

Review [May 2017]

Mentat is currently running an architectural review gauntlet to decide whether Mozilla will invest further or cancel the project. The architecture review has focused primarily on whether Mentat can address the wide-ranging short-term storage architectural failings that hold Firefox for Desktop down. Unfortunately, we didn’t expect to need to do so at this time — indeed, we were told not to! — so we’re fighting to justify our approach as I type.

We still believe that Mentat can solve some of Firefox’s storage problems right now and can evolve to solve most of the others, but we have not engaged directly with most of the concrete problems the architecture review process foresee. I hope we’ll get time to do so, but if we don’t — this blog post can serve as a Project Mentat pre-mortem.

Conclusion

I hope this narrative explains the forces that shaped the arc we followed to get to Mentat, and that not all of the problems we faced, solutions we explored, and artifacts we created are lost immediately.

Thanks to Richard Newman (@rnewman) for being the animating force behind Mentat and this work. Thanks to Joe Walker for being a great manager to work with and for. And thanks to Mark Mayo and the Tofino team for trying a new thing, regardless of the outcome. Innovator’s dilemma is a real thing.

Many thanks to Joe Walker, who provided initial feedback on this blog post before it was intended for public consumption; and to early readers and reviewers Richard Newman, Grisha Kruglov, Ralph Giles, and Francois Marier, who provided valuable feedback and compelled me to expand or rewrite many sections.

Discussion is best conducted on IRC: I’m nalexander in irc.mozilla.org/#mentat and on Slack (Mozilla Corporation only), and I’m @ncalexander on Twitter.

Changes

Wed 31 May 2017: Incorporated suggestions from Francois Marier.
Mon 29 May 2017: Incorporated suggestions from Grisha Kruglov, Ralph Giles, and Richard Newman.
Wed 24 May 2017: Initial version.

Notes

[1]

Syncing is a large and active area of academic research; the Wikipedia article for conflict-free replicated datatypes is a good place to begin.

Firefox Sync is even more disadvantaged than most systems:

to preserve Firefox user’s privacy, all data is encrypted in such a way that the cloud storage part of the system (owned and operated by Mozilla) cannot help the clients manage conflicts in the data;
the HTTP API that Firefox Sync uses to upload and download data from cloud storage does not make it easy to robustly fetch and report all data. These issues are being addressed with the introduction of atomic uploads, batch downloads that can be interrupted and resumed, etc.

However, at a higher layer, the Firefox Sync client protocol makes it very difficult to actually implement Firefox Sync. For example, every Firefox Sync client is expected to recognize when the cloud store has failed and bootstrap a new cloud store, uploading all historical data. These ancient decisions make it very difficult to implement a Firefox Sync client that can interoperate with existing Firefox Sync clients.

[2]

In addition, Brian Grinstead, Victor Porof, Jordan Santell, Joe Walker, and Emily Toop (all Mozilla Corporation employees) have contributed to Mentat. We also gratefully acknowledge feedback and support from several Rust library authors:

Markus Westerland (combine)
John Gallagher (rusqlite)
Kevin Mehall (rust-peg)

[3]	The code that grew to become Firefox Sync was developed by the Mozilla Services team in an add-on. It was originally called Weave.

[4]	Many of these issues are now being resolved: the Firefox for Desktop Sync team has been building support for robust syncing into the underlying data stores for some time. See in particular Bug 1258127.

[5]

By transactional, I mean that a series of storage operations either all succeed and are committed in one atomic operation, or none succeed and they are all dropped. I use transactional and atomic interchangeably.

To see that an observer notification-based system is not transactional, suppose that a single user operation both bookmarks a site and visits it to fetch its current title. Firefox Sync might witness a notification saying the site was bookmarked and a second notification saying that the site was visited. Now suppose that there was a race between syncing bookmarks from a different client and the notification flow. It’s possible that the different client has already bookmarked the site. That means the first notification (the bookmarked notification) should be dropped, since the site is already bookmarked. But the second notification (the visit notification) will not be dropped, even though logically it should be dropped as well.

This particular example is artificial, but these types of scenarios become more and more prevalent as the scope of synced data expands.

See the Wikipeda article on atomicity for more.

[6]

For folks not familiar with Mozilla’s development process, different parts of the source code are split into modules. Each module has a designated owner, who stewards that part of the code base. They triage tickets, set priorities, delegate reviews, and are the ultimate decision maker for changes impacting the module. See https://wiki.mozilla.org/Modules.

[7]

Originally, Firefox for Android did not own its data store: we intended to use the Android system Browser data stores. Firefox for Android quickly outgrew the system Browser data stores and evolved its own storage; unfortunately, the Android ContentProvider API is essentially non-transactional across separate ContentProvider instances, and there was a period where Firefox for Android supported both the external storage and the internal storage, which led us to ape the system Browser data stores and not structure our internal storage in the ways that best supported our browser and Firefox Sync.

[8]	Most of these issues have been addressed in the browser: we have a well abstracted BrowserDB interface that transactionally updates the underlying stores. And the Sync implementation is actively evolving to update a clone of the underlying store before atomically committing changes.

[9]	A portmanteau blending frequent and recent; see https://developer.mozilla.org/en-US/docs/Mozilla/Tech/Places/Frecency_algorithm.

[10]

Why not start from the iOS storage and Sync implementation? We considered doing so. However, the iOS implementation was built in Apple’s Swift language, and Tofino had a strong technology experimentation position: Mozilla wanted to evaluate building a browser using "modern web technologies". Swift at that time had been open-sourced only 6 months earlier (December 2015) and desktop support was patchy. Rust is an area of technology investment at Mozilla; Swift is not. In addition, the Firefox for iOS implementation is not flexible enough to support the rapid prototyping phase we wanted for Tofino.

[11]	To Victor Porof, Brian Grinstead, and Joe Walker’s credit, Photino has become an active test bed for the Photon UX team.

About Nick Alexander

Mathematician. Mozillian. Runner. Master of Disguise.