Schema modeling with BTree

In this tutorial, we'll walk through some best practices for BTree schema modeling while writing a simple blogging app. We know that the blogging app will need to support querying logic by both an author and the most recent posts, but the database schema design is left to us.

Here's what we'll cover:

  • BTree schema design anti-patterns
  • A better, query-driven approach to schema design
  • Using transactionality for data consistency

Cloud storage deep dive video

Prerequisites

Why use BTree

The BTree cloud storage type behaves like a Map; it stores typed key value pairs with CRUD operations. But BTree is also sorted and data can be retrieved with range operations. This makes it an especially good choice for data that needs to be chunked or returned in a specific order, as in our blog post example.

What not to do...

If you’re used to modeling data in a relational context, you might not know where to start when creating your BTree schema, or worse, you might define a schema which is incapable of supporting your application’s use-cases.

It might be tempting to hop right into defining the data models, i.e. an Author writes many Post entities with a unique OffsetDateTime. Maybe we only need one simple, wide table—something like BTree Author [Post]:

type Author = {
  authorId : Text,
  name : Text
}

type Post = {
  title: Text,
  body : Text,
  timeStamp : OffsetDateTime,
}

antiPattern : '{Exception, Cloud} BTree Author [Post]
antiPattern = do BTree.named !Database.default "badSchema" Universal.ordering

After all, the Author could be the key to a List of Posts, and maybe we can add the equivalent of a SQL where or limit clause with ordering for some filtering logic based on time.

Here’s why that’s not ideal:

Storing data in an unbounded list in BTree is generally an anti-pattern

Say you want to edit a single blog post inside the List of Post as a transaction. Do you want the entirety of the author’s blogposts to also be locked up while the update for a single post is taking place? Generally, no.

BTree also has some hard-stop data size limits when it comes to reading and writing data, so especially prolific bloggers might eventually run into issues creating new posts!

As a rule, if your BTree encloses a list as its value, it’s a sign that you may want to key the data differently, so that your List of Posts becomes something shaped like a BTree SomeUniqueKey Post.

Query-driven schema design

Because we’re not handing off our data to something with a query plan optimizer, we need to start by defining precisely how our application will be accessing the data.

💡 Our data modeling concerns here are twofold, for database writes, the keys we use need to be unique enough to avoid inconsistencies, and for our reads, we want our business queries to work with the minimum amount of in-memory filtering.

Ideally, we also want the lowest possible factor of data duplication, but as we’ll see, the queries over our data will ultimately dictate that.

For the purposes of this example, these are the queries we want to support:

  • Get all the posts for a given author
  • Get the most recent N posts by a given author
  • Get the most recent N posts on the entire blogging website, including the author name

The only query that our earlier BTree Author [Post] would support without in-memory scans is the first one.

Compound keys for better queries

Here’s a better schema for our app: we define two BTrees with compound (tuple) keys–one for queries first scoped by an Author in some way, and a second for queries that are filtering by time information first.

For convenience, a Storage type can contain both BTree tables and the Database they’re stored in:

type Storage = {
  database : Database,
  postsByTimestamp : BTree (OffsetDateTime, Author) Post,
  postsByAuthor : BTree (Author, OffsetDateTime) Post
  }

The compound keys ensure that we’re writing unique entities since it’s unlikely the same author is composing two posts with sub-millisecond timing. The ordering of the elements in the tuple is also important, since the BTree will be ordered by the first element in the tuple and then order by the second only if the first elements are equal.

BTree (Author, OffsetDateTime) covers the first two desired queries. Getting all the posts by a given author and the latest N posts by a given author are both supported by variations of a call to the rangeClosed.prefix function. It queries the BTree for all keys within the given range, using only the first element of the compound key.

getAllPostsByAuthor : Storage -> Author -> {Remote} [Post]
getAllPostsByAuthor storage author =
  postsByAuthor = Storage.postsByAuthor storage
  streamEntities = rangeClosed.prefix postsByAuthor prefixOrdering author author
  Stream.map (at2) streamEntities |> Stream.toList!

getLatestPostsByAuthor : Storage -> Author -> Nat -> {Remote} [Post]
getLatestPostsByAuthor storage author n =
  postsByAuthor = Storage.postsByAuthor storage
  streamEntities = rangeClosed.prefix postsByAuthor prefixOrdering author author
  Stream.take n streamEntities |> Stream.map (at2) |> Stream.toList!

Likewise, to support the final query across all authors over time, we’ll use the BTree keyed by (OffsetDateTime, Author) and the same prefixQuery function.

getLatestPosts : Storage -> OffsetDateTime -> Nat -> {Remote} [Post]
getLatestPosts storage time n =
  postsByTimestamp = Storage.postsByTimestamp storage
  streamEntities = rangeClosed.prefix postsByTimestamp prefixOrdering time time
  Stream.take n streamEntities |> Stream.map (at2) |> Stream.toList!

Use transactionality to sync your data across BTrees

Keeping two BTrees around with the same content but different keys sounds like a recipe for inconsistency, but the Cloud’s durable storage layer has another mechanism for keeping data in sync and that is transactionality.

Represented by the Transaction ability in functions like BTree.write.tx, we can use this ability to describe interactions with the cloud’s storage layer that need to complete together.

In our case, when we receive a new blogpost, we need to write to both BTrees in the same transactional block. Below, the transactional block is opened by the do keyword, and run as a State interaction with the transact.random function.

writePost : Storage -> Author -> OffsetDateTime -> Post -> {Remote, State, Exception, Random} ()
writePost storage author time newPost =
  postByAuthorBTree = Storage.postsByAuthor storage
  postByTimestampBTree = Storage.postsByTimestamp storage
  db = Storage.database storage
  State.transact.random db do
      BTree.write.tx postByAuthorBTree (author, time) newPost
      BTree.write.tx postByTimestampBTree (time, author) newPost

Run a test deployment

You can exercise each of these database interactions in local tests by writing posts with the writePost function and then querying them with your various read functions. Here's an example of a test run that writes two posts and then queries for all posts by a given author:

createStorage : '{Exception, Cloud} Storage
createStorage = do
  database = Database.create "posts"
  Database.assign database (!Environment.default)
  postsByTimestamp = BTree.named database "postsByTimestamp" Universal.ordering
  postsByAuthor = BTree.named database "postsByAuthor" Universal.ordering
  Storage database postsByTimestamp postsByAuthor

testGetAuthor : '{IO, Exception} [Post]
testGetAuthor = Cloud.main.local do
  postTime = !OffsetDateTime.current
  secondTime = addDuration postTime (Duration.seconds +10)
  storage = !createStorage
  Cloud.submit (!Environment.default) do
    writePost storage (Author "Alice") postTime (Post "Hello" "World" postTime)
    writePost storage (Author "Bob") secondTime (Post "Goodbye" "Computer" secondTime)
    getAllPostsByAuthor storage (Author "Alice")

Note that the local updates and reads to the database must take place in the same Cloud.main.local block. Entering run testGetAuthor in the UCM will return a list of posts by the author "Alice", not "Bob".

Summary

Working with BTree requires slightly different thinking about your data modeling. If you’re ever stuck, remember:

  • Your BTree should very seldom look like BTree Key [ListOfValues]
  • Start by describing the queries your application will need to perform. It will help dictate what your data should be keyed by.
  • It’s not uncommon to coordinate multiple BTrees with Transactions in order to serve different access patterns for your data.

📚 Learn more about BTree via the API docs