Datatypes and schemas

As stated before, Smithy4s generates code that does not depends on any third-party library. However, we still want to use the generated code with specific serialisation technologies, such as JSON, or Protocol Buffers, or CBOR, MessagePack, XML (yes ... we know).

We also want to avoid having to implement complex macros to allow for auto-derivation of these things. For starters, the reality is that maintaining macros across two different Scala versions (2 and 3) is hard work. Secondly, macros close the door to an interesting feature, namely "dynamic schematisation" that we'll describe in another chapter.

If you have 45 minutes to waste, feel free to go watch the following video where Olivier explained the rationale behind the crazy pattern we are about to explain. Otherwise, head over below!

The Schema GADT

Each datatype generated by Smithy4s is accompanied by a schema value in its companion object, which contains an expression of type smithy4s.schema.Schema that captures everything needed to deconstruct/reconstruct instances of the datatype.

smithy4s.schema.Schema is a Generalised Algebraic Datatype (or GADT for short) that can be used to precisely reference all the information needed to traverse datatypes that can be expressed in Smithy. It is a bit like JVM reflection, except that it exposes higher-level information about the datatypes. It achieves this by exposing building blocks that accurately reflect what is possible to express in the Smithy language. These building blocks form a metamodel: a model for models. And, unlike JVM reflection, using schemas is type-safe.

The Schema type reflects the various ways of constructing datatypes in Smithy. It is encoded as a sealed trait, the members of which capture the following aspects of the Smithy language:

Primitives
Lists
Maps
Enumerations
Structures
Unions

For a Scala type called Foo, formulating a Schema[Foo] is equivalent to exhaustively capturing the information needed for the serialisation and deserialisation of Foo in any format (JSON, XML, ...). Indeed, for any Codec[_] construct provided by third-party libraries, it is possible to write a generic def compile(schema: Schema[A]): Codec[A] function that produces the Codec for A based on the information held by the Schema.

Why do things this way? Why not just render Codec during code generation? The reason is that we want for the generated code to be completely decoupled from any serialisation format or library, and for the user to have the ability to wire that generated code in different ways, without having to change anything in the build. Moreover, this approach has proven that it allows for a bounded investment for adding interop with various libraries, and offers really good testability.

Hints

In Smithy, all shapes (and members of composite shapes) can be annotated with traits. Smithy4s generically translates these annotations to instances of the corresponding generated classes, which means that Smithy4s supports generating user defined traits that it has zero knowledge of.

So if you have the following Smithy description:

namespace example

@trait
structure metadata {
  @required
  description: String
}

@metadata(description: "This is my own integer shape")
integer MyInt

When processing this Smithy model, Smithy4s renders a case class Metadata(description: String), with an associated ShapeTag[Metadata] instance, and the following expression in the companion object of MyInt:

val hints = Hints(
  Metadata("this is my own integer shape")
)

The smithy4s.Hints type is a polymorphic map that can hold shapes, keyed by ShapeTag. A ShapeTag is a uniquely identified tag that uses referential equality. Every schema can hold a Hints instance, which means that in addition to the datatype structures, Schemas also offer an accurate reflection of the trait values that annotate shapes in the smithy models.

Smithy4s uses these hints to implement interpreters. For instance, the smithy.api#jsonName smithy trait translates to a smithy.api.JsonName Scala type, that we can query from a Hints instance when implementing a Schema ~> JsonCodec transformation. This allows to give users a little customisability in the json serialisation of their datatypes.

Structures

A structure, also referred to as product, or record, is a construct that groups several values together. Typically, it translates naturally to a case class.

namespace example

structure Foo {
  @required
  a: Integer
  @length(min: 1)
  b: String
}

...and the associated, generated Scala code:

package example

import smithy4s.schema.Schema._

case class Foo(a: Int, b: Option[String] = None)

object Foo extends smithy4s.ShapeTag.Companion[Foo] {
  val id: smithy4s.ShapeId = smithy4s.ShapeId("example", "Foo")

  implicit val schema: smithy4s.Schema[Foo] = struct(
    int.required[Foo]("a", _.a),
    string.optional[Foo]("b", _.b).addHints(smithy.api.Length(Some(1), None))
  ){
    Foo.apply
  }.withId(id)
}

As you can see, the Smithy structure translates quite naturally to a Scala case class. Every member of the structure that does not have either the @required trait or a default value specified is rendered as an optional value defaulting to None (by default, smithy4s sorts the fields before rendering the case class so that the required ones appear before the optional ones. That is a pragmatic decision that tends to improve UX for users.)

Indeed, for each field, there is an associated reference to a schema (int, string, ...), a string label, and a lambda calling the case class accessor that allows the retrieval of the associated field value. Additionally, the constructor of the case class is also referenced in the Schema.

Typically, the accessors are needed for encoding the data, which involves destructuring it to access its individual components. The labels are there to cater to serialisation mechanisms like JSON or XML, where sub-components of a piece of data are labelled and nested under a larger block.

Conversely, the constructor is used for deserialisation, which involves reconstructing the data after all of its component values have been successfully deserialised.

Another detail is the presence of the addHints call on field labelled with b. This is due to the presence of the length trait (from the smithy.api namespace, aka the prelude) on the corresponding b member of the smithy Foo shape.

Note related to `optional` and `required`

You may have noticed the required and optional methods, which create Field instances from Schemas, in order to pass them to structures.

Since 0.18, the concept of Option in Smithy4s is backed by a OptionSchema member of the Schema GADT. Having Option as a first-class citizen has some advantages, as it allows to support sparse collections.

The downside is that this allows to create schemas (and therefore codecs) that do not abide by round-tripping properties. Indeed, once data is on the wire, it's often impossible to distinguish Option[Option[Option[Int]] ] from Option[Int]. If you need to distinguish between presence of a null value and absence of a value, Smithy4s provides an additional Nullable type in order to allow an extra level of nesting.

Unions

Union, also referred to as coproduct, or sum type, is a construct that expresses sealed polymorphism. It is the dual of a structure: when structures express that you have A AND B, unions express that you can have A OR B.

The way this is expressed in Smithy looks like this:

namespace example

union Bar {
  a: Integer
  b: String
}

This hints at the default serialisation that AWS has intended to use on unions expressed in smithy, namely tagged unions. Indeed, the AWS json-centric protocols specifies that shapes like these should be serialised in objects with a single key/value entry, where the key receives the value of the tag. For instance, { "a": 1 } or { "b": "two" }. There are some very relevant technical reasons for it, but this way of encoding unions/co-products in JSON is arguably the best. It may also be familiar to Circe users as it's the default encoding of co-products in circe-generic.

Regarding the Scala code rendered by Smithy4s for the above Smithy specification, it looks like this:

package example

import smithy4s.schema.Schema._

sealed trait Bar extends scala.Product with scala.Serializable
object Bar extends smithy4s.ShapeTag.Companion[Bar] {
  val id: smithy4s.ShapeId = smithy4s.ShapeId("foobar", "Bar")

  case class ACase(a: Int) extends Bar
  case class BCase(b: String) extends Bar

  object ACase {
    val hints: smithy4s.Hints = smithy4s.Hints.empty
    val schema: smithy4s.Schema[ACase] = bijection(int.addHints(hints), ACase(_), _.a)
    val alt = schema.oneOf[Bar]("a")
  }
  object BCase {
    val hints: smithy4s.Hints = smithy4s.Hints.empty
    val schema: smithy4s.Schema[BCase] = bijection(string.addHints(hints), BCase(_), _.b)
    val alt = schema.oneOf[Bar]("b")
  }

  implicit val schema: smithy4s.Schema[Bar] = union(
    ACase.alt,
    BCase.alt,
  ){
    case _: ACase => 0
    case _: BCase => 1
  }.withId(id)
}

The union is rendered as an ADT (sealed trait), the members of which are single-value case classes wrapping values of the types referenced by the union member. The Case suffix is added as a way to reduce risk of collision between the generated code and other types (especially the types being wrapped).

Each ADT member is accompanied by its own schema, which is not provided implicitly, in an effort to retain coherence in the type-class instances, and avoid the situation where you'd have different behaviours during serialisation based on whether you've up-casted a member to the ADT. Additionally, the companion objects of each ADT members contain an alt value (for "alternative"), which is the union's equivalent to the structure's field.

Much like a field, an alt contains a label, and can carry hints. But unlike a field, which contains an accessor, the alt contains the function to "inject" (up-cast) the member into the union. This is useful for de-serialisation, when, after successfully de-serialising a member of a union, you need to inject it into the ADT to return the expected type.

As for the union's schema, it is somewhat similar to the structure's, in that it references all its alternatives. But instead of a structure's constructor, we have a dispatch function instead, which contains a pattern match against all the possible members, and dispatches the "down-casted" value to its corresponding ordinal, allowing to recover the corresponding alternative. This is useful for serialisation, when the behaviour of the alternatives can only be applied to values of the corresponding type: "if my ADT is an A, then I serialise the A, and add a discriminating tag to the serialised A".

Named simple shapes

Smithy allows for the creation of named shapes that reference "primitive types":

namespace example

integer MyInt

Smithy4s translates this to a Scala newtype: a zero-overhead wrapper for the underling type (in this case, Int):

package example

object MyInt extends Newtype[Int] {
  val id: smithy4s.ShapeId = smithy4s.ShapeId("foobar", "MyInt")
  val hints: smithy4s.Hints = smithy4s.Hints.empty
  val underlyingSchema: smithy4s.Schema[Int] = int.withId(id).addHints(hints)
  implicit val schema: smithy4s.Schema[MyInt] = bijection(underlyingSchema, MyInt(_), (_: MyInt).value)
}

A MyInt type alias, pointing to the MyInt.Type type member, is rendered in the example package object, which makes it possible to write such code:

val myInt: MyInt = MyInt(1)
// val int: Int = myInt // doesn't compile because MyInt is not an Int at compile time.
val int: Int = myInt.value

You may have noticed that the schema value is using bijection. Additionally to the GADT members stated previously, Schema also has a BijectionSchema member, which allows to apply bidirectional transformation on other Schemas. This is useful for the case of newtypes: if we are able to derive a codec that can encode and decode Int, it should be possible to derive a codec that encodes and decodes MyInt.

Collections

Smithy supports two types of collections out of the box :

list
map

NB: the "set" type was supported in smithy 1.0, but has disappeared in smithy 2.0 in favour of the uniqueItems trait

Additionally, Smithy4s allows users to annotate list shapes to customise the type of collection used during code-generation.

Smithy does not support generics, therefore all collection are named. Though seemingly tedious, it makes it easier to build tooling (and probably helps languages that do not support generics). Provided the following shape :

namespace example

list IntList {
  member: Integer
}

You get the following Scala code :

package example

object IntList extends Newtype[List[Int]] {
  val id: smithy4s.ShapeId = smithy4s.ShapeId("example", "IntList")
  val hints: smithy4s.Hints = smithy4s.Hints.empty
  val underlyingSchema: smithy4s.Schema[List[Int]] = list(int).withId(id).addHints(hints)
  implicit val schema: smithy4s.Schema[IntList] = bijection(underlyingSchema, IntList(_), (_: IntList).value)
}

It is really similar to named primitives. However, for pragmatic reasons, when a structure references a collection in one of its members, the Scala field gets rendered using the de-aliased type (as opposed to the newtype). The IntList newtype is generated mostly as a way to hold the hints and schemas corresponding to the smithy IntList shape. Additionally, the IntList newtype is used by Smithy4s to render Hints values :

namespace example

@trait

list info {
  member: String
}

@info("foo", "bar", "baz")
structure A {}

would lead to the following code being rendered in the companion object of A :

val hints: Hints = Hints(
  example.Info(List("foo", "bar", "baz")),
)

This allows to query Hints for Info using the following syntax: hints.get(example.Info)

Regarding the underlyingSchema value in the companion object of IntList, you can see that it is constructed using a list function. Conceptually, it encodes this: "if I'm able to encode or decode an A in a specific format, then I should be able to encode or decode a List[A]".

Enumerations

Smithy allows for two types of enumerations : string and integer enumerations.

Additionally, smithy4s supports specifying whether an operation is open or closed. An open enumeration allows for holding unknown values, whereas a closed one is strictly limited to a set of specified values. This brings the total number of possible "flavours" of enumerations to 4, which is reified via a smithy4s.schema.EnumTag ADT that comprises 4 different cases : one for each combination between [open, closed] and [int, string].

Enumerations are typically modelled as Algebraic Data types. Each case of an enumeration is associated with both a String and Int value. In the case of intEnum, the string value is the name of the case. In the case of a normal (string) enum, the integer value is the index of the case in the list.

Additionally, each enumeration case holds its own hints.

Closed enumerations

Given this smithy code :

namespace example

intEnum Numbers {
  ONE = 1
  TWO = 2
}

The corresponding generated Scala-code is :

sealed abstract class Numbers(_value: String, _name: String, _intValue: Int, _hints: Hints) extends Enumeration.Value {
  override type EnumType = Numbers
  override val value: String = _value
  override val name: String = _name
  override val intValue: Int = _intValue
  override val hints: Hints = _hints
  override def enumeration: Enumeration[EnumType] = Numbers
  @inline final def widen: Numbers = this
}
object Numbers extends Enumeration[Numbers] with ShapeTag.Companion[Numbers] {
  val id: ShapeId = ShapeId("smithy4s.example", "Numbers")

  val hints: Hints = Hints.empty

  case object ONE extends Numbers("ONE", "ONE", 1, Hints())
  case object TWO extends Numbers("TWO", "TWO", 2, Hints())

  val values: List[Numbers] = List(
    ONE,
    TWO,
  )
  val tag: EnumTag[Numbers] = EnumTag.ClosedIntEnum
  implicit val schema: Schema[Numbers] = enumeration(tag, values).withId(id).addHints(hints)
}

Open enumeration

Given this smithy code :

namespace example

use alloy#openEnum

@openEnum
intEnum OpenNums {
  ONE = 1
  TWO = 2
}

The corresponding generated Scala-code is :

package smithy4s.example

import smithy4s.Enumeration
import smithy4s.Hints
import smithy4s.Schema
import smithy4s.ShapeId
import smithy4s.ShapeTag
import smithy4s.schema.EnumTag
import smithy4s.schema.Schema.enumeration

sealed abstract class OpenNums(_value: String, _name: String, _intValue: Int, _hints: Hints) extends Enumeration.Value {
  override type EnumType = OpenNums
  override val value: String = _value
  override val name: String = _name
  override val intValue: Int = _intValue
  override val hints: Hints = _hints
  override def enumeration: Enumeration[EnumType] = OpenNums
  @inline final def widen: OpenNums = this
}
object OpenNums extends Enumeration[OpenNums] with ShapeTag.Companion[OpenNums] {
  val id: ShapeId = ShapeId("smithy4s.example", "OpenNums")

  val hints: Hints = Hints(
    alloy.OpenEnum(),
  )

  case object ONE extends OpenNums("ONE", "ONE", 1, Hints())
  case object TWO extends OpenNums("TWO", "TWO", 2, Hints())
  final case class $Unknown(int: Int) extends OpenNums("$Unknown", "$Unknown", int, Hints.empty)

  val $unknown: Int => OpenNums = $Unknown(_)

  val values: List[OpenNums] = List(
    ONE,
    TWO,
  )
  val tag: EnumTag[OpenNums] = EnumTag.OpenIntEnum($unknown)
  implicit val schema: Schema[OpenNums] = enumeration(tag, values).withId(id).addHints(hints)
}

As you can see, the main difference between the two is the presence of an final case class $Unknown ADT member in the open enumeration, which allows to capture values that are not defined in the specification.

The Schema GADT​

Hints​

Structures​

Note related to optional and required​

Unions​

Named simple shapes​

Collections​

Enumerations​

Closed enumerations​

Open enumeration​