Schema Types

Hollow records are strongly typed. The structure of each type is defined by a schema. A schema will be one of the following:

  • Object: A fixed set of strongly typed fields.
  • List: An ordered collection of references to records of a specific type.
  • Set: An unordered collection of references to records of a specific type.
  • Map: A key/value mapping between references to records of a specific key type and records of a specific value type.

Schemas Define the Data Model

A hollow dataset is comprised of one or more data types. The data model for a dataset is defined by the schemas describing those types.

Object Schemas

Object schemas have one or more fields. Each field is one of the following types:

  • INT: An integer value up to 32-bits.
  • LONG: An integer value up to 64-bits
  • FLOAT: A 32-bit floating-point value
  • DOUBLE: A 64-bit floating-point value
  • BOOLEAN: true or false
  • STRING: An array of characters.
  • BYTES: An array of bytes.
  • REFERENCE: A reference to another specific type. The referenced type must be defined by the schema.

Additionally, an Object schema may optionally specify a primary key, which can be used as a default indexing mechanism in many of the tools provided with Hollow.

On consumers, INT and LONG fields are each represented by a number of bits exactly sufficient to represent the maximum value for the field across all records. FLOAT, DOUBLE, and BOOLEAN fields are represented by 32, 64, and 2 bits, respectively. STRING and BYTES fields use a variable number of bytes for each record. REFERENCE fields encode the ordinal of referenced records, and are represented by a number of bits exactly sufficient to encode the maximum ordinal of the referenced type.

Designing for Efficiency

Try to model your data such that there aren't any outlier values for INT and LONG fields. Also, avoid FLOAT and DOUBLE fields where possible, since these field types are relatively expensive.

List Schemas

A List schema indicates a record type which is an ordered collection of REFERENCE fields. Each record will have a variable number of references. The referenced type must be defined by the schema, and all references in all records will encode only the ordinals of the referenced records as the values for these references.

Set Schemas

A Set schema indicates a record type which is an unordered collection of REFERENCE fields. Each record will have a variable number of references, and the referenced type must be defined by the schema. Within a single set record, each reference must be unique.

References in Set records can be hashed by some specific element fields for O(1) retrieval. In order to enable this feature, a Set schema will define an optional hash key, which defines how its elements are hashed/indexed.

Map Schemas

A Map schema indicates a record type which is an unordered collection of pairs of REFERENCE fields, used to represent a key/value mapping. Each record will have a variable number of key/value pairs. Both the key reference type and the value reference type must be defined by the schema. The key reference type does not have to be the same as the value reference type. Within a single map record, each key reference must be unique.

Entries in Map records can be hashed by some specific key fields for O(1) retrieval of the keys, values, and/or entries. In order to enable this feature, a Map schema will define an optional hash key, which defines how its entries are hashed/indexed.

Hash Keys

Each Map and Set schema may optionally define a hash key. A hash key specifies one or more user-defined fields used to hash entries into the collection. When a hash key is defined on a Set, each set record becomes like a primary key index; records in the set can be efficiently retrieved by matching the specified hash key fields. Similarly, when a hash key is defined on a Map, each map record becomes like an index over the keys in the key/value pairs contained in the map record.

If using the HollowObjectMapper, hash keys will be automatically selected if an element or key type contain a single non-reference field. Addionally, if a Set or Map references Object elements with a defined primary key, then the hash key will default to the primary key of the element type.

Alternatively, hash keys can be explicitly defined using the @HollowHashKey annotation in POJOs for Set schemas by specifying one or more fields from the element type, or for Map schemas by specifying one or more fields from the key type:

public class Movie {
    long id;
    String title;
    @HollowHashKey(fields="actorId")
    Set<Actor> actors;

    ...
}

The consumers, via the generated API, will have the ability to retrieve elements from Set records by the hash key:

MovieHollow movie = api.getMovieHollow(ordinal);
HollowSet<ActorHollow> actors = movie._getActors();
ActorHollow actor = actors.findElement(104);

System.out.println("Actor with ID 104: " + actor._getActorName()._getValue());

The consumers will have the ability to retrieve keys, values, and entries from the map by the hash key:

MovieHollow movie = api.getMovieHollow(ordinal);
HollowMap<ActorHollow, CharacterHollow> actors = movie._getActors();

ActorHollow actor = actors.findKey(104);
CharacteroHollow character = actors.findValue(104);
/// alternatively: Map.Entry<ActorHollow, CharacterHollow> entry = actors.findEntry(104);

System.out.println("Actor with ID 104: " + actor._getActorName()._getValue() + 
         " played character " + character._getCharacterName()._getValue());

We can define more than one field in the @HollowHashKey declaration if our key spans multiple fields. Each field may be multiple levels deep in a hierarchical data model, expressed via dot notation (e.g. actorName.value).

Circular References

Circular references are not allowed in Hollow. A type may not reference itself, either directly or transitively.

Inlined vs Referenced Fields

While modeling data, a choice sometimes must be made whether to define an Object field as a non-reference type (e.g. STRING), or as a REFERENCE to a separate type which has a single STRING field. Consider the following type:

Award {
    String awardName;
    long movieId;
    long actorId;
}

In this case, imagine awardName is something like “Best Supporting Actress”. Over the years, many such awards will be given, so we’ll have a lot of records which share that value. If we use a STRING field, then that character string will be repeated for every such award record. However, if we reference a separate record type, all such awards will reference the same record with that value. If the awardName values have a lot of repetition, then this can result in a significant savings.

Deduplication

Record deduplication happens automatically at the record granularity in Hollow. Try to model your data such that when there is a lot of repetition in records, the repetitive fields are encapsulated into their own types.

To consider the opposite case, let’s examine the Actor type:

Actor {
    long id;
    String actorName;
}

The actorName is unlikely to be repeated often. In this case, if we reference a separate record type, we have to retain roughly the same number of character strings, plus, we need to retain references to those records. In this case, we end up saving space by using a STRING field instead of a reference to a separate type.

Reference Costs

A REFERENCE field isn't free, and therefore we shouldn't necessarily try to encapsulate fields inside their own record types where we won't benefit from deduplication. These fields should instead be inlined.

We refer to fields which are defined with native Hollow types as inlined fields, and fields which are defined as references to types with a single field as referenced fields.

In order to be very efficient, referenced fields sometimes need to be namespaced so that fields with like values may use the same type, but referenced fields of the same type elsewhere in the data model use different types. For example, consider our Award type again, but this time, we’ll reference a type called AwardName, instead of String:

Award {
    AwardName awardName;
    long movieId;
    long actorId;
}


AwardName {
    string value;
}

Other types in our data model which reference award names can reuse the AwardName type. Other referenced string fields in our data model, which are unrelated to award names, should use different types corresponding to the semantics of their values.

Namespacing fields appropriately saves space because references to types with a lower cardinality use fewer bits than references to types with a higher cardinality. The reason for this can be gleaned from the In-Memory Data Layout topic underneath the Advanced Topics section.

Namespacing Reduces Reference Costs

Using an appropriately namespaced type reduces the heap footprint of REFERENCE fields.

Grouping Associated Fields

Referencing fields can save space because the same field values do not have to be repeated for every record in which they occur. Similarly, we can group fields which have covarying values, and pull these out from larger objects as their own types. For example, imagine we started with a Movie record which included the following fields:

Movie {
    long id;
    String title;
    String maturityRating;
    String advisories;
}

We might notice that the maturityRating and advisories fields vary together, and are often the repeated across many Movies. We can pull out a separate type for these fields:

Movie {
    long id;
    String title;
    MaturityRating maturityRating;
}

MaturityRating {
    string rating;
    string advisories;
}

We could have referenced these fields separately. If we had done so, each Movie record, of which there are probably many, would have had to contain two separate references for these fields. Instead, by recognizing that these fields were associated and pulling them together, space is saved because each Movie record now only contains one reference for this data.

Maintaining Backwards Compatibility

A data model will evolve over time. The following operations will not impact the interoperability between existing clients and new data:

  • Adding a new type
  • Removing an existing type
  • Adding a new field to an existing type
  • Removing an existing field from an existing type.

When adding new fields or types, existing generated client APIs will ignore the new fields, and all of the fields which existed at the time of API generation will still be visible using the same methods. When removing fields, existing generated client APIs will see null values if the methods corresponding to the removed fields are called. When removing types, existing generated client APIs will see removed types as having no records.

It is not backwards compatible to change the type of an existing field. The client behavior when calling a method corresponding to a field with a changed type is undefined.

Backwards compatibility often has a lot to do with the use case and semantics of the data. Hollow will always behave in the stated way for evolving data models, but it’s possible that consumers require a field which starts returning null once it gets removed. For this reason, additional caution should be exercised when removing types and fields.

Movie/Actor Example

Let's examine the Movie / Actor data model from our Getting Started Guide:

public class Movie {
    long id;
    String title;
    int releaseYear;
    List<Actor> actors;

    public Movie(long id, String title, int year, List<Actor> actors) {
        this.id = id;
        this.title = title;
        this.releaseYear = year;
        this.actors = actors;
    }
}

public class Actor {
    long actorId;
    String actorName;

    public Actor(long actorId, String actorName) {
        this.actorId = actorId;
        this.actorName = actorName;
    }
}

There are four type schemas in this data model: Movie, ListOfActor, Actor, and String.

Upon observing the Movie and Actor classes, the HollowObjectMapper will add a type for each into its HollowWriteStateEngine, each with an Object schema type. Each of these references a String field. Just as in Java, a String references a separate Object, and the HollowObjectMapper will add an Object schema for a type named String to assign to these references.

Inlined Strings in the HollowObjectMapper

To define an inlined String with the HollowObjectMapper, define string references as char[].