This section describes details on schemas used in Hollow and how your data model maps to them.
A Hollow data model is a set of schemas, which are usually defined by the POJOs used on the producer to populate the data. This section will use POJOs as examples, but there are other ways to define schemas -- for example you could ingest a text file and use the schema parser.
Schemas Define the Data Model
A hollow dataset is comprised of one or more data types. The data model for a dataset is defined by the schemas describing those types.
Each POJO class you define will result in an
Object schema, which is a fixed set of strongly typed fields. The fields will be based on the member variables in the class. For example, the class
Movie will define an
Object schema with three fields:
Each schema has a type name. The name of the type will default to the simple name of your POJO -- in this case
Each schema field has a field name, which will default to the same name as the field in the POJO -- in this case
actors. Each field also has a field type, which is in this case
REFERENCE, respectively. Each
REFERENCE field also indicates the referenced type, which for our reference fields above default to
The possible field types are:
INT: An integer value up to 32-bits
LONG: An integer value up to 64-bits
FLOAT: A 32-bit floating-point value
DOUBLE: A 64-bit floating-point value
STRING: An array of characters
BYTES: An array of bytes
REFERENCE: A reference to another specific type. The referenced type must be defined by the schema.
Notice that since the reference type is defined by the schema, data models must be strongly typed. Each reference in your data model must point to a specific concrete implementation. References to interfaces, abstract classes, or
java.lang.Object are not supported.
Object schemas may specify a primary key. This is accomplished by using the
@HollowPrimaryKey annotation and specifying the fields.
When defined in the schema, primary keys are a part of your data model and drive useful functionality and default configuration in the hollow explorer, hollow history, and diff ui. They also provide a shortcut when creating a primary key index.
Primary keys defined in the schema follow the same convention as primary keys defined for indexes. They consist of one or more field paths, which will auto-expand if they terminate in a
Null values are not supported
Primary key field(s) cannot have null values. This is not supported as it was not needed. Please be mindful when adding values to primary key fields. This will result in exception similar to below.
java.lang.RuntimeException: java.util.concurrent.ExecutionException: java.lang.RuntimeException: Attempting to read null value as int at com.netflix.hollow.core.write.HollowBlobWriter.writeSnapshot(HollowBlobWriter.java:69) at com.netflix.hollow.api.producer.fs.HollowFilesystemBlobStager$FilesystemBlob .write(HollowFilesystemBlobStager.java:117)
Inlined vs Referenced Fields¶
We can inline some fields in our POJOs so that they are no longer
REFERENCE fields, but instead encode their data directly in each record:
While modeling data, we choose whether or not to inline a field for efficiency. Consider the following type:
In this case, imagine
awardName is something like “Best Supporting Actress”. Over the years, many such awards will be given, so we’ll have a lot of records which share that value. If we use an inlined
STRING field, then the value "Best Supporting Actress" will be repeated for every such award record. However, if we reference a separate record type, all such awards will reference the same child record with that value. If the
awardName values have a lot of repetition, then this can result in a significant savings.
Record deduplication happens automatically at the record granularity in Hollow. Try to model your data such that when there is a lot of repetition in records, the repetitive fields are encapsulated into their own types.
To consider the opposite case, let’s examine the following
actorName is unlikely to be repeated often. In this case, if we reference a separate record type, we have to retain roughly the same number of unique character strings plus we need to retain references to those records. In this case, we end up saving space by using an inlined
STRING field instead of a reference to a separate type.
REFERENCE field isn't free, and therefore we shouldn't necessarily try to encapsulate fields inside their own record types where we won't benefit from deduplication. These fields should instead be inlined.
We refer to fields which are defined with native Hollow types as inlined fields, and fields which are defined as references to types with a single field as referenced fields.
Namespaced Record Type Names¶
In order to be very efficient, referenced types sometimes should be namespaced so that fields with like values may reference the same record type, but reference fields of the same primitive type elsewhere in the data model use different record types. For example, consider our
Award type again, but this time, we’ll reference a type called
AwardName, instead of
String. We can explicitly name the type of a field with the
Other types in our data model which reference award names can reuse the
AwardName type. Other referenced string fields in our data model, which are unrelated to award names, should use different types corresponding to the semantics of their values.
Namespacing fields saves space because references to types with a lower cardinality use fewer bits than references to types with a higher cardinality. The reason for this can be gleaned from the In-Memory Data Layout topic underneath the Advanced Topics section.
Namespacing fields is also useful if some consumers don't need the contents of a specific referenced field. If a type is namespaced, it can be selectively filtered, whereas if it is grouped with other fields which are needed by all consumers, then it cannot be selected for filtering.
Namespacing Reduces Reference Costs
Using an appropriately namespaced type reduces the heap footprint cost of
Changing default type names
@HollowTypeName annotation can also be used at the class level to select a default type name for a class other than its simple name. Custom type names should begin with an upper case character to avoid ambiguity in naming in the generated API, although this is not enforced by Hollow due to backwards compatibility reasons.
Grouping Associated Fields¶
Referencing fields can save space because the same field values do not have to be repeated for every record in which they occur. Similarly, we can group fields which have covarying values, and pull these out from larger objects as their own types. For example, imagine we started with a
Movie type which included the following fields:
We might notice that the
advisories fields vary together, and are often the repeated across many
Movie records. We can pull out a separate type for these fields:
We could have referenced these fields separately. If we had done so, each
Movie record, of which there are probably many, would have had to contain two separate references for these fields. Instead, by recognizing that these fields were associated and pulling them together, space is saved because each
Movie record now only contains one reference for this data.
A transient field is ignored and will not be included in an
Object Schema. A transient field is a field declared
transient Java keyword or annotated with the
@HollowTransient annotation. The latter may be used for
cases when the use of the
transient Java keyword has unwanted side-effects, such as when POJOs defining the data
model are also consumed by tools, other than Hollow, for which the field is not transient.
You can define
List schemas by adding a member variable of type
List in your data model. For example:
List must explicitly define its parameterized element type. The default type name of the above
List schema will be
List schema indicates a record type which is an ordered collection of
REFERENCE fields. Each record will have a variable number of references. The referenced type must be defined by the schema, and all references in all records will encode only the ordinals of the referenced records as the values for these references.
You can define
Set schemas by adding a member variable of type
Set in your data model. The
Set must explicitly define its parameterized element type. For example:
Set schema indicates a record type which is an unordered collection of
REFERENCE fields. Each record will have a variable number of references, and the referenced type must be defined by the schema. Within a single set record, each reference must be unique.
Set records can be hashed by some specific element fields for O(1) retrieval. In order to enable this feature, a
Set schema will define an optional hash key, which defines how its elements are hashed/indexed.
You can define
Map schemas by adding a member variable of type
Map in your data model. The
Map must explicitly define it parameterized key and values types. For example:
Map schema indicates a record type which is an unordered collection of pairs of
REFERENCE fields, used to represent a key/value mapping. Each record will have a variable number of key/value pairs. Both the key reference type and the value reference type must be defined by the schema. The key reference type does not have to be the same as the value reference type. Within a single map record, each key reference must be unique.
Map records can be hashed by some specific key fields for O(1) retrieval of the keys, values, and/or entries. In order to enable this feature, a
Map schema will define an optional hash key, which defines how its entries are hashed/indexed.
Set schema may optionally define a hash key. A hash key specifies one or more user-defined fields used to hash entries into the collection. When a hash key is defined on a
Set, each set record becomes like a primary key index; records in the set can be efficiently retrieved by matching the specified hash key fields. Similarly, when a hash key is defined on a
Map, each map record becomes like an index over the keys in the key/value pairs contained in the map record.
See Hash Keys for a detailed discussion of hash keys.
Circular references are not allowed in Hollow. A type may not reference itself, either directly or transitively.
Object Memory Layout¶
LONG fields are each represented by a number of bits exactly sufficient to represent the maximum value for the field across all records.
BOOLEAN fields are represented by 32, 64, and 2 bits, respectively.
BYTES fields use a variable number of bytes for each record.
REFERENCE fields encode the ordinal of referenced records, and are represented by a number of bits exactly sufficient to encode the maximum ordinal of the referenced type. See In-memory Data Layout for more details.
Avoid Outlier Values
Try to model your data such that there aren't any outlier values for
LONG fields. Also, avoid
DOUBLE fields where possible, since these field types are relatively expensive.
Maintaining Backwards Compatibility¶
A data model will evolve over time. The following operations will not impact the interoperability between existing clients and new data:
- Adding a new type
- Removing an existing type
- Adding a new field to an existing type
- Removing an existing field from an existing type.
When adding new fields or types, existing generated client APIs will ignore the new fields, and all of the fields which existed at the time of API generation will still be visible using the same methods. When removing fields, existing generated client APIs will see null values if the methods corresponding to the removed fields are called. When removing types, existing generated client APIs will see removed types as having no records.
It is not backwards compatible to change the type of an existing field. The client behavior when calling a method corresponding to a field with a changed type is undefined.
It is not backwards compatible to change the primary key or hash key for any type.
Beyond the specification of Hollow itself, backwards compatibility often has a lot to do with the use case and semantics of the data. Hollow will always behave in the stated way for evolving data models, but it’s possible that consumers require a field which starts returning null once it gets removed. For this reason, additional caution should be exercised when removing types and fields.
Backwards-incompatible data remodeling
Every so often, it may be required or desirable to make changes to the data model which are incompatible with prior versions. In this case, an older producer, which produces the older data model, should run in parallel with the newer producer, producing the newer incompatible data model. Each producer should write its blobs to a different namespace, so that older consumers can read from the old data model, and newer consumers can read from the newer data model. Once all consumers are upgraded and reading from the newer data model, the older producer can be decommissioned.