In order to wield Hollow effectively for your organization, you need only implement four interfaces to integrate with your infrastructure:
Once you've implemented these four interfaces, Hollow can be used in many different contexts in your organization. You'll never again have to write code to ship json or csv data from one machine to another -- and better yet, you'll gain visibility and insights into previously opaque and hard-to-debug datasets.
Think local first
The following sections describe how to plug Hollow into your infrastructure. Now is the time to think about how you will debug your data later. Consider making your implementations of these interfaces easily allow for (securely!) retrieving data from any environment, including production, right down to your local development box.
If you take this step, you'll be giving yourself immense power to glean insight into your data and debug production issues. Imagine it's 10am, and you suspect some issue surrounding some particular data was present at 4am this morning. You can open up Eclipse or IntelliJ and write a main method which -- with a few lines of code -- pulls down the data exactly as it existed on your production instances at that time. You can write code against it to explore specific scenarios or feed it into the explorer to confirm or deny your suspicion in seconds.
Storing the Blobs
Blobs are published to a file store which is accessible by consumers. From this blob store, consumers must be able to query for and retrieve blobs in the following ways:
- Snapshots: Must be queryable based on the state identifier. If a blob store is queried for a snapshot with an identifier which does not exist, the snapshot with the greatest identifier prior to the queried identifier should be retrieved.
- Deltas: Must be queryable based on the state identifier to which a delta should be applied (e.g. the delta's from state identifier).
- Reverse Deltas: Must be queryable based on the state identifier to which a reverse delta should be applied (e.g. the reverse delta's from state identifier).
BlobRetriever are opposite sides of your blob store (writer and reader, respectively).
Publisher implementation must only define a single method:
public void publish(HollowProducer.Blob blob);
Blob passed to your
Publisher should be published somewhere for retrieval by consumers. The blob's data is retrieved for publish by calling either
getFile(). The blob will be one of either a snapshot, delta, or reverse delta -- the type can be determined by calling
getType(). The blob should be indexed for later retrieval as indicated above -- snapshots by the result of
getToVersion(), and deltas/reversedeltas by the result of
getFromVersion(). Note that you will need to be able to distinguish between a snapshot, delta, and reversedelta with the same version number.
Choosing a blob store
You can publish blobs anywhere -- S3, an FTP server, an NFS, etc -- so long as that selected blob store can scale to the necessary volume of concurrent consumer requests.
Note that if your announcement mechanism is instantaneous all consumers will attempt to retrieve the blob files simultaneously.
Blobs must be overwriteable
Publisher implementation must allow blobs to be overwritten. If an attempt is made to publish a blob with to be indexed by a state identifier for which a corresponding artifact already exists, it must overwrite the existing artifact previously published. This happens routinely -- for example if a data state fails after publishing for any reason (e.g. validation fails), then the producer will automatically roll back the state and a delta will be re-published with the same from version.
BlobRetriever is the other side of the blob store equation. Your implementation must define three methods:
public HollowConsumer.Blob retrieveSnapshotBlob(long desiredVersion); public HollowConsumer.Blob retrieveDeltaBlob(long currentVersion); public HollowConsumer.Blob retrieveReverseDeltaBlob(long currentVersion);
Blob you return will be a custom implementation for your blob store which extends
HollowConsumer.Blob and implements the
retrieveSnapshotBlob(long desiredVersion) implementation should return the blob which exactly matches the specified
desiredVersion if it exists. If no such version exists then the greatest available version which is less than the specified
desiredVersion should be returned. If no such match exists, return null.
retrieveDeltaBlob(long currentVersion) and
retrieveReverseDeltaBlob(long currentVersion) implementations should each return the blob which exactly matches the specified
currentVersion. If no such match exists, return null.
Scanning for snapshots
If an exact match for the requested snapshot doesn't exist, you'll need to scan the available versions for the closest match prior to the requested. For this reason, if you have a large number of consumers, it makes sense to index your available snapshot versions so this operation is fast.
Announcing the State
Once the necessary transitions to bring clients up to date have been written to the blob store, the availability of the state must be announced to clients. This simply means that a centralized location must be maintained and updated by the producer which indicates the version of the currently available state.
When this announced state is updated, usually it is desirable to have consumers realize this update as quickly as possible. This can be accomplished either via a push notification to all consumers, or via frequent polling by consumers.
AnnouncementWatcher are opposite sides of your announcement mechanism (writer and reader, respectively).
Announcer implementation must only implement a single method:
public void announce(long stateVersion);
stateVersion passed to your
Announcer should be immediately communicated to your consumers. You can use either a 'push' mechanism or a frequent 'polling' mechanism to minimize the time between when a producer announces a version, and all consumers receive that announcement.
AnnouncementWatcher implementation must implement two methods:
public long getLatestVersion(); public void subscribeToUpdates(HollowConsumer consumer);
AnnouncementWatcher is initialized, you should immediately set up your selected announcement mechanism -- either subscribe to your push notifications or set up a thread to poll for updates.
Implementations should maintain a list of subscribed
HollowConsumers, and each time
subcribeToUpdates(HollowConsumer consumer) is called, you should add the provided
HollowConsumer to your list. When the announced version changes, call
triggerAsyncRefresh() on each subscribed consumer.
Whether or not any
HollowConsumers are subscribed, implementations should return the latest announced version each time
getLatestVersion() is called.
HollowConsumer subscribes itself
HollowConsumer will automatically call
subscribeToUpdates() with itself for an
AnnouncementWatcher with which it is initialized.
Mistakes happen. What's important is that we can recover from them quickly. If you accidentally publish bad data, you should be able to revert those changes quickly. If you give your
AnnouncementWatcher implementation an alternate location to read the announcement from, which overrides the announcement from the consumer, then you can use this to quickly force clients to go back to any arbitrary state in the past. We call setting a state version in this alternate location pinning the consumers.
Implementing a pinning mechanism is extremely useful and highly recommended. You can operationally reverse data issues immediately upon discovery, so that symptoms go away while you diagnose exactly what went wrong. This can save an enormous amount of stress and money.
If you've pinned consumers due to a data issue, it's probably not desirable to simply 'unpin' them after the root cause is addressed. Instead, restart the producer and instruct it to restore from the pinned state. It should then produce a delta which skips over all of the bad states. Only unpin after the delta from the pinned version to a bad version is overwritten with a delta from the pinned version to the good version.
Different use cases within your organization may want to reuse the same infrastructure integration. You may want your
AnnouncementWatcher to allow for multiple blob namespaces, one for each use case.