High-Performance Queuing Systems (how the 3 Vs change the picture)

Abstract

Queuing systems provide three main benefits to the developer.

  • Guaranteed delivery
  • Transacted re-queue if some failure occurs
  • First In First Out order

When dealing with high-performance systems, and queued messages are arriving at speed and are varied, only one of these is certain: guaranteed delivery.

Messages arrive too rapidly to hold them in the queue until failure is no longer an option. Multiple systems de-queuing the messages open the possibility of messages being processed out of order.

This post explores the options available to the developer to ensure messages that must be delivered are delivered, that they are processed as they should be, and that it’s possible to determine what happened.

Basic Queuing Applications

One of the main use cases for queuing applications is to allow Application A to send a message without having to worry that Application B, the recipient, is around to receive it. The queue management infrastructure ensures that when Application B is ready to pick up the message, or messages, they are available and there is no requirement that Application A be up and running when the message, or messages, are delivered.

Loosely-coupled, sender/receiver-agnostic, resilient messaging infrastructure.

Many applications can be built around this infrastructure. The queuing system gives the interconnected applications stability and resilience. That the sending application is done with the message once it has been enqueued allows that application to exit or return to whatever work it had been doing:

  • interacting with end-users
  • processing transactions as they arrive
  • monitoring some Internet of Things (IoT) device(s)

This frees up system resources and can contribute to the real, or perceived, responsiveness of the overall system.

Large-scale e-commerce applications benefit from queuing application approaches.

High-Performance Queuing Applications

But what happens when Application A is not only still around when the messages are picked up, but is only one of several Application A instances, which are sending messages without pause? In like manner, multiple Application B instances may be deployed to cope with the volume and variety of the inbound messages and need to replicate themselves to ensure queues don’t get backed up when the system is under stress.

There are three such scenarios in high-performance queuing applications and the complexity of the receiving applications rises with each.

Scenario 1: Only new data

In this scenario multiple Application A instances are sending new data only. In an IoT application new sensor data may be streaming in from a variety of devices almost continuously. These data are typically simple (spot temperature values, pipeline flow rate past a sensor, start or stop for each of a bank of elevators in a large office building.) In this situation the only concerns are whether there are enough Application B instances to absorb the messages and persist them, and whether later aggregation of the data can determine in which order the messages arrived.

The answer to the first of these questions is a configuration issue. The queuing infrastructure should be configurable in such a way as to allow changing the number of Application B instances to meet the message volume demand. Preferably this configuration should be elastic, scaling up the number of instances as the message volume rises, and throttling back the number of instances as the volume declines.

The answer to the second question requires that Application A encode a sequence number in the message, or the queuing infrastructure attach a date/time stamp to the message as the message is being enqueued. Such a date/time stamp should be of sufficient granularity to identify the order in which messages from a given IoT device were enqueued.

The sequence number has the advantage of being monotonic, removing concerns about granularity. The data/time stamp can serve an additional function of capturing the time from enqueue to dequeue, which may suggest the health of the system. This is valuable for any custom monitoring solutions put in place.

Application B is simply charged with dequeuing messages, possibly adding a dequeue date/time stamp, and writing the resulting data to the persistence layer.

Because the data persistence layer is append-only, the factors for choice of storage mechanism hinge on low write latency and replication. The former to ensure data is persisted with minimal delay. The latter to prevent single-point-of-failure issues. Replication can also have the advantage of speeding up later consumption and aggregation of the data.

Scenario 2: Data distribution hub scenario with optional Publish/Subscribe (Pub/Sub)

In this scenario the Application B instances are charged with the full Insert/Amend/Delete suite of operations. Application A instances insert, amend, delete records in local transaction databases and emit the record details with the record manipulation action, as a message. Application B instances must consume the messages and persist the final state of changed records. Finally, the Application B instances may alert topic subscribers that changes have occurred.

As in Scenario 1, it must be possible for Application B instances to identify the order in which to apply updates to the Publisher data store. Were the rate of message arrival slow, the first-in first-out nature of the queuing infrastructure would be enough. In a high-performance queuing scenario, the messages must contain a means of ordering themselves. This ordering comes into play to address two needs:

Application B dequeues messages in batches. With a sufficiently high transaction rate in Application A, multiple instances of the same record might exist in one batch. Application B will only be interested in persisting the last version of any record. The ordering information within the messages allows Application B instances to select, within primary record identifiers, which instance of each record to persist.

Multiple Application B instances will dequeue messages in batches. It’s possible, given the asynchronicity inherent in network transmissions and parallel processing, that different versions of a given record may be distributed across multiple instances of Application B. Variety in message structure, complexity, length will introduce its own variability in processing time for messages. Application B instances must only persist records if the version they have is later than the version already stored. The ordering information within the record serves this purpose.

With the need to identify records for update or deletion operations, besides the simple store operation, the persistence layer should be a database management system that can rapidly identify records for Amend/Delete operations. Subscribers to changed topics also must identify records related to changed topics quickly. Some form of relational, aggregate, or hybrid relational/memory-mapped database management system should be selected.

The parallel and asynchronous nature of the Application B instances ensures database collisions as different instances of Application B attempt to write to the database. At the least, Application B should only insert a record if a record with that key identifier does not exist yet. Also, a record should only be updated if its sequence number/timestamp is greater than that of the record already persisted. Finally, Application B must use retry logic patterns to recover from database collision, yet avoid infinite retry loops. To avoid these the developer must implement circuit-breaker patterns.

The sequence number/timestamp must be persisted by both Application A and Application B instances. This ensures it’s possible to verify that the data in Application B managed systems match those in Application A managed systems, the system of record.

Logging must be implemented in Application B to record the disposition of all messages delivered to it. As only the latest versions of messages are persisted to the data management layer, the fate of the intermediate versions must be known for auditing and validation purposes.

Pub/Sub for Scenario 2 can be implemented in many ways. There are Pub/Sub infrastructures available that provide subscribe and notification APIs that can be employed. On persisting new data Application B instances simply need to call the notification API specifying the topic just updated and leave the rest to the Pub/Sub infrastructure.

Otherwise, using a queuing infrastructure, Application B instances could enqueue notifications to a set of queues monitored by subscribers to updated topics. Again, once the notification has been sent the work of Application B is done.

Scenario 3: Full Duplex with Writeback

All of scenario 2 with round-trip update to the initiator.

There may be Application C instances—the subscribers in Scenario 2—that enrich the received data through user interaction and/or other data sources, yet want to return that enriched data to the original source.

In this scenario little changes in the main Application B dequeing application. It’s possible this scenario tightens the service level agreement under which Application B operates, particularly in terms of speed. Care must be taken in provisioning enough hardware and network bandwidth to accommodate such a tightening.

Application C instances will work against a persistence layer that is, in the ideal case, not the one Application B instances update. This is to control the volume of writes to a data layer in a high-performance queuing application. There exist in-memory, or in-memory/disk-based hybrid, systems that might accommodate the differing transaction patterns of Application B, Subscribers to a Pub/Sub system, and Application C instances. These data systems could obviate the need for an entirely different data store. The entire pipeline in these systems must be monitored to identify and eliminate choke points as they arise.

Application C instances updating local copies of the data ensures quick response to Application C users. It must be possible to undo such updates should the return of the data back to the Application A source fail. In like manner, should the source data have been updated before the Application C request completes, it must be possible to signal to Application C this has happened so it may undo its update to the local copy.

BI–Not Just a Pretty Face

We’ve all seen the demo. A salesperson showing off the latest tool for “Self-Service BI.”

“Just drag, drop, plug in your data, and there…a dashboard with all your measures, metrics, KPIs available. Refreshed at the touch of a button!”

Ten minutes (on a slow day) and we all stare at the pretty pictures, the gauges, the graphs, the large/small/colored circles scattered across a map showing us…well, everything we might want to know.

What isn’t shown in the ten minute demo (for there needs to be time to talk about licenses, seats, the great new things coming down the path) are the hours spent crafting the datasets that exist to permit the fluid drag and drop that was just demoed.

And there’s where the work is. Without the plumbing that pipes(!) the data into reservoirs, that shapes the data into structures easily retrieved…without queries to summarize and categorize…to turn measures to metrics, to measure metrics against targets for KPIs…then we can “drag and drop” all day and achieve little more than pretty pictures.

Without question the “Visual Display of Information” is important, and the art to creating visually effective dashboards is no less valuable or difficult than deep knowledge of CSS is to creating functional and appealing applications.

But they are all for naught without the hard data-related work beneath.

Go With The Flow

Back in the day…programming was easy.

Yeah. There I just lost most of my audience! No, it wasn’t easy. It’s not easy now. Back there there were fewer, and less powerful, tools. Today there are lots of tools, frameworks, ways-of-doing-things.

Wasn’t easy then. Differently not-easy now.

Some aspects were easy (easier.) A concept like “data at rest” is one such. Back then you wrote data. Inserted it. Updated it. It stayed that way. Heck. I can even remember when data was written out to tape. That *really* was data at rest. You wanted to update it? You processed your transaction tape against your master tape, producing a new master.

Functional programming with invariant data! But I digress.

In modern data-centric organizations, organizations where data is how they move through the marketplace, data is not “at rest.” It is flowing through systems.

All. The. Time.

Applications that persist data must be aware the data won’t stay put. It will flow on to the next stage in the process, on to the next process, loop back through systems, and be different the next time the application retrieves it.

Data architects need to ensure the data is aware updates have occurred, when, where.

Application architects have to be aware that multiple, interrelated, distributed, fragments of code can and will touch data as it moves through a…system…not simply a program.

We’re no longer developing programs that act upon data, that persist data. What we’re working on, working towards, are ecosystems through which data flows.

Code Generation—Consistency

“…consistency is the hobgoblin of little minds…” So said Ralph Waldo Emerson.

In a previous post, In Case of Emergency…, I spoke of code generation as a “force multiplier” and that hasn’t changed. There still remains more code to be written than can be written in the time needed by our ever-changing technical landscape.

But a client reminded me of another, likely more important, reason: the topic of the butchered quote at the start of this post.

Most often, when the Emerson is quote is used, you get the above. Delivered in a distain-filled tone. The “A foolish…” modifier at the start of the quote being omitted to emphasize the point the person doing the quoting is making.

Consistency, as if this were a bad thing. It’s not a bad thing for code.

The client reminded me that a good reason for generated code is that the result will be consistent. That it’s the best way to ensure that it is.

Consistent means predictable, means supportable, means maintainable. This is why we have architecture patterns, standards, enterprise-level coding templates.

One of the best ways to ensure consistency is to manufacture the code, not write it.

I’ll be speaking at SQL Saturday, New York on May 19th on the topic of Code Generation. Hope to see you there.

In Case of Emergency…

gear-wheel-310906_1280In the distant past that represents the early part of my career articles were written about how IT, software development in particular, might be the last place automation took hold. Acres of column inches talked of how everything else was being automated by computers, but software developers felt their work was too complex and resisted the march of progress that was sweeping away everyone else’s jobs.

Even so, there were companies building Computer-Aided Software Engineering (CASE) tools. There were even classifications dividing such tools into lowerCASE and upperCASE. I’ll leave it to the historians (or perhaps paleontologists) among us to provide the definitions for these classifications.

Speeding forward to the near-past and the present, we find automation taken for granted, or perhaps not even thought of at all.

The ‘scaffolding’ provided by Microsoft’s .NET framework that adds substantial support code for object-relational mapping, the Javascript frameworks (nodeJS, angular—version 1 or 2, etc.) that inject code beneath the tokens supplied by programmers, the database queries generated by analytical & dashboarding tools, even by the humble Excel spreadsheets used by so many, all attest to how automation has helped augment human interaction with computers and the striving to get them to do what is wanted.

The title of this entry is borrowed from a McKnight Consulting presentation I saw some years ago, a cartoon which showed a white-coated individual in a glass case with an “In Case of Emergency Break Glass” sign on the wall beside it.

The context was data science and the thrust of the visual was “Automate everything you can. Use humans where necessary.”

In the era of high-speed data, high-volume data, highly-varied data, this remains true. When faced with processing millions of messages an hour, a minute, a…oh dear…second, code is needed. Lots of it. Simple in-line code.

More than a decade and a half ago I was making the case for metadata-driven data movement applications. My thought was a few data flows that would configure themselves at runtime, taking this path or that, depending on the metadata.

A young woman on my team offered a better way. Using the same metadata she generated specific jobs for each type of data flow. Simple. Fast. No runtime decisions to get in the way. New metadata? Just re-run the generation process. Simple.

I’ve learned my lesson. The tools I use have changed. XML & XSLT for code generation (code is really just text, right?, so XSLT was a good fit) have given way to template code with embedded keywords and Python scripts that know how to apply metadata to those templates.

However you do it, if you’re faced with code that needs to process data at volume & speed, take the time upfront to determine what it is you need and see where you can use code generation as a “force multiplier”—a term I despise because of its use by a CEO I once had the misfortune to work for.

But it applies here.

Data Refinery

In an earlier post I wrote about data lakes vs. data reservoirs. An effort to change the thinking by changing the words.

Continuing  with the theme of applying different metaphors, this post is about the idea of a data refinery. The metaphor is not new. It’s already “out there.”

Earlier this year I put together a data strategy for a company.

They have a fire-hose of data coming at them from multiple sources. They need to deliver to their customers valid and useful information siphoned off from the fire-hose. They also have a requirement, in a more lenient time frame, to deliver enriched datasets to those same customers.

Finally they need to use the enriched data, in aggregate form, to make better decisions about their own business.

Differing needs, differing levels of enrichment. Wash, rinse, repeat.

The focus of this entry is not the siphoning off of interesting data in near real time. That’s its own entry for some other time.

This is about the dataset enrichment.

So you let your fire-hose pour into your data reservoir. Then what?

Traditional approaches run jobs across the dataset to extract information from it and ship said information to your customers. They’re almost ETL jobs, just without the E part. They perform some transformations on the data, pull the enriched set out, and “load” it to the waiting customer.

In addition, analysts run some form of wide-ranging queries (however they might be formulated/encoded) to extract interesting patterns, answers to hypotheses.

My belief is that this is old thinking. The lake (sorry reservoir) is seen as a source and operated on to produce packaged results.

Newer thinking regards the systems around such a reservoir as forming a data refinery, and the processes applied to elements in the reservoir as data refinement. Continuous data refinement.

The enriched data remains in the reservoir and contributes to the next level of enrichment. Extracts are performed whenever a customer requires a dataset. The customer gets whatever the current state of enrichment provides. Analysts work with the enriched datasets or on aggregated sets of the enriched data.

The enrichment process is continuous. A framework needs to be put in place to support this, and logging should be performed to ensure repeatability. The framework needs to be extensible, because the enrichment process itself must be capable of enrichment over time. In another post I’ll lay out what are the elements that belong in such a framework.

The refinery is where the company’s data truly lives. It’s where the information to support the company’s business may be found.

 

Don’t just warehouse your data…

Often the goal of enterprise data initiatives is an Enterprise Data Warehouse. Whether a single all-encompassing data structure or a federation of smaller data marts, the goal remains the same. And a laudable goal, to be sure. “A single version of the Truth.” “A basis for analytics and reporting.” “Self-service Business Intelligence.”

There is little wrong with these goals…apart from their scope. The end result isn’t going to be an organization with data at its core, driving every aspect of the business.

Warehousing your data is analogous to maintaining a warehouse of products for sale. Standing orders can be met. Logistics knows where to find the products and deliver them to regular customers.

Orders come in and can either be met with existing inventory or additional product can be manufactured and delivered to customers. Reactive. Some intelligence can be built around it so seasonality can be determined and more product manufactured and warehoused to meet such demands.

Simply warehousing your data puts it in one place where regularly-scheduled reports can find it. Analysts looking into new analyses can look to the warehouse to see what data might be used to address their needs. Even some predictive analysis can be done to determine new market trends and, thus, open new markets, or new products for existing markets.

None of the above is bad and will keep business intelligence consultants and data architects employed doing useful work. And such solutions, well-implemented, don’t harm businesses.

Having data at the core of your business, flowing through each process, informing each decision point is a much larger goal. A goal that makes for a business that reacts quickly, can anticipate trends, can deliver a more personal, and effective, pre-sales and support experience to its customers.

This means knowing the data as it arrives into the organization, at the time and point of arrival, and being able to make decisions, change approaches, based on what’s determined right then.

This means using all available data, that generated by business processes and that available “out there.”

That means being able to blend data from all the systems in use within an business: packaged, home-grown, cloud-based, and using such a blend for decisions at the time the decisions need to be made.

These decisions should be made in “business time.” So some decisions are near-instantaneous (which ad to show to this customer right now–and not the same one we’ve been showing for six months because a link got clicked at some point in the past.) Others can wait minutes or more. Re-route a delivery because of bad weather at one hub or another. Others, longer. Launch a targeted ad campaign when the first snow falls.

The above implies a continual examination and review of data as it flows through an organization.

There needs to be a great deal of automation in the process because people aren’t so good at millions of simultaneous decisions all day every day.

There needs to be a lot of data. Questions about whether thirty or sixty days of data should be readily available have no place in an organization that is determined to have data at its core. Again, automation. People have trouble conceiving of petabytes of data. And traditional warehouses won’t hold them.

There needs to be plumbing throughout an organization’s systems to bring data where it’s needed by the time it’s needed.

You don’t need a data warehouse. You need a data refinery.