A Closer Look at How LOC’s Data Processes are Invoked — the “Data Source Connector Pattern”

Leveraging Kubernetes and Message Queue to Build a Universal Broker for Containerized Tasks

Alan Wang
FST Network

--

Photo by Federico Beccari on Unsplash

Author’s note: the technological details described in this article is based on an older version of FST Network’s LOC. Since LOC is still activity evolving to meet our customer’s needs and to stay competitive, please refer to our official blog and documentation for latest updates.

In one of our previous post, we’ve discussed how FST Network’s data governance platform, LOC — Logic Operating Centre — automatically creates pods or containers with users’ code. Here’s a quick recap:

  • When a user “deploy” a data process, a pair of custom resource definition (CRD) and controller code are generated so that Kubernetes can start up a new pod or container as the data process.
  • All data processes are based on the same base image, which contains a Deno JavaScript/TypeScript runtime. The actual user code will first be uploaded to a cloud storage, then “injected” into the Deno runtime after the data process is ready. This is faster than generating new pod images from scratch.
  • The data process is now ready in the Kubernetes environment and can to be invoked via a “data source”.
Photo by Viktor Krč on Unsplash

We’ve also introduced what is a data process:

  • A data process is consisted of several special functions, including at least one generic logic and only one aggregator logic.
  • The data process, while being executing, can access a series of utility functions called “agents” to interact with internal or external resources, like session storage, event store, HTTP servers and databases. A “Result” agent sends anything that the user wants to receive from the data process after the execution is completed.

Deployment is Not the End

However, we have not yet discuss how exactly a data process is triggered in LOC’s Kubernetes cluster, nor how LOC handles users’ request to invoke one or more data processes. You may end up with dozens — even hundreds — of data processes running at the same time, and you certainly want to be sure you can call the right ones when you need them. Do you need some extra efforts to link up the data sources to data processes?

The answer is, fortunately, no.

The execution aspect, like the auto-deployment or Kubernetes pod operation, is equally crucial for LOC to deliver its service to our customers. It’s the combustion chambers and spark plugs of the LOC engine. If you are familiar with the idea of Kubernetes, this is sort of like the control plane or master node of all other pods.

So, today we’ll take a closer look at this topic, and let you have a clearer idea of what will happen when you invoke a data process.

Data Source, Execution and Pulsar MQ

Kubernetes’ etcd database is also served as local storage data for data processes. Unlike session storage, local data can be persistent over several executions. The event store service is connected to a Elasticsearch DB for storing and processing events.

Note: data sources are going to be officially renamed as triggers in the newer version of LOC. But we’ll keep the old terminology in this article for the time being.

There are three types of data sources in LOC: HTTP API routes, message queues, or the scheduler.

  • API routes are custom paths set in LOC’s API router, which can be invoked by HTTP clients with GET or POST requests.
  • Message queues are, literally, queues for messages —one can receive messages from MQ tools like Apache Kafka.
  • Scheduler is a built-in service in LOC that can invoke data processes at certain times, dates or intervals automatically.

A message queue is a software engineering component used for communication between processes or between threads within the same process. Message queues provide an asynchronous communication protocol in which the sender and receiver of messages don’t need to interact at the same time — messages are held in queue until the recipient retrieves them.

Message queues are used within operating systems or applications as a way for programs to communicate with one another. They may also be used to pass messages between computer systems.

— Techopedia, Message Queue

Photo by Justin Morgan on Unsplash

When one of the data sources triggers an execution — which contains one or more “tasks” (data processes) — it would be handled by a LOC service called Execution. The Execution service creates a message (which may have one or more tasks and additional payload, etc.) and send it to Pulsar, which is an open source message queue tool.

From there, the messages will wait for their turns until LOC has resources to read them and execute data processes.

OpenFaas, a open source FaaS framework, has a similar approach called Event-Connector Pattern, although you’ll have to setup the OpenFaas gateway to expose functions as accessible services.

In LOC, either user or developer don’t need to do anything, and you’ll see why in a minute. But we can similarly call the arrangement of data sources, execution, Pulsar MQ and data processes as Data Source Connector Pattern. After all, it connects data sources to data processes in an unified way.

After execution, the result will be written into Pulsar and then again sent back to the user, unless the process is invoked asynchronously (the result will have to be retrieved later).

Task Consumer, Executor and Finalizer

The data process is now a custom pod/container running in Kubernetes. It executes itself with a MQ message as trigger and also returns a message.

So how does Pulsar trigger the right data processes? Well, it doesn't.

Instead, data processes listen to the Pulsar message queue, and execute themselves when a related task message appears. (More precisely, data processes subscribes to a topic that is meant for it. It will be notified when a message contains related topics.) This allows data sources to send messages targeting multiple data processes without knowing where exactly are them — lower coupling between Pulsar and custom pods.

To know why a data process can do this, we’ll have to go even deeper into the technical detail. From the execution aspect, a data process is consisted of three parts, which works kind of like relay race:

  • Task Consumer: listens to the Pulsar message queue and receive Data Process Task messages.
  • Task Executor: takes the “Data Process Task” message from Consumer, execute all logic code one by one then prepare the Finalizer.
  • Task Finalizer: output the result as a Data Process Task Result message to Pulsar (as notification that the execution is completed).
Photo by Braden Collum on Unsplash

Together the three parts are also called a Execution Worker.

The actual user’s result of the data process, if there’s any, will be written into a Redis DB in Kubernetes (this is not shown in the illustration above for simplification). LOC’s API Router and MQ service will in turn listen to Pulsar, then retrieve the result from Redis for the user.

The Task Executor will run user’s logic functions one by one in the Deno runtime (or other runtime in the future, for example, WebAssembly). This is what we’ve already seen in some of our other articles.

These functions are in effect isolated from each other, but share the same session storage, a Data Context object (with payload and agent interfaces, etc — which is the ctx object you’ve seen in our previous article) and some Rust-based agent backends during execution.

Again, this “runtime inside runtime” pattern seems weird, but it actually avoids the cold-start issue in other FaaS platforms. You can check out our article about LFaaS and Logic Injection for more details.

Key Takeaways

Photo by John Schnobrich on Unsplash

With LOC’s MQ-based execution design, a user can deploy data processes then invoke some of them pretty easily with API routes, message queues or the LOC scheduler, without any additional setup. This results a smooth, reliable data pipeline operation from the users’ end.

The Pulsar plays the role of broker or connector between data sources and data processes. And the data process, once receive the correct message by listening, will execute user’s logic codes and return the execution result.

This article was written with help from FST Network’s dev team.

For more info about Logic Operating Centre (LOC) or demonstration request, please contact support@fstk.io.

--

--

Alan Wang
FST Network