Apache Druid an Overview

Published On: 4 January 2023.By tushar agarwal.

General

Apache Druid an Overview

Introduction:-

“Apache Druid is a high performance real-time analytics database. It’s designed for workflows where fast queries and ingest matter. Druid excels at instant data visibility, ad-hoc queries, operational analytics, and handling high concurrency.”

Apache Druid is purpose built to generate high performance at low cost on a set of use cases that are becoming increasingly common, known as Operational Analytics. Operational Analytics, also called “Continuous Analytics”, involves analysing real time data to make real time decisions. It’s “analytics on the fly”—business signals are processed in real time and then fed to decision makers who make appropriate adjustments to optimise the business activity.

When your use case meets a set of characteristics that make it optimal for Druid, Druid’s data engine can generate extremely fast results which enable the fast decision-making required by operational analytics.

Druid vs Relational Databases:-

Druid is not a relational database, but some concepts are transferable. Rather than tables, we have datasources. As with relational databases, these are logical groupings of data that are represented as columns. Unlike relational databases, there is no concept of joins. As such we need to ensure that whichever columns we want to filter or group by are included in each datasource.
Apache Druid was designed to allow users to continue having a conversation with their data, just like they’re used to having with an RDBMS backend to a data-driven application, while also enabling massive scalability beyond the limitations of a traditional RDBMS.
Druid scales both horizontally and vertically, meaning that as your concurrency demands and the size of your data increase, you can just add more servers to the cluster or increase the size of the current servers.

Druid Services: –

Druid services

Druid has several types of services:

Coordinator The Druid Coordinator process is primarily responsible for segment management and distribution. More specifically, the Druid Coordinator process communicates to Historical processes to load or drop segments based on configurations. The Druid Coordinator is responsible for loading new segments, dropping outdated segments, ensuring that segments are “replicated” (that is, loaded on multiple different Historical nodes) proper (configured) number of times, and moving (“balancing”) segments between Historical nodes to keep the latter evenly loaded.
Overlord The Overlord process is responsible for accepting tasks, coordinating task distribution, creating locks around tasks, and returning statuses to callers. Overlord can be configured to run in one of two modes – local or remote (local being default). In local mode, Overlord is also responsible for creating Peons for executing tasks. When running the Overlord in local mode, all MiddleManager and Peon configurations must also be provided. Local mode is typically used for simple workflows. In remote mode, the Overlord and MiddleManager are run in separate processes and you can run each on a different server. This mode is recommended if you intend to use the indexing service as the single endpoint for all Druid indexing.
Broker The Broker is the process to route queries if you want to run a distributed cluster. It understands the metadata published to ZooKeeper about what segments exist on what processes and routes queries such that they hit the right processes. This process also merges the result sets from all of the individual processes. On start-up, Historical processes announce themselves and the segments they are serving in Zookeeper.
Router The Apache Druid Router process can be used to route queries to different Broker processes. By default, the broker routes queries based on how Rules are set up. For example, if 1 month of recent data is loaded into a hot cluster, queries that fall within the recent month can be routed to a dedicated set of brokers. Queries outside this range are routed to another set of brokers. This set up provides query isolation such that queries for more important data are not impacted by queries for less important data.
Historical Each Historical process maintains a constant connection to Zookeeper and watches a configurable set of Zookeeper paths for new segment information. Historical processes do not communicate directly with each other or with the Coordinator processes but instead rely on Zookeeper for coordination.
MiddleManager The MiddleManager process is a worker process that executes submitted tasks. Middle Managers forward tasks to Peons that run in separate JVMs. The reason we have separate JVMs for tasks is for resource and log isolation. Each Peon is capable of running only one task at a time. However, a middle manager may have multiple Peons.

Querying:-

Druid supports two query languages: Druid SQL and native queries. Under the hood, Druid SQL queries are converted into native queries. Native queries are submitted as JSON to a REST endpoint and are our primary mechanism.
In our project, we used Metabase to generate complex queries for particular data sources.

Some examples of those queries are:-

This is a basic SQL query

This is the JSON version of the above query

Limitations:-

So far, we have faced two limitations:-

No windowed functionality, such as rolling average. You will have to implement it yourself within your API.
Not possible to join data. But if you really have this use case, you’re probably doing something wrong.

When should One use Druid?

Druid is best used when you have a large amount of incoming data which doesn’t require updates. It’s also helpful when your data has a time component and requires low latency.

For these reasons, Druid is commonly used to power graphical user interfaces (GUIs) for analytics applications. It also fits neatly into the back end of highly-concurrent APIs that need fast aggregations.

When not to use Druid?

Due to the focus on real-time analytics of incoming data, Druid is less useful when you’re looking to process purely historical data, though it can and is used for processing specific types of historical data. Druid excels at handling streaming inserts or taking in data but not streaming updates.

So what database should you use in place of Druid for these sorts of applications? Well, that’s a question and answer for another Monday.