Elasticsearch¶

General Elasticsearch Cluster Information¶

In order to avoid excessive, useless network traffic generated when the cluster reallocates shards across cluster nodes after you restart an Elasticsearch instance, NetEye employs systemd post-start and pre-stop scripts to automatically enable and disable shard allocation properly on the current node whenever the Elasticsearch service is started or stopped by systemctl.

Note

By starting a stopped Elasticsearch instance, shard allocation will be enabled globally for the entire cluster. So if you have more than one Elasticsearch instance down, shards will be reallocated in order to prevent data loss.

Therefore best practice is to:

Never keep an Elasticsearch instance stopped on purpose. Stop it only for maintenance reasons (e.g. for restarting the server) and start it up again as soon as possible.
Restart or stop/start one Elasticsearch node at a time. If something bad happens and multiple Elasticsearch nodes go down, then start them all up again together.

Elastic-only Nodes¶

From Neteye 4.9 it is possible to install Elastic-only nodes in order to improve elasticsearch performance by adding more resources and processing abilities to the cluster.

For more information on Single Purpose nodes please check out Cluster Architecture

To create an Elastic-only node you have to create an entry of type ElasticOnlyNodes in the file /etc/neteye-cluster as in the following example. Syntax is the same used for standard Node

{ "ElasticOnlyNodes": [
             {
          "addr" : "192.168.1.3",
          "hostname" : "my-neteye-03",
          "hostname_ext" : "my-neteye-03.example.com"
       }
    ]
}

Voting-only Nodes¶

From Neteye 4.16 it is possible to install Voting-only nodes in order to add a node with a single purpose - to provide quorum. If NetEye Elastic Stack module is installed, this node also provides voting-only functionalities to Elasticsearch cluster.

This functionality is achieved configuring the node as a voting-only master-eligible node specifying the variable ES_NODE_ROLES="master, voting_only" in the sysconfig file /neteye/local/elasticsearch/conf/sysconfig/elasticsearch-voting-only.

Voting-only node is defined in /etc/neteye-cluster as in the following example

{ "VotingOnlyNode": {
         "addr" : "192.168.1.3",
         "hostname" : "my-neteye-03",
         "hostname_ext" : "my-neteye-03.example.com",
         "id" : 3
      }
}

Please note that VotingOnlyNode is a json object and not an array because you can have a single Voting-only node in a NetEye cluster.

Design and Configuration¶

With NetEye 4 we recommend that you use at least 3 nodes to form an Elasticsearch cluster. If nevertheless you decide to setup a 2-node cluster, we recommend to consult a Würth Phoenix NetEye Solution Architect who can fully explain the risks in your specific environment and help you develop strategies to mitigate potential risks.

Elasticsearch coordination subsystem is in charge to choose which nodes can form a quorum (note that all NetEye cluster nodes are master eligible by default). If Log Manager is installed, the neteye install script will properly set seed_hosts and initial_master_nodes according to Elasticsearch’s recommendations and no manual intervention is required.

neteye install will set two options to configure cluster discovery:

discovery.seed_hosts: ["host1", "host2", "host3"]
cluster.initial_master_nodes: ["node1"]

Please note that the value for initial_master_nodes will be set only on the first installed node of the cluster (it is optional on other nodes and if set it must be the same for all nodes in the cluster). Option seed_hosts will be set on all cluster nodes, included Elastic Only nodes, and will have the same value on all nodes.

Elasticsearch reverse proxy¶

Starting with NetEye 4.13, NGINX has been added to NetEye. NGINX acts as a reverse proxy, by exposing a single endpoint and acting as a load-balancer, to distribute incoming requests across all nodes and, in this case, to all Elasticsearch instances. This solution improves the overall performance and reliability of the cluster.

The elasticsearch endpoint is reachable at URI https://elasticsearch.neteyelocal:9200/. Please note that this is the same port used before so no additional change is required; old certificates used for elastic are still valid with the new configuration.

All services connected elastic stack services like Kibana, Logstash and Filebeat have been updated in order to reflect this improvement and to take advantages of the new load balancing feature.

Elasticsearch Configuration¶

Elasticsearch settings need to be added to their configuration files at run-time.

Starting from NetEye 4.16 release, configuration files for Elasticsearch are not anymore modified by neteye install. The support of the run-time configuration is instead done via environment.

The default values for NetEye are stored in the /neteye/local/elasticsearch/conf/sysconfig/elasticsearch file and they can be overridden by creating the /neteye/local/elasticsearch/conf/sysconfig/elasticsearch-user-customization file and specify the new values.

By restarting Elasticsearch, the new settings are now loaded at run-time, thus overriding the default ones.

Elasticsearch temporary directory¶

NetEye uses the /neteye/local/elasticsearch/data/tmp directory as the temporary storage for Elasticsearch. It is essential to ensure that this directory resides on a filesystem that does not have the noexec mount option enabled. This directory shall be changed to a different location if the default one can not meet the requirements, by setting the ES_TMPDIR environment variable in the user customization file.

Elasticsearch Backup and Restore¶

Elasticsearch provides snapshot functionality which is great for backups because they can be restored relatively quickly.

The main features of Elasticsearch snapshots are:

They are incremental
They can store either individual indices or an entire cluster
They can be stored in a remote repository such as a shared file system

The destination for snapshots must be a shared file system mounted on each Elasticsearch node.

For further details see the Official Elasticsearch snapshot documentation.

Plugins¶

Plugins extend the core functionality of Elasticsearch. They range from adding custom mapping types, custom analyzers, native scripts, custom discovery and more.

Plugins can come from different sources: the official ones created or at least maintained by Elastic, community-sourced plugins from other users, and plugins that you provide.

Core plugins are part of Elasticsearch project, and are delivered at the same time as Elasticsearch. Their version number always matches the version number of Elasticsearch itself.

Community contributed plugins are external to the Elasticsearch project. They are provided by individual developers or private companies and have their own licenses as well as their own versioning system.

Plugins contain JAR files, but may also contain scripts and config files, and must be installed on every node in the cluster.

Run the following command to source sysconfig variables:

. /usr/share/neteye/elasticsearch/scripts/es_autosetup_functions.sh; source_elasticsearch_sysconfig

Now, an actual plugin command can be run:

ES_PATH_CONF=${ES_PATH_CONF} /usr/share/elasticsearch/bin/elasticsearch-plugin install [plugin_name]

This command will install the version of the plugin that matches your Elasticsearch version. Upon every neteye update / neteye upgrade the plugins will be updated to the latest version available.

A plugin can also be downloaded directly from a custom location by specifying the URL, from your local file system, or from an HTTP URL. Please consult official installation guide for more details on various plugin installation methods.

After installation, each node must be restarted before the plugin becomes visible.

Some of the official plugins are always provided with the service, and can be enabled per deployment.

Note

When running neteye update / neteye upgrade for deployments with community contributed plugins installed, the latter must be manually removed from all nodes before the running the precedure, and re-installed after the procedure is successfully completed. This will prevent neteye update / neteye upgrade from failing due to not being able to automatically re-install a plugin from a custom source.

Check out the official Elasticsearch guide to find more information on plugin management options.

Elasticsearch security helper tool¶

The secure communication provided by the X-Pack Security requires additional parameters such as authentication certificates to interact with the Elastic Stack APIs. We have developed a few helper tools, based on curl, to simplify your interaction with the APIs.

The Elasticsearch helper script lets you omit all the authentication parameters for the admin user, which would otherwise be required.

Location: /usr/share/neteye/elasticsearch/scripts/es_curl.sh

The NetEye helper script can be used instead if you only need read permission for the fields @timestamp and host on the Logstash index entries. This script is used by NetEye for self-monitoring activities.

Location: /usr/share/neteye/elasticsearch/scripts/es_neteye_curl.sh

Elasticsearch System indices on Single Nodes¶

Elasticsearch System indices are a type of indices meant to be used only internally by the components of the Elastic Stack. They may be associated with some System template which specifies their settings.

Some System templates may require System indices to have at least one replica, which in simple terms means that the index must be replicated on more than one node (note that this may apply also for non-System templates and indices). While having replicas is straightforward on a NetEye Cluster, on Single Node installations this is not possible, which causes these indices to go in yellow state due to the so-called unassigned shards warning.

Since on Single Node installations this warning cannot be solved by assigning shards to other nodes, the solution to the problem is to tell Elasticsearch that non-replicated indices should be allowed.

NetEye applies this solution by default during the neteye install on the templates and indices which have this problem. However, some templates and indices are created only when some features of the Elastic Stack are triggered by the user and must be manually fixed by the Elastic Stack administrator.

To facilitate the job of admin though, NetEye provides a set of scripts which help to fix the problem by setting the option index.auto_expand_replicas to 0-1 on the templates allowing the indices to have zero replicas on Single Node installations.

Depening whether the problematic index is managed by Fleet or not, you may use the appropriate script.

Index Templates Managed by Fleet¶

To fix index templates managed by Fleet (for example those that are automatically created when installing a new Elastic integration) you can use the following script that will resolve this issue on all the Fleet-managed index templates and already created associated indexes.

neteye# python3 /usr/share/neteye/elasticsearch/scripts/configurator/fix_fleet_integrations_autoexpand_replicas.py

Index Templates Not Managed by Fleet¶

Suppose that, after the creation and execution of a Security rule, you identified that the Elasticsearch template that causes the index to have the number of replicas set to one (or more) is named .items-default. You can then call the script as follows:

neteye# bash /usr/share/neteye/elasticsearch/scripts/elasticsearch_set_autoexpand_replicas_to_index_templates_and_indexes.sh ".items-default"

Note

The script works only with the composable index templates introduced in Elasticsearch 7.8 and does not support the Legacy index templates.

Moreover, the script supports the update of multiple Index Templates at once. To perform such operation simply pass the multiple Index Templates names as arguments, like this:

neteye# bash /usr/share/neteye/elasticsearch/scripts/elasticsearch_set_autoexpand_replicas_to_index_templates_and_indexes.sh <index_template_name_1> <index_template_name_2> <index_template_name_3>

Elasticsearch Performance tuning¶

This guide summarises the relevant configuration optimisations that allows to optimize Elastic Stack and boost the performance to optimise the use on Netye 4. Applying these suggestions proves very useful and is suggested, especially on larger Elastic deployments.

Elasticsearch JVM¶

In Elasticsearch, the default options for the JVM are specified in the /neteye/local/elasticsearch/conf/jvm.options file. Please note how this file must not be modified, since it will be overwritten at each update.

If you would like to specify or override some options, a new .options file should be created in the /neteye/local/elasticsearch/conf/jvm.options.d/ folder, containing the desired options, one per line. Please note that the JVM processes the options files according to the lexicographic order.

For example, we can set the encoding used by Java for reading and saving files to UTF-8 by creating a /neteye/local/elasticsearch/conf/jvm.options.d/01_custom_jvm.options with the following content:

-Dfile.encoding=UTF-8

For more information about the available JVM options and their syntax, please refer to the official documentation.

Elasticsearch Database¶

Swapping

Swapping is very bad for performance, for node stability, and should be avoided at any costs, because it can cause garbage collections to last for minutes instead of milliseconds, it can cause nodes to respond slowly, or even to disconnect from the cluster. In a resilient distributed system, it proves more effective to let the operating system kill the node than allowing swapping.

Moreover, Elasticsearch performs poorly when the system is swapping the memory to disk. Therefore, it is vitally important to the health of your node that none of the JVM is ever swapped out to disk. The following steps allow to achieve this goal.

Configure swappiness. Ensure that the sysctl value vm.swappiness is set to 1. This reduces the kernel’s tendency to swap and should not lead to swapping under normal circumstances, while still allowing the whole system to swap in emergency conditions. Execute the following commands on each Elastic Node and made changes persistent:
```
sysctl vm.swappiness=1
echo "vm.swappiness=1" > /etc/sysctl.d/zzz-swappiness.conf
sysctl -p
```
Memory locking. Another best practice on Elastic nodes is use mlockall option, to try to lock the process address space into RAM, preventing any Elasticsearch memory from being swapped out. Set the bootstrap.memory_lock setting to true, so Elasticsearch will lock the process address space into RAM, preventing any portion of memory used by Elasticsearch from being swapped out.
1. Uncomment or add this line to the /neteye/local/elasticsearch/conf/elasticsearch.yml file:
```
bootstrap.memory_lock: true
```
2. Edit limit of system resources on Service section creating the new file /etc/systemd/system/elasticsearch.service.d/neteye-limits.conf with the following content:
```
[Service]
LimitMEMLOCK=infinity
```
3. Restart resources:
```
systemctl daemon-reload
systemctl restart elasticsearch
```
4. After starting Elasticsearch, you can see whether this setting was applied successfully by checking the value of mlockall in the output from this request:
```
sh /usr/share/neteye/elasticsearch/scripts/es_curl.sh -XGET 'https://elasticsearch.neteyelocal:9200/_nodes?filter_path=**.mlockall&pretty'``
```

Increase file descriptor

Check if the amount of file descriptor suffices by using the command lsof -p <elastic-pid> | wc -l on each nodes. By default the setting on Neteye is 65,535.

To increase the default value this create a file in /etc/systemd/system/elasticsearch.service.d/neteye-open-file-limit.conf with content such as:

[Service]
LimitNOFILE=100000

For more information, see the official documentation

DNS cache settings

By default, Elasticsearch runs with a security manager in place, which implies that the JVM defaults to caching positive hostname resolutions indefinitely and defaults to caching negative hostname resolutions for ten seconds. Elasticsearch overrides this behavior with default values to cache positive lookups for 60 seconds, and to cache negative lookups for 10 seconds.

These values should be suitable for most environments, including environments where DNS resolutions vary with time. If not, you can edit the values es.networkaddress.cache.ttl and es.networkaddress.cache.negative.ttl in the JVM options drop-in folder /neteye/local/elasticsearch/conf/jvm.options.d/.

Prevent Data from growing¶

Data is growing very fast, consuming too much cpu, RAM and disk. If a system is over 50% of cpu or disk, an action should be taken, which is indicated by the Icinga checks that help to take the right decisions.

Limit data retention

Index Lifecycle Management (ILM) is designed to manage data retention by automating the lifecycle of indices in Elasticsearch. It allows you to define policies that automatically manage indices based on their age and performance needs, including actions like rollover, shrinking, force merging, and deletion. This helps optimize storage costs and enforce data retention policies.

Learn more about configuring a lifecycle policy with an appropriate retention in official documentation.

Save disk space

You can use a Time series data stream (TSDS) to store metrics data more efficiently. Metrics data stored in a TSDS may use up to 70% less disk space than a regular data stream. The exact impact will vary per data set. Learn more about when to use a TSDS in the official documentation.
Use logsdb index mode. Logsdb index mode significantly reduces storage needs by using slightly more CPU during ingest. After enabling logsdb index mode for your data sources, you may need to adjust cluster sizing in response to the new CPU and storage needs. logsdb mode is very easy to implement. To do that, just add the following parameter into the index template:
```
{
 "index": {
   "mode": "logsdb"
}
```
To learn more about how logsdb index mode optimizes CPU and storage usage, check the blog on Elasticsearch newly specialized logsdb index mode.

CPU and Data Ingestion¶

It might happen that events are stored late or not stored at all.

Important

In the case when data ingestion via Elastic agents is performed by means of a network listener, and a user is facing performance problems, they should use TCP in order to be able to see existing performance problems, because UDP is throwing away packets by design (in this case).

To provide a better latency or better throughput, you should set predefined values in the output definition of the agent policy. If setting predefined values as shown above is not enough, try more advanced settings.

In order to receive fast enough, change advanced yaml configuration in a new output definition to define batch size, more workers, etc. Besides settings for ssl.certificate_authorities, apply the following settings on big machines:

bulk_max_size: 1600 worker: 8 queue.mem.events: 100000 queue.mem.flush.min_events: 5000 queue.mem.flush.timeout: 10 compression_level: 1 idle_connection_timeout: 15

Make sure you find the best values matching your particular needs.

Note

The setting “Performance tuning” just above the advanced yaml configuration within the output definition will be set to custom, if you add such values into the advanced yaml configuration.

Memory / RAM settings¶

Keep Satellite accessible

In some environments, Elastic Agent integrations can unexpectedly consume excessive memory due to different reasons. When this happens, the Linux Kernel may invoke the OoM (Out of Memory) killer of systemd, terminating the Elastic Agent service and usually, disrupting data ingestion.

To avoid the Agent being terminated when exceeding available memory, follow the tips described on our blog.

Choose your license

If you’re using Elastic under the Enterprise licence, GB of RAM are the basis of licensing, and not the number of nodes. Hence, you should consider the type of license you’re using, and in case with Enterprise license make sure the memory is enough to be distributed properly between the nodes depending on your infrastructure.

Elasticsearch Index Recovery Settings¶

During the process of index recovery on a NetEye cluster, default settings of Elasticsearch may not appear optimal.

The default limit is set to the following values:

{
    "max_concurrent_operations": "1",
    "max_bytes_per_sec": "40mb"
}

which means that the reallocation of indices may appear to be slow when node leaves the Elasticsearch cluster. Default settings may under-utilize the internal cluster network bandwidth, so it is recommended to update the settings dynamically using the cluster update settings API call:

/usr/share/neteye/elasticsearch/scripts/es_curl.sh -XPUT -H 'content-type: application/json' https://elasticsearch.neteyelocal:9200/_cluster/settings -d '{
  "persistent" : {
    "indices.recovery.max_bytes_per_sec" : "100mb",
    "indices.recovery.max_concurrent_operations": 2
  }
}'

This dynamic setting will apply the same limit on every node in the Cluster.

For a NetEye Elastic Stack module it is required to have a 10GB/s Private connection for the Cluster, with all the nodes having the same capabilities.

When updating the settings, the max_bytes_per_sec value should be set to max 50% of the private network bandwidth if Operative nodes are also Elastic Data. In case Operative Nodes are not Data Nodes (i.e. all the data are stored on Elastic Data-only Nodes) the value can go up to 95% of the bandwidth.

The max_concurrent_operations value is recommended to be set to 2, and can be increased (e.g. 4) for larger clusters.

You can find more details in the official documentation.