Monitoring

Monitoring#

You can to debug system issues in many different ways.

Each of the Services maintains its own log file, which helps with finding the source of an error.

Additionally, systemd helps by automatically restarting services that fail completely. A few basic operating system commands are also explained in the following sections, which can be used to find out if the base system is running smoothly.

Log files#

The majority of the log files can be found under the /var/log/ directory. The following table shows the relevant log files and their usage.

Log File	Service	OS	Additional Notes
`/var/log/squirro/SERVICE_NAME/SERVICE_NAME.log`	Squirro services	All	Detailed log file about each service.
`/var/log/squirro/SERVICE_NAME/stdout.log`	Squirro services	All	Messages sent to the standard output stream of the service and not logged in the main service log file.
`/var/log/squirro/SERVICE_NAME/stderr.log`	Squirro services	All	Messages sent to the standard error stream of the service (typically error messages) and not logged in the main service log file. Might contain useful information when a service is unable to boot up.
`/var/log/squirro/SERVICE_NAME/nginx-access.log`	Nginx (Squirro services)	All	Every request to the web services is recorded in this log file in a line-by-line format.
`/var/log/squirro/SERVICE_NAME/nginx-error.log`	Nginx (Squirro services)	All	Records errors on the HTTP level. When a service is stopped, errors may show up here indicating that the service is not reachable.
`/var/log/squirro/update-cluster-node.log` `/var/log/squirro/update-storage-node.log`		All	Update log for Squirro Cluster/Storage node.
`/var/log/messages`		All	General system log. Serious system failures will be recorded here.
`/var/log/elasticsearch/ES_CLUSTER_NAME*.log`	Elasticsearch	All	`ES_CLUSTER_NAME.log:` Records cluster information and major failures. `ES_CLUSTER_NAME__index_indexing_slowlog.log`: contains the logs about the indexing performed by the system `ES_CLUSTER_NAME__index_search_slowlog.log`: contains the logs about the queries asked to the system
`/var/log/redis/*.log`	Redis	All
`/var/log/mysqld.log`	MySQL	RHEL
`/var/log/mariadb/mariadb.log`	MariaDB	RHEL
`/var/log/cron`	Cron
`/var/log/secure` `/var/log/audit/audit.log`	System	RHEL	Used to debug connection issues

Additionally the /var/lib/squirro/ directory contains the following log files:

Log File	Service	OS	Additional Notes
`/var/lib/squirro/datasource/job_logs/*.log`	datasource (sqdatasourced)	All	Contains rotated log files for the created data sources. Any logs during the initial phase of creating the source and loading data into the system (dataloader logs; before transforming them in the pipeline; for these the ingester logs are relevant) will be found here.
`/var/log/squirro/machinelearning/job_logs/*.log`	machinelearning (sqmachinelearningd)	All	Contains log files for the machinelearning jobs that run on the server. Each machinelearning job uses its own log file and any output during its execution is logged there (for example, output during the training of a model).

The log level can be changed for each service. Such changes can be made within /etc/squirro/ in the ini file corresponding to each service.

For any of the services, the following can be added to the ini files to adjust the log level:

[logger_root]
level = INFO

RHEL Service monitoring with systemctl#

With Red Hat Enterprise Linux (RHEL), we rely on systemd to control and manage the Squirro services.

To check for all the service use the following command:

systemctl list-units --type service --all

If you want to inspect a single service you can use:

systemctl status SERVICE_NAME

This last command also returns some fundamental information about the service (current status, PiD, …) and if you call it with root permissions you also receive the last lines of the logs.

Should you wish to restart a particular service, the following command can be run:

systemctl restart SERVICE_NAME

It is important to reiterate that when Squirro services go down, the systemd daemon automatically attempts to restart the service. Should the service still be inactive, the server administrator should inspect the logs related belonging to that service. These log files consist of:

/var/log/squirro/SERVICE_NAME/SERVICE_NAME.log
/var/log/squirro/SERVICE_NAME/stderr.log

Monitoring Services from Web Interface#

Within Squirro, server administrators are also able to inspect the status of the current services from the web interface.

Such a feature is available as a plugin from within the Server space as can be seen below.

System commands#

The Squirro services are standard Unix daemons. Standard Linux utilities can be used to debug any issues that may arise.

Processor usage#

The current processor usage can be consulted with two standard commands: uptime and top.

uptime#

Next to some uptime information, the uptime command outputs the load average for the past 1, 5 and 15 minutes. The load average is a simple metric showing how many processed had to wait for processing. It should usually be close or below to 1.0. If it goes above 5.0 the load is quite high, values above that are unusual.

When seeing a high load average value, the top will usually show the processes that are generating load. But when the CPU usage shown by top is low despite a high load average, that may indicate issues with I/O, such as disk performance.

top#

The command top shows a list of all processes on the system, sorted by current CPU usage. Pressing M on the keyboard (upper case, so use Shift+m) will sort the list by memory usage.

Memory usage#

Memory usage of individual processes can be debugged with the top command above. To see memory usage of the system as a whole, use free.

free#

The free command outputs some statistics on how much RAM is being used by the system. The most useful value to consider is the used and free “-/+ buffers/cache”. Those values account for how much memory the system is committed to using and it cannot free it easily.

By default free outputs all values in bytes. By calling it with the -m parameter (free -m) all values are output in megabytes instead.

When free memory is very low, the system may be running into issues with memory usage. In some cases, the kernel may need to kill processes randomly to make space. Those instances can be seen in the standard system log /var/log/messages and are manifest by lines such as “Out of memory: kill process 23123”.

Disk usage#

A full disk will prevent the system from working. The df command can help with finding those issues.

df#

Use the df command to see a list of all partitions and their disk usage. The column “Use%” will show the usage in percentage. Anything above 95% is considered full and will usually hinder the system from working well.

When you are experiencing full disks, consider enlarging the corresponding disk, or contact Squirro Support for ways to remove extra data.

Following log files#

tail#

A lot of information is captured in log files. These files can be followed with the tail command, specifically by using it’s -f parameter to follow all updates on a file.

For example:

tail -f /var/log/squirro/topic/topic.log

This shows a real-time view of what is written into the topic service log file.

tail also accepts multiple file names or even wildcards. So all Squirro service log files can be monitored as follows:

tail -f /var/log/squirro/topic/topic.log

grep#

The grep command searches files for occurrences of a specific text. For example, if Squirro is reporting errors, but you are unsure where they might be coming from, the following command helps pinning down the responsible service:

grep ERROR /var/log/squirro/*/*.log

This will output a list of all Squirro log files that contain the text “ERROR” together with the lines that contain this text.

Squirro Logs#

All the Squirro logs are stored in /var/log/squirro. Each service has its own directory that can be queried as follow:

tail -f /var/log/squirro/SERVICE/*.log

For instance, if we are interested in the topic service:

tail -f /var/log/squirro/topic/*.log

In case we want to check all the services log:

tail -f /var/log/squirro/*/*.log

Squirro Log Utilities#

For convenience, Squirro provides shell functions that combine common log monitoring operations.

squirro_tail_errors#

The squirro_tail_errors function provides a convenient way to monitor all Squirro service logs and filter for warnings and errors only. This is particularly useful for troubleshooting issues across all services.

squirro_tail_errors

This command is equivalent to:

tail -f /var/log/squirro/*/*.log /var/log/squirro/ingester/processor_*/*.log | grep -E 'WARNING|ERROR'

The function monitors log files from:

All Squirro service directories under /var/log/squirro/*/
All ingester processor directories under /var/log/squirro/ingester/processor_*/

Only log entries containing WARNING or ERROR are displayed, making it easier to spot issues without being overwhelmed by informational messages.

squirro_tail_logs#

For complete log monitoring without filtering, use:

squirro_tail_logs

This shows all log entries from all Squirro services in real-time.

Note

These functions are available when using the Squirro shell aliases, which are typically loaded automatically on Squirro servers. If the functions are not available, you can load them manually by running:

source /tools/packaging/squirro-cluster-config/profile.d.squirro-aliases.sh

To make them available automatically in future shell sessions, add this line to your shell profile (for example, ~/.bashrc or ~/.profile).

The ingester service#

Due to its complexity, the ingester service has a different log structure. To do its job the service manages a set of processes (named processor_X where X in (1,2 … N)). Each process maintains its unique log in its unique directory. The easiest way to debug consists in merging their content via the following command:

tail -f /var/log/squirro/ingester/processor_*/*.log