2017-12-25

Monitoring Satellite 5 with PCP (Performance Co-Pilot)

During some performance testing we have done, I have used PCP to monitor basic stats about Red Hat Satellite 5 (could be applied to Spacewalk). I was unable to make it sufficient, but maybe somebody could fix and enhance it. I have taken lots from lzap. First of all, install PCP (PostgreSQL and Apache PMDA lives in RHEL Optional repo as of now, in CentOS7 it seems to be directly in base repo):
subscription-manager repos --enable rhel-6-server-optional-rpms
yum -y install pcp pcp-pmda-postgresql pcp-pmda-apache
subscription-manager repos --disable rhel-6-server-optional-rpms
Now start services:
chkconfig pmcd on
chkconfig pmlogger on
service pmcd restart
service pmlogger restart
Install PostgreSQL and Apache monitoring plugins
cd /var/lib/pcp/pmdas/postgresql
./Install   # select "c(ollector)" when it asks
cd /var/lib/pcp/pmdas/apache
echo -e "<Location /server-status>\n  SetHandler server-status\n  Allow from all\n</Location>\nExtendedStatus On" >>/etc/httpd/conf/httpd.conf
service httpd restart
./Install
# Configure hot proc
cat >/var/lib/pcp/pmdas/proc/hotproc.conf <<EOF
> #pmdahotproc
> Version 1.0
> fname == "java" || fname == "httpd"
> EOF
And because I have Graphite/Grafana setup available, I was pumping selected metrices there (from RHEL6 which is with SysV):
# tail -n 1 /etc/rc.local
pcp2graphite --graphite-host carbon.example.com --prefix "pcp-jhutar." --host localhost - kernel.all.load mem.util.used mem.util.swapCached filesys.full network.interface.out.bytes network.interface.in.bytes disk.dm.read disk.dm.write apache.requests_per_sec apache.bytes_per_sec apache.busy_servers apache.idle_servers postgresql.stat.all_tables.idx_scan postgresql.stat.all_tables.seq_scan postgresql.stat.database.tup_inserted postgresql.stat.database.tup_returned postgresql.stat.database.tup_deleted postgresql.stat.database.tup_fetched postgresql.stat.database.tup_updated filesys.full hotproc.memory.rss &

Problems I had with this

For some reasons I have not investigated closely, after some time PostgreSQL data were not visible in Grafana. Also I was unable to get hotproc data available in Grafana. Also I was experimenting with PCP's emulation of Graphite and its Grafana, but PCP's Graphite lack filters which makes its usage hard and not practical for anything beyond simple stats.

2017-12-22

"Error: Too many open files" when inside Docker container

Does not work: various ulimit settings for daemon

We have container build from this Dockerfile, running RHEL7 with oldish docker-1.10.3-59.el7.x86_64. Containers are started with:

# for i in $( seq 500 ); do
      docker run -h "$( hostname -s )container$i.example.com" -d --tmpfs /tmp --tmpfs /run -v /sys/fs/cgroup:/sys/fs/cgroup:ro --ulimit nofile=10000:10000 r7perfsat
  done

and we have set limits for a docker service on a docker host:

# cat /etc/systemd/system/docker.service.d/limits.conf
[Service]
LimitNOFILE=10485760
LimitNPROC=10485760

but we have still seen issues with "Too many open files" inside the container. It could happen when installing package with yum (resulting into corrupted rpm database, rm -rf /var/lib/rpm/__db.00*; rpm --rebuilddb; saved it though) and when enabling service (our containers have systemd in them on purpose):

# systemctl restart osad
Error: Too many open files
# echo $?
0

Because I was stupid, I have not checked journal (in the container) in the moment when I have spotted the failure for the first time:

Dec 21 10:18:54 b08-h19-r620container247.example.com journalctl[39]: Failed to create inotify watch: Too many open files
Dec 21 10:18:54 b08-h19-r620container247.example.com systemd[1]: systemd-journal-flush.service: main process exited, code=exited, status=1/FAILURE
Dec 21 10:18:54 b08-h19-r620container247.example.com systemd[1]: inotify_init1() failed: Too many open files
Dec 21 10:18:54 b08-h19-r620container247.example.com systemd[1]: inotify_init1() failed: Too many open files

Does work: fs.inotify.max_user_instances

At the end I have ran into some issue and very last comment there had a think I have not seen before. At the end I have ended up with:

# cat /etc/sysctl.d/40-max-user-watches.conf
fs.inotify.max_user_instances=8192
fs.inotify.max_user_watches=1048576

Default on a different machine is:

# sysctl -a 2>&1 | grep fs.inotify.max_user_
fs.inotify.max_user_instances = 128
fs.inotify.max_user_watches = 8192

Looks like increasing fs.inotify.max_user_instances helped and our containers are stable.