Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Restore failed on OCP Power9 for Java 21 #540

Open
abdulmateen-1 opened this issue May 7, 2024 · 14 comments
Open

Restore failed on OCP Power9 for Java 21 #540

abdulmateen-1 opened this issue May 7, 2024 · 14 comments
Labels
bug InstantOn SVT Tests/Defectsby SVT

Comments

@abdulmateen-1
Copy link

abdulmateen-1 commented May 7, 2024

Java Defect: eclipse-openj9/openj9#19511

Hello team, we're on Liberty 24.0.0.5 and Java 21. We did a checkpoint on an EBC P9 VM and did a restore on an OCP P9 VM which failed failed with the error below.

Found mounted TLS certificates, generating keystore
CWWKE0964E: Restoring the checkpoint server process failed. Check the /logs/checkpoint/restore.log log to determine why the checkpoint process was not restored. The server did not launch because checkpoint restore recovery is disabled.
Warn (criu/kerndat.c:1103): $XDG_RUNTIME_DIR not set. Cannot find location for kerndat file
Warn (criu/libnetlink.c:84): Can't send request message
Warn (criu/libnetlink.c:84): Can't send request message
Warn (criu/libnetlink.c:84): Can't send request message
Warn (criu/libnetlink.c:84): Can't send request message
Warn (criu/libnetlink.c:84): Can't send request message
Warn (criu/libnetlink.c:84): Can't send request message
Warn (criu/kerndat.c:1103): $XDG_RUNTIME_DIR not set. Cannot find location for kerndat file
1023: Warn (criu/sockets.c:544): sockets: socket has dumped SO_BUF_LOCK state but kernel doesn't support SO_BUF_LOCK
1023: Warn (criu/sockets.c:544): sockets: socket has dumped SO_BUF_LOCK state but kernel doesn't support SO_BUF_LOCK
1023: Warn (criu/sockets.c:544): sockets: socket has dumped SO_BUF_LOCK state but kernel doesn't support SO_BUF_LOCK
1023: Warn (criu/sockets.c:544): sockets: socket has dumped SO_BUF_LOCK state but kernel doesn't support SO_BUF_LOCK
1023: Warn (criu/sockets.c:544): sockets: socket has dumped SO_BUF_LOCK state but kernel doesn't support SO_BUF_LOCK
pie: 1023: Error (criu/pie/restorer.c:1676): Can't restore 0x78f367cf0000 mapping with 0xfffffffffffffff3
pie: 1023: Error (criu/pie/restorer.c:2102): Restorer fail 1023
Error (criu/cr-restore.c:2547): Restoring FAILED.

The lscpu from OCP Power9

Architecture:        ppc64le
Byte Order:          Little Endian
CPU(s):              8
On-line CPU(s) list: 0-7
Thread(s) per core:  1
Core(s) per socket:  1
Socket(s):           8
NUMA node(s):        1
Model:               2.2 (pvr 004e 1202)
Model name:          POWER9 (architected), altivec supported
Hypervisor vendor:   KVM
Virtualization type: para
L1d cache:           32K
L1i cache:           32K
NUMA node0 CPU(s):   0-7

The result of uname -r on both machines

OCP
sh-4.4# uname -r
5.14.0-284.59.1.el9_2.ppc64le

P9VM
5.15.0-97-generic

To get more trace we set the CRIU logging level to 4 and I included the output of the log file.
ebuy-instanton-1-597864cdcd-fk8lz-app.log

@leochr
Copy link
Member

leochr commented May 8, 2024

FYI, this is being discussed in the private instanton Slack channel with Java/Semeru team. Depending on the discussion, this might get moved elsewhere: https://ibm-cloud.slack.com/archives/C03MR7EC3NG/p1715103170419049

@abdulmateen-1
Copy link
Author

We tested different Test case scenarios to properly determine the problem is from

1. Take the checkpoint and restore on a P9 image if successful, perform a restore on an OCP-P9 cluster

The restore was successful on the P9 VM machine but failed on the OCP-P9 cluster.

(00.059541) pie: 1025: 	mmap(0x7a1183240000 -> 0x7a1183250000, 0x3 0x12 398)
(00.059548) pie: 1025: 	mmap(0x7a1183250000 -> 0x7a11832b0000, 0x7 0x12 399)
(00.059554) pie: 1025: 	mmap(0x7a11832b0000 -> 0x7a11832c0000, 0x3 0x12 399)
(00.059560) pie: 1025: 	mmap(0x7a11832c0000 -> 0x7a11832d0000, 0x3 0x12 399)
(00.059567) pie: 1025: 	mmap(0x7a11832d0000 -> 0x7a11833a0000, 0x3 0x32 -1)
(00.059573) pie: 1025: 	mmap(0x7a11833a0000 -> 0x7a11833b0000, 0x0 0x32 -1)
(00.059578) pie: 1025: 	mmap(0x7a11833b0000 -> 0x7a11833f0000, 0x3 0x32 -1)
(00.059583) pie: 1025: 	mmap(0x7a11833f0000 -> 0x7a1183400000, 0x1 0x11 400)
(00.059717) pie: 1025: Error (criu/pie/restorer.c:1676): Can't restore 0x7a11833f0000 mapping with 0xfffffffffffffff3
(00.059728) pie: 1025: Error (criu/pie/restorer.c:2102): Restorer fail 1025
(00.059965) Error (criu/cr-restore.c:2547): Restoring FAILED.

restore.log
ebuy-instanton-1-76df794d95-bxf4l-app.log

2. Take the checkpoint and restore on a P10 image if sucessfuly, perform a restore on an OCP-P9 cluster

The checkpoint on the P10 machine initially failed with this error.

[ERROR   ] CWWKE0701E: bundle com.ibm.ws.tx.embeddable.jakarta:1.0.89.cl240620240507-2000 (455)[com.ibm.tx.jta.embeddable.impl.EmbeddableTMHelper(314)] : The shutdown method has thrown an exception java.lang.NullPointerException: Cannot invoke "com.ibm.tx.jta.impl.TxRecoveryAgentImpl.stop(boolean)" because "com.ibm.tx.jta.util.TxTMHelper._recoveryAgent" is null
	at com.ibm.tx.jta.util.TxTMHelper.shutdown(TxTMHelper.java:545)
	at [internal classes]
  
[AUDIT   ] CWWKE0036I: The server defaultServer stopped after 8.602 seconds.
CWWKE0962E: The server checkpoint request failed. The following output is from the CRIU /logs/checkpoint/checkpoint.log file that contains details on why the checkpoint failed.
Warn  (criu/kerndat.c:1103): $XDG_RUNTIME_DIR not set. Cannot find location for kerndat file
Warn  (criu/libnetlink.c:84): Can't send request message
Warn  (criu/libnetlink.c:84): Can't send request message
Warn  (criu/libnetlink.c:84): Can't send request message
Warn  (criu/libnetlink.c:84): Can't send request message
Warn  (criu/libnetlink.c:84): Can't send request message
Warn  (criu/libnetlink.c:84): Can't send request message
Warn  (criu/kerndat.c:1103): $XDG_RUNTIME_DIR not set. Cannot find location for kerndat file
Warn  (compel/src/lib/infect.c:133): Unable to interrupt task: 1117 (Operation not permitted)
Warn  (criu/libnetlink.c:84): Can't send request message
Warn  (criu/libnetlink.c:84): Can't send request message
Warn  (criu/libnetlink.c:84): Can't send request message
Warn  (criu/libnetlink.c:84): Can't send request message
Warn  (criu/libnetlink.c:84): Can't send request message
Warn  (criu/libnetlink.c:84): Can't send request message
Warn  (criu/libnetlink.c:84): Can't send request message
Warn  (criu/libnetlink.c:84): Can't send request message
Warn  (criu/libnetlink.c:84): Can't send request message
Warn  (criu/libnetlink.c:84): Can't send request message
Warn  (criu/libnetlink.c:84): Can't send request message
Error (criu/sockets.c:436): sockets: Sockets (family 1 proto 0) are not collected
Error (criu/sk-unix.c:351): unix: Unix socket 1351736 not found
Error (criu/cr-dump.c:1669): Dump files (pid: 1023) failed with -1
Error (criu/cr-dump.c:2093): Dumping FAILED.
CWWKE0963E: The server checkpoint request failed because netlink system calls were unsuccessful. If SELinux is enabled in enforcing mode, netlink system calls might be blocked by the SELinux "virt_sandbox_use_netlink" policy setting. Either disable SELinux or enable the netlink system calls with the "setsebool virt_sandbox_use_netlink 1" command.

Following the request by InstantOn team, On a regression scenario to taking checkpoint on a P10 image and restore: Try the getsebool virt_sandbox_use_netlink command, if that returns 0
do setsebool, virt_sandbox_use_netlink 1
and try the checkpoint again.

We ran the below command to get it working on a P10 machine.

getsebool virt_sandbox_use_netlink command, if that returns 0 
do setsebool virt_sandbox_use_netlink 1 and try the checkpoint again

this time the restore was successful on the P10 image, but the restore on OCP-P9 image failed but with a different error.

(00.143171) 1023 (native) is going to execute the syscall 91, required is 172
(00.143190) 1023 was trapped
(00.143200) 1023 (native) is going to execute the syscall 172, required is 172
(00.143225) 1023 was stopped
(00.143262) 1023 was trapped
(00.143274) 1023 (native) is going to execute the syscall 91, required is 91
(00.143335) 1023 was stopped
(00.143352) Run late stage hook from criu master for external devices
(00.143359) restore late stage hook for external plugin failed
(00.143367) Run late stage hook from criu master for external devices
(00.143374) restore late stage hook for external plugin failed
(00.143381) Running pre-resume scripts
(00.146642) Error (criu/cr-restore.c:2146): Unable to detach 1077: No such process
(00.146674) Error (criu/cr-restore.c:2520): Killing processes because of failure on restore.
The Network was unlocked so some data or a connection may have been lost.
(00.147040) Error (criu/cr-restore.c:2547): Restoring FAILED.

ebuy-instanton-1-76df794d95-zfl99-app.log

3. Change the application the checkpoint is performed on and restore on an OCP-P9 cluster.

This time we tried to deploy a different application, to see if the error had to do with the app. The restore was successful on the OCP-P9 cluster this time around.
ebuy-instanton-1-76df794d95-bxf4l-app.log

@mtamboli mtamboli added bug SVT Tests/Defectsby SVT labels May 13, 2024
@mtamboli
Copy link

Old checkpoint images which were taken using WLO 24.0.0.3 are getting restored fine on OCP -P 4.15 cluster for eBuy.

@abdulmateen-1
Copy link
Author

A new image was built using 24.0.0.3 driver. The restore was successful on the on-prem VM but failed on the OCP instance.
ebuy-instanton-1-7b64849466-nqnn2-app.log

@ymanton
Copy link
Contributor

ymanton commented May 16, 2024

On Power Linux platforms OpenJ9 loads a special file called /proc/powerpc/systemcfg. This is a virtual file provided by the kernel and it contains detailed info about the system. We use some of the info to provide faster time keeping functions. Loading of this file is optional; if we fail to load it we gracefully fall back.

On systems that have SELinux set to Enforcing mode access to this file is blocked. This can be seen in the audit log /var/log/audit/audit.log:

type=AVC msg=audit(1715833184.627:1597): avc:  denied  { map } for  pid=1316985 comm="java" path="/proc/powerpc/systemcfg" dev="proc" ino=4026531945 scontext=system_u:system_r:container_t:s0:c30,c433 tcontext=system_u:object_r:proc_t:s0 tclass=file permissive=0

This problem is happening because the checkpoint was taken on a system where access to this file was not blocked (Ubuntu 22.04, SELinux either not installed, disabled, or set to Permissive mode) and the restore was attempted on a system where access is blocked (OCP 4.15, SELinux set to Enforcing mode).

The older checkpoint image taken with the old 24.0.0.3 driver that can still be restored on OCP works because the JVM did not load the systemcfg file. That checkpoint was taken on a RHEL 9 system, not an Ubuntu system, and that system likely had SELinux set to Enforcing mode, which prevented us from loading systemcfg in the first place. That checkpoint can be restored on OCP without any problem because we don't try to reload systemcfg.

This is also why taking a new checkpoint on Ubuntu with the old 24.0.0.3 driver still fails. If we set SELinux to Enforcing mode on Ubuntu, or take the checkpoint on a RHEL 9 machine, it will likely restore fine on OCP. Alternatively, disabling SELinux or setting it to Permissive mode on OCP allows the Ubuntu checkpoint to restore (I tested this scenario).

My conclusion is that this is not a regression, just a combination of checkpoint host and restore host that was not previously tested. All of the previous builds should show the same problem.

Since the systemcfg file is not necessary we can probably just avoid loading it in the first place if -XX:+EnableCRIUSupport is used and provide a better experience.

@abdulmateen-1
Copy link
Author

After setting SELinux to permissive option, we did perform three scenarios to verify things worked properly

  1. The 24.0.0.5 liberty we opened the case for with checkpoint taken on a Ubuntu EBC Power9 VM. The restore was successfully completed.
  2. A new image built with 24.0.03 driver on a RHEL Power10 machine with netlink system enabled. The restore successfully completed
  3. An old image built with 24.0.0.3 that's being verified to restore successfully on an OCP Power cluster. We tried this image to verify if it's a regression issue. The image also restored successfully.

As customers wouldn't want to disable SELinux on their OCP cluster because that isn't best practice what would be the best practice here for a customer to do when building a checkpoint image.

@ymanton
Copy link
Contributor

ymanton commented May 17, 2024

As customers wouldn't want to disable SELinux on their OCP cluster because that isn't best practice what would be the best practice here for a customer to do when building a checkpoint image.

If users use SELinux Enforcing on the checkpoint side they can also keep it on the restore side. On Ubuntu SELinux is typically not used or is used in Permissive mode, so in this case my advice would be to either set up SELinux Enforcing on Ubuntu, or build checkpoint images on RHEL, where SELinux Enforcing is the default.

We can improve this particular case in future JVM releases, but this kind of issue will probably come up again for other reasons.

@mtamboli
Copy link

Thanks @ymanton! We still need to figure out why restore is failing on OCP when the images are built on P10 machine which is RHEL and seems to SELInux enabled. Do we know why those are failing? For example, 24003 images built earlier vs build now. Do you think we are disabling SELInux somehow when we took checkpoint agin on 24003 on RHEL based P10 machine?

@ymanton
Copy link
Contributor

ymanton commented Jun 4, 2024

The problem that's seen when we take a checkpoint on a P10 RHEL machine and fail to restore on OCP is actually a crash late in the restore process. The evidence can be seen in the kernel log:

[1207617.180242] Attach API wait[3883768]: bad frame in handle_rt_signal64: 00000000f27d5730 nip 00007fff90a3506c lr 00007fff8bfca234
[1207617.180248] Common-Cleaner[3883760]: bad frame in handle_rt_signal64: 00000000e778bd07 nip 00007fff90059720 lr 00007fff90ce0464
[1207617.180251] Attach API wait[3883768]: bad frame in handle_rt_signal64: 00000000f27d5730 nip 00007fff90a3506c lr 00007fff8bfca234
[1207617.185052] systemd-coredump[3883805]: Failed to get COMM: No such process

The above are Java threads that CRIU attempted to restore, but they hit errors in the kernel when CRIU attempted to run them and the process was terminated.

@abdulmateen-1
Copy link
Author

Thanks for the clarification @ymanton. Would we classify this as an application issue or a kernel issue?

@ymanton
Copy link
Contributor

ymanton commented Jun 6, 2024

Unlikely to be a kernel bug, probably a problem in CRIU and the way we use it. We've seen problems like this before and fixed or worked around them in CRIU, the JVM, or Liberty. I expect something like that here, once I figure out exactly what's going wrong and why.

@ymanton
Copy link
Contributor

ymanton commented Jul 15, 2024

The crashes seen when checkpointing on P10 and restoring on OCP (machines are P9) are caused by P10-specific libraries being loaded at checkpoint time. These libraries will cause crashes when the process is restored on a P9 machine.

This problem is already fixed in Liberty images; the checkpoint.sh script will prevent any processor-specific libraries from being loaded:

if [ -e /etc/instanton.ld.so.cache ]; then
cp /etc/instanton.ld.so.cache /etc/ld.so.cache
fi

However it looks like there is a custom checkpoint.sh being used in the testing environment that does not have this fix. The script looks like this:

bash-5.1$ cat -n /opt/ol/helpers/build/checkpoint.sh
     1  #!/bin/bash
     2
     3  echo "USING THE PATCHED checkpoint.sh"
     4
     5  # hack to bump up the pid by 2000
     6  for i in {1..1000}
     7  do
     8      pidplus.sh
     9  done
    10
    11  echo "Performing checkpoint --at=$1"
    12  /opt/ol/wlp/bin/server checkpoint defaultServer --at=$1
    13
    14  rc=$?
    15  if [ $rc -eq 0 ]; then
    16      # Find all directories in logs/ and output/ that the current user has read/write/execute permissions for
    17      # and give the same permissions to the group.
    18      find -L /logs /output -type d -readable -writable -executable -exec chmod g+rwx {} \;
    19
    20      # Find all files in logs/ and output/ that the current user has read/write permissions for
    21      # and give the same permissions to the group.
    22      find -L /logs /output -type f -readable -writable -exec chmod g+rw {} \;
    23  fi
    24
    25  exit $rc

The location of the crashes seems to be in the signal handler itself, so that's why we don't get javacore output or core files; instead the thread crashes, which causes a SIGILL, which crashes again during signal handling, which causes another SIGILL to be sent, and so on, eventually overflowing the stack and causing a crash in the kernel signal handling code.

If we update the patched checkpoint.sh to include the above code, or just get rid of it altogether if it doesn't have a purpose, this should resolve the issue.

@vijaysun-omr
Copy link

fyi @tam512 and @tjwatson since we have discussed this issue briefly on the InstantOn scrum calls

@tam512
Copy link

tam512 commented Aug 14, 2024

a serviceability defect OpenLiberty/open-liberty#29240 was opened to have a better error messages for restore on SELinux

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug InstantOn SVT Tests/Defectsby SVT
Projects
None yet
Development

No branches or pull requests

6 participants