Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segment fault occurred when node was added to executor if the callback group variable created was local #2445

Open
Chaochao0215 opened this issue Mar 5, 2024 · 13 comments
Assignees

Comments

@Chaochao0215
Copy link

Chaochao0215 commented Mar 5, 2024

ROS2

Bug report

Required Info:

  • Operating System:
    • Ubuntu 22.04
  • Installation type:
    • from source
  • Version or commit hash:
    • humble release patch 5
  • DDS implementation:
    • Fast-RTPS
  • Client library (if applicable):
    • rclcpp

Steps to reproduce issue

Create a node and call the create_callback_group interface to create a callback group local variable, then add the node to the executor and spin the executor, segment fault occuered sometimes.

#include <memory>
#include <thread>
#include <string>
#include <vector>

#include "rclcpp/executors.hpp"
#include "rclcpp/node.hpp"
#include "rclcpp/timer.hpp"
#include "std_msgs/msg/string.hpp"

int main(int argc, char **argv) {
  rclcpp::init(argc, argv);

  auto node = std::make_shared<rclcpp::Node>("test_callback_group");
  auto sub = node->create_subscription<std_msgs::msg::String>(
      "/test_callback_group", 10,
      [](const std_msgs::msg::String::ConstSharedPtr) {});
  rclcpp::executors::MultiThreadedExecutor executor(rclcpp::ExecutorOptions(),
                                                    3u);
  std::thread thr;
  {
    auto group = node->create_callback_group(
        rclcpp::CallbackGroupType::MutuallyExclusive);
    executor.add_node(node);
    thr = std::thread([&executor]() { executor.spin(); });
    std::cout << "wait for executor running..." << std::endl;
  } // executor wait on dds and then destory the callback group

  while (rclcpp::ok()) {
    std::this_thread::sleep_for(std::chrono::seconds(1));
  }
  rclcpp::shutdown();

  if (thr.joinable()) {
    thr.join();
  }

  rclcpp::shutdown();
  return 0;
} 

#### Expected behavior
process runs normally
#### Actual behavior
segment fault
#### Additional information
It seems to be related to the lifetime of the condition in the callback group.
@Zard-C
Copy link
Contributor

Zard-C commented Mar 5, 2024

Hello @Chaochao0215, maybe we could build a debug type of this source code and use gdb to backtrace when it crashed

@Barry-Xu-2018
Copy link
Collaborator

I tried your code, but I cannot find segment fault.
My test environment is based on ros2 Debian package (ros-humble-rclcpp 16.0.8-1jammy.20240217.065743).

@Chaochao0215
Copy link
Author

Chaochao0215 commented Mar 5, 2024

Hello @Chaochao0215, maybe we could build a debug type of this source code and use gdb to backtrace when it crashed
Glad to receive your reply, this issue reoccuered and i pasted the backtrace information
HowHgpAxRD

@Zard-C
Copy link
Contributor

Zard-C commented Mar 5, 2024

I tried your code and I can't reproduce the segment fault.
However, in gdb backtrace it seems like a multi-thread problem,
rebuild your program with

set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -fsanitize=thread -g")

and check if there's data race in ThreadSanitizer report.

@Chaochao0215
Copy link
Author

Chaochao0215 commented Mar 6, 2024

I tried your code and I can't reproduce the segment fault. However, in gdb backtrace it seems like a multi-thread problem, rebuild your program with

set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -fsanitize=thread -g")

and check if there's data race in ThreadSanitizer report.

hi, I found a quick way to reproduce the problem

  1. set breakpoint 1 at line 21 of the test file
  2. set breakpoint 2 at line 25 of the test file
  3. set breakpoint 3 at " rcl_ret_t status =
    rcl_wait(&wait_set_, std::chrono::duration_caststd::chrono::nanoseconds(timeout).count());" in executor.cpp
  4. set gdb to not stop when hit breakpoints , then run the test
  5. after breakpoints 1 && 3 both hit , continue the main thread and wait to hit breakpoint 2
  6. after breakpoint 2 hit, send the message subscribed in the new shell terminal, such as 'ros2 topic pub /test_callback_group std_msgs/msg/String "data: hello"' and continue the thread interrupted at breakpoint 3 , segment fault should occurred

@Zard-C
Copy link
Contributor

Zard-C commented Mar 6, 2024

6. send the message subscribed in the new shell terminal, such as 'ros2 topic pub /test_callback_group std_msgs/msg/String "data: hello"

I missed this step.

@Chaochao0215
Copy link
Author

  1. send the message subscribed in the new shell terminal, such as 'ros2 topic pub /test_callback_group std_msgs/msg/String "data: hello"

I missed this step.

I'm sorry i didn't specify the need to publish message at the beginning, hope you reproduce it successfully.

@Chaochao0215
Copy link
Author

I tried your code, but I cannot find segment fault. My test environment is based on ros2 Debian package (ros-humble-rclcpp 16.0.8-1jammy.20240217.065743).
Thank you for your attention, it occurs with low probability and I built ROS2 from source so that it can be controlled through gdb

@fujitatomoya
Copy link
Collaborator

@Chaochao0215 it would be helpful if you could provide self-contained reproducible test.

@zxf101826
Copy link

@Chaochao0215 Did you find the cause of this problem in the end? How did you solve it?
I also encountered the same problem when using MultiThreadedExecutor, The program stack is as follows

Program terminated with signal SIGSEGV, Segmentation fault.
#0 0x0000007f874cdac4 in ?? () from /opt/ros/humble/lib/librmw_cyclonedds_cpp.so
[Current thread is 1 (Thread 0x7f5f7ee8e0 (LWP 402))]
(gdb) bt
#0 0x0000007f874cdac4 in ?? () from /opt/ros/humble/lib/librmw_cyclonedds_cpp.so
#1 0x0000007f874cfd50 in rmw_wait () from /opt/ros/humble/lib/librmw_cyclonedds_cpp.so
#2 0x0000007f87f16c38 in rcl_wait () from /opt/ros/humble/lib/librcl.so
#3 0x0000007f88266850 in rclcpp::Executor::wait_for_work(std::chrono::duration<long, std::ratio<1l, 1000000000l> >) () from /opt/ros/humble/lib/librclcpp.so
#4 0x0000007f8826989c in rclcpp::Executor::get_next_executable(rclcpp::AnyExecutable&, std::chrono::duration<long, std::ratio<1l, 1000000000l> >) ()
from /opt/ros/humble/lib/librclcpp.so
#5 0x0000007f88270d38 in rclcpp::executors::MultiThreadedExecutor::run(unsigned long) () from /opt/ros/humble/lib/librclcpp.so
#6 0x0000007f87d531fc in ?? () from /lib/aarch64-linux-gnu/libstdc++.so.6
#7 0x0000007f87b1d5c8 in start_thread (arg=0x0) at ./nptl/pthread_create.c:442
#8 0x0000007f87b85d9c in thread_start () at ../sysdeps/unix/sysv/linux/aarch64/clone.S:79

@Zard-C
Copy link
Contributor

Zard-C commented Jun 24, 2024

@Chaochao0215 Did you find the cause of this problem in the end? How did you solve it? I also encountered the same problem when using MultiThreadedExecutor, The program stack is as follows

Program terminated with signal SIGSEGV, Segmentation fault.

could you give a minimum demo code to regenerate this error ? I tried codes above, but it didn't crash.

@zxf101826
Copy link

@Chaochao0215 Did you find the cause of this problem in the end? How did you solve it? I also encountered the same problem when using MultiThreadedExecutor, The program stack is as follows
Program terminated with signal SIGSEGV, Segmentation fault.

could you give a minimum demo code to regenerate this error ? I tried codes above, but it didn't crash.

My code is mixed with a lot of logic, I don't know which call caused it, this crash is quite sporadic, The program only appears once every few days.

@adamlm
Copy link

adamlm commented Sep 8, 2024

This issue looks similar to ros2/rmw_fastrtps#777.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants