Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Span dispatcher does not handle WorkflowHandler.pending correctly. #16283

Open
B-Step62 opened this issue Sep 29, 2024 · 1 comment · May be fixed by #16290
Open

[Bug]: Span dispatcher does not handle WorkflowHandler.pending correctly. #16283

B-Step62 opened this issue Sep 29, 2024 · 1 comment · May be fixed by #16290
Labels
bug Something isn't working triage Issue needs to be triaged/prioritized

Comments

@B-Step62
Copy link
Contributor

Bug Description

Hi team, I am MLflow maintainer and found an issue in the instrumentation mechanism with the new LlamaIndex workflow. The issue is not only for MLflow, but also affect other observability providers like LlamaTrace.

MLflow Tracing implements the new instrumentation mechanism (span + event handlers) and it works perfectly for tracing LlamaIndex workflow until this change was introduced in 0.11.10. After this async context was introduced, the span dispatcher invokes the prepare_to_exit_span handler with the pending WorkflowHandler. This is problematic beacuse:

  1. Users cannot see the actual result but just pending future object as an span/trace output.
  2. Users cannot measure the latency of actual operation. Instead it only records the Future creation time.
  3. The parent span (run) exits earlier than the child task is scheduled, then the self.open_spans attribute of the span handler to be empty when the child span starts, as a result, span handler cannot determine the parent child relationship.
Screenshot 2024-09-29 at 16 31 24

The dispatcher should await for the result and call prepare_to_exist_span. I'm not very familiar with the dispatching logic for workflow, but apparently the issue is that dispatcher.span decorator that invokes the span handler immediately with the Future object.

I also validate the same issue happens in LlamaTrace as well:

Screenshot 2024-09-29 at 16 46 05

(The parent-child relationship is maintained probably because of some difference in the implementation detail, but you can see the output and span duration is off for the root span).

Version

0.11.10 (or later)

Steps to Reproduce

Reproduce with dummy span handler (no dependency)

  1. Define a dummy handler
import inspect
from typing import Any, Dict, Optional
from llama_index.core.bridge.pydantic import Field
from llama_index.core.instrumentation.span.base import BaseSpan
from llama_index.core.instrumentation.span_handlers import BaseSpanHandler
from llama_index.core.workflow.handler import WorkflowHandler


class TestSpanHandler(BaseSpanHandler[BaseSpan]):
    def new_span(
        self,
        id_: str,
        bound_args: inspect.BoundArguments,
        instance: Optional[Any] = None,
        parent_span_id: Optional[str] = None,
        tags: Optional[Dict[str, Any]] = None,
        **kwargs: Any,
    ) -> Optional[BaseSpan]:
        """Create a span."""
        pass

    def prepare_to_exit_span(
        self,
        id_: str,
        bound_args: inspect.BoundArguments,
        instance: Optional[Any] = None,
        result: Optional[Any] = None,
        **kwargs: Any,
    ) -> Any:
        """Logic for preparing to exit a span."""
        if isinstance(result, WorkflowHandler) and not result.is_done():
            print(f"prepare_to_exit_span called with pending WorkflowHandler")


    def prepare_to_drop_span(
        self,
        id_: str,
        bound_args: inspect.BoundArguments,
        instance: Optional[Any] = None,
        err: Optional[BaseException] = None,
        **kwargs: Any,
    ) -> Any:
        """Logic for preparing to drop a span."""
        pass
  1. Define a simple workflow
    from llama_index.core.workflow import StartEvent, StopEvent,Workflow, step
    import time

    class MyWorkflow(Workflow):
        @step
        async def my_step(self, ev: StartEvent) -> StopEvent:
            time.sleep(3)
            return StopEvent(result="Hi, world!")

    w = MyWorkflow(timeout=10, verbose=False)
  1. Run the workflow with the span handler
import llama_index.core.instrumentation as instrument

dispatcher = instrument.get_dispatcher()
dispatcher.add_span_handler(TestSpanHandler())

result = await w.run()

Output:

prepare_to_exit_span called with pending WorkflowHandler

Reproduce with MLflow

  1. Install dev branch pip install git+https://github.com/B-Step62/mlflow.git@llama-workflow-trace (Installing dev branch only for applying some fix for recent async handling. The stable release works with LlamaIndex workflow earlier than 0.11.10)
  2. Enable MLflow tracing for LlamaIndex
import mlflow

mlflow.llama_index.autolog()
  1. Run the same workflow as above.

Relevant Logs/Tracbacks

No response

@B-Step62 B-Step62 added bug Something isn't working triage Issue needs to be triaged/prioritized labels Sep 29, 2024
Copy link

dosubot bot commented Sep 29, 2024

To address the issue where the span dispatcher in LlamaIndex version 0.11.10 invokes the prepare_to_exit_span handler with a pending WorkflowHandler, you need to ensure that the prepare_to_exit_span method correctly handles the state of the span before it is exited. This can be done by adding a check to see if the span is in a valid state to be exited.

Here is a possible solution:

  1. Modify the prepare_to_exit_span method to include a check for the span's state.
  2. Ensure that the span is only removed from open_spans if it is in a valid state to be exited.

Here is the updated code:

def span_exit(
        self,
        id_: str,
        bound_args: inspect.BoundArguments,
        instance: Optional[Any] = None,
        result: Optional[Any] = None,
        **kwargs: Any,
    ) -> None:
        """Logic for exiting a span."""
        span = self.prepare_to_exit_span(
            id_=id_, bound_args=bound_args, instance=instance, result=result
        )
        if span:
            with self.lock:
                if id_ in self.open_spans:
                    del self.open_spans[id_]
                else:
                    # Handle the case where the span is not in open_spans
                    # This could involve logging an error or taking other appropriate action
                    pass

In the prepare_to_exit_span method, you should ensure that the span is in a valid state to be exited. This might involve checking certain attributes of the span or ensuring that it has completed its intended operations.

Here is the relevant part of the prepare_to_exit_span method from the AgentOpsSpanHandler class:

def prepare_to_exit_span(
    self,
    id_: str,
    bound_args: inspect.BoundArguments,
    instance: Optional[Any] = None,
    result: Optional[Any] = None,
    **kwargs: Any,
) -> SimpleSpan:
    self._shared_handler_state.remove_span_id(id_)
    return super().prepare_to_exit_span(id_, bound_args, instance, result, **kwargs)

Additionally, you can refer to the test cases provided in the LlamaIndex core tests to ensure that your changes are correctly handling the span lifecycle. For example, the test_dispatcher_span_drop_args and test_dispatcher_async_span_drop_args_with_instance tests show how to mock and assert the behavior of span enter, drop, and exit methods.

By adding these checks and referring to the provided test cases, you can prevent the prepare_to_exit_span handler from being invoked with a pending WorkflowHandler, thereby avoiding incorrect span/trace outputs and latency measurements [1][2][3][4].

To continue talking to Dosu, mention @dosu.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working triage Issue needs to be triaged/prioritized
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant