Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Task]: Optimize Spark Runner parDo transform evaluator #32537

Open
2 of 17 tasks
twosom opened this issue Sep 23, 2024 · 3 comments · May be fixed by #32546
Open
2 of 17 tasks

[Task]: Optimize Spark Runner parDo transform evaluator #32537

twosom opened this issue Sep 23, 2024 · 3 comments · May be fixed by #32546

Comments

@twosom
Copy link
Contributor

twosom commented Sep 23, 2024

What needs to happen?

When evaluating ParDo operations in the TransformTranslator in Apache Spark Runner, too many filter operations are applied.
The reason for applying filter operations is that a ParDo can have multiple outputs, so we apply filter operations to filter only elements such as each TupleTag.

However, the filter operation is also applied to a ParDo with a single output, which can have a performance impact.
Therefore, we should avoid applying the filter operation when evaluating ParDo operations with a single output.

related mail context

Issue Priority

Priority: 2 (default / most normal work should be filed as P2)

Issue Components

  • Component: Python SDK
  • Component: Java SDK
  • Component: Go SDK
  • Component: Typescript SDK
  • Component: IO connector
  • Component: Beam YAML
  • Component: Beam examples
  • Component: Beam playground
  • Component: Beam katas
  • Component: Website
  • Component: Infrastructure
  • Component: Spark Runner
  • Component: Flink Runner
  • Component: Samza Runner
  • Component: Twister2 Runner
  • Component: Hazelcast Jet Runner
  • Component: Google Cloud Dataflow Runner
@twosom
Copy link
Contributor Author

twosom commented Sep 23, 2024

.take-issue

@tejasrok007
Copy link

Can you please elaborate what is needed in this
I understood that we cant use filter options as it can have performance impact
but we will have to change it totally so as to satisfy this requirement
Can you guide me a little on this i think i can complete it.

@twosom
Copy link
Contributor Author

twosom commented Sep 24, 2024

Can you please elaborate what is needed in this I understood that we cant use filter options as it can have performance impact but we will have to change it totally so as to satisfy this requirement Can you guide me a little on this i think i can complete it.

@tejasrok007
Thanks for your comment.
But I've already done the work and am testing it.

@twosom twosom linked a pull request Sep 24, 2024 that will close this issue
3 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants