Feature: AggregateMonotonicity #14271

mertak-synnada · 2025-01-24T13:02:25Z

Which issue does this PR close?

Closes #.

Rationale for this change

This PR creates a definition of set-monotonicity for Aggregate expressions. Some aggregation functions create ordered results by definition (such as count, min, max). With this PR, we're adding this information to the output ordering and be able to remove some SortExecs while optimizing

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

# Conflicts: # datafusion/core/src/physical_optimizer/enforce_sorting.rs # datafusion/core/src/physical_optimizer/test_utils.rs

# Conflicts: # datafusion/core/src/physical_optimizer/enforce_sorting.rs # datafusion/physical-optimizer/src/test_utils.rs

separate stubs and count_udafs

change monotonicity to return an Enum rather than Option<bool> fix indices re-add monotonicity tests

# Conflicts: # datafusion/core/tests/physical_optimizer/enforce_sorting.rs

mertak-synnada · 2025-01-24T13:04:01Z

datafusion/sqllogictest/test_files/aggregate.slt

@@ -4963,6 +4963,9 @@ false
 true
 NULL

+statement ok


These are related with #14231

In order that the tests better explain the implications of this change, can you please add a new test rather than updating the existing test (by setting this option).

So that would mean set the flag and run the EXPLAIN again in a separate block

That will let the tests better illustrate any change in behavior

2010YOUY01 · 2025-01-25T08:54:50Z

datafusion/expr/src/udaf.rs

+    /// function is monotonically increasing if its value increases as its argument grows
+    /// (as a set). Formally, `f` is a monotonically increasing set function if `f(S) >= f(T)`
+    /// whenever `S` is a superset of `T`.
+    fn monotonicity(&self, _data_type: &DataType) -> AggregateExprMonotonicity {


I recommend adding a note at the beginning of the comment: This is used for a specific (is it BoundedWindowAggExec? )optimization and can be skipped by using the default implementation.
This interface seems quite difficult to understand for a general user who only wants to add a simple UDAF

Would it be possible to follow the existing model for ScalarUDFs here instead?

https://github.com/apache/datafusion/blob/27db82fe396f43077b5056bab4b20b084c8f6948/datafusion/expr/src/udf.rs#L753-L752

Soemthing like this:

pub trait AggregateUDFImpl { ... /// returns the output order of this aggregate expression given the input properites fn output_ordering(&self, inputs: &[ExprProperties]) -> Result<SortProperties>; ... }

I think this is not possible because this property is purely related with the function's nature. It does not depend input order or anything else, just the relation between the element-wise increment (or decrement) in the grouping set and resulting values of aggregate function. I'm renaming the monotonicity as set-monotonicity.

I recommend adding a note at the beginning of the comment: This is used for a specific (is it BoundedWindowAggExec? )optimization and can be skipped by using the default implementation. This interface seems quite difficult to understand for a general user who only wants to add a simple UDAF

We've tried to provide a good documentation, and the API's itself comes up with a default implementation. If the general users are not interested at these properties, we are not forcing them to be. Do you have further suggestions either for code or documentation level?

alamb

Thanks @mertak-synnada -- I like where this is headed. I am not sure about some of the plan changes and I also have some questions about the API

Thanks @2010YOUY01 for the look as well

datafusion/expr/src/udaf.rs

alamb · 2025-01-26T11:10:07Z

datafusion/expr/src/udaf.rs

+    /// function is monotonically increasing if its value increases as its argument grows
+    /// (as a set). Formally, `f` is a monotonically increasing set function if `f(S) >= f(T)`
+    /// whenever `S` is a superset of `T`.
+    fn monotonicity(&self, _data_type: &DataType) -> AggregateExprMonotonicity {


Would it be possible to follow the existing model for ScalarUDFs here instead?

https://github.com/apache/datafusion/blob/27db82fe396f43077b5056bab4b20b084c8f6948/datafusion/expr/src/udf.rs#L753-L752

Soemthing like this:

pub trait AggregateUDFImpl { ... /// returns the output order of this aggregate expression given the input properites fn output_ordering(&self, inputs: &[ExprProperties]) -> Result<SortProperties>; ... }

datafusion/sqllogictest/test_files/aggregate.slt

alamb · 2025-01-26T11:13:33Z

datafusion/sqllogictest/test_files/aggregate.slt

@@ -4963,6 +4963,9 @@ false
 true
 NULL

+statement ok


In order that the tests better explain the implications of this change, can you please add a new test rather than updating the existing test (by setting this option).

So that would mean set the flag and run the EXPLAIN again in a separate block

That will let the tests better illustrate any change in behavior

datafusion/sqllogictest/test_files/aggregates_topk.slt

ozankabak

I have some minor comments, almost ready to go

datafusion/physical-expr/src/aggregate.rs

datafusion/physical-expr/src/window/aggregate.rs

datafusion/physical-expr/src/window/standard.rs

datafusion/physical-plan/src/aggregates/mod.rs

ozankabak

This LGTM and is ready to go from my perspective. @alamb, it'd be great if you can take a look. It doesn't introduce any changes to existing plans/tests unless it is a strict improvement, but I'd still prefer if you could take a final quick look.

alamb

Thanks @mertak-synnada and @ozankabak

I think I am missing something here -- the code is very nicely structured and does what the PR says it should do. However, the optimization doesn't seem to compute the same answer

datafusion/functions-aggregate/src/min_max.rs

datafusion/expr/src/udaf.rs

datafusion/functions-aggregate/src/sum.rs

datafusion/expr/src/udaf.rs

datafusion/sqllogictest/test_files/aggregate.slt

datafusion/core/tests/physical_optimizer/enforce_sorting.rs

alamb

Thanks again @ozankabak and @mertak-synnada

I am still confused about this PR -- I am sorry I am probably missing something silly

My understanding of this PR

As I understand this PR, it is optimizing queries like

select a, count(b) FROM ... GROUP BY a ORDER BY a, count(b)

By noticing that when

the input is already sorted by a
we use the ordering preserving grouping (ordering_mode=Sorted)

This implies that the output is already sorted by a, count(b) and thus no SortExec is needed

This makes total sense to me and is a great optimization ✅

My confusion -- doesn't this always hold?

What I don't understand is why this optimization relies on the specific aggregate function used (aka why is AggregateExprSetMonotonicity needed)?

It seems to me like any query like the following doesn't need an extra sort.

select a, agg(b) FROM ... GROUP BY a ORDER BY a, agg(b)

(where agg(b) is any aggregate )

My reasoning is that the GROUP BY ensures that there are no duplicates in the a column, so by definition the stream is sorted by a, <any other columns> as we know a is unique 😕

datafusion/core/tests/physical_optimizer/enforce_sorting.rs

ozankabak · 2025-01-30T07:24:24Z

Thanks for reviewing carefully, as always, much appreciated 🚀

select a, agg(b) FROM ... GROUP BY a ORDER BY a, agg(b)

You are right that all queries of this form can be optimized independent of what agg is. The unit tests involving such queries in this PR (in aggregate.slt) should work even in the absence of the AggregateExprSetMonotonicity concept. This feature should actually be already tested by other tests (ones exercising correct handling of uniqueness constraints in equivalence properties). We will double check this, and if there are no problems, remove the redundant tests. If we discover any bugs, we will add the fix into this PR.

Now, coming back to the original aim of the PR -- the main intent behind AggregateExprSetMonotonicity is the following:

To make windowing queries ordering aware in case of set-monotonic window/aggregation functions. This is the immediate benefit, and you can find the tests exercising this in window.slt.
To open the door to incremental computations involving filters containing comparisons between an accumulated value (e.g. a COUNT) and a fixed value (or a value with a bound). In such cases, you can do efficient calculations/pruning only when you have set-monotonicity information on the aggregate function computing the accumulated value. We plan to bring such functionality to DataFusion in the upcoming months.

Does that help?

berkaysynnada · 2025-01-30T07:49:39Z

@alamb could you take a final look?

ozankabak · 2025-01-30T13:31:48Z

This now includes the optimization for single-row outputs, windowing operations with set-monotonic functions, and it lays the foundational machinery for more sophisticated optimizations based on expressions involving functions with set-monotonicity properties.

I am quite happy with the final state of this PR. Once @alamb confirms there are no concerns left, I will merge.

alamb · 2025-01-30T14:16:38Z

I will try and give it a good look later today

alamb

Thanks! I looked at the plans carefully and they look ok, but I am not sure the tests in

Set-Monotonic Window Aggregate functions can output results in order

Are really testing the monotonic aggregate functions (they seem to be missing an ORDER BY)

alamb · 2025-01-30T21:44:55Z

datafusion/sqllogictest/test_files/aggregate.slt

@@ -6203,3 +6203,20 @@ physical_plan
 14)--------------PlaceholderRowExec
 15)------------ProjectionExec: expr=[1 as id, 2 as foo]
 16)--------------PlaceholderRowExec
+
+# SortExec is removed if it is coming after one-row producing AggregateExec's having an empty group by expression


alamb · 2025-01-30T21:47:19Z

datafusion/sqllogictest/test_files/window.slt

-# physical plan should contain SortExec.
+# Top level sort is pushed down through BoundedWindowAggExec as its SUM result does already satisfy the required
+# global order. The existing sort is for the second-term lexicographical ordering requirement, which is being
+# preserved also at lexicographical level during the BoundedWindowAggExec.
 query TT
 EXPLAIN SELECT c9, sum1 FROM (SELECT c9,
                       SUM(c9) OVER(ORDER BY c9 DESC) as sum1


I see -- the fact that each subsequent value in the window here has additional values added to to it and Sum is increasing means the data is still sorted that way 👍

alamb · 2025-01-30T21:50:27Z

datafusion/sqllogictest/test_files/window.slt

+set datafusion.optimizer.prefer_existing_sort = true;
+
+query TT
+EXPLAIN SELECT c1, SUM(c9) OVER(PARTITION BY c1) as sum_c9 FROM aggregate_test_100_ordered ORDER BY c1, sum_c9;


This query should not depend on the particular aggregate to avoid the sort I don't think (because it only has a PARTITION BY not an ORDER BY clause

The plan looks fine to me, but the comments imply this is testing something related to the set monotonic aggregate functions

Likewise for the query below with OVER()

You are right -- the output of the query repeats the same value for c9 for every c1 group regardless of the particular window/aggregation function, because the frame is the whole table. So we should be able to do this optimization irrespective of set monotonicity. However, we don't just yet (using AVG instead of SUM reveals this).

We will fix this with a follow-on PR early next week and move these tests elsewhere with that PR.

I've missed that🤦‍♂️ We need a frame [unbounded-current row] to test these monotonic functions.

I'm tracking this testing issue, and will fix it

mertak-synnada added 16 commits January 16, 2025 15:21

add monotonic function definitions for aggregate expressions

a2919b6

fix benchmark results

14109e6

set prefer_existing_sort to true in sqllogictests

b3d75ba

set prefer_existing_sort to true in sqllogictests

549502e

fix typo

623e0c5

Merge branch 'refs/heads/apache_main' into feature/monotonic-sets

6a9d24e

# Conflicts: # datafusion/core/src/physical_optimizer/enforce_sorting.rs # datafusion/core/src/physical_optimizer/test_utils.rs

re-add test_utils.rs changes to the new file

53ee3de

clone input with Arc

97d8951

Merge branch 'refs/heads/apache_main' into feature/monotonic-sets

cc33031

Merge branch 'refs/heads/apache_main' into feature/monotonic-sets

41d9430

# Conflicts: # datafusion/core/src/physical_optimizer/enforce_sorting.rs # datafusion/physical-optimizer/src/test_utils.rs

inject aggr expr indices

e988dcf

separate stubs and count_udafs

remove redundant file

906245e

add Sum monotonicity

475fe2d

change monotonicity to return an Enum rather than Option<bool> fix indices re-add monotonicity tests

fix sql logic tests

57e000e

fix sql logic tests

ca57f46

Merge branch 'refs/heads/apache_main' into feature/monotonic-sets

6cf9644

# Conflicts: # datafusion/core/tests/physical_optimizer/enforce_sorting.rs

github-actions bot added logical-expr Logical plan and expressions physical-expr Changes to the physical-expr crates optimizer Optimizer rules core Core DataFusion crate sqllogictest SQL Logic Tests (.slt) functions Changes to functions implementation labels Jan 24, 2025

mertak-synnada commented Jan 24, 2025

View reviewed changes

update docs

072e6ef

2010YOUY01 reviewed Jan 25, 2025

View reviewed changes

alamb reviewed Jan 26, 2025

View reviewed changes

Merge branch 'apache_main' into feature/monotonic-sets

7d62cb0

github-actions bot removed the optimizer Optimizer rules label Jan 28, 2025

berkaysynnada added 2 commits January 28, 2025 16:20

review part 1

491aabe

fix the tests

972c56f

berkaysynnada added 2 commits January 29, 2025 15:53

Update mod.rs

29af731

remove unnecessary computations

1f02953

berkaysynnada force-pushed the feature/monotonic-sets branch from f1777ef to 1f02953 Compare January 29, 2025 13:04

berkaysynnada added 2 commits January 29, 2025 16:29

remove index calc

79dd942

Update mod.rs

247d5fe

ozankabak reviewed Jan 29, 2025

View reviewed changes

ozankabak and others added 2 commits January 29, 2025 17:26

Apply suggestions from code review

16bdac4

add slt

1875336

berkaysynnada force-pushed the feature/monotonic-sets branch from 6b90eba to 1875336 Compare January 29, 2025 14:29

ozankabak approved these changes Jan 29, 2025

View reviewed changes

alamb reviewed Jan 29, 2025

View reviewed changes

alamb changed the title ~~Feature: Monotonic Sets~~ Feature: AggregateMonotonicity Jan 29, 2025

alamb reviewed Jan 29, 2025

View reviewed changes

datafusion/core/tests/physical_optimizer/enforce_sorting.rs Outdated Show resolved Hide resolved

alamb reviewed Jan 29, 2025

View reviewed changes

datafusion/core/tests/physical_optimizer/enforce_sorting.rs Outdated Show resolved Hide resolved

berkaysynnada added 2 commits January 30, 2025 10:44

remove aggregate changes, tests already give expected results

ba7b94f

fix clippy

2152b7f

berkaysynnada and others added 2 commits January 30, 2025 14:56

remove one row sorts

7822613

Improve comments

5e9b2db

ozankabak force-pushed the feature/monotonic-sets branch from d7e3135 to 5e9b2db Compare January 30, 2025 12:54

Use a short name for set monotonicity

54d62d6

alamb approved these changes Jan 30, 2025

View reviewed changes

Merge branch 'main' into feature/monotonic-sets

1146811

ozankabak merged commit 48a28af into apache:main Jan 31, 2025
25 checks passed

berkaysynnada mentioned this pull request Jan 31, 2025

Feature Add Monotonic Definition synnada-ai/datafusion-upstream#59

Closed

berkaysynnada mentioned this pull request Feb 21, 2025

Window Functions Order Conservation -- Follow-up On Set Monotonicity #14813

Merged

berkaysynnada mentioned this pull request Mar 16, 2025

Blog for DataFusion 46.0.0 #15053

Closed

Feature: AggregateMonotonicity #14271

Feature: AggregateMonotonicity #14271

Uh oh!

Conversation

mertak-synnada commented Jan 24, 2025

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

berkaysynnada Jan 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ozankabak left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ozankabak left a comment

Choose a reason for hiding this comment

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

My understanding of this PR

My confusion -- doesn't this always hold?

Uh oh!

Uh oh!

ozankabak commented Jan 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

berkaysynnada commented Jan 30, 2025

Uh oh!

ozankabak commented Jan 30, 2025

Uh oh!

alamb commented Jan 30, 2025

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Set-Monotonic Window Aggregate functions can output results in order

berkaysynnada Jan 28, 2025 •

edited

Loading

ozankabak commented Jan 30, 2025 •

edited

Loading