Skip to content

[SPARK-48710][PYTHON] Use NumPy 2.0 compatible types #47083

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 5 commits into from

Conversation

codesorcery
Copy link
Contributor

@codesorcery codesorcery commented Jun 25, 2024

What changes were proposed in this pull request?

  • Replace NumPy types removed in NumPy 2.0 with their equivalent counterparts
  • Make tests compatible to new __repr__ of numerical scalars

Why are the changes needed?

PySpark references some code which was removed with NumPy 2.0:

NumPy 2.0 changed the __repr__ of numerical scalars to contain type information (e.g. np.int32(3) instead of 3). Old behavior can be enabled by setting numpy.printoptions(legacy="1.25") (or the older 1.21 and 1.13 legacy modes). There are multiple tests and doctests that rely on the old behavior.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Tests for modules pyspark-connect, pyspark-core, pyspark-errors, pyspark-mllib, pyspark-pandas, pyspark-sql, pyspark-resource, pyspark-testing were executed in a local venv with numpy==2.0.0 installed.

Was this patch authored or co-authored using generative AI tooling?

No.

@codesorcery codesorcery changed the title Spark [SPARK-48710][PYTHON] Use NumPy 2.0 compatible types [SPARK-48710][PYTHON] Use NumPy 2.0 compatible types Jun 25, 2024
@allisonwang-db
Copy link
Contributor

cc @itholic

@@ -176,7 +176,7 @@ def as_spark_type(
return None
return types.ArrayType(element_type)
# BinaryType
elif tpe in (bytes, np.character, np.bytes_, np.string_):
Copy link
Contributor

@itholic itholic Jun 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

qq: why do we remove np.string_?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, nvm. I just checked the PR description.

@itholic
Copy link
Contributor

itholic commented Jun 27, 2024

Let's use the default PR template:

### What changes were proposed in this pull request?

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

### Was this patch authored or co-authored using generative AI tooling?

@codesorcery
Copy link
Contributor Author

@itholic thanks for reviewing!
I added some small changes to the tests, to make them compatible with NumPy 2.0. MR description is updated accordingly and formatted according to the default template. I couldn't yet get the pyspark-ml tests running locally, so those tests are not yet tested with NumPy 2.0.

Maybe it makes sense to update the GitHub jobs to test with both the lowest supported (i.e. 1.21) and latest (i.e. 2.0.0) NumPy version?

@itholic
Copy link
Contributor

itholic commented Jun 30, 2024

Oh, okay seems like the NumPy is upgraded their major version recently (2024-06-17): Release Note.

@HyukjinKwon Maybe should we upgrade the minimum NumPy support to 2.0.0 as we did for Pandas??

Also cc @zhengruifeng who have worked on similar PR from #42944.

@HyukjinKwon
Copy link
Member

I think that's too aggressive. NumPy is also used in Spark ML, and many other dependent proejcts

@HyukjinKwon
Copy link
Member

cc @WeichenXu123

@HyukjinKwon
Copy link
Member

Seems fine from a cursory look but let's make the CI happy :-).

@WeichenXu123
Copy link
Contributor

CI passed !

@HyukjinKwon
Copy link
Member

Copy link
Contributor

@zhengruifeng zhengruifeng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have we test this changes against numpy 2.0?

@@ -5370,6 +5370,17 @@ def _test() -> None:
import tempfile
from pyspark.core.context import SparkContext

try:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we add a todo that once we upgrade the minimum version >= 2.0, we can remove this try-except and update the doc tests.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There exist also multiple tests already with:

try:
    # Numpy 1.14+ changed it's string format.
    numpy.set_printoptions(legacy="1.13")
except TypeError:
    pass

I'd guess these should be considered for updating before that (since the minimum NumPy version is at 1.21 currently).


os.chdir(os.environ["SPARK_HOME"])

if Version(np.__version__) >= Version("2"):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

@codesorcery
Copy link
Contributor Author

There are some linter failures: https://github.com/codesorcery/spark/actions/runs/9708335267/job/26795030107

Added # type: ignore[arg-type] to the affected lines, since legacy="1.25" is only implemented in numpy>=2 and we're checking that the code path is only executed when numpy>=2 is installed.

@codesorcery
Copy link
Contributor Author

Have we test this changes against numpy 2.0?

I've tested it on my local workstation, as written in the PR description. There aren't any CI jobs testing with NumPy 2.0 yet.
To make sure that no calls to code removed in NumPy 2 is made, we could also use ruff in dev/lint-python since it can check for usage of NumPy 2 deprecations.
(ruff can also be used as a faster replacement of both flake8 and black, but that should be out of scope here)

@HyukjinKwon
Copy link
Member

The change seems fine to me. @codesorcery do you mind creating a PR to set the upperbound at setup.py like numpy < 2? I think NumPy release will affect 3.5 users too.

@codesorcery
Copy link
Contributor Author

@codesorcery do you mind creating a PR to set the upperbound at setup.py like numpy < 2? I think NumPy release will affect 3.5 users too.

@HyukjinKwon you mean for branches where this PR doesn't get applied? Otherwise, most Python package managers and tools like Renovate won't allow users to update to NumPy 2 with this bound set.

Maybe also of interest: there is a list tracking the compatibility status of Python libraries with NumPy 2.0 at numpy/numpy#26191

@HyukjinKwon
Copy link
Member

Yes, branch-3.5.

@codesorcery
Copy link
Contributor Author

codesorcery commented Jul 2, 2024

@HyukjinKwon here's the PR for branch-3.5 limiting numpy<2: #47175 (also auto-linked by GitHub above)

@jakirkham
Copy link

It should be possible to write code that is compatible with NumPy 1 & 2. That is what most projects are doing

Would look over the migration guide. There are more suggestions in the release notes

As already noted ruff's NumPy 2 plugin can be a great help in migrating code

cc @rgommers (for awareness)

HyukjinKwon pushed a commit that referenced this pull request Jul 3, 2024
…1.15,<2)

### What changes were proposed in this pull request?
 * Add a constraint for `numpy<2` to the PySpark package

### Why are the changes needed?

PySpark references some code which was removed with NumPy 2.0. Thus, if `numpy>=2` is installed, executing PySpark may fail.

#47083 updates the `master` branch to be compatible with NumPy 2. This PR adds a version bound for older releases, where it won't be applied.

### Does this PR introduce _any_ user-facing change?
NumPy will be limited to `numpy<2` when installing `pypspark` with extras `ml`, `mllib`, `sql`, `pandas_on_spark` or `connect`.

### How was this patch tested?
Via existing CI jobs.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes #47175 from codesorcery/SPARK-48710-numpy-upper-bound.

Authored-by: Patrick Marx <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
HyukjinKwon pushed a commit that referenced this pull request Jul 3, 2024
…1.15,<2)

### What changes were proposed in this pull request?
 * Add a constraint for `numpy<2` to the PySpark package

### Why are the changes needed?

PySpark references some code which was removed with NumPy 2.0. Thus, if `numpy>=2` is installed, executing PySpark may fail.

#47083 updates the `master` branch to be compatible with NumPy 2. This PR adds a version bound for older releases, where it won't be applied.

### Does this PR introduce _any_ user-facing change?
NumPy will be limited to `numpy<2` when installing `pypspark` with extras `ml`, `mllib`, `sql`, `pandas_on_spark` or `connect`.

### How was this patch tested?
Via existing CI jobs.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes #47175 from codesorcery/SPARK-48710-numpy-upper-bound.

Authored-by: Patrick Marx <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
(cherry picked from commit 44eba46)
Signed-off-by: Hyukjin Kwon <[email protected]>
@HyukjinKwon
Copy link
Member

Merged to master.

@rgommers
Copy link

rgommers commented Jul 3, 2024

@codesorcery @HyukjinKwon I noticed that a review comment here said np.set_printoptions is used in the tests and can be updated, but this PR uses it in pyspark/core/ rather than in tests. np.set_printoptions changes global state within numpy, and doing that from within another library is a big no-no usually. Could you please consider changing this?

@codesorcery
Copy link
Contributor Author

but this PR uses it in pyspark/core/ rather than in tests

@rgommers np.set_printoptions is only called inside def _test() -> None: in these classes, which setup and run doctest. It's not called from any function that would be called when PySpark is used as a library.

@rgommers
Copy link

rgommers commented Jul 3, 2024

Ah there are tests within source files, I missed that. Sorry for the noise!

gaecoli pushed a commit to gaecoli/spark that referenced this pull request Jul 10, 2024
…1.15,<2)

### What changes were proposed in this pull request?
 * Add a constraint for `numpy<2` to the PySpark package

### Why are the changes needed?

PySpark references some code which was removed with NumPy 2.0. Thus, if `numpy>=2` is installed, executing PySpark may fail.

apache#47083 updates the `master` branch to be compatible with NumPy 2. This PR adds a version bound for older releases, where it won't be applied.

### Does this PR introduce _any_ user-facing change?
NumPy will be limited to `numpy<2` when installing `pypspark` with extras `ml`, `mllib`, `sql`, `pandas_on_spark` or `connect`.

### How was this patch tested?
Via existing CI jobs.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes apache#47175 from codesorcery/SPARK-48710-numpy-upper-bound.

Authored-by: Patrick Marx <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
ericm-db pushed a commit to ericm-db/spark that referenced this pull request Jul 10, 2024
### What changes were proposed in this pull request?
 * Replace NumPy types removed in NumPy 2.0 with their equivalent counterparts
 * Make tests compatible to new `__repr__` of numerical scalars

### Why are the changes needed?

PySpark references some code which was removed with NumPy 2.0:
 * `np.NaN` was removed, should be replaced with `np.nan`
 * `np.string_` was removed, [is an alias for](https://github.com/numpy/numpy/blob/v1.26.5/numpy/__init__.pyi#L3134) `np.bytes_`
 * `np.float_` was removed, [is defined the same as](https://github.com/numpy/numpy/blob/v1.26.5/numpy/__init__.pyi#L3042-3043) `np.double`
  * `np.unicode_` was removed, [is an alias for](https://github.com/numpy/numpy/blob/v1.26.5/numpy/__init__.pyi#L3148) `np.str_`

NumPy 2.0 changed the `__repr__` of numerical scalars to contain type information (e.g. `np.int32(3)` instead of `3`). Old behavior can be enabled by setting `numpy.printoptions(legacy="1.25")` (or the older `1.21` and `1.13` legacy modes). There are multiple tests and doctests that rely on the old behavior.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Tests for modules `pyspark-connect`, `pyspark-core`, `pyspark-errors`, `pyspark-mllib`, `pyspark-pandas`, `pyspark-sql`, `pyspark-resource`, `pyspark-testing` were executed in a local venv with `numpy==2.0.0` installed.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes apache#47083 from codesorcery/SPARK-48710.

Authored-by: Patrick Marx <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, @codesorcery , @HyukjinKwon , @zhengruifeng , @itholic .

This PR seems to introduce numpy dependency accidentally to core/rdd module.

Starting test(python3): pyspark.core.rdd (temp output: /Users/dongjoon/APACHE/spark-merge/python/target/6da9b910-0500-479c-85ef-89e4bd085853/python3__pyspark.core.rdd__oldy8rob.log)
<frozen runpy>:128: RuntimeWarning: 'pyspark.core.rdd' found in sys.modules after import of package 'pyspark.core', but prior to execution of 'pyspark.core.rdd'; this may result in unpredictable behaviour
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/Users/dongjoon/APACHE/spark-merge/python/pyspark/core/rdd.py", line 5400, in <module>
    _test()
    ~~~~~^^
  File "/Users/dongjoon/APACHE/spark-merge/python/pyspark/core/rdd.py", line 5376, in _test
    import numpy as np
ModuleNotFoundError: No module named 'numpy'

core module should not have this dependency even in the test case.
Screenshot 2024-07-29 at 12 00 49

@dongjoon-hyun
Copy link
Member

If there are numpy-related test cases, can we move them out from rdd module?

IIUC, this problem is already pointed out by @rgommers one month ago in this PR.

try:
# Numpy 2.0+ changed its string format,
# adding type information to numeric scalars.
import numpy as np
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here actually did a try catch but I think there's some issue related to import ..

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's move this to sql and ml only for now because both modules use numpy.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you.

Yes, the problem is that try-catch didn't handle ModuleNotFoundError and causes failure like the following.

$ python/run-tests.py --python-executables python3 --modules pyspark-core
...
  File "/Users/dongjoon/APACHE/spark-merge/python/pyspark/core/rdd.py", line 5376, in _test
    import numpy as np
ModuleNotFoundError: No module named 'numpy'

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's move this to sql and ml only for now because both modules use numpy.

+1 for moving.

if Version(np.__version__) >= Version("2"):
# `legacy="1.25"` only available in `nump>=2`
np.set_printoptions(legacy="1.25") # type: ignore[arg-type]
except TypeError:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, yeah let's catch ImportError ....

@dongjoon-hyun
Copy link
Member

HyukjinKwon pushed a commit that referenced this pull request Jul 30, 2024
…ptional dependencies

### What changes were proposed in this pull request?

This is a follow-up of #47083 to recover PySpark RDD tests.

### Why are the changes needed?

`PySpark Core` test should not fail on optional dependencies.

**BEFORE**
```
$ python/run-tests.py --python-executables python3 --modules pyspark-core
...
  File "/Users/dongjoon/APACHE/spark-merge/python/pyspark/core/rdd.py", line 5376, in _test
    import numpy as np
ModuleNotFoundError: No module named 'numpy'
```

**AFTER**
```
$ python/run-tests.py --python-executables python3 --modules pyspark-core
...
Tests passed in 189 seconds

Skipped tests in pyspark.tests.test_memory_profiler with python3:
    test_assert_vanilla_mode (pyspark.tests.test_memory_profiler.MemoryProfiler2Tests.test_assert_vanilla_mode) ... skipped 'Must have memory-profiler installed.'
    test_memory_profiler_aggregate_in_pandas (pyspark.tests.test_memory_profiler.MemoryProfiler2Tests.test_memory_profiler_aggregate_in_pandas) ... skipped 'Must have memory-profiler installed.'
    test_memory_profiler_clear (pyspark.tests.test_memory_profiler.MemoryProfiler2Tests.test_memory_profiler_clear) ... skipped 'Must have memory-profiler installed.'
    test_memory_profiler_cogroup_apply_in_arrow (pyspark.tests.test_memory_profiler.MemoryProfiler2Tests.test_memory_profiler_cogroup_apply_in_arrow) ... skipped 'Must have memory-profiler installed.'
    test_memory_profiler_cogroup_apply_in_pandas (pyspark.tests.test_memory_profiler.MemoryProfiler2Tests.test_memory_profiler_cogroup_apply_in_pandas) ... skipped 'Must have memory-profiler installed.'
    test_memory_profiler_group_apply_in_arrow (pyspark.tests.test_memory_profiler.MemoryProfiler2Tests.test_memory_profiler_group_apply_in_arrow) ... skipped 'Must have memory-profiler installed.'
    test_memory_profiler_group_apply_in_pandas (pyspark.tests.test_memory_profiler.MemoryProfiler2Tests.test_memory_profiler_group_apply_in_pandas) ... skipped 'Must have memory-profiler installed.'
    test_memory_profiler_map_in_pandas_not_supported (pyspark.tests.test_memory_profiler.MemoryProfiler2Tests.test_memory_profiler_map_in_pandas_not_supported) ... skipped 'Must have memory-profiler installed.'
    test_memory_profiler_pandas_udf (pyspark.tests.test_memory_profiler.MemoryProfiler2Tests.test_memory_profiler_pandas_udf) ... skipped 'Must have memory-profiler installed.'
    test_memory_profiler_pandas_udf_iterator_not_supported (pyspark.tests.test_memory_profiler.MemoryProfiler2Tests.test_memory_profiler_pandas_udf_iterator_not_supported) ... skipped 'Must have memory-profiler installed.'
    test_memory_profiler_pandas_udf_window (pyspark.tests.test_memory_profiler.MemoryProfiler2Tests.test_memory_profiler_pandas_udf_window) ... skipped 'Must have memory-profiler installed.'
    test_memory_profiler_udf (pyspark.tests.test_memory_profiler.MemoryProfiler2Tests.test_memory_profiler_udf) ... skipped 'Must have memory-profiler installed.'
    test_memory_profiler_udf_multiple_actions (pyspark.tests.test_memory_profiler.MemoryProfiler2Tests.test_memory_profiler_udf_multiple_actions) ... skipped 'Must have memory-profiler installed.'
    test_memory_profiler_udf_registered (pyspark.tests.test_memory_profiler.MemoryProfiler2Tests.test_memory_profiler_udf_registered) ... skipped 'Must have memory-profiler installed.'
    test_memory_profiler_udf_with_arrow (pyspark.tests.test_memory_profiler.MemoryProfiler2Tests.test_memory_profiler_udf_with_arrow) ... skipped 'Must have memory-profiler installed.'
    test_profilers_clear (pyspark.tests.test_memory_profiler.MemoryProfiler2Tests.test_profilers_clear) ... skipped 'Must have memory-profiler installed.'
    test_code_map (pyspark.tests.test_memory_profiler.MemoryProfilerTests.test_code_map) ... skipped 'Must have memory-profiler installed.'
    test_memory_profiler (pyspark.tests.test_memory_profiler.MemoryProfilerTests.test_memory_profiler) ... skipped 'Must have memory-profiler installed.'
    test_profile_pandas_function_api (pyspark.tests.test_memory_profiler.MemoryProfilerTests.test_profile_pandas_function_api) ... skipped 'Must have memory-profiler installed.'
    test_profile_pandas_udf (pyspark.tests.test_memory_profiler.MemoryProfilerTests.test_profile_pandas_udf) ... skipped 'Must have memory-profiler installed.'
    test_udf_line_profiler (pyspark.tests.test_memory_profiler.MemoryProfilerTests.test_udf_line_profiler) ... skipped 'Must have memory-profiler installed.'

Skipped tests in pyspark.tests.test_rdd with python3:
    test_take_on_jrdd_with_large_rows_should_not_cause_deadlock (pyspark.tests.test_rdd.RDDTests.test_take_on_jrdd_with_large_rows_should_not_cause_deadlock) ... skipped 'NumPy or Pandas not installed'

Skipped tests in pyspark.tests.test_serializers with python3:
    test_statcounter_array (pyspark.tests.test_serializers.NumPyTests.test_statcounter_array) ... skipped 'NumPy not installed'
    test_serialize (pyspark.tests.test_serializers.SciPyTests.test_serialize) ... skipped 'SciPy not installed'

Skipped tests in pyspark.tests.test_worker with python3:
    test_memory_limit (pyspark.tests.test_worker.WorkerMemoryTest.test_memory_limit) ... skipped "Memory limit feature in Python worker is dependent on Python's 'resource' module on Linux; however, not found or not on Linux."
    test_python_segfault (pyspark.tests.test_worker.WorkerSegfaultNonDaemonTest.test_python_segfault) ... skipped 'SPARK-46130: Flaky with Python 3.12'
    test_python_segfault (pyspark.tests.test_worker.WorkerSegfaultTest.test_python_segfault) ... skipped 'SPARK-46130: Flaky with Python 3.12'
```

### Does this PR introduce _any_ user-facing change?

No. The failure happens during testing.

### How was this patch tested?

Pass the CIs and do the manual test without optional dependencies.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #47526 from dongjoon-hyun/SPARK-48710.

Authored-by: Dongjoon Hyun <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
fusheng9399 pushed a commit to fusheng9399/spark that referenced this pull request Aug 6, 2024
…ptional dependencies

### What changes were proposed in this pull request?

This is a follow-up of apache#47083 to recover PySpark RDD tests.

### Why are the changes needed?

`PySpark Core` test should not fail on optional dependencies.

**BEFORE**
```
$ python/run-tests.py --python-executables python3 --modules pyspark-core
...
  File "/Users/dongjoon/APACHE/spark-merge/python/pyspark/core/rdd.py", line 5376, in _test
    import numpy as np
ModuleNotFoundError: No module named 'numpy'
```

**AFTER**
```
$ python/run-tests.py --python-executables python3 --modules pyspark-core
...
Tests passed in 189 seconds

Skipped tests in pyspark.tests.test_memory_profiler with python3:
    test_assert_vanilla_mode (pyspark.tests.test_memory_profiler.MemoryProfiler2Tests.test_assert_vanilla_mode) ... skipped 'Must have memory-profiler installed.'
    test_memory_profiler_aggregate_in_pandas (pyspark.tests.test_memory_profiler.MemoryProfiler2Tests.test_memory_profiler_aggregate_in_pandas) ... skipped 'Must have memory-profiler installed.'
    test_memory_profiler_clear (pyspark.tests.test_memory_profiler.MemoryProfiler2Tests.test_memory_profiler_clear) ... skipped 'Must have memory-profiler installed.'
    test_memory_profiler_cogroup_apply_in_arrow (pyspark.tests.test_memory_profiler.MemoryProfiler2Tests.test_memory_profiler_cogroup_apply_in_arrow) ... skipped 'Must have memory-profiler installed.'
    test_memory_profiler_cogroup_apply_in_pandas (pyspark.tests.test_memory_profiler.MemoryProfiler2Tests.test_memory_profiler_cogroup_apply_in_pandas) ... skipped 'Must have memory-profiler installed.'
    test_memory_profiler_group_apply_in_arrow (pyspark.tests.test_memory_profiler.MemoryProfiler2Tests.test_memory_profiler_group_apply_in_arrow) ... skipped 'Must have memory-profiler installed.'
    test_memory_profiler_group_apply_in_pandas (pyspark.tests.test_memory_profiler.MemoryProfiler2Tests.test_memory_profiler_group_apply_in_pandas) ... skipped 'Must have memory-profiler installed.'
    test_memory_profiler_map_in_pandas_not_supported (pyspark.tests.test_memory_profiler.MemoryProfiler2Tests.test_memory_profiler_map_in_pandas_not_supported) ... skipped 'Must have memory-profiler installed.'
    test_memory_profiler_pandas_udf (pyspark.tests.test_memory_profiler.MemoryProfiler2Tests.test_memory_profiler_pandas_udf) ... skipped 'Must have memory-profiler installed.'
    test_memory_profiler_pandas_udf_iterator_not_supported (pyspark.tests.test_memory_profiler.MemoryProfiler2Tests.test_memory_profiler_pandas_udf_iterator_not_supported) ... skipped 'Must have memory-profiler installed.'
    test_memory_profiler_pandas_udf_window (pyspark.tests.test_memory_profiler.MemoryProfiler2Tests.test_memory_profiler_pandas_udf_window) ... skipped 'Must have memory-profiler installed.'
    test_memory_profiler_udf (pyspark.tests.test_memory_profiler.MemoryProfiler2Tests.test_memory_profiler_udf) ... skipped 'Must have memory-profiler installed.'
    test_memory_profiler_udf_multiple_actions (pyspark.tests.test_memory_profiler.MemoryProfiler2Tests.test_memory_profiler_udf_multiple_actions) ... skipped 'Must have memory-profiler installed.'
    test_memory_profiler_udf_registered (pyspark.tests.test_memory_profiler.MemoryProfiler2Tests.test_memory_profiler_udf_registered) ... skipped 'Must have memory-profiler installed.'
    test_memory_profiler_udf_with_arrow (pyspark.tests.test_memory_profiler.MemoryProfiler2Tests.test_memory_profiler_udf_with_arrow) ... skipped 'Must have memory-profiler installed.'
    test_profilers_clear (pyspark.tests.test_memory_profiler.MemoryProfiler2Tests.test_profilers_clear) ... skipped 'Must have memory-profiler installed.'
    test_code_map (pyspark.tests.test_memory_profiler.MemoryProfilerTests.test_code_map) ... skipped 'Must have memory-profiler installed.'
    test_memory_profiler (pyspark.tests.test_memory_profiler.MemoryProfilerTests.test_memory_profiler) ... skipped 'Must have memory-profiler installed.'
    test_profile_pandas_function_api (pyspark.tests.test_memory_profiler.MemoryProfilerTests.test_profile_pandas_function_api) ... skipped 'Must have memory-profiler installed.'
    test_profile_pandas_udf (pyspark.tests.test_memory_profiler.MemoryProfilerTests.test_profile_pandas_udf) ... skipped 'Must have memory-profiler installed.'
    test_udf_line_profiler (pyspark.tests.test_memory_profiler.MemoryProfilerTests.test_udf_line_profiler) ... skipped 'Must have memory-profiler installed.'

Skipped tests in pyspark.tests.test_rdd with python3:
    test_take_on_jrdd_with_large_rows_should_not_cause_deadlock (pyspark.tests.test_rdd.RDDTests.test_take_on_jrdd_with_large_rows_should_not_cause_deadlock) ... skipped 'NumPy or Pandas not installed'

Skipped tests in pyspark.tests.test_serializers with python3:
    test_statcounter_array (pyspark.tests.test_serializers.NumPyTests.test_statcounter_array) ... skipped 'NumPy not installed'
    test_serialize (pyspark.tests.test_serializers.SciPyTests.test_serialize) ... skipped 'SciPy not installed'

Skipped tests in pyspark.tests.test_worker with python3:
    test_memory_limit (pyspark.tests.test_worker.WorkerMemoryTest.test_memory_limit) ... skipped "Memory limit feature in Python worker is dependent on Python's 'resource' module on Linux; however, not found or not on Linux."
    test_python_segfault (pyspark.tests.test_worker.WorkerSegfaultNonDaemonTest.test_python_segfault) ... skipped 'SPARK-46130: Flaky with Python 3.12'
    test_python_segfault (pyspark.tests.test_worker.WorkerSegfaultTest.test_python_segfault) ... skipped 'SPARK-46130: Flaky with Python 3.12'
```

### Does this PR introduce _any_ user-facing change?

No. The failure happens during testing.

### How was this patch tested?

Pass the CIs and do the manual test without optional dependencies.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes apache#47526 from dongjoon-hyun/SPARK-48710.

Authored-by: Dongjoon Hyun <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
szehon-ho pushed a commit to szehon-ho/spark that referenced this pull request Aug 7, 2024
…1.15,<2)

### What changes were proposed in this pull request?
 * Add a constraint for `numpy<2` to the PySpark package

### Why are the changes needed?

PySpark references some code which was removed with NumPy 2.0. Thus, if `numpy>=2` is installed, executing PySpark may fail.

apache#47083 updates the `master` branch to be compatible with NumPy 2. This PR adds a version bound for older releases, where it won't be applied.

### Does this PR introduce _any_ user-facing change?
NumPy will be limited to `numpy<2` when installing `pypspark` with extras `ml`, `mllib`, `sql`, `pandas_on_spark` or `connect`.

### How was this patch tested?
Via existing CI jobs.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes apache#47175 from codesorcery/SPARK-48710-numpy-upper-bound.

Authored-by: Patrick Marx <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
(cherry picked from commit 44eba46)
Signed-off-by: Hyukjin Kwon <[email protected]>
attilapiros pushed a commit to attilapiros/spark that referenced this pull request Oct 4, 2024
### What changes were proposed in this pull request?
 * Replace NumPy types removed in NumPy 2.0 with their equivalent counterparts
 * Make tests compatible to new `__repr__` of numerical scalars

### Why are the changes needed?

PySpark references some code which was removed with NumPy 2.0:
 * `np.NaN` was removed, should be replaced with `np.nan`
 * `np.string_` was removed, [is an alias for](https://github.com/numpy/numpy/blob/v1.26.5/numpy/__init__.pyi#L3134) `np.bytes_`
 * `np.float_` was removed, [is defined the same as](https://github.com/numpy/numpy/blob/v1.26.5/numpy/__init__.pyi#L3042-3043) `np.double`
  * `np.unicode_` was removed, [is an alias for](https://github.com/numpy/numpy/blob/v1.26.5/numpy/__init__.pyi#L3148) `np.str_`

NumPy 2.0 changed the `__repr__` of numerical scalars to contain type information (e.g. `np.int32(3)` instead of `3`). Old behavior can be enabled by setting `numpy.printoptions(legacy="1.25")` (or the older `1.21` and `1.13` legacy modes). There are multiple tests and doctests that rely on the old behavior.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Tests for modules `pyspark-connect`, `pyspark-core`, `pyspark-errors`, `pyspark-mllib`, `pyspark-pandas`, `pyspark-sql`, `pyspark-resource`, `pyspark-testing` were executed in a local venv with `numpy==2.0.0` installed.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes apache#47083 from codesorcery/SPARK-48710.

Authored-by: Patrick Marx <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
attilapiros pushed a commit to attilapiros/spark that referenced this pull request Oct 4, 2024
…ptional dependencies

### What changes were proposed in this pull request?

This is a follow-up of apache#47083 to recover PySpark RDD tests.

### Why are the changes needed?

`PySpark Core` test should not fail on optional dependencies.

**BEFORE**
```
$ python/run-tests.py --python-executables python3 --modules pyspark-core
...
  File "/Users/dongjoon/APACHE/spark-merge/python/pyspark/core/rdd.py", line 5376, in _test
    import numpy as np
ModuleNotFoundError: No module named 'numpy'
```

**AFTER**
```
$ python/run-tests.py --python-executables python3 --modules pyspark-core
...
Tests passed in 189 seconds

Skipped tests in pyspark.tests.test_memory_profiler with python3:
    test_assert_vanilla_mode (pyspark.tests.test_memory_profiler.MemoryProfiler2Tests.test_assert_vanilla_mode) ... skipped 'Must have memory-profiler installed.'
    test_memory_profiler_aggregate_in_pandas (pyspark.tests.test_memory_profiler.MemoryProfiler2Tests.test_memory_profiler_aggregate_in_pandas) ... skipped 'Must have memory-profiler installed.'
    test_memory_profiler_clear (pyspark.tests.test_memory_profiler.MemoryProfiler2Tests.test_memory_profiler_clear) ... skipped 'Must have memory-profiler installed.'
    test_memory_profiler_cogroup_apply_in_arrow (pyspark.tests.test_memory_profiler.MemoryProfiler2Tests.test_memory_profiler_cogroup_apply_in_arrow) ... skipped 'Must have memory-profiler installed.'
    test_memory_profiler_cogroup_apply_in_pandas (pyspark.tests.test_memory_profiler.MemoryProfiler2Tests.test_memory_profiler_cogroup_apply_in_pandas) ... skipped 'Must have memory-profiler installed.'
    test_memory_profiler_group_apply_in_arrow (pyspark.tests.test_memory_profiler.MemoryProfiler2Tests.test_memory_profiler_group_apply_in_arrow) ... skipped 'Must have memory-profiler installed.'
    test_memory_profiler_group_apply_in_pandas (pyspark.tests.test_memory_profiler.MemoryProfiler2Tests.test_memory_profiler_group_apply_in_pandas) ... skipped 'Must have memory-profiler installed.'
    test_memory_profiler_map_in_pandas_not_supported (pyspark.tests.test_memory_profiler.MemoryProfiler2Tests.test_memory_profiler_map_in_pandas_not_supported) ... skipped 'Must have memory-profiler installed.'
    test_memory_profiler_pandas_udf (pyspark.tests.test_memory_profiler.MemoryProfiler2Tests.test_memory_profiler_pandas_udf) ... skipped 'Must have memory-profiler installed.'
    test_memory_profiler_pandas_udf_iterator_not_supported (pyspark.tests.test_memory_profiler.MemoryProfiler2Tests.test_memory_profiler_pandas_udf_iterator_not_supported) ... skipped 'Must have memory-profiler installed.'
    test_memory_profiler_pandas_udf_window (pyspark.tests.test_memory_profiler.MemoryProfiler2Tests.test_memory_profiler_pandas_udf_window) ... skipped 'Must have memory-profiler installed.'
    test_memory_profiler_udf (pyspark.tests.test_memory_profiler.MemoryProfiler2Tests.test_memory_profiler_udf) ... skipped 'Must have memory-profiler installed.'
    test_memory_profiler_udf_multiple_actions (pyspark.tests.test_memory_profiler.MemoryProfiler2Tests.test_memory_profiler_udf_multiple_actions) ... skipped 'Must have memory-profiler installed.'
    test_memory_profiler_udf_registered (pyspark.tests.test_memory_profiler.MemoryProfiler2Tests.test_memory_profiler_udf_registered) ... skipped 'Must have memory-profiler installed.'
    test_memory_profiler_udf_with_arrow (pyspark.tests.test_memory_profiler.MemoryProfiler2Tests.test_memory_profiler_udf_with_arrow) ... skipped 'Must have memory-profiler installed.'
    test_profilers_clear (pyspark.tests.test_memory_profiler.MemoryProfiler2Tests.test_profilers_clear) ... skipped 'Must have memory-profiler installed.'
    test_code_map (pyspark.tests.test_memory_profiler.MemoryProfilerTests.test_code_map) ... skipped 'Must have memory-profiler installed.'
    test_memory_profiler (pyspark.tests.test_memory_profiler.MemoryProfilerTests.test_memory_profiler) ... skipped 'Must have memory-profiler installed.'
    test_profile_pandas_function_api (pyspark.tests.test_memory_profiler.MemoryProfilerTests.test_profile_pandas_function_api) ... skipped 'Must have memory-profiler installed.'
    test_profile_pandas_udf (pyspark.tests.test_memory_profiler.MemoryProfilerTests.test_profile_pandas_udf) ... skipped 'Must have memory-profiler installed.'
    test_udf_line_profiler (pyspark.tests.test_memory_profiler.MemoryProfilerTests.test_udf_line_profiler) ... skipped 'Must have memory-profiler installed.'

Skipped tests in pyspark.tests.test_rdd with python3:
    test_take_on_jrdd_with_large_rows_should_not_cause_deadlock (pyspark.tests.test_rdd.RDDTests.test_take_on_jrdd_with_large_rows_should_not_cause_deadlock) ... skipped 'NumPy or Pandas not installed'

Skipped tests in pyspark.tests.test_serializers with python3:
    test_statcounter_array (pyspark.tests.test_serializers.NumPyTests.test_statcounter_array) ... skipped 'NumPy not installed'
    test_serialize (pyspark.tests.test_serializers.SciPyTests.test_serialize) ... skipped 'SciPy not installed'

Skipped tests in pyspark.tests.test_worker with python3:
    test_memory_limit (pyspark.tests.test_worker.WorkerMemoryTest.test_memory_limit) ... skipped "Memory limit feature in Python worker is dependent on Python's 'resource' module on Linux; however, not found or not on Linux."
    test_python_segfault (pyspark.tests.test_worker.WorkerSegfaultNonDaemonTest.test_python_segfault) ... skipped 'SPARK-46130: Flaky with Python 3.12'
    test_python_segfault (pyspark.tests.test_worker.WorkerSegfaultTest.test_python_segfault) ... skipped 'SPARK-46130: Flaky with Python 3.12'
```

### Does this PR introduce _any_ user-facing change?

No. The failure happens during testing.

### How was this patch tested?

Pass the CIs and do the manual test without optional dependencies.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes apache#47526 from dongjoon-hyun/SPARK-48710.

Authored-by: Dongjoon Hyun <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
himadripal pushed a commit to himadripal/spark that referenced this pull request Oct 19, 2024
…ptional dependencies

### What changes were proposed in this pull request?

This is a follow-up of apache#47083 to recover PySpark RDD tests.

### Why are the changes needed?

`PySpark Core` test should not fail on optional dependencies.

**BEFORE**
```
$ python/run-tests.py --python-executables python3 --modules pyspark-core
...
  File "/Users/dongjoon/APACHE/spark-merge/python/pyspark/core/rdd.py", line 5376, in _test
    import numpy as np
ModuleNotFoundError: No module named 'numpy'
```

**AFTER**
```
$ python/run-tests.py --python-executables python3 --modules pyspark-core
...
Tests passed in 189 seconds

Skipped tests in pyspark.tests.test_memory_profiler with python3:
    test_assert_vanilla_mode (pyspark.tests.test_memory_profiler.MemoryProfiler2Tests.test_assert_vanilla_mode) ... skipped 'Must have memory-profiler installed.'
    test_memory_profiler_aggregate_in_pandas (pyspark.tests.test_memory_profiler.MemoryProfiler2Tests.test_memory_profiler_aggregate_in_pandas) ... skipped 'Must have memory-profiler installed.'
    test_memory_profiler_clear (pyspark.tests.test_memory_profiler.MemoryProfiler2Tests.test_memory_profiler_clear) ... skipped 'Must have memory-profiler installed.'
    test_memory_profiler_cogroup_apply_in_arrow (pyspark.tests.test_memory_profiler.MemoryProfiler2Tests.test_memory_profiler_cogroup_apply_in_arrow) ... skipped 'Must have memory-profiler installed.'
    test_memory_profiler_cogroup_apply_in_pandas (pyspark.tests.test_memory_profiler.MemoryProfiler2Tests.test_memory_profiler_cogroup_apply_in_pandas) ... skipped 'Must have memory-profiler installed.'
    test_memory_profiler_group_apply_in_arrow (pyspark.tests.test_memory_profiler.MemoryProfiler2Tests.test_memory_profiler_group_apply_in_arrow) ... skipped 'Must have memory-profiler installed.'
    test_memory_profiler_group_apply_in_pandas (pyspark.tests.test_memory_profiler.MemoryProfiler2Tests.test_memory_profiler_group_apply_in_pandas) ... skipped 'Must have memory-profiler installed.'
    test_memory_profiler_map_in_pandas_not_supported (pyspark.tests.test_memory_profiler.MemoryProfiler2Tests.test_memory_profiler_map_in_pandas_not_supported) ... skipped 'Must have memory-profiler installed.'
    test_memory_profiler_pandas_udf (pyspark.tests.test_memory_profiler.MemoryProfiler2Tests.test_memory_profiler_pandas_udf) ... skipped 'Must have memory-profiler installed.'
    test_memory_profiler_pandas_udf_iterator_not_supported (pyspark.tests.test_memory_profiler.MemoryProfiler2Tests.test_memory_profiler_pandas_udf_iterator_not_supported) ... skipped 'Must have memory-profiler installed.'
    test_memory_profiler_pandas_udf_window (pyspark.tests.test_memory_profiler.MemoryProfiler2Tests.test_memory_profiler_pandas_udf_window) ... skipped 'Must have memory-profiler installed.'
    test_memory_profiler_udf (pyspark.tests.test_memory_profiler.MemoryProfiler2Tests.test_memory_profiler_udf) ... skipped 'Must have memory-profiler installed.'
    test_memory_profiler_udf_multiple_actions (pyspark.tests.test_memory_profiler.MemoryProfiler2Tests.test_memory_profiler_udf_multiple_actions) ... skipped 'Must have memory-profiler installed.'
    test_memory_profiler_udf_registered (pyspark.tests.test_memory_profiler.MemoryProfiler2Tests.test_memory_profiler_udf_registered) ... skipped 'Must have memory-profiler installed.'
    test_memory_profiler_udf_with_arrow (pyspark.tests.test_memory_profiler.MemoryProfiler2Tests.test_memory_profiler_udf_with_arrow) ... skipped 'Must have memory-profiler installed.'
    test_profilers_clear (pyspark.tests.test_memory_profiler.MemoryProfiler2Tests.test_profilers_clear) ... skipped 'Must have memory-profiler installed.'
    test_code_map (pyspark.tests.test_memory_profiler.MemoryProfilerTests.test_code_map) ... skipped 'Must have memory-profiler installed.'
    test_memory_profiler (pyspark.tests.test_memory_profiler.MemoryProfilerTests.test_memory_profiler) ... skipped 'Must have memory-profiler installed.'
    test_profile_pandas_function_api (pyspark.tests.test_memory_profiler.MemoryProfilerTests.test_profile_pandas_function_api) ... skipped 'Must have memory-profiler installed.'
    test_profile_pandas_udf (pyspark.tests.test_memory_profiler.MemoryProfilerTests.test_profile_pandas_udf) ... skipped 'Must have memory-profiler installed.'
    test_udf_line_profiler (pyspark.tests.test_memory_profiler.MemoryProfilerTests.test_udf_line_profiler) ... skipped 'Must have memory-profiler installed.'

Skipped tests in pyspark.tests.test_rdd with python3:
    test_take_on_jrdd_with_large_rows_should_not_cause_deadlock (pyspark.tests.test_rdd.RDDTests.test_take_on_jrdd_with_large_rows_should_not_cause_deadlock) ... skipped 'NumPy or Pandas not installed'

Skipped tests in pyspark.tests.test_serializers with python3:
    test_statcounter_array (pyspark.tests.test_serializers.NumPyTests.test_statcounter_array) ... skipped 'NumPy not installed'
    test_serialize (pyspark.tests.test_serializers.SciPyTests.test_serialize) ... skipped 'SciPy not installed'

Skipped tests in pyspark.tests.test_worker with python3:
    test_memory_limit (pyspark.tests.test_worker.WorkerMemoryTest.test_memory_limit) ... skipped "Memory limit feature in Python worker is dependent on Python's 'resource' module on Linux; however, not found or not on Linux."
    test_python_segfault (pyspark.tests.test_worker.WorkerSegfaultNonDaemonTest.test_python_segfault) ... skipped 'SPARK-46130: Flaky with Python 3.12'
    test_python_segfault (pyspark.tests.test_worker.WorkerSegfaultTest.test_python_segfault) ... skipped 'SPARK-46130: Flaky with Python 3.12'
```

### Does this PR introduce _any_ user-facing change?

No. The failure happens during testing.

### How was this patch tested?

Pass the CIs and do the manual test without optional dependencies.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes apache#47526 from dongjoon-hyun/SPARK-48710.

Authored-by: Dongjoon Hyun <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants