[SPARK-48710][PYTHON] Use NumPy 2.0 compatible types #47083

codesorcery · 2024-06-25T11:31:15Z

What changes were proposed in this pull request?

Replace NumPy types removed in NumPy 2.0 with their equivalent counterparts
Make tests compatible to new __repr__ of numerical scalars

Why are the changes needed?

PySpark references some code which was removed with NumPy 2.0:

np.NaN was removed, should be replaced with np.nan
np.string_ was removed, is an alias for np.bytes_
np.float_ was removed, is defined the same as np.double
np.unicode_ was removed, is an alias for np.str_

NumPy 2.0 changed the __repr__ of numerical scalars to contain type information (e.g. np.int32(3) instead of 3). Old behavior can be enabled by setting numpy.printoptions(legacy="1.25") (or the older 1.21 and 1.13 legacy modes). There are multiple tests and doctests that rely on the old behavior.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Tests for modules pyspark-connect, pyspark-core, pyspark-errors, pyspark-mllib, pyspark-pandas, pyspark-sql, pyspark-resource, pyspark-testing were executed in a local venv with numpy==2.0.0 installed.

Was this patch authored or co-authored using generative AI tooling?

No.

allisonwang-db · 2024-06-26T17:59:44Z

cc @itholic

itholic · 2024-06-27T00:21:44Z

python/pyspark/pandas/typedef/typehints.py

@@ -176,7 +176,7 @@ def as_spark_type(
            return None
        return types.ArrayType(element_type)
    # BinaryType
-    elif tpe in (bytes, np.character, np.bytes_, np.string_):


~~qq: why do we remove np.string_?~~

Oh, nvm. I just checked the PR description.

itholic · 2024-06-27T00:23:17Z

Let's use the default PR template:

### What changes were proposed in this pull request?

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

### Was this patch authored or co-authored using generative AI tooling?

codesorcery · 2024-06-27T20:41:37Z

@itholic thanks for reviewing!
I added some small changes to the tests, to make them compatible with NumPy 2.0. MR description is updated accordingly and formatted according to the default template. I couldn't yet get the pyspark-ml tests running locally, so those tests are not yet tested with NumPy 2.0.

Maybe it makes sense to update the GitHub jobs to test with both the lowest supported (i.e. 1.21) and latest (i.e. 2.0.0) NumPy version?

itholic · 2024-06-30T22:48:47Z

Oh, okay seems like the NumPy is upgraded their major version recently (2024-06-17): Release Note.

@HyukjinKwon Maybe should we upgrade the minimum NumPy support to 2.0.0 as we did for Pandas??

Also cc @zhengruifeng who have worked on similar PR from #42944.

HyukjinKwon · 2024-06-30T23:34:05Z

I think that's too aggressive. NumPy is also used in Spark ML, and many other dependent proejcts

HyukjinKwon · 2024-06-30T23:34:10Z

cc @WeichenXu123

HyukjinKwon · 2024-07-01T00:21:01Z

Seems fine from a cursory look but let's make the CI happy :-).

WeichenXu123 · 2024-07-01T03:03:58Z

CI passed !

HyukjinKwon · 2024-07-01T03:15:57Z

There are some linter failures: https://github.com/codesorcery/spark/actions/runs/9708335267/job/26795030107

zhengruifeng

Have we test this changes against numpy 2.0?

zhengruifeng · 2024-07-01T04:55:51Z

python/pyspark/core/rdd.py

@@ -5370,6 +5370,17 @@ def _test() -> None:
    import tempfile
    from pyspark.core.context import SparkContext

+    try:


shall we add a todo that once we upgrade the minimum version >= 2.0, we can remove this try-except and update the doc tests.

There exist also multiple tests already with:

try: # Numpy 1.14+ changed it's string format. numpy.set_printoptions(legacy="1.13") except TypeError: pass

I'd guess these should be considered for updating before that (since the minimum NumPy version is at 1.21 currently).

zhengruifeng · 2024-07-01T04:56:14Z

python/pyspark/pandas/indexing.py


    os.chdir(os.environ["SPARK_HOME"])

+    if Version(np.__version__) >= Version("2"):


codesorcery · 2024-07-01T09:43:41Z

There are some linter failures: https://github.com/codesorcery/spark/actions/runs/9708335267/job/26795030107

Added # type: ignore[arg-type] to the affected lines, since legacy="1.25" is only implemented in numpy>=2 and we're checking that the code path is only executed when numpy>=2 is installed.

codesorcery · 2024-07-01T11:11:35Z

Have we test this changes against numpy 2.0?

I've tested it on my local workstation, as written in the PR description. There aren't any CI jobs testing with NumPy 2.0 yet.
To make sure that no calls to code removed in NumPy 2 is made, we could also use ruff in dev/lint-python since it can check for usage of NumPy 2 deprecations.
(ruff can also be used as a faster replacement of both flake8 and black, but that should be out of scope here)

HyukjinKwon · 2024-07-02T07:30:08Z

The change seems fine to me. @codesorcery do you mind creating a PR to set the upperbound at setup.py like numpy < 2? I think NumPy release will affect 3.5 users too.

codesorcery · 2024-07-02T08:22:23Z

@codesorcery do you mind creating a PR to set the upperbound at setup.py like numpy < 2? I think NumPy release will affect 3.5 users too.

@HyukjinKwon you mean for branches where this PR doesn't get applied? Otherwise, most Python package managers and tools like Renovate won't allow users to update to NumPy 2 with this bound set.

Maybe also of interest: there is a list tracking the compatibility status of Python libraries with NumPy 2.0 at numpy/numpy#26191

HyukjinKwon · 2024-07-02T08:32:58Z

Yes, branch-3.5.

codesorcery · 2024-07-02T09:05:23Z

@HyukjinKwon here's the PR for branch-3.5 limiting numpy<2: #47175 (also auto-linked by GitHub above)

jakirkham · 2024-07-02T22:18:00Z

It should be possible to write code that is compatible with NumPy 1 & 2. That is what most projects are doing

Would look over the migration guide. There are more suggestions in the release notes

As already noted ruff's NumPy 2 plugin can be a great help in migrating code

cc @rgommers (for awareness)

…1.15,<2) ### What changes were proposed in this pull request? * Add a constraint for `numpy<2` to the PySpark package ### Why are the changes needed? PySpark references some code which was removed with NumPy 2.0. Thus, if `numpy>=2` is installed, executing PySpark may fail. #47083 updates the `master` branch to be compatible with NumPy 2. This PR adds a version bound for older releases, where it won't be applied. ### Does this PR introduce _any_ user-facing change? NumPy will be limited to `numpy<2` when installing `pypspark` with extras `ml`, `mllib`, `sql`, `pandas_on_spark` or `connect`. ### How was this patch tested? Via existing CI jobs. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #47175 from codesorcery/SPARK-48710-numpy-upper-bound. Authored-by: Patrick Marx <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>

…1.15,<2) ### What changes were proposed in this pull request? * Add a constraint for `numpy<2` to the PySpark package ### Why are the changes needed? PySpark references some code which was removed with NumPy 2.0. Thus, if `numpy>=2` is installed, executing PySpark may fail. #47083 updates the `master` branch to be compatible with NumPy 2. This PR adds a version bound for older releases, where it won't be applied. ### Does this PR introduce _any_ user-facing change? NumPy will be limited to `numpy<2` when installing `pypspark` with extras `ml`, `mllib`, `sql`, `pandas_on_spark` or `connect`. ### How was this patch tested? Via existing CI jobs. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #47175 from codesorcery/SPARK-48710-numpy-upper-bound. Authored-by: Patrick Marx <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]> (cherry picked from commit 44eba46) Signed-off-by: Hyukjin Kwon <[email protected]>

HyukjinKwon · 2024-07-03T00:23:15Z

Merged to master.

rgommers · 2024-07-03T13:03:13Z

@codesorcery @HyukjinKwon I noticed that a review comment here said np.set_printoptions is used in the tests and can be updated, but this PR uses it in pyspark/core/ rather than in tests. np.set_printoptions changes global state within numpy, and doing that from within another library is a big no-no usually. Could you please consider changing this?

codesorcery · 2024-07-03T13:19:33Z

but this PR uses it in pyspark/core/ rather than in tests

@rgommers np.set_printoptions is only called inside def _test() -> None: in these classes, which setup and run doctest. It's not called from any function that would be called when PySpark is used as a library.

rgommers · 2024-07-03T13:21:30Z

Ah there are tests within source files, I missed that. Sorry for the noise!

…1.15,<2) ### What changes were proposed in this pull request? * Add a constraint for `numpy<2` to the PySpark package ### Why are the changes needed? PySpark references some code which was removed with NumPy 2.0. Thus, if `numpy>=2` is installed, executing PySpark may fail. apache#47083 updates the `master` branch to be compatible with NumPy 2. This PR adds a version bound for older releases, where it won't be applied. ### Does this PR introduce _any_ user-facing change? NumPy will be limited to `numpy<2` when installing `pypspark` with extras `ml`, `mllib`, `sql`, `pandas_on_spark` or `connect`. ### How was this patch tested? Via existing CI jobs. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#47175 from codesorcery/SPARK-48710-numpy-upper-bound. Authored-by: Patrick Marx <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>

### What changes were proposed in this pull request? * Replace NumPy types removed in NumPy 2.0 with their equivalent counterparts * Make tests compatible to new `__repr__` of numerical scalars ### Why are the changes needed? PySpark references some code which was removed with NumPy 2.0: * `np.NaN` was removed, should be replaced with `np.nan` * `np.string_` was removed, [is an alias for](https://github.com/numpy/numpy/blob/v1.26.5/numpy/__init__.pyi#L3134) `np.bytes_` * `np.float_` was removed, [is defined the same as](https://github.com/numpy/numpy/blob/v1.26.5/numpy/__init__.pyi#L3042-3043) `np.double` * `np.unicode_` was removed, [is an alias for](https://github.com/numpy/numpy/blob/v1.26.5/numpy/__init__.pyi#L3148) `np.str_` NumPy 2.0 changed the `__repr__` of numerical scalars to contain type information (e.g. `np.int32(3)` instead of `3`). Old behavior can be enabled by setting `numpy.printoptions(legacy="1.25")` (or the older `1.21` and `1.13` legacy modes). There are multiple tests and doctests that rely on the old behavior. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Tests for modules `pyspark-connect`, `pyspark-core`, `pyspark-errors`, `pyspark-mllib`, `pyspark-pandas`, `pyspark-sql`, `pyspark-resource`, `pyspark-testing` were executed in a local venv with `numpy==2.0.0` installed. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#47083 from codesorcery/SPARK-48710. Authored-by: Patrick Marx <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>

dongjoon-hyun

Hi, @codesorcery , @HyukjinKwon , @zhengruifeng , @itholic .

This PR seems to introduce numpy dependency accidentally to core/rdd module.

Starting test(python3): pyspark.core.rdd (temp output: /Users/dongjoon/APACHE/spark-merge/python/target/6da9b910-0500-479c-85ef-89e4bd085853/python3__pyspark.core.rdd__oldy8rob.log)
<frozen runpy>:128: RuntimeWarning: 'pyspark.core.rdd' found in sys.modules after import of package 'pyspark.core', but prior to execution of 'pyspark.core.rdd'; this may result in unpredictable behaviour
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/Users/dongjoon/APACHE/spark-merge/python/pyspark/core/rdd.py", line 5400, in <module>
    _test()
    ~~~~~^^
  File "/Users/dongjoon/APACHE/spark-merge/python/pyspark/core/rdd.py", line 5376, in _test
    import numpy as np
ModuleNotFoundError: No module named 'numpy'

core module should not have this dependency even in the test case.

dongjoon-hyun · 2024-07-29T19:04:34Z

If there are numpy-related test cases, can we move them out from rdd module?

IIUC, this problem is already pointed out by @rgommers one month ago in this PR.

HyukjinKwon · 2024-07-29T19:29:53Z

python/pyspark/core/rdd.py

+    try:
+        # Numpy 2.0+ changed its string format,
+        # adding type information to numeric scalars.
+        import numpy as np


Here actually did a try catch but I think there's some issue related to import ..

Let's move this to sql and ml only for now because both modules use numpy.

Thank you.

Yes, the problem is that try-catch didn't handle ModuleNotFoundError and causes failure like the following.

$ python/run-tests.py --python-executables python3 --modules pyspark-core ... File "/Users/dongjoon/APACHE/spark-merge/python/pyspark/core/rdd.py", line 5376, in _test import numpy as np ModuleNotFoundError: No module named 'numpy'

Let's move this to sql and ml only for now because both modules use numpy.

+1 for moving.

HyukjinKwon · 2024-07-29T19:43:55Z

python/pyspark/core/rdd.py

+        if Version(np.__version__) >= Version("2"):
+            # `legacy="1.25"` only available in `nump>=2`
+            np.set_printoptions(legacy="1.25")  # type: ignore[arg-type]
+    except TypeError:


Ah, yeah let's catch ImportError ....

dongjoon-hyun · 2024-07-29T23:01:30Z

Here is a follow-up according to @HyukjinKwon 's comment.

[SPARK-48710][PYTHON][FOLLOWUP] PySpark rdd test should not fail on optional dependencies #47526

…ptional dependencies ### What changes were proposed in this pull request? This is a follow-up of #47083 to recover PySpark RDD tests. ### Why are the changes needed? `PySpark Core` test should not fail on optional dependencies. **BEFORE** ``` $ python/run-tests.py --python-executables python3 --modules pyspark-core ... File "/Users/dongjoon/APACHE/spark-merge/python/pyspark/core/rdd.py", line 5376, in _test import numpy as np ModuleNotFoundError: No module named 'numpy' ``` **AFTER** ``` $ python/run-tests.py --python-executables python3 --modules pyspark-core ... Tests passed in 189 seconds Skipped tests in pyspark.tests.test_memory_profiler with python3: test_assert_vanilla_mode (pyspark.tests.test_memory_profiler.MemoryProfiler2Tests.test_assert_vanilla_mode) ... skipped 'Must have memory-profiler installed.' test_memory_profiler_aggregate_in_pandas (pyspark.tests.test_memory_profiler.MemoryProfiler2Tests.test_memory_profiler_aggregate_in_pandas) ... skipped 'Must have memory-profiler installed.' test_memory_profiler_clear (pyspark.tests.test_memory_profiler.MemoryProfiler2Tests.test_memory_profiler_clear) ... skipped 'Must have memory-profiler installed.' test_memory_profiler_cogroup_apply_in_arrow (pyspark.tests.test_memory_profiler.MemoryProfiler2Tests.test_memory_profiler_cogroup_apply_in_arrow) ... skipped 'Must have memory-profiler installed.' test_memory_profiler_cogroup_apply_in_pandas (pyspark.tests.test_memory_profiler.MemoryProfiler2Tests.test_memory_profiler_cogroup_apply_in_pandas) ... skipped 'Must have memory-profiler installed.' test_memory_profiler_group_apply_in_arrow (pyspark.tests.test_memory_profiler.MemoryProfiler2Tests.test_memory_profiler_group_apply_in_arrow) ... skipped 'Must have memory-profiler installed.' test_memory_profiler_group_apply_in_pandas (pyspark.tests.test_memory_profiler.MemoryProfiler2Tests.test_memory_profiler_group_apply_in_pandas) ... skipped 'Must have memory-profiler installed.' test_memory_profiler_map_in_pandas_not_supported (pyspark.tests.test_memory_profiler.MemoryProfiler2Tests.test_memory_profiler_map_in_pandas_not_supported) ... skipped 'Must have memory-profiler installed.' test_memory_profiler_pandas_udf (pyspark.tests.test_memory_profiler.MemoryProfiler2Tests.test_memory_profiler_pandas_udf) ... skipped 'Must have memory-profiler installed.' test_memory_profiler_pandas_udf_iterator_not_supported (pyspark.tests.test_memory_profiler.MemoryProfiler2Tests.test_memory_profiler_pandas_udf_iterator_not_supported) ... skipped 'Must have memory-profiler installed.' test_memory_profiler_pandas_udf_window (pyspark.tests.test_memory_profiler.MemoryProfiler2Tests.test_memory_profiler_pandas_udf_window) ... skipped 'Must have memory-profiler installed.' test_memory_profiler_udf (pyspark.tests.test_memory_profiler.MemoryProfiler2Tests.test_memory_profiler_udf) ... skipped 'Must have memory-profiler installed.' test_memory_profiler_udf_multiple_actions (pyspark.tests.test_memory_profiler.MemoryProfiler2Tests.test_memory_profiler_udf_multiple_actions) ... skipped 'Must have memory-profiler installed.' test_memory_profiler_udf_registered (pyspark.tests.test_memory_profiler.MemoryProfiler2Tests.test_memory_profiler_udf_registered) ... skipped 'Must have memory-profiler installed.' test_memory_profiler_udf_with_arrow (pyspark.tests.test_memory_profiler.MemoryProfiler2Tests.test_memory_profiler_udf_with_arrow) ... skipped 'Must have memory-profiler installed.' test_profilers_clear (pyspark.tests.test_memory_profiler.MemoryProfiler2Tests.test_profilers_clear) ... skipped 'Must have memory-profiler installed.' test_code_map (pyspark.tests.test_memory_profiler.MemoryProfilerTests.test_code_map) ... skipped 'Must have memory-profiler installed.' test_memory_profiler (pyspark.tests.test_memory_profiler.MemoryProfilerTests.test_memory_profiler) ... skipped 'Must have memory-profiler installed.' test_profile_pandas_function_api (pyspark.tests.test_memory_profiler.MemoryProfilerTests.test_profile_pandas_function_api) ... skipped 'Must have memory-profiler installed.' test_profile_pandas_udf (pyspark.tests.test_memory_profiler.MemoryProfilerTests.test_profile_pandas_udf) ... skipped 'Must have memory-profiler installed.' test_udf_line_profiler (pyspark.tests.test_memory_profiler.MemoryProfilerTests.test_udf_line_profiler) ... skipped 'Must have memory-profiler installed.' Skipped tests in pyspark.tests.test_rdd with python3: test_take_on_jrdd_with_large_rows_should_not_cause_deadlock (pyspark.tests.test_rdd.RDDTests.test_take_on_jrdd_with_large_rows_should_not_cause_deadlock) ... skipped 'NumPy or Pandas not installed' Skipped tests in pyspark.tests.test_serializers with python3: test_statcounter_array (pyspark.tests.test_serializers.NumPyTests.test_statcounter_array) ... skipped 'NumPy not installed' test_serialize (pyspark.tests.test_serializers.SciPyTests.test_serialize) ... skipped 'SciPy not installed' Skipped tests in pyspark.tests.test_worker with python3: test_memory_limit (pyspark.tests.test_worker.WorkerMemoryTest.test_memory_limit) ... skipped "Memory limit feature in Python worker is dependent on Python's 'resource' module on Linux; however, not found or not on Linux." test_python_segfault (pyspark.tests.test_worker.WorkerSegfaultNonDaemonTest.test_python_segfault) ... skipped 'SPARK-46130: Flaky with Python 3.12' test_python_segfault (pyspark.tests.test_worker.WorkerSegfaultTest.test_python_segfault) ... skipped 'SPARK-46130: Flaky with Python 3.12' ``` ### Does this PR introduce _any_ user-facing change? No. The failure happens during testing. ### How was this patch tested? Pass the CIs and do the manual test without optional dependencies. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #47526 from dongjoon-hyun/SPARK-48710. Authored-by: Dongjoon Hyun <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>

…ptional dependencies ### What changes were proposed in this pull request? This is a follow-up of apache#47083 to recover PySpark RDD tests. ### Why are the changes needed? `PySpark Core` test should not fail on optional dependencies. **BEFORE** ``` $ python/run-tests.py --python-executables python3 --modules pyspark-core ... File "/Users/dongjoon/APACHE/spark-merge/python/pyspark/core/rdd.py", line 5376, in _test import numpy as np ModuleNotFoundError: No module named 'numpy' ``` **AFTER** ``` $ python/run-tests.py --python-executables python3 --modules pyspark-core ... Tests passed in 189 seconds Skipped tests in pyspark.tests.test_memory_profiler with python3: test_assert_vanilla_mode (pyspark.tests.test_memory_profiler.MemoryProfiler2Tests.test_assert_vanilla_mode) ... skipped 'Must have memory-profiler installed.' test_memory_profiler_aggregate_in_pandas (pyspark.tests.test_memory_profiler.MemoryProfiler2Tests.test_memory_profiler_aggregate_in_pandas) ... skipped 'Must have memory-profiler installed.' test_memory_profiler_clear (pyspark.tests.test_memory_profiler.MemoryProfiler2Tests.test_memory_profiler_clear) ... skipped 'Must have memory-profiler installed.' test_memory_profiler_cogroup_apply_in_arrow (pyspark.tests.test_memory_profiler.MemoryProfiler2Tests.test_memory_profiler_cogroup_apply_in_arrow) ... skipped 'Must have memory-profiler installed.' test_memory_profiler_cogroup_apply_in_pandas (pyspark.tests.test_memory_profiler.MemoryProfiler2Tests.test_memory_profiler_cogroup_apply_in_pandas) ... skipped 'Must have memory-profiler installed.' test_memory_profiler_group_apply_in_arrow (pyspark.tests.test_memory_profiler.MemoryProfiler2Tests.test_memory_profiler_group_apply_in_arrow) ... skipped 'Must have memory-profiler installed.' test_memory_profiler_group_apply_in_pandas (pyspark.tests.test_memory_profiler.MemoryProfiler2Tests.test_memory_profiler_group_apply_in_pandas) ... skipped 'Must have memory-profiler installed.' test_memory_profiler_map_in_pandas_not_supported (pyspark.tests.test_memory_profiler.MemoryProfiler2Tests.test_memory_profiler_map_in_pandas_not_supported) ... skipped 'Must have memory-profiler installed.' test_memory_profiler_pandas_udf (pyspark.tests.test_memory_profiler.MemoryProfiler2Tests.test_memory_profiler_pandas_udf) ... skipped 'Must have memory-profiler installed.' test_memory_profiler_pandas_udf_iterator_not_supported (pyspark.tests.test_memory_profiler.MemoryProfiler2Tests.test_memory_profiler_pandas_udf_iterator_not_supported) ... skipped 'Must have memory-profiler installed.' test_memory_profiler_pandas_udf_window (pyspark.tests.test_memory_profiler.MemoryProfiler2Tests.test_memory_profiler_pandas_udf_window) ... skipped 'Must have memory-profiler installed.' test_memory_profiler_udf (pyspark.tests.test_memory_profiler.MemoryProfiler2Tests.test_memory_profiler_udf) ... skipped 'Must have memory-profiler installed.' test_memory_profiler_udf_multiple_actions (pyspark.tests.test_memory_profiler.MemoryProfiler2Tests.test_memory_profiler_udf_multiple_actions) ... skipped 'Must have memory-profiler installed.' test_memory_profiler_udf_registered (pyspark.tests.test_memory_profiler.MemoryProfiler2Tests.test_memory_profiler_udf_registered) ... skipped 'Must have memory-profiler installed.' test_memory_profiler_udf_with_arrow (pyspark.tests.test_memory_profiler.MemoryProfiler2Tests.test_memory_profiler_udf_with_arrow) ... skipped 'Must have memory-profiler installed.' test_profilers_clear (pyspark.tests.test_memory_profiler.MemoryProfiler2Tests.test_profilers_clear) ... skipped 'Must have memory-profiler installed.' test_code_map (pyspark.tests.test_memory_profiler.MemoryProfilerTests.test_code_map) ... skipped 'Must have memory-profiler installed.' test_memory_profiler (pyspark.tests.test_memory_profiler.MemoryProfilerTests.test_memory_profiler) ... skipped 'Must have memory-profiler installed.' test_profile_pandas_function_api (pyspark.tests.test_memory_profiler.MemoryProfilerTests.test_profile_pandas_function_api) ... skipped 'Must have memory-profiler installed.' test_profile_pandas_udf (pyspark.tests.test_memory_profiler.MemoryProfilerTests.test_profile_pandas_udf) ... skipped 'Must have memory-profiler installed.' test_udf_line_profiler (pyspark.tests.test_memory_profiler.MemoryProfilerTests.test_udf_line_profiler) ... skipped 'Must have memory-profiler installed.' Skipped tests in pyspark.tests.test_rdd with python3: test_take_on_jrdd_with_large_rows_should_not_cause_deadlock (pyspark.tests.test_rdd.RDDTests.test_take_on_jrdd_with_large_rows_should_not_cause_deadlock) ... skipped 'NumPy or Pandas not installed' Skipped tests in pyspark.tests.test_serializers with python3: test_statcounter_array (pyspark.tests.test_serializers.NumPyTests.test_statcounter_array) ... skipped 'NumPy not installed' test_serialize (pyspark.tests.test_serializers.SciPyTests.test_serialize) ... skipped 'SciPy not installed' Skipped tests in pyspark.tests.test_worker with python3: test_memory_limit (pyspark.tests.test_worker.WorkerMemoryTest.test_memory_limit) ... skipped "Memory limit feature in Python worker is dependent on Python's 'resource' module on Linux; however, not found or not on Linux." test_python_segfault (pyspark.tests.test_worker.WorkerSegfaultNonDaemonTest.test_python_segfault) ... skipped 'SPARK-46130: Flaky with Python 3.12' test_python_segfault (pyspark.tests.test_worker.WorkerSegfaultTest.test_python_segfault) ... skipped 'SPARK-46130: Flaky with Python 3.12' ``` ### Does this PR introduce _any_ user-facing change? No. The failure happens during testing. ### How was this patch tested? Pass the CIs and do the manual test without optional dependencies. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#47526 from dongjoon-hyun/SPARK-48710. Authored-by: Dongjoon Hyun <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>

…1.15,<2) ### What changes were proposed in this pull request? * Add a constraint for `numpy<2` to the PySpark package ### Why are the changes needed? PySpark references some code which was removed with NumPy 2.0. Thus, if `numpy>=2` is installed, executing PySpark may fail. apache#47083 updates the `master` branch to be compatible with NumPy 2. This PR adds a version bound for older releases, where it won't be applied. ### Does this PR introduce _any_ user-facing change? NumPy will be limited to `numpy<2` when installing `pypspark` with extras `ml`, `mllib`, `sql`, `pandas_on_spark` or `connect`. ### How was this patch tested? Via existing CI jobs. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#47175 from codesorcery/SPARK-48710-numpy-upper-bound. Authored-by: Patrick Marx <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]> (cherry picked from commit 44eba46) Signed-off-by: Hyukjin Kwon <[email protected]>

### What changes were proposed in this pull request? * Replace NumPy types removed in NumPy 2.0 with their equivalent counterparts * Make tests compatible to new `__repr__` of numerical scalars ### Why are the changes needed? PySpark references some code which was removed with NumPy 2.0: * `np.NaN` was removed, should be replaced with `np.nan` * `np.string_` was removed, [is an alias for](https://github.com/numpy/numpy/blob/v1.26.5/numpy/__init__.pyi#L3134) `np.bytes_` * `np.float_` was removed, [is defined the same as](https://github.com/numpy/numpy/blob/v1.26.5/numpy/__init__.pyi#L3042-3043) `np.double` * `np.unicode_` was removed, [is an alias for](https://github.com/numpy/numpy/blob/v1.26.5/numpy/__init__.pyi#L3148) `np.str_` NumPy 2.0 changed the `__repr__` of numerical scalars to contain type information (e.g. `np.int32(3)` instead of `3`). Old behavior can be enabled by setting `numpy.printoptions(legacy="1.25")` (or the older `1.21` and `1.13` legacy modes). There are multiple tests and doctests that rely on the old behavior. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Tests for modules `pyspark-connect`, `pyspark-core`, `pyspark-errors`, `pyspark-mllib`, `pyspark-pandas`, `pyspark-sql`, `pyspark-resource`, `pyspark-testing` were executed in a local venv with `numpy==2.0.0` installed. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#47083 from codesorcery/SPARK-48710. Authored-by: Patrick Marx <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>

…ptional dependencies ### What changes were proposed in this pull request? This is a follow-up of apache#47083 to recover PySpark RDD tests. ### Why are the changes needed? `PySpark Core` test should not fail on optional dependencies. **BEFORE** ``` $ python/run-tests.py --python-executables python3 --modules pyspark-core ... File "/Users/dongjoon/APACHE/spark-merge/python/pyspark/core/rdd.py", line 5376, in _test import numpy as np ModuleNotFoundError: No module named 'numpy' ``` **AFTER** ``` $ python/run-tests.py --python-executables python3 --modules pyspark-core ... Tests passed in 189 seconds Skipped tests in pyspark.tests.test_memory_profiler with python3: test_assert_vanilla_mode (pyspark.tests.test_memory_profiler.MemoryProfiler2Tests.test_assert_vanilla_mode) ... skipped 'Must have memory-profiler installed.' test_memory_profiler_aggregate_in_pandas (pyspark.tests.test_memory_profiler.MemoryProfiler2Tests.test_memory_profiler_aggregate_in_pandas) ... skipped 'Must have memory-profiler installed.' test_memory_profiler_clear (pyspark.tests.test_memory_profiler.MemoryProfiler2Tests.test_memory_profiler_clear) ... skipped 'Must have memory-profiler installed.' test_memory_profiler_cogroup_apply_in_arrow (pyspark.tests.test_memory_profiler.MemoryProfiler2Tests.test_memory_profiler_cogroup_apply_in_arrow) ... skipped 'Must have memory-profiler installed.' test_memory_profiler_cogroup_apply_in_pandas (pyspark.tests.test_memory_profiler.MemoryProfiler2Tests.test_memory_profiler_cogroup_apply_in_pandas) ... skipped 'Must have memory-profiler installed.' test_memory_profiler_group_apply_in_arrow (pyspark.tests.test_memory_profiler.MemoryProfiler2Tests.test_memory_profiler_group_apply_in_arrow) ... skipped 'Must have memory-profiler installed.' test_memory_profiler_group_apply_in_pandas (pyspark.tests.test_memory_profiler.MemoryProfiler2Tests.test_memory_profiler_group_apply_in_pandas) ... skipped 'Must have memory-profiler installed.' test_memory_profiler_map_in_pandas_not_supported (pyspark.tests.test_memory_profiler.MemoryProfiler2Tests.test_memory_profiler_map_in_pandas_not_supported) ... skipped 'Must have memory-profiler installed.' test_memory_profiler_pandas_udf (pyspark.tests.test_memory_profiler.MemoryProfiler2Tests.test_memory_profiler_pandas_udf) ... skipped 'Must have memory-profiler installed.' test_memory_profiler_pandas_udf_iterator_not_supported (pyspark.tests.test_memory_profiler.MemoryProfiler2Tests.test_memory_profiler_pandas_udf_iterator_not_supported) ... skipped 'Must have memory-profiler installed.' test_memory_profiler_pandas_udf_window (pyspark.tests.test_memory_profiler.MemoryProfiler2Tests.test_memory_profiler_pandas_udf_window) ... skipped 'Must have memory-profiler installed.' test_memory_profiler_udf (pyspark.tests.test_memory_profiler.MemoryProfiler2Tests.test_memory_profiler_udf) ... skipped 'Must have memory-profiler installed.' test_memory_profiler_udf_multiple_actions (pyspark.tests.test_memory_profiler.MemoryProfiler2Tests.test_memory_profiler_udf_multiple_actions) ... skipped 'Must have memory-profiler installed.' test_memory_profiler_udf_registered (pyspark.tests.test_memory_profiler.MemoryProfiler2Tests.test_memory_profiler_udf_registered) ... skipped 'Must have memory-profiler installed.' test_memory_profiler_udf_with_arrow (pyspark.tests.test_memory_profiler.MemoryProfiler2Tests.test_memory_profiler_udf_with_arrow) ... skipped 'Must have memory-profiler installed.' test_profilers_clear (pyspark.tests.test_memory_profiler.MemoryProfiler2Tests.test_profilers_clear) ... skipped 'Must have memory-profiler installed.' test_code_map (pyspark.tests.test_memory_profiler.MemoryProfilerTests.test_code_map) ... skipped 'Must have memory-profiler installed.' test_memory_profiler (pyspark.tests.test_memory_profiler.MemoryProfilerTests.test_memory_profiler) ... skipped 'Must have memory-profiler installed.' test_profile_pandas_function_api (pyspark.tests.test_memory_profiler.MemoryProfilerTests.test_profile_pandas_function_api) ... skipped 'Must have memory-profiler installed.' test_profile_pandas_udf (pyspark.tests.test_memory_profiler.MemoryProfilerTests.test_profile_pandas_udf) ... skipped 'Must have memory-profiler installed.' test_udf_line_profiler (pyspark.tests.test_memory_profiler.MemoryProfilerTests.test_udf_line_profiler) ... skipped 'Must have memory-profiler installed.' Skipped tests in pyspark.tests.test_rdd with python3: test_take_on_jrdd_with_large_rows_should_not_cause_deadlock (pyspark.tests.test_rdd.RDDTests.test_take_on_jrdd_with_large_rows_should_not_cause_deadlock) ... skipped 'NumPy or Pandas not installed' Skipped tests in pyspark.tests.test_serializers with python3: test_statcounter_array (pyspark.tests.test_serializers.NumPyTests.test_statcounter_array) ... skipped 'NumPy not installed' test_serialize (pyspark.tests.test_serializers.SciPyTests.test_serialize) ... skipped 'SciPy not installed' Skipped tests in pyspark.tests.test_worker with python3: test_memory_limit (pyspark.tests.test_worker.WorkerMemoryTest.test_memory_limit) ... skipped "Memory limit feature in Python worker is dependent on Python's 'resource' module on Linux; however, not found or not on Linux." test_python_segfault (pyspark.tests.test_worker.WorkerSegfaultNonDaemonTest.test_python_segfault) ... skipped 'SPARK-46130: Flaky with Python 3.12' test_python_segfault (pyspark.tests.test_worker.WorkerSegfaultTest.test_python_segfault) ... skipped 'SPARK-46130: Flaky with Python 3.12' ``` ### Does this PR introduce _any_ user-facing change? No. The failure happens during testing. ### How was this patch tested? Pass the CIs and do the manual test without optional dependencies. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#47526 from dongjoon-hyun/SPARK-48710. Authored-by: Dongjoon Hyun <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>

codesorcery added 2 commits June 25, 2024 12:01

[SPARK-48710][PYTHON] Use NumPy 2.0 compatible types

e6dd745

Merge branch 'apache:master' into SPARK-48710

87f06bd

codesorcery changed the title ~~Spark [SPARK-48710][PYTHON] Use NumPy 2.0 compatible types~~ [SPARK-48710][PYTHON] Use NumPy 2.0 compatible types Jun 25, 2024

github-actions bot added PYTHON PANDAS API ON SPARK labels Jun 25, 2024

itholic reviewed Jun 27, 2024

View reviewed changes

[SPARK-48710][PYTHON] Make tests and ml module NumPy 2.0 compatible

7512c71

github-actions bot added SQL ML labels Jun 27, 2024

[SPARK-48710][PYTHON] Reformat code and fix NumPy version checks

6103cd5

zhengruifeng reviewed Jul 1, 2024

View reviewed changes

[SPARK-48710][PYTHON] Fix type checks for NumPy in tests

2851ae0

codesorcery force-pushed the SPARK-48710 branch from d05b653 to 2851ae0 Compare July 1, 2024 10:59

HyukjinKwon approved these changes Jul 2, 2024

View reviewed changes

codesorcery mentioned this pull request Jul 2, 2024

[SPARK-48710][PYTHON][3.5] Limit NumPy version to supported range (>=1.15,<2) #47175

Closed

HyukjinKwon closed this in 0aa32e4 Jul 3, 2024

rgommers mentioned this pull request Jul 3, 2024

Ecosystem compatibility with numpy 2.0 numpy/numpy#26191

Open

dongjoon-hyun reviewed Jul 29, 2024

View reviewed changes

HyukjinKwon reviewed Jul 29, 2024

View reviewed changes

dongjoon-hyun mentioned this pull request Jul 29, 2024

[SPARK-48710][PYTHON][FOLLOWUP] PySpark rdd test should not fail on optional dependencies #47526

Closed

HyukjinKwon mentioned this pull request Sep 3, 2024

[SPARK-47995][INFRA][PYTHON] Refresh testing image for pyarrow 17 #47965

Closed

codesorcery mentioned this pull request Oct 1, 2024

[SPARK-49792][PYTHON][BUILD] Upgrade to numpy 2 for building and testing Spark branches #48180

Closed

jakirkham mentioned this pull request Oct 26, 2024

avoid numpy 2 for pyspark conda-forge/conda-forge-repodata-patches-feedstock#888

Merged

ireneisdoomed mentioned this pull request Nov 20, 2024

fix: specify numpy and scipy versions to enable running in dataproc opentargets/ot-release-metrics#32

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-48710][PYTHON] Use NumPy 2.0 compatible types #47083

[SPARK-48710][PYTHON] Use NumPy 2.0 compatible types #47083

codesorcery commented Jun 25, 2024 •

edited

Loading

allisonwang-db commented Jun 26, 2024

itholic Jun 27, 2024 •

edited

Loading

itholic Jun 27, 2024

itholic commented Jun 27, 2024 •

edited

Loading

codesorcery commented Jun 27, 2024

itholic commented Jun 30, 2024

HyukjinKwon commented Jun 30, 2024

HyukjinKwon commented Jun 30, 2024

HyukjinKwon commented Jul 1, 2024

WeichenXu123 commented Jul 1, 2024

HyukjinKwon commented Jul 1, 2024

zhengruifeng left a comment

zhengruifeng Jul 1, 2024

codesorcery Jul 1, 2024

zhengruifeng Jul 1, 2024

codesorcery commented Jul 1, 2024

codesorcery commented Jul 1, 2024

HyukjinKwon commented Jul 2, 2024

codesorcery commented Jul 2, 2024

HyukjinKwon commented Jul 2, 2024

codesorcery commented Jul 2, 2024 •

edited

Loading

jakirkham commented Jul 2, 2024

HyukjinKwon commented Jul 3, 2024

rgommers commented Jul 3, 2024

codesorcery commented Jul 3, 2024

rgommers commented Jul 3, 2024

dongjoon-hyun left a comment •

edited

Loading

dongjoon-hyun commented Jul 29, 2024

HyukjinKwon Jul 29, 2024

HyukjinKwon Jul 29, 2024

dongjoon-hyun Jul 29, 2024

dongjoon-hyun Jul 29, 2024

HyukjinKwon Jul 29, 2024

dongjoon-hyun commented Jul 29, 2024


		os.chdir(os.environ["SPARK_HOME"])

		if Version(np.__version__) >= Version("2"):

[SPARK-48710][PYTHON] Use NumPy 2.0 compatible types #47083

[SPARK-48710][PYTHON] Use NumPy 2.0 compatible types #47083

Conversation

codesorcery commented Jun 25, 2024 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

allisonwang-db commented Jun 26, 2024

itholic Jun 27, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

itholic commented Jun 27, 2024 • edited Loading

codesorcery commented Jun 27, 2024

itholic commented Jun 30, 2024

HyukjinKwon commented Jun 30, 2024

HyukjinKwon commented Jun 30, 2024

HyukjinKwon commented Jul 1, 2024

WeichenXu123 commented Jul 1, 2024

HyukjinKwon commented Jul 1, 2024

zhengruifeng left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codesorcery commented Jul 1, 2024

codesorcery commented Jul 1, 2024

HyukjinKwon commented Jul 2, 2024

codesorcery commented Jul 2, 2024

HyukjinKwon commented Jul 2, 2024

codesorcery commented Jul 2, 2024 • edited Loading

jakirkham commented Jul 2, 2024

HyukjinKwon commented Jul 3, 2024

rgommers commented Jul 3, 2024

codesorcery commented Jul 3, 2024

rgommers commented Jul 3, 2024

dongjoon-hyun left a comment • edited Loading

Choose a reason for hiding this comment

dongjoon-hyun commented Jul 29, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dongjoon-hyun commented Jul 29, 2024

codesorcery commented Jun 25, 2024 •

edited

Loading

itholic Jun 27, 2024 •

edited

Loading

itholic commented Jun 27, 2024 •

edited

Loading

codesorcery commented Jul 2, 2024 •

edited

Loading

dongjoon-hyun left a comment •

edited

Loading