ENH: Implement pandas.read_iceberg #61383

datapythonista · 2025-04-30T21:00:23Z

Tests added and passed if fixing a bug or adding a new feature
All code checks passed.
Added type annotations to new arguments/methods/functions.
Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

datapythonista · 2025-04-30T21:06:44Z

There is a test failure, for some reason PyIceberg is able to find the namespace and the table when using:

pyiceberg.catalog.load_catalog(**{"uri": f"sqlite:///{path}/catalog.sqlite"})

but not when using:

pyiceberg.catalog.load_catalog("pandas_tests_catalog")

with config file ~/.pyiceberg:

catalog:
  pandas_tests_catalog:
    type: sql
    uri: sqlite:///path/catalog.sqlite

I need to research more on what's the problem, since this should work. Other than that this should be ready to get merged.

IsaacWarren · 2025-05-02T16:35:24Z

Hi @datapythonista, I took a look at the failing tests and from what I can tell it looks like writing the .pyicberg.yaml file causes subsequent tests to fail. I'm not really sure why though and I can't reproduce it locally with pyiceberg 0.9. I have a ~/.pyiceberg.yaml file containing

  catalog:
    pandas_tests_catalog:
      uri: sqlite:////tmp/iceberg_catalog/catalog.sqlite

and a test file running

from pyiceberg import catalog

catalog = catalog.load_catalog(None, **{"uri": "sqlite:////tmp/iceberg_catalog/catalog.sqlite"})
print(catalog.list_namespaces())

just fine. I did see that the line removing the ~/.pyiceberg.yaml was commented out, was there a reason for that?

datapythonista · 2025-05-02T16:42:50Z

Thanks for giving it a try. I just commented that line to make sure locally that it contained what I expected after running the test, I forgot to uncomment.

Passing the uri to load_catalog worked fine locally. I just had problems when passing a catalog name to it (which requires the config file with the uri). I'll have another look, I thought I could be missing something obvious, but I guess it's something more complicated.

IsaacWarren · 2025-05-02T17:06:26Z

Oh ok, I just tried by passing the catalog name with a config file

from pyiceberg import catalog
from pyiceberg.schema import Schema

catalog = catalog.load_catalog("pandas_tests_catalog")
catalog.create_namespace_if_not_exists("default")
catalog.create_table_if_not_exists("default.test_table", schema=Schema())

and it gave me an error about there not being a default path because warehouse isn't set, maybe try setting warehouse for pandas_tests_catalog in ~/.pyiceberg.yaml? When I added that locally it started working.

datapythonista · 2025-05-02T18:24:30Z

I tried providing warehouse, but that didn't work.

I've been debugging, and seems like the problem is that whether the catalog is loaded by a URI or by a name, the query that retrieve namespaces is filtering by a catalog name. Which will be default when the catalog is loaded with a URI with no name. I'm still having an issue, after making the name consistent for our tests, but I hope I can get a fix soon.

Thank for the help!

datapythonista · 2025-05-02T21:36:15Z

Tests should be fixed now. It was a bit tricky, since pyiceberg loads the config when importing the module, not when loading the catalog, which made the config file to not always be loaded.

This reverts commit e593977.

datapythonista · 2025-05-05T01:49:36Z

For some reason pyiceberg used to have an upper version for dependencies. They stopped doing it now, but for the minimum pyiceberg version it was not possible to be compatible with all our minimum version dependencies. I relaxed the minimum version of fsspec, s3fs and gcsfs to be able to resolve the minimum versions environment with pyiceberg.

mroeschke · 2025-05-05T17:15:18Z

Curious if there is an open issue discussing including this new feature.

(FWIW I did like your prior IO registration PDEP that would have made this easier to externally implement)

datapythonista · 2025-05-05T17:32:10Z

Curious if there is an open issue discussing including this new feature.

I think only the discussions you are aware of, bodo-ai/Bodo-Pandas-Collaboration#9 and the discussions in the calls that I know.

I had a look at using read_sql instead of read_iceberg, and to me it feels like the API would be too difficult to use. Considering the popularity of Iceberg, and that we already have specific connectors for much less popular formats such as feather, SPSS... I found this the most reasonable implementation. But happy to give a try at using the code in this PR with read_sql instead, if there is interest. But it also feels that our code will be more complex, so personally I don't see an advantage.

I'm also happy to revisit PDEP-9. There weren't objections to the general idea that I remember, the main blocker was that some people weren't happy that connectors could register with the name they wanted. And I don't think there is a good solution to this. To me it's not a problem, since at the end is the user who decides which Python dependencies are installed. In any case, to me it makes sense to move forward with this Iceberg, and surely this would be a good candidate to move as a third party with many others if we ever implement PDEP-9.

ci/deps/actions-310-minimum_versions.yaml

mroeschke · 2025-05-08T16:20:26Z

pandas/tests/io/test_iceberg.py

+
+
+@contextmanager
+def create_catalog(catalog_name_in_pyiceberg_config=None):


Does it make sense for this to be a pytest fixture? It could be parametrized over different catalog names by default and could use the tmpdir fixture to do the temporary directory stuff automatically

That was my first implementation, but I couldn't make a fixture remove the files when the test finish. That's why I implemented it as a context manager.

I'll check again, as I didn't try with the tmpdir fixture, but it may not be easy.

mroeschke · 2025-05-08T16:22:06Z

pandas/io/iceberg.py

+    scan_properties: dict[str, Any] | None = None,
+) -> DataFrame:
+    """
+    Read an Apache Iceberg table into a pandas DataFrame.


Might be good to flag as experimental so if we revisit and implement the IO plugin model we can pivot this to that model a little quicker.

Makes sense, thanks for all the feedback.

ENH: Implement pandas.read_iceberg

aa59971

datapythonista added the IO Data IO issues that don't fit into a more specific label label Apr 30, 2025

datapythonista added 3 commits April 30, 2025 23:24

Typo in API index

9ef8e4e

Run iceberg tests

34792c9

Fix docstring example

c1b2426

datapythonista requested a review from mroeschke as a code owner April 30, 2025 21:38

datapythonista added 2 commits May 1, 2025 00:11

Fixes to the docstring

6081bfb

Creating catalog dynamically

91370db

Fixing tests

ecf40ff

datapythonista added 15 commits May 3, 2025 12:44

Adding debug info

8c1e4dc

Making pyiceberg tests run in a single cpu

ee61079

Debugging CI problems

2debd5f

Bump pyiceberg version to 0.8.1

e593977

Revert "Bump pyiceberg version to 0.8.1"

24c0ceb

This reverts commit e593977.

Commenting debugging information

5fc738e

Bump minimum version to 0.7.1

4953745

Removing debug code

17d73e8

Allowing older version of gcsfs

5f07a49

Allowing an older version of s3fs

32add5f

Allowing an older version of fsspec

3b0d7ee

adding pyiceberg to requirements

7018c11

empty

75c24d6

pre-commit

9c343a5

Updating gcsfs minimum version

9cd2d5c

datapythonista added 6 commits May 5, 2025 03:05

Print difference in validation of minimum versions

301e988

Fix diff print

15f6397

Fix bug when showing diff

f973e61

Merge remote-tracking branch 'upstream/main' into read_iceberg

9110f2c

debug validate min versions

c13ce5b

Updating new CI deps, reverting validate min versions script changes

bc4d689

Reverting test data to working version

735c48c

mroeschke reviewed May 8, 2025

View reviewed changes

ci/deps/actions-310-minimum_versions.yaml Show resolved Hide resolved

mroeschke reviewed May 8, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: Implement pandas.read_iceberg #61383

ENH: Implement pandas.read_iceberg #61383

datapythonista commented Apr 30, 2025

datapythonista commented Apr 30, 2025

IsaacWarren commented May 2, 2025

datapythonista commented May 2, 2025

IsaacWarren commented May 2, 2025

datapythonista commented May 2, 2025

datapythonista commented May 2, 2025

datapythonista commented May 5, 2025

mroeschke commented May 5, 2025

datapythonista commented May 5, 2025

mroeschke May 8, 2025

datapythonista May 8, 2025

mroeschke May 8, 2025

datapythonista May 8, 2025



		@contextmanager
		def create_catalog(catalog_name_in_pyiceberg_config=None):

ENH: Implement pandas.read_iceberg #61383

Are you sure you want to change the base?

ENH: Implement pandas.read_iceberg #61383

Conversation

datapythonista commented Apr 30, 2025

datapythonista commented Apr 30, 2025

IsaacWarren commented May 2, 2025

datapythonista commented May 2, 2025

IsaacWarren commented May 2, 2025

datapythonista commented May 2, 2025

datapythonista commented May 2, 2025

datapythonista commented May 5, 2025

mroeschke commented May 5, 2025

datapythonista commented May 5, 2025

mroeschke May 8, 2025

Choose a reason for hiding this comment

datapythonista May 8, 2025

Choose a reason for hiding this comment

mroeschke May 8, 2025

Choose a reason for hiding this comment

datapythonista May 8, 2025

Choose a reason for hiding this comment