Skip to content

ENH: Implement pandas.read_iceberg #61383

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 29 commits into
base: main
Choose a base branch
from

Conversation

datapythonista
Copy link
Member

@datapythonista datapythonista added the IO Data IO issues that don't fit into a more specific label label Apr 30, 2025
@datapythonista
Copy link
Member Author

There is a test failure, for some reason PyIceberg is able to find the namespace and the table when using:

pyiceberg.catalog.load_catalog(**{"uri": f"sqlite:///{path}/catalog.sqlite"})

but not when using:

pyiceberg.catalog.load_catalog("pandas_tests_catalog")

with config file ~/.pyiceberg:

catalog:
  pandas_tests_catalog:
    type: sql
    uri: sqlite:///path/catalog.sqlite

I need to research more on what's the problem, since this should work. Other than that this should be ready to get merged.

@IsaacWarren
Copy link

Hi @datapythonista, I took a look at the failing tests and from what I can tell it looks like writing the .pyicberg.yaml file causes subsequent tests to fail. I'm not really sure why though and I can't reproduce it locally with pyiceberg 0.9. I have a ~/.pyiceberg.yaml file containing

  catalog:
    pandas_tests_catalog:
      uri: sqlite:////tmp/iceberg_catalog/catalog.sqlite

and a test file running

from pyiceberg import catalog

catalog = catalog.load_catalog(None, **{"uri": "sqlite:////tmp/iceberg_catalog/catalog.sqlite"})
print(catalog.list_namespaces())

just fine. I did see that the line removing the ~/.pyiceberg.yaml was commented out, was there a reason for that?

@datapythonista
Copy link
Member Author

Thanks for giving it a try. I just commented that line to make sure locally that it contained what I expected after running the test, I forgot to uncomment.

Passing the uri to load_catalog worked fine locally. I just had problems when passing a catalog name to it (which requires the config file with the uri). I'll have another look, I thought I could be missing something obvious, but I guess it's something more complicated.

@IsaacWarren
Copy link

Oh ok, I just tried by passing the catalog name with a config file

from pyiceberg import catalog
from pyiceberg.schema import Schema

catalog = catalog.load_catalog("pandas_tests_catalog")
catalog.create_namespace_if_not_exists("default")
catalog.create_table_if_not_exists("default.test_table", schema=Schema())

and it gave me an error about there not being a default path because warehouse isn't set, maybe try setting warehouse for pandas_tests_catalog in ~/.pyiceberg.yaml? When I added that locally it started working.

@datapythonista
Copy link
Member Author

I tried providing warehouse, but that didn't work.

I've been debugging, and seems like the problem is that whether the catalog is loaded by a URI or by a name, the query that retrieve namespaces is filtering by a catalog name. Which will be default when the catalog is loaded with a URI with no name. I'm still having an issue, after making the name consistent for our tests, but I hope I can get a fix soon.

Thank for the help!

@datapythonista
Copy link
Member Author

Tests should be fixed now. It was a bit tricky, since pyiceberg loads the config when importing the module, not when loading the catalog, which made the config file to not always be loaded.

@datapythonista
Copy link
Member Author

For some reason pyiceberg used to have an upper version for dependencies. They stopped doing it now, but for the minimum pyiceberg version it was not possible to be compatible with all our minimum version dependencies. I relaxed the minimum version of fsspec, s3fs and gcsfs to be able to resolve the minimum versions environment with pyiceberg.

@mroeschke
Copy link
Member

Curious if there is an open issue discussing including this new feature.

(FWIW I did like your prior IO registration PDEP that would have made this easier to externally implement)

@datapythonista
Copy link
Member Author

Curious if there is an open issue discussing including this new feature.

I think only the discussions you are aware of, bodo-ai/Bodo-Pandas-Collaboration#9 and the discussions in the calls that I know.

I had a look at using read_sql instead of read_iceberg, and to me it feels like the API would be too difficult to use. Considering the popularity of Iceberg, and that we already have specific connectors for much less popular formats such as feather, SPSS... I found this the most reasonable implementation. But happy to give a try at using the code in this PR with read_sql instead, if there is interest. But it also feels that our code will be more complex, so personally I don't see an advantage.

I'm also happy to revisit PDEP-9. There weren't objections to the general idea that I remember, the main blocker was that some people weren't happy that connectors could register with the name they wanted. And I don't think there is a good solution to this. To me it's not a problem, since at the end is the user who decides which Python dependencies are installed. In any case, to me it makes sense to move forward with this Iceberg, and surely this would be a good candidate to move as a third party with many others if we ever implement PDEP-9.



@contextmanager
def create_catalog(catalog_name_in_pyiceberg_config=None):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it make sense for this to be a pytest fixture? It could be parametrized over different catalog names by default and could use the tmpdir fixture to do the temporary directory stuff automatically

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was my first implementation, but I couldn't make a fixture remove the files when the test finish. That's why I implemented it as a context manager.

I'll check again, as I didn't try with the tmpdir fixture, but it may not be easy.

scan_properties: dict[str, Any] | None = None,
) -> DataFrame:
"""
Read an Apache Iceberg table into a pandas DataFrame.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be good to flag as experimental so if we revisit and implement the IO plugin model we can pivot this to that model a little quicker.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, thanks for all the feedback.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO Data IO issues that don't fit into a more specific label
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants