-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
ENH: Implement pandas.read_iceberg #61383
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There is a test failure, for some reason PyIceberg is able to find the namespace and the table when using: pyiceberg.catalog.load_catalog(**{"uri": f"sqlite:///{path}/catalog.sqlite"}) but not when using: pyiceberg.catalog.load_catalog("pandas_tests_catalog") with config file catalog:
pandas_tests_catalog:
type: sql
uri: sqlite:///path/catalog.sqlite I need to research more on what's the problem, since this should work. Other than that this should be ready to get merged. |
Hi @datapythonista, I took a look at the failing tests and from what I can tell it looks like writing the .pyicberg.yaml file causes subsequent tests to fail. I'm not really sure why though and I can't reproduce it locally with pyiceberg 0.9. I have a ~/.pyiceberg.yaml file containing
and a test file running from pyiceberg import catalog
catalog = catalog.load_catalog(None, **{"uri": "sqlite:////tmp/iceberg_catalog/catalog.sqlite"})
print(catalog.list_namespaces()) just fine. I did see that the line removing the ~/.pyiceberg.yaml was commented out, was there a reason for that? |
Thanks for giving it a try. I just commented that line to make sure locally that it contained what I expected after running the test, I forgot to uncomment. Passing the uri to load_catalog worked fine locally. I just had problems when passing a catalog name to it (which requires the config file with the uri). I'll have another look, I thought I could be missing something obvious, but I guess it's something more complicated. |
Oh ok, I just tried by passing the catalog name with a config file from pyiceberg import catalog
from pyiceberg.schema import Schema
catalog = catalog.load_catalog("pandas_tests_catalog")
catalog.create_namespace_if_not_exists("default")
catalog.create_table_if_not_exists("default.test_table", schema=Schema()) and it gave me an error about there not being a default path because warehouse isn't set, maybe try setting |
I tried providing warehouse, but that didn't work. I've been debugging, and seems like the problem is that whether the catalog is loaded by a URI or by a name, the query that retrieve namespaces is filtering by a catalog name. Which will be Thank for the help! |
Tests should be fixed now. It was a bit tricky, since pyiceberg loads the config when importing the module, not when loading the catalog, which made the config file to not always be loaded. |
This reverts commit e593977.
For some reason pyiceberg used to have an upper version for dependencies. They stopped doing it now, but for the minimum pyiceberg version it was not possible to be compatible with all our minimum version dependencies. I relaxed the minimum version of fsspec, s3fs and gcsfs to be able to resolve the minimum versions environment with pyiceberg. |
Curious if there is an open issue discussing including this new feature. (FWIW I did like your prior IO registration PDEP that would have made this easier to externally implement) |
I think only the discussions you are aware of, bodo-ai/Bodo-Pandas-Collaboration#9 and the discussions in the calls that I know. I had a look at using I'm also happy to revisit PDEP-9. There weren't objections to the general idea that I remember, the main blocker was that some people weren't happy that connectors could register with the name they wanted. And I don't think there is a good solution to this. To me it's not a problem, since at the end is the user who decides which Python dependencies are installed. In any case, to me it makes sense to move forward with this Iceberg, and surely this would be a good candidate to move as a third party with many others if we ever implement PDEP-9. |
|
||
|
||
@contextmanager | ||
def create_catalog(catalog_name_in_pyiceberg_config=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it make sense for this to be a pytest fixture? It could be parametrized over different catalog names by default and could use the tmpdir
fixture to do the temporary directory stuff automatically
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That was my first implementation, but I couldn't make a fixture remove the files when the test finish. That's why I implemented it as a context manager.
I'll check again, as I didn't try with the tmpdir fixture, but it may not be easy.
scan_properties: dict[str, Any] | None = None, | ||
) -> DataFrame: | ||
""" | ||
Read an Apache Iceberg table into a pandas DataFrame. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be good to flag as experimental
so if we revisit and implement the IO plugin model we can pivot this to that model a little quicker.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense, thanks for all the feedback.
doc/source/whatsnew/vX.X.X.rst
file if fixing a bug or adding a new feature.