From 156d71064d6e0121662309fd476ccaaecb5bd7c9 Mon Sep 17 00:00:00 2001 From: davitbzh Date: Tue, 30 Jan 2024 18:21:52 +0100 Subject: [PATCH 1/6] add comment about java env --- docs/user_guides/fs/compute_engines.md | 9 +++++++-- 1 file changed, 7 insertions(+), 2 deletions(-) diff --git a/docs/user_guides/fs/compute_engines.md b/docs/user_guides/fs/compute_engines.md index a3feb9c0d..490ac43d2 100644 --- a/docs/user_guides/fs/compute_engines.md +++ b/docs/user_guides/fs/compute_engines.md @@ -4,12 +4,12 @@ In order to execute a feature pipeline to write to the Feature Store, as well as Hopsworks Feature Store APIs are built around dataframes, that means feature data is inserted into the Feature Store from a Dataframe and likewise when reading data from the Feature Store, it is returned as a Dataframe. -As such, Hopsworks supports three computational engines: +As such, Hopsworks supports four computational engines: 1. [Apache Spark](https://spark.apache.org): Spark Dataframes and Spark Structured Streaming Dataframes are supported, both from Python environments (PySpark) and from Scala environments. 2. [Pandas](https://pandas.pydata.org/): For pure Python environments without dependencies on Spark, Hopsworks supports [Pandas Dataframes](https://pandas.pydata.org/). 3. [Apache Flink](https://flink.apache.org): Flink Data Streams are currently supported as an experimental feature from Java/Scala environments. -3. [Apache Beam](https://beam.apache.org/) *experimental*: Beam Data Streams are currently supported as an experimental feature from Java/Scala environments. +4. [Apache Beam](https://beam.apache.org/) *experimental*: Beam Data Streams are currently supported as an experimental feature from Java/Scala environments. Hopsworks supports running [compute on the platform itself](../../concepts/dev/inside.md) in the form of [Jobs](../projects/jobs/pyspark_job.md) or in [Jupyter Notebooks](../projects/jupyter/python_notebook.md). Alternatlively, you can also connect to Hopsworks using Python or Spark from [external environments](../../concepts/dev/outside.md), given that there is network connectivity. @@ -76,3 +76,8 @@ Apache Beam integration with Hopsworks feature store was only tested using Dataf For more details head over to the [Getting Started Guide](https://github.com/logicalclocks/hopsworks-tutorials/tree/master/integrations/java/beam). +## Java +It is also possible to interact to Hopsworks feature store using pure Java environments without dependencies on Spark, Flink or Beam. +However, this is limited to retrieval of feature vector(s) from the online Feature Store. + +For more details head over to the [Getting Started Guide](https://github.com/logicalclocks/hopsworks-tutorials/tree/master/java). From 7b69856692bfb1019eb026c2661d29202080dee8 Mon Sep 17 00:00:00 2001 From: davitbzh Date: Tue, 14 Jan 2025 17:12:18 +0100 Subject: [PATCH 2/6] java client --- docs/user_guides/integrations/index.md | 1 + docs/user_guides/integrations/java.md | 51 ++++++++++++++++++++++++++ 2 files changed, 52 insertions(+) create mode 100644 docs/user_guides/integrations/java.md diff --git a/docs/user_guides/integrations/index.md b/docs/user_guides/integrations/index.md index a68842daf..fb9d212f8 100644 --- a/docs/user_guides/integrations/index.md +++ b/docs/user_guides/integrations/index.md @@ -3,6 +3,7 @@ Hopsworks is an open platform aiming to be accessible from a variety of tools. Learn in this section how to connect to Hopsworks from - [Python, AWS SageMaker, Google Colab, Kubeflow](python) +- [Java](java) - [Databricks](databricks/networking) - [AWS EMR](emr/emr_configuration) - [Azure HDInsight](hdinsight) diff --git a/docs/user_guides/integrations/java.md b/docs/user_guides/integrations/java.md new file mode 100644 index 000000000..f9c62c7bd --- /dev/null +++ b/docs/user_guides/integrations/java.md @@ -0,0 +1,51 @@ +--- +description: Documentation on how to connect to Hopsworks from a Java client. +--- + +# Java client + +This guide explains step by step how to connect to Hopsworks from a Java client. + + +## Generate an API key + +For instructions on how to generate an API key follow this [user guide](../projects/api_key/create_api_key.md). For the Java client to work correctly make sure you add the following scopes to your API key: + + 1. featurestore + 2. project + 3. job + 4. kafka + +## Connecting to the Feature Store + +You are now ready to connect to the Hopsworks Feature Store from a Java client: + +```Java +//Import necessary classes +import com.logicalclocks.hsfs.FeatureStore; +import com.logicalclocks.hsfs.FeatureView; +import com.logicalclocks.hsfs.HopsworksConnection; + +//Establish connection with Hopsworks. +HopsworksConnection hopsworksConnection = HopsworksConnection.builder() + .host("my_instance") // DNS of your Feature Store instance + .port(443) // Port to reach your Hopsworks instance, defaults to 443 + .project("my_project") // Name of your Hopsworks Feature Store project + .apiKeyValue("api_key") // The API key to authenticate with the feature store + .hostnameVerification(false) // Disable for self-signed certificates + .build(); + +//get feature store handle +FeatureStore fs = hopsworksConnection.getFeatureStore(); + +//get feature view handle +FeatureView fv = fs.getFeatureView(fvName, fvVersion); + +// get feature vector +List singleVector = fv.getFeatureVector(new HashMap() {{ + put("id", 100); + }}); +``` + +## Next Steps +For more information how to interact from Java client with the Hopsworks Feature store follow this [tutorial](https://github.com/logicalclocks/hopsworks-tutorials/tree/java_engine/java). \ No newline at end of file From 503246c60160156c5e37ece9b99009ee5422566f Mon Sep 17 00:00:00 2001 From: davitbzh Date: Tue, 21 Jan 2025 13:09:28 +0100 Subject: [PATCH 3/6] compute engine java --- docs/user_guides/fs/compute_engines.md | 31 ++++++++++++++++---------- 1 file changed, 19 insertions(+), 12 deletions(-) diff --git a/docs/user_guides/fs/compute_engines.md b/docs/user_guides/fs/compute_engines.md index b8eda6c84..92d3ad92b 100644 --- a/docs/user_guides/fs/compute_engines.md +++ b/docs/user_guides/fs/compute_engines.md @@ -4,12 +4,16 @@ In order to execute a feature pipeline to write to the Feature Store, as well as Hopsworks Feature Store APIs are built around dataframes, that means feature data is inserted into the Feature Store from a Dataframe and likewise when reading data from the Feature Store, it is returned as a Dataframe. -As such, Hopsworks supports four computational engines: +As such, Hopsworks supports five computational engines: 1. [Apache Spark](https://spark.apache.org): Spark Dataframes and Spark Structured Streaming Dataframes are supported, both from Python environments (PySpark) and from Scala environments. 2. [Python](https://www.python.org/): For pure Python environments without dependencies on Spark, Hopsworks supports [Pandas Dataframes](https://pandas.pydata.org/) and [Polars Dataframes](https://pola.rs/). 3. [Apache Flink](https://flink.apache.org): Flink Data Streams are currently supported as an experimental feature from Java/Scala environments. 4. [Apache Beam](https://beam.apache.org/) *experimental*: Beam Data Streams are currently supported as an experimental feature from Java/Scala environments. +<<<<<<< HEAD +======= +5. [Java](https://www.java.com): For pure Java environments without dependencies on Spark, Hopsworks supports writing using List of POJO Objects. +>>>>>>> b5aed0ee (compute engine java) Hopsworks supports running [compute on the platform itself](../../concepts/dev/inside.md) in the form of [Jobs](../projects/jobs/pyspark_job.md) or in [Jupyter Notebooks](../projects/jupyter/python_notebook.md). Alternatively, you can also connect to Hopsworks using Python or Spark from [external environments](../../concepts/dev/outside.md), given that there is network connectivity. @@ -18,17 +22,16 @@ Alternatively, you can also connect to Hopsworks using Python or Spark from [ext Hopsworks is aiming to provide functional parity between the computational engines, however, there are certain Hopsworks functionalities which are exclusive to the engines. -| Functionality | Method | Spark | Python | Flink | Beam | Comment | -| ----------------------------------------------------------------- | ------ | ----- | ------ | ------ | ------ | ------- | -| Feature Group Creation from dataframes | [`FeatureGroup.create_feature_group()`](https://docs.hopsworks.ai/hopsworks-api/{{{ hopsworks_version }}}/generated/api/feature_group_api/#create_feature_group) | :white_check_mark: | :white_check_mark: | - | - | Currently Flink/Beam doesn't support registering feature group metadata. Thus it needs to be pre-registered before you can write real time features computed by Flink/Beam.| -| Training Dataset Creation from dataframes | [`TrainingDataset.save()`](https://docs.hopsworks.ai/hopsworks-api/{{{ hopsworks_version }}}/generated/api/training_dataset_api/#save) | :white_check_mark: | - | - | - | Functionality was deprecated in version 3.0 | -| Data validation using Great Expectations for streaming dataframes | [`FeatureGroup.validate()`](https://docs.hopsworks.ai/hopsworks-api/{{{ hopsworks_version }}}/generated/api/feature_group_api/#validate) [`FeatureGroup.insert_stream()`](https://docs.hopsworks.ai/hopsworks-api/{{{ hopsworks_version }}}/generated/api/feature_group_api/#insert_stream) | - | - | - | - | `insert_stream` does not perform any data validation even when a expectation suite is attached. | -| Stream ingestion | [`FeatureGroup.insert_stream()`](https://docs.hopsworks.ai/hopsworks-api/{{{ hopsworks_version }}}/generated/api/feature_group_api/#insert_stream) | :white_check_mark: | - | :white_check_mark: | :white_check_mark: | Python/Pandas/Polars has currently no notion of streaming. | -| Stream ingestion | [`FeatureGroup.insert_stream()`](https://docs.hopsworks.ai/hopsworks-api/{{{ hopsworks_version }}}/generated/api/feature_group_api/#insert_stream) | :white_check_mark: | - | :white_check_mark: | :white_check_mark: | Python/Pandas/Polars has currently no notion of streaming. | -| Reading from Streaming Storage Connectors | [`KafkaConnector.read_stream()`](https://docs.hopsworks.ai/hopsworks-api/{{{ hopsworks_version }}}/generated/api/storage_connector_api/#read_stream) | :white_check_mark: | - | - | - | Python/Pandas/Polars has currently no notion of streaming. For Flink/Beam only write operations are supported | -| Reading training data from external storage other than S3 | [`FeatureView.get_training_data()`](https://docs.hopsworks.ai/hopsworks-api/{{{ hopsworks_version }}}/generated/api/feature_view_api/#get_training_data) | :white_check_mark: | - | - | - | Reading training data that was written to external storage using a Storage Connector other than S3 can currently not be read using HSFS APIs, instead you will have to use the storage's native client. | -| Reading External Feature Groups into Dataframe | [`ExternalFeatureGroup.read()`](https://docs.hopsworks.ai/hopsworks-api/{{{ hopsworks_version }}}/generated/api/external_feature_group_api/#read) | :white_check_mark: | - | - | - | Reading an External Feature Group directly into a Pandas/Polars Dataframe is not supported, however, you can use the [Query API](https://docs.hopsworks.ai/hopsworks-api/{{{ hopsworks_version }}}/generated/api/query_api/) to create Feature Views/Training Data containing External Feature Groups. | -| Read Queries containing External Feature Groups into Dataframe | [`Query.read()`](https://docs.hopsworks.ai/hopsworks-api/{{{ hopsworks_version }}}/generated/api/query_api/#read) | :white_check_mark: | - | - | - | Reading a Query containing an External Feature Group directly into a Pandas/Polars Dataframe is not supported, however, you can use the Query to create Feature Views/Training Data and write the data to a Storage Connector, from where you can read up the data into a Pandas/Polars Dataframe. | +| Functionality | Method | Spark | Python | Flink | Beam | Java | Comment | +| ----------------------------------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------------------ | ------------------ | ---------------------- | ------------------ | ------------------ |------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| +| Feature Group Creation from dataframes | [`FeatureGroup.create_feature_group()`](https://docs.hopsworks.ai/feature-store-api/{{{ hopsworks_version }}}/generated/api/feature_group_api/#create_feature_group) | :white_check_mark: | :white_check_mark: | - | - | - | Currently Flink/Beam/Java doesn't support registering feature group metadata. Thus it needs to be pre-registered before you can write real time features computed by Flink/Beam. | +| Training Dataset Creation from dataframes | [`TrainingDataset.save()`](https://docs.hopsworks.ai/feature-store-api/{{{ hopsworks_version }}}/generated/api/training_dataset_api/#save) | :white_check_mark: | - | - | - | - | Functionality was deprecated in version 3.0 | +| Data validation using Great Expectations for streaming dataframes | [`FeatureGroup.validate()`](https://docs.hopsworks.ai/feature-store-api/{{{ hopsworks_version }}}/generated/api/feature_group_api/#validate)
[`FeatureGroup.insert_stream()`](https://docs.hopsworks.ai/feature-store-api/{{{ hopsworks_version }}}/generated/api/feature_group_api/#insert_stream) | - | - | - | - | - | `insert_stream` does not perform any data validation even when a expectation suite is attached. | +| Stream ingestion | [`FeatureGroup.insert_stream()`](https://docs.hopsworks.ai/feature-store-api/{{{ hopsworks_version }}}/generated/api/feature_group_api/#insert_stream) | :white_check_mark: | - | :white_check_mark: | :white_check_mark: | :white_check_mark: | Python/Pandas/Polars has currently no notion of streaming. | +| Reading from Streaming Storage Connectors | [`KafkaConnector.read_stream()`](https://docs.hopsworks.ai/feature-store-api/{{{ hopsworks_version }}}/generated/api/storage_connector_api/#read_stream) | :white_check_mark: | - | - | - | - | Python/Pandas/Polars has currently no notion of streaming. For Flink/Beam/Java only write operations are supported | +| Reading training data from external storage other than S3 | [`FeatureView.get_training_data()`](https://docs.hopsworks.ai/feature-store-api/{{{ hopsworks_version }}}/generated/api/feature_view_api/#get_training_data) | :white_check_mark: | - | - | - | - | Reading training data that was written to external storage using a Storage Connector other than S3 can currently not be read using HSFS APIs, instead you will have to use the storage's native client. | +| Reading External Feature Groups into Dataframe | [`ExternalFeatureGroup.read()`](https://docs.hopsworks.ai/feature-store-api/{{{ hopsworks_version }}}/generated/api/external_feature_group_api/#read) | :white_check_mark: | - | - | - | - | Reading an External Feature Group directly into a Pandas/Polars Dataframe is not supported, however, you can use the [Query API](https://docs.hopsworks.ai/feature-store-api/{{{ hopsworks_version }}}/generated/api/query_api/) to create Feature Views/Training Data containing External Feature Groups. | +| Read Queries containing External Feature Groups into Dataframe | [`Query.read()`](https://docs.hopsworks.ai/feature-store-api/{{{ hopsworks_version }}}/generated/api/query_api/#read) | :white_check_mark: | - | - | - | - | Reading a Query containing an External Feature Group directly into a Pandas/Polars Dataframe is not supported, however, you can use the Query to create Feature Views/Training Data and write the data to a Storage Connector, from where you can read up the data into a Pandas/Polars Dataframe. | ## Python @@ -78,7 +81,11 @@ Apache Beam integration with Hopsworks feature store was only tested using Dataf For more details head over to the [Getting Started Guide](https://github.com/logicalclocks/hopsworks-tutorials/tree/master/integrations/java/beam). ## Java +<<<<<<< HEAD It is also possible to interact to Hopsworks feature store using pure Java environments without dependencies on Spark, Flink or Beam. However, this is limited to retrieval of feature vector(s) from the online Feature Store. +======= +It is also possible to interact to Hopsworks feature store using pure Java environments without dependencies on Spark, Flink or Beam. +>>>>>>> b5aed0ee (compute engine java) For more details head over to the [Getting Started Guide](https://github.com/logicalclocks/hopsworks-tutorials/tree/master/java). From 2e59f30fe94345064b48bb5da4ecc6ec34197392 Mon Sep 17 00:00:00 2001 From: davitbzh Date: Sun, 26 Jan 2025 19:13:25 +0100 Subject: [PATCH 4/6] fix bad merge --- docs/user_guides/fs/compute_engines.md | 8 -------- 1 file changed, 8 deletions(-) diff --git a/docs/user_guides/fs/compute_engines.md b/docs/user_guides/fs/compute_engines.md index 92d3ad92b..a0191ea36 100644 --- a/docs/user_guides/fs/compute_engines.md +++ b/docs/user_guides/fs/compute_engines.md @@ -10,10 +10,7 @@ As such, Hopsworks supports five computational engines: 2. [Python](https://www.python.org/): For pure Python environments without dependencies on Spark, Hopsworks supports [Pandas Dataframes](https://pandas.pydata.org/) and [Polars Dataframes](https://pola.rs/). 3. [Apache Flink](https://flink.apache.org): Flink Data Streams are currently supported as an experimental feature from Java/Scala environments. 4. [Apache Beam](https://beam.apache.org/) *experimental*: Beam Data Streams are currently supported as an experimental feature from Java/Scala environments. -<<<<<<< HEAD -======= 5. [Java](https://www.java.com): For pure Java environments without dependencies on Spark, Hopsworks supports writing using List of POJO Objects. ->>>>>>> b5aed0ee (compute engine java) Hopsworks supports running [compute on the platform itself](../../concepts/dev/inside.md) in the form of [Jobs](../projects/jobs/pyspark_job.md) or in [Jupyter Notebooks](../projects/jupyter/python_notebook.md). Alternatively, you can also connect to Hopsworks using Python or Spark from [external environments](../../concepts/dev/outside.md), given that there is network connectivity. @@ -81,11 +78,6 @@ Apache Beam integration with Hopsworks feature store was only tested using Dataf For more details head over to the [Getting Started Guide](https://github.com/logicalclocks/hopsworks-tutorials/tree/master/integrations/java/beam). ## Java -<<<<<<< HEAD -It is also possible to interact to Hopsworks feature store using pure Java environments without dependencies on Spark, Flink or Beam. -However, this is limited to retrieval of feature vector(s) from the online Feature Store. -======= It is also possible to interact to Hopsworks feature store using pure Java environments without dependencies on Spark, Flink or Beam. ->>>>>>> b5aed0ee (compute engine java) For more details head over to the [Getting Started Guide](https://github.com/logicalclocks/hopsworks-tutorials/tree/master/java). From c7cafe065a4c454afb6ed4decfaba8f32cab2fd3 Mon Sep 17 00:00:00 2001 From: davitbzh <44586065+davitbzh@users.noreply.github.com> Date: Tue, 11 Feb 2025 13:17:32 +0100 Subject: [PATCH 5/6] Update docs/user_guides/fs/compute_engines.md Co-authored-by: Ralf --- docs/user_guides/fs/compute_engines.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/user_guides/fs/compute_engines.md b/docs/user_guides/fs/compute_engines.md index a0191ea36..26e44acef 100644 --- a/docs/user_guides/fs/compute_engines.md +++ b/docs/user_guides/fs/compute_engines.md @@ -78,6 +78,6 @@ Apache Beam integration with Hopsworks feature store was only tested using Dataf For more details head over to the [Getting Started Guide](https://github.com/logicalclocks/hopsworks-tutorials/tree/master/integrations/java/beam). ## Java -It is also possible to interact to Hopsworks feature store using pure Java environments without dependencies on Spark, Flink or Beam. +It is also possible to interact to Hopsworks feature store using pure Java environments without dependencies on Spark, Flink or Beam. For more details head over to the [Getting Started Guide](https://github.com/logicalclocks/hopsworks-tutorials/tree/master/java). From 3dfae69b190da69324359f749626d662931c95b6 Mon Sep 17 00:00:00 2001 From: davitbzh Date: Tue, 11 Feb 2025 13:35:16 +0100 Subject: [PATCH 6/6] hsfs java --- docs/user_guides/client_installation/index.md | 13 +++++++++++++ 1 file changed, 13 insertions(+) diff --git a/docs/user_guides/client_installation/index.md b/docs/user_guides/client_installation/index.md index c67afd4f6..c832d434f 100644 --- a/docs/user_guides/client_installation/index.md +++ b/docs/user_guides/client_installation/index.md @@ -69,6 +69,19 @@ The HSFS library is available on the Hopsworks' Maven repository. If you are usi The library has different builds targeting different environments: +### HSFS Java + +The `artifactId` for the HSFS Java build is `hsfs`, if you are using Maven as build tool, you can add the following dependency: + +``` + + com.logicalclocks + hsfs + ${hsfs.version} + +``` + + ### Spark The `artifactId` for the Spark build is `hsfs-spark-spark{spark.version}`, if you are using Maven as build tool, you can add the following dependency: