Create Spark SQL DataFrame: Pass bucket name/password to DF not to Session builder

I’m using Spark and Spark Couchbase Connector version 2.1.

I want to load a bucket into a DataFrame. If I pass the bucket name and password to the session builder like this:

val spark = SparkSession.builder.master("local[*]").appName("Couchebase-app")
            .config("spark.couchbase.nodes", "127.0.0.1")
            .config("spark.couchbase.bucket.User", "123")
            .getOrCreate;

then issue val df = spark.read.couchbase(), the DataFrame gets created just normally.

However, omitting bucket/password from the session builder and submitting them instead as options during the DataFrame creation

val df = spark.read.format("com.couchbase.spark.sql.DefaultSource").option("spark.couchbase.bucket.User","123").load()

…fails throwing the common error:

com.couchbase.client.java.error.InvalidPasswordException: Passwords for bucket "default" do not match.

That suggests that the bucket name/password aren’t picked up at all.

If I give the name of bucket as an option (which is apparently only to be used when more than a bucket are open), like:

val df = spark.read.format("com.couchbase.spark.sql.DefaultSource").option("bucket","user").load()

The bucket name gets seemingly picked up, but the password is missing: java.lang.IllegalStateException: Not able to find bucket password for bucket User.

Question: I’m having a specific application design where I have to create Spark session builder separately from the DataFrame, so any database specific information, such as user, password, host, port, etc. are only passed during the DataFrame creation as options. Isn’t there a way to achieve this?

Hi @mnmami. You need to specify the bucket parameters in the SparkSession config, as in your first example. The dataframe.option("bucket","user") exists to choose between multiple buckets that were previous configured in the SparkSession config, and cannot be used to configure the bucket.

This is because opening a connection to a Couchbase bucket is a relatively expensive operation, so it’s not something you want to be doing every time you create a Dataframe.

I hope that helps :slight_smile:

1 Like