ArrayIndexOutOfBoundsException

Good evening folks,

New to couchbase here. I just setup a zeppelin notebook and trying to create a dataframe on my bucket. Got it working on one of my buckets but on the other I keep getting error - Caused by: java.lang.ArrayIndexOutOfBoundsException: 10. See below code and results seems to suggest it works fine. But when I do a .show() I get this error of array out of bounds. Anyone else seen this issue and know how to fix?

Code:
val df = sqlContext.sql(“SELECT account from people LIMIT 10”)

Result:
df: org.apache.spark.sql.DataFrame = [account: array<struct<anonCustomerIds:array,consent:array<structisAllowed:boolean,lastUpdated:string,linkedEventId:string,type:string>,contactProfile:struct<address:array<structaddressLine1:string,addressLine2:string,area:string,description:string,houseNameNumber:string,isPrimary:boolean,lookupPostcode:string,postalOutcode:string,postcode:string,townCity:string>,email:array<structaddress:string,isPrimary:boolean,lookupAddress:string,source:string>,phone:array<structlookupNumber:string,number:string>>,credentials:structpasswordHash:string,securityQuestion:string,username:string,currentAcceptedTerms:structtermsActiveDate:string,termsVersion:string,dob:structday:bigint,month:bigint,year:bigint,exclusion:s

Schema…
root
|-- account: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- anonCustomerIds: array (nullable = true)
| | | |-- element: string (containsNull = true)
| | |-- consent: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- isAllowed: boolean (nullable = true)
| | | | |-- lastUpdated: string (nullable = true)
| | | | |-- linkedEventId: string (nullable = true)
| | | | |-- type: string (nullable = true)
| | |-- contactProfile: struct (nullable = true)
| | | |-- address: array (nullable = true)
| | | | |-- element: struct (containsNull = true)
| | | | | |-- addressLine1: string (nullable = true)
| | | | | |-- addressLine2: string (nullable = true)
| | | | | |-- area: string (nullable = true)
| | | | | |-- description: string (nullable = true)
| | | | | |-- houseNameNumber: string (nullable = true)
| | | | | |-- isPrimary: boolean (nullable = true)
| | | | | |-- lookupPostcode: string (nullable = true)
| | | | | |-- postalOutcode: string (nullable = true)
| | | | | |-- postcode: string (nullable = true)
| | | | | |-- townCity: string (nullable = true)
| | | |-- email: array (nullable = true)
| | | | |-- element: struct (containsNull = true)
| | | | | |-- address: string (nullable = true)
| | | | | |-- isPrimary: boolean (nullable = true)
| | | | | |-- lookupAddress: string (nullable = true)
| | | | | |-- source: string (nullable = true)
| | | |-- phone: array (nullable = true)
| | | | |-- element: struct (containsNull = true)
| | | | | |-- lookupNumber: string (nullable = true)
| | | | | |-- number: string (nullable = true)
| | |-- credentials: struct (nullable = true)

Hi Mark, can you confirm the versions of Zeppelin, Spark and Couchbase you are using together?

Hi Tyler,

I am using aws emr-4.7.2 with these applications - Hive 1.0.0, Pig 0.14.0, Hue 3.7.1, Zeppelin-Sandbox 0.5.6, Spark 1.6.2.

The error seems to be because the schema that is being inferred by spark is missing data fields. As I understand the inferred schema takes a sample of x records and infers the schema from that. The issue that I am having is that I have few million customers with a small percentage with a lot more data fields so when show() is run array out of bounds error is returned.

If I somehow could filter better on the data that would also solve my problem.
Python Code I am using is:
df1 = sqlContext.read.format(“com.couchbase.spark.sql.DefaultSource”).option(“schemaFilter”, “doctype=“Customer””").load()

If I could add “filter”, account[0].source =“PQF” then this would infer a more accurate schema. Do you know if this is possible?

If not the only options would be two write my schema manually which there are 200 fields.

thanks,
Mark

@markmiko can you please post the full exception as seen in the spark logs? thanks!