Spark: subset a few columns and remove null rows -
i running spark 2.1 on windows 10, have fetched data mysql spark using jdbc , table looks this
x y z ------------------ 1 d1 null v ed 5 null null 7 s null null bd null
i want create new spark dataset x , y columns
above table , wan't keep rows not have null in either of 2 columns. resultant table should this
x y -------- 1 7 s
the following code:
val load_df = spark.read.format("jdbc").option("url", "jdbc:mysql://100.150.200.250:3306").option("dbtable", "schema.table_name").option("user", "uname1").option("password", "pass1").load() val filter_df = load_df.select($"x".isnotnull,$"y".isnotnull).rdd // lets print first 5 values of filter_df filter_df.take(5) res0: array[org.apache.spark.sql.row] = array([true,true], [false,true], [true,false], [true,true], [false,true])
as shown, above result doesn't give me actual values returns boolean values (true when value not null , false when value null)
try this;
val load_df = spark.read.format("jdbc").option("url", "jdbc:mysql://100.150.200.250:3306").option("dbtable", "schema.table_name").option("user", "uname1").option("password", "pass1").load()
now;
load_df.select($"x",$"y").filter("x !== null").filter("y !== null")
Comments
Post a Comment