apache spark - Structured Streaming 2.1.0 stream to Parquet creates many small files -

apache spark - Structured Streaming 2.1.0 stream to Parquet creates many small files -

i running structured streaming (2.1.0) on 3-node yarn cluster , stream json records parquet. code fragment looks this:

  val query = ds.writestream         .format("parquet")         .option("checkpointlocation", "/data/kafka_streaming.checkpoint")         .start("/data/kafka_streaming.parquet")

i notice creates thousands of small files 1,000 records. suspect has frequency of trigger. changed it:

  val query = ds.writestream         .format("parquet")         .option("checkpointlocation", "/data/kafka_streaming.checkpoint")         .**trigger(processingtime("60 seconds"))**         .start("/data/kafka_streaming.parquet")

the difference obvious. can see smaller number of files created same number of records.

my question: there way have low latency trigger , keep smaller number of larger output files?

Comments