apache spark - Structured Streaming 2.1.0 stream to Parquet creates many small files -


i running structured streaming (2.1.0) on 3-node yarn cluster , stream json records parquet. code fragment looks this:

  val query = ds.writestream         .format("parquet")         .option("checkpointlocation", "/data/kafka_streaming.checkpoint")         .start("/data/kafka_streaming.parquet") 

i notice creates thousands of small files 1,000 records. suspect has frequency of trigger. changed it:

  val query = ds.writestream         .format("parquet")         .option("checkpointlocation", "/data/kafka_streaming.checkpoint")         .**trigger(processingtime("60 seconds"))**         .start("/data/kafka_streaming.parquet") 

the difference obvious. can see smaller number of files created same number of records.

my question: there way have low latency trigger , keep smaller number of larger output files?


Comments

Popular posts from this blog

javascript - Clear button on addentry page doesn't work -

c# - Selenium Authentication Popup preventing driver close or quit -

tensorflow when input_data MNIST_data , zlib.error: Error -3 while decompressing: invalid block type -