apache spark - Structured Streaming 2.1.0 stream to Parquet creates many small files -


i running structured streaming (2.1.0) on 3-node yarn cluster , stream json records parquet. code fragment looks this:

  val query = ds.writestream         .format("parquet")         .option("checkpointlocation", "/data/kafka_streaming.checkpoint")         .start("/data/kafka_streaming.parquet") 

i notice creates thousands of small files 1,000 records. suspect has frequency of trigger. changed it:

  val query = ds.writestream         .format("parquet")         .option("checkpointlocation", "/data/kafka_streaming.checkpoint")         .**trigger(processingtime("60 seconds"))**         .start("/data/kafka_streaming.parquet") 

the difference obvious. can see smaller number of files created same number of records.

my question: there way have low latency trigger , keep smaller number of larger output files?


Comments

Popular posts from this blog

javascript - Clear button on addentry page doesn't work -

python - Error: Unresolved reference 'selenium' What is the reason? -

asp.net ajax - Jquery scroll to element just goes to top of page -