apache spark - Structured Streaming 2.1.0 stream to Parquet creates many small files -
i running structured streaming (2.1.0) on 3-node yarn cluster , stream json records parquet. code fragment looks this:
val query = ds.writestream .format("parquet") .option("checkpointlocation", "/data/kafka_streaming.checkpoint") .start("/data/kafka_streaming.parquet")
i notice creates thousands of small files 1,000 records. suspect has frequency of trigger. changed it:
val query = ds.writestream .format("parquet") .option("checkpointlocation", "/data/kafka_streaming.checkpoint") .**trigger(processingtime("60 seconds"))** .start("/data/kafka_streaming.parquet")
the difference obvious. can see smaller number of files created same number of records.
my question: there way have low latency trigger , keep smaller number of larger output files?
Comments
Post a Comment