amazon s3 - How to improve query performance to s3 data from Athena -
i have partitioned data stored in s3 in hive format this.
bucket/year=2017/month=3/date=1/filename.json bucket/year=2017/month=3/date=2/filename1.json bucket/year=2017/month=3/date=3/filename2.json
every partition has around 1,000,000 records. have created table , partitions in athena this.
now running query athena
select count(*) mts_data_1 year='2017' , month='3' , date='1'
this query taking 1800 seconds scan 1,000,000 records.
so question how can improve query performance?
i think problem athena has read many files s3. 250 mb isn't data, 1,000,000 files lot of files. athena query performance improve dramatically if reduce number of files, , compressing aggregated files more. how many files need 1 day's partition? one-minute resolution, need less 1,500 files. if current query time ~30 minutes, might start lot less.
there many options aggregating , compressing records:
- aws's kinesis firehose simple way start on sort of problem.
- a streaming data processing tool apache nifi offer richer set of tranformation, aggregation, , compression options. i've written blog post using apache nifi stream data s3 athena, covering these same issues.
Comments
Post a Comment