amazon s3 - How to improve query performance to s3 data from Athena -

i have partitioned data stored in s3 in hive format this.

bucket/year=2017/month=3/date=1/filename.json bucket/year=2017/month=3/date=2/filename1.json bucket/year=2017/month=3/date=3/filename2.json

every partition has around 1,000,000 records. have created table , partitions in athena this.

now running query athena

select count(*) mts_data_1 year='2017' , month='3' , date='1'

this query taking 1800 seconds scan 1,000,000 records.

so question how can improve query performance?

i think problem athena has read many files s3. 250 mb isn't data, 1,000,000 files lot of files. athena query performance improve dramatically if reduce number of files, , compressing aggregated files more. how many files need 1 day's partition? one-minute resolution, need less 1,500 files. if current query time ~30 minutes, might start lot less.

there many options aggregating , compressing records:

aws's kinesis firehose simple way start on sort of problem.
a streaming data processing tool apache nifi offer richer set of tranformation, aggregation, , compression options. i've written blog post using apache nifi stream data s3 athena, covering these same issues.

Search This Blog

Breniser

amazon s3 - How to improve query performance to s3 data from Athena -

Comments

Post a Comment

Popular posts from this blog

javascript - Clear button on addentry page doesn't work -

tensorflow when input_data MNIST_data , zlib.error: Error -3 while decompressing: invalid block type -

reflection - why SomeClass::class is KClass<SomeClass> but this::class is KClass<out SomeClass> -