coverrest.blogg.se - S3 json query

#S3 JSON QUERY HOW TO#

How I started to merge files was by downloading the files locally and merging the json from ‘n’ no.of files to a single file. We’ve planned to merge multiple S3 files to fewer files. Now, it’s likely for us to assume that large files can improve performance and it’s a NO, because the files can neither be too small nor too large and it should be of optimum size at which it doesn’t degrade the performance. Not clear about what this means? It’s just that we need to combine the smaller files so as to reduce numerous small files to few large files which makes query execution easier for Athena. Solution is to avoid storing smaller S3 files in huge numbers and instead have less files with more data in individual files. Coming to the optimum file size, now it’s quite clear that our individual S3 files are very small with less data and it results in execution engine spending more time listing directories, opening and closing files, managing huge metadata, setting up to transfer results and all the other steps involved in this process. Finding out if this could cause a problem or not is the question? And, it is, as mentioned in AWS Blog that states “ optimum file size is crucial for query performance” and also mentions various performance tuning techniques for Athena. Upon little research, we came to know that there were numerous (more than 7000 files per partition) smaller files in S3. What again now, the query was tuned, data was reformatted and why were the queries failing to fetch the results? Again the only way is to go-back, re-check what more could be done or understand how the data is present in S3 (as S3 is from where the data gets queried). Well, not completely, as it was surprising to see that Superset was still unable to load the visuals after running through queries for max time available and results in timed-out errors. This solved the problem if what I thought. Well, this time it was ecstatic to see that the data scanned by Athena was reduced along with the cost. Compressed JSON removes the key-value pair of json’s encoding to store keys & values in a separate parallel arrays. Hence, our team decided to make a compression (gzip) so as to reduce the cost of queries by storing the S3 data in a compressed format. Our data in S3 is in JSON format and was uncompressed. We had to step back and think about the reason for poor query performance despite performance tuning and here’s where I got to know that the problem is with how the data is stored in S3. Not withstanding the frustration of the visualization tool not updating the charts (which our Data Analysts wanted desperately), the cost was also prohibitive as Amazon Athena’s pricing is based on the bytes scanned. The query time most often exceeded the threshold set by Superset resulting in the charts on Superset not getting updated.

We use Superset (visualization tool) for analyzing a dataset by querying on AWS Athena (the data being stored in AWS S3). Optimized use of JOINS, ORDER BY and GROUP BY clauses.Use of subqueries when required, as writing a query without nesting causes a massive data explosion, nested query is a tool for performing operations in multiple steps.Use of (database) partitions to filter the targeted data, this helps reduce the amount of data scanned per query.Avoiding multiple joins and if this doesn’t suffice your result, change the database table schema in a way that requires less joins.Selecting only required fields instead of * in the SELECT statement.Choosing the right Aggregate and Scalar functions as per the query requirement.

#S3 JSON QUERY HOW TO#

So, how to optimize the queries is the question., here are some ways to optimize a query : We can also see a detailed information on query execution plan for the query using EXPLAIN & EXPLAIN ANALYZE statement. Having a basic idea on order of execution for query statements can be helpful.