Athena 2 Cloudtrail "HIVE_BAD_DATA: Line too long"

Amazon recently announced the general availability of Athena 2, which contains a bunch of performance improvements and features.

As part of our release process, we query all of our Cloudtrail logs to ensure that no secrets were modified unexpectedly. But Cloudtrail has hundreds of thousands of tiny JSON files, and querying them with Athena takes forever. This is because under the hood Athena has to fetch each file from S3. This takes 20-30 minutes to run, and hurts the developer experience.

Worse than taking forever, it frequently throws an error of Query exhausted resources at this scale factor. The documentation suggests this is because our query uses more resources than planned. While you can typically get this error to go away if you run the same query a few more times, you only encounter the error after 15 minutes. It’s a huge waste of time.

To fix this, every night we combine all tiny Cloudtrail files into a single large file. This file is about 900 MB of raw data but compresses down to only 60 MB. We instead build our Athena schema over these compressed daily Cloudtrail files and query them instead. This reduces the query time to only 3-4 minutes or so.

This worked great on Athena 1. But on Athena 2, we started seeing errors like this:

Your query has the following error(s):
HIVE_BAD_DATA: Line too long in text file: s3://xxx/rollup/dt=20190622/data.json.gz
This query ran against the "default" database, unless qualified by the query.
Please post the error message on our forum or contact customer support with Query Id: aaa8d916-xxxx-yyyy-zzzz-000000000000.

Contrary to the error message, none of the lines in the file are too long. They are at most about 2kb. There seems to be a bug in the AWS-provided Cloudtrail parser that treats the whole file as a single line which violates some hidden cap on line-length.

Some sleuthing of the Presto source code (which Athena is based on) shows that there is a default line length of 100 MB. Now, we split the consolidated Cloudtrail log into 100 MB chunks and query those instead.

This works out fine. But it’s a pain and a waste of time to do this.

Athena has a cap on the total number of partitions you can have in a table. We used to consume only one partition per day, but this change ups it to 9 per day (and growing with data growth). Since the cap is 20,000, we’re still well within quota.

I’m hoping that AWS will fix this bug soon. Everything about it is needlessly annoying.