From Theory to Practice: Count distinct optimization in Trino / Presto
Approximate count distinct is a powerful technique used in a variety of use cases where exact count distinct is computationally expensive or not feasible due to the size of the dataset. Here are some practical use cases where approximate count distinct can be particularly useful since EXACT answer is not the primary requirement:
Web Analytics: Counting the number of unique visitors to a website is a common task in web analytics.
Network Traffic Analysis: Network traffic analysis involves monitoring network traffic to identify potential security threats, network performance issues, or other anomalies.
Machine Learning: Machine learning algorithms often require the calculation of the number of distinct values in a dataset as a pre-processing step. However, for very large datasets, this can be computationally expensive.
Social Media Analysis: Social media platforms generate massive amounts of data, and counting the number of unique users who have interacted with a particular post or piece of content is a common task.
Data Observability / Data Quality Frameworks: Approximate count distinct can also be used in data quality checks to identify potential issues or anomalies in large datasets.
Mathematics behind probabilistic data structures
A monoid is a mathematical structure consisting of a set of values and an operation that combines two values to produce a third value, such that the operation is associative and has an identity element. Here are the main properties of a monoid:
Closure: The set of values under consideration must be closed under the binary operation. That is, if a and b are elements of the set, then their combination using the operation must also be an element of the set.
Associativity: The binary operation must be associative. That is, for any three elements a, b, and c in the set, (a * b) * c must be equal to a * (b * c), where * is the binary operation.
Identity: There exists an identity element, denoted by e, in the set such that for any element a in the set, a * e = e * a = a.
How to optimize Trino queries using approximations?
Trino/Presto provides several built-in approximate aggregate functions that allow users to estimate the result of an expensive operation without actually executing the operation on the entire dataset. Here is the documentation of the approximate functions provided by Trino: https://trino.io/docs/current/functions/aggregate.html#
Let’s Test
Setup used in the testing:
tpc-ds public dataset
Ran Spark job to download and ingest the data in s3
Ingested data is saved in Apache Iceberg format
AWS Athena to query the data
Run 1: Using count distinct query
Run 2: Using approximate aggregate function
As you can see even on smaller dataset, we can see a performance gain of ~18% percentage but by compromising some percentage of accuracy and it would improve in range of ~40-50% as the volume of table would grow.
Conclusion
In summary, using approximate counting in SQL queries can be a powerful technique for optimizing queries, improving performance, and reducing computational cost. By understanding the principles behind probabilistic data structures and the approximate functions provided by Trino, users can achieve accurate and efficient results in their data analysis and processing tasks.