The Statistics framework in Datafusion is a foundational component for query planning and execution. It provides metadata about datasets, enabling optimization decisions and influencing runtime behaviors. This task focuses on a comprehensive redesign of the Statistics representation by transitioning to an enum-based structure that supports multiple distribution types, offering greater flexibility and expressiveness.
Statistics_v2 RepresentationDefine a new statistics representation as an enum to accommodate different distribution types. The new Statistics_v2 enum will include the following variants:
Uniform: Represents data with a uniform distribution.
Uniform {
// Interval is an existing struct that implements interval arithmetic (IA).
// You will utilize and/or draw inspiration from the IA implementation in
// DataFusion extensively for the purposes of this project.
interval: Interval,
}
Exponential: Represents data with an exponential distribution.
Exponential {
rate: ScalarValue,
offset: ScalarValue,
}
Gaussian: Represents data with a Gaussian (normal) distribution.
Gaussian {
mean: ScalarValue,
variance: ScalarValue,
}
Unknown: Represents datasets with unspecified distribution characteristics. Therefore, this variant only includes basic descriptive statistics.
Unknown {
mean: ScalarValue,
median: ScalarValue,
std_dev: ScalarValue,
range: Interval, // Hard (certain) bounds, if known.
}
// The first three values take the `None` internal value when there is no estimate.
// The default value of `range`, when nothing is known, is the whole interval.
The statistics framework will fall back to the Unknown variant whenever a calculation creates a Statistics_v2 object whose distribution is not (yet) supported (i.e. anything except Uniform, Exponential, or Gaussian).
Implement methods to validate and ensure the internal consistency of each Statistics_v2 variant. Examples of validation rules include:
Exponential, the rate must be positive.Gaussian, the variance must be non-negative.Unknown, ensure:
mean and median (when specified) always fall within the range.std_dev (when specified) always remains non-negative and is consistent with range (when it exists). For details, see this post.mean, range, and std_dev for all variants. Note that these are implicit for some variants; i.e. the mean value of a uniform distribution is the arithmetic average of its lower and upper bounds.