Background and Motivation

Streaming data processing is a big deal in big data these days and for good reasons. However, most computation types in Datafusion currently require all intermediate data to be fully materialized. Obviously, this is not feasible in terms of either memory or computation time for unbounded data sources (streams).

Having streaming support will help Datafusion/Ballista to make progress in establishing itself as the next-generation go-to technology for query processing and motivate many users of conventional Java-based technologies to consider “upgrading” to Datafusion/Ballista.

Datafusion provides many file input formats like Avro, CSV, JSON, etc. for query processing. We can extend this functionality to support both unbounded streaming sources and bounded batch sources in a unified way. To do so, we need to consider:

Goals

Non-goals

Initial Proposal

Stream Sources