Background and Motivation

Streaming data processing is a big deal in big data these days and for good reasons. However, most computation types in Datafusion currently require all intermediate data to be fully materialized. Obviously, this is not feasible in terms of either memory or computation time for unbounded data sources (streams).

Having streaming support will help Datafusion/Ballista to make progress in establishing itself as the next-generation go-to technology for query processing and motivate many users of conventional Java-based technologies to consider “upgrading” to Datafusion/Ballista.

Datafusion provides many file input formats like Avro, CSV, JSON, etc. for query processing. We can extend this functionality to support both unbounded streaming sources and bounded batch sources in a unified way. To do so, we need to consider:

Goals

Support mini-batch streaming execution in operators.
Provide a software architecture that can be extended for improving user experience.
Better support for composing queries.

Non-goals

Supporting ultra-low latency jobs (not considered right now)
Implementing complex back-pressure mechanisms in between operators (not considered right now)
End-to-end delivery guarantees (exactly once etc.) (future work in Ballista)

Background and Motivation

Goals

Non-goals

Initial Proposal

Stream Sources