Adding data provenance support to Apache Spark
The VLDB Journal (VLDB Journal), 27(5):
595-615, October 2018.
Special issue on best papers of VLDB 2016
Matteo Interlandi, Ari Ekmekji, Kshitij Shah, Muhammad Ali Gulzar, Sai Deep Tetali, Miryung Kim, Todd Millstein, Tyson Condie
Debugging data processing logic in
data-intensive scalable computing (DISC) systems is a difficult
and time-consuming effort. Today’s DISC systems offer very little
tooling for debugging programs, and as a result, programmers spend
countless hours collecting evidence (e.g., from log files) and
performing trial-and-error debugging. To aid this effort, we built
Titian, a library that enables data provenance --
tracking data through transformations -- in Apache Spark. Data
scientists using the Titian Spark extension will be able to
quickly identify the input data at the root cause of a potential
bug or outlier result. Titian is built directly into the Spark
platform and offers data provenance support at interactive speeds
-- orders of magnitude faster than alternative solutions -- while
minimally impacting Spark job performance; observed overheads for
capturing data lineage rarely exceed 30% above the baseline job
execution time.
[PDF]