Scientific workflow systems enable scientists to perform large-scale data intensive scientific experiments using distributed computing resources. However, existing scientific workflow systems require a large investment of time to familiarize and adapt existing workflows.
Thus, many scientific workflows are still being implemented by script based languages (such as java and R) due to familiarity and extensive third party library support.
WorkflowDSL introduces “mlwf” domain specific language for composing scientific workflows. Using this DSL, domain experts are able to compose a workflow by defining workflow tasks, parameters and data dependencies. Since mlwf specification contains only the workflow composition, users can reuse or repurpose workflows.
WorflowDSL has a pluggable architecture which allows multiple execution engines to utilise the same workflow specification. Currently it should support java and R as target languages through its customizable execution engines. The select engine will generate task templates, which must be implement by a technical expert.
The approach should have to present in this work is only suitable for data flow workflows, that include cyclic operations or decision trees. Thus an investigation of how to express control flows using a DSL and how to integrate with Celery engine would be beneficial for users.
WorkflowDSL should also support scalable execution using Celery execution engine for java-based workflows. Using the data dependencies specified by user, WorkflowDSL can automatically parallelize workflow tasks to execute them in a cluster.
Data provenance can be defined as metadata that explains how certain objects came to be. Who has created, modified them, when has that happened. Data provenance is essential in scientific research, as it eases the process of reproducing experiments, proves that there was no tampering with data that could possibly lead to fabrication or falsification of the results, and ensures integrity of the conducted research.
In addition, provenance in research can be used in troubleshooting and optimizing the scientific workflows to be more efficient, as it provides a lot of data that was collected during the execution, typically including the input parameters, the environment in which the execution happened, intermediary data, outputs, processing time, etc. There should not be redundancy in execution. Reproducing past experiments, with identical input data and parameters or with different parameters is one of the features that would be benefit for data scientists.
The amount of data collected depends on how fine-grained the provenance was configured to be.
Another avenue of investigation is how a specific provenance record can be stored in a blockchain to serve as a notary. Finally, another important aspect of provenance is interoperability. Work should have to present a data model that is specific to mlwf language, thus it should be interchanged with other provenance data models such as W3C PROV standard.
Thus, a tool that can convert between the implemented data model and other PROV based data models.