It is a workflow scheduler system to manage Apache Hadoop jobs. It combines multiple jobs sequentially into one logical unit of work. Hence, Oozie framework is fully integrated with Apache Hadoop stack, YARN as an architecture center. It also supports Hadoop jobs for Apache MapReduce, Pig, Hive, and Sqoop.
Oozie is scalable and also very much flexible. One can easily start, stop, suspend and rerun jobs. Hence, Oozie makes it very easy to rerun failed workflows. It is also possible to skip a specific failed node.
There are two basic types of Oozie jobs:
- Oozie workflow – It is to store and run workflows composed of Hadoop jobs e.g., MapReduce, Pig, Hive.
- Oozie coordinator – It runs workflow jobs based on predefined schedules and availability of data.
How does OOZIE work?
Oozie runs as a service in the cluster and clients submit workflow definitions for immediate or later processing.
Oozie workflow consists of action nodes and control-flow nodes.
An action node represents a workflow task, e.g., moving files into HDFS, running a MapReduce, Pig or Hive jobs, importing data using Sqoop or running a shell script of a program written in Java.
A control-flow node controls the workflow execution between actions by allowing constructs like conditional logic wherein different branches may be followed depending on the result of earlier action node.
Features of OOZIE
- Oozie has client API and command line interface which can be used to launch, control and monitor job from Java application.
- Using its Web Service APIs one can control jobs from anywhere.
- Oozie has provision to execute jobs which are scheduled to run periodically.
-
Oozie has provision to send email notifications upon completion of jobs.