Development process and methodology
Development of an Apex application starts with mapping the functional specification to operators (smaller functional building blocks), which can then be composed into a DAG to collectively provide the functionality required for the use case.
This involves identifying the data sources, formats, transformations and sinks for the application, and finding matching operators from the Apex library (which will be covered in the next chapter). In most cases, the required connectors will be available from the library that support frequent sources, such as files and Kafka, along with many other external systems that are part of the Apache Big Data ecosystem.
With the comprehensive operator library and set of examples to cover frequently used I/O cases and transformations, it is often possible to assemble a preliminary end-to-end flow that covers a subset of the functionality quickly, before building out the complete business logic in detail.
Examples that show how to work with frequently used library operators and accelerate the path to an initial running application can be found at https://github.com/apache/apex-malhar/tree/master/examples.
Having a basic pipeline working early on in the target environment (or at least close to it) allows for various important integration and operational requirements to be evaluated in parallel, such as security and access control. It also establishes a baseline for iterative and parallel development, and for testing the full-featured operators. Experience from working on complex pipelines shows how having an early basic pipeline can reduce risk and provides better visibility into the progress of a bigger project, especially when it has many integration points and a larger development team. Essentially, development dependencies can follow the modular structure of the DAG, allowing the full pipeline to be gradually built up and functions further downstream to be developed in parallel with mocked input, when needed.
A large project broken down into a series of smaller and more manageable milestones would roughly involve the following sequence of steps:
- Writing the Java code for new or customized operator.
- Unit testing (in IDE, no cluster environment needed).
- Integrating the operator into DAG.
- Integration testing (testing the DAG with potentially mocked data, in IDE).
- Configuring operator properties for the target environment (connector setting, and so on).
- End-to-end testing with realistic data set in the target environment.
- Tuning (optimizing resource utilization, configuring appropriate platform attributes such as processing locality, memory and CPU allocation, scaling and so on).
Following a similar sequence will ensure that basic functional issues are discovered early on (ideally within the IDE environment where it is far more efficient to debug and fix) before fully packaging and deploying the pipeline to a cluster.
In subsequent sections, we will look at each of these phases in more detail.