Running the application
Package the example application by running mvn package -DskipTests in the wordcount subdirectory of the apex-samples directory. This will create the .apa application package in the target directory:
/workspace/apex-samples/wordcount/target/wordcount-1.0-SNAPSHOT.apa
Prepare some input data:
apex@7bd66492cedc:~$ wget https://raw.githubusercontent.com/apache/apex-core/master/LICENSE apex@7bd66492cedc:~$ hdfs dfs -mkdir /tmp/wordcount
apex@7bd66492cedc:~$ hdfs dfs -put LICENSE /tmp/wordcount/LICENCE
Set up configuration for input and output (src/test/resources/properties-sandbox.xml):
<?xml version="1.0"?>
<configuration> <property> <name>apex.application.*.operator.*.attr.MEMORY_MB</name> <value>256</value> </property>
<property>
<name>apex.operator.input.prop.directory</name>
<value>/tmp/wordcount</value>
</property>
<property>
<name>apex.operator.output.prop.filePath</name>
<value>/tmp/wordcount-result</value>
</property>
<property>
<name>apex.operator.output.prop.outputFileName</name>
<value>wordcountresult</value>
</property>
<property>
<name>apex.operator.output.prop.maxLength</name>
<value>500</value>
</property>
<property>
<name>apex.operator.output.prop.alwaysWriteToTmp</name>
<value>false</value>
</property>
</configuration>
Compared to the unit test, the file locations now refer to the default file system. In the sandbox, that's HDFS. The input directory property needs to match the location where the file was previously placed (/tmp/wordcount). Now we are ready to start the CLI:
apex@7bd66492cedc:~$ apex Apex CLI 3.6.0 12.06.2017 @ 12:28:11 UTC rev: 7cc3470 branch: 7cc3470d99488d985aa7c50c62ecf994121fdb05 apex>
Launch the application, using the application package from the workspace build directory and the configuration file with settings that are specific to this environment:
apex> launch /workspace/apex-samples/wordcount/target/wordcount-1.0-SNAPSHOT.apa -conf /workspace/apex-samples/wordcount/src/test/resources/properties-sandbox.xml
Now, the application should be launched and the id of the application should be displayed:
{"appId": "application_1490404466322_0001"} apex (application_1490404466322_0001) >
Note that the CLI also permits you to directly specify properties through the launch command. There are also several commands to obtain information about the running application, such as listing the operators and containers along with system metrics. Enter help for available commands and options. Enter exit to leave the Apex CLI and return back to the shell prompt. After a few seconds, the results will be available in the HDFS output directory:
apex@7bd66492cedc:~$ hdfs dfs -ls /tmp/wordcount-result/ Found 4 items -rwxrwxrwx 1 apex supergroup 5425 2017-03-20 04:24 /tmp/wordcount-result/wordcountresult_4.0 -rwxrwxrwx 1 apex supergroup 501 2017-03-20 04:25 /tmp/wordcount-result/wordcountresult_4.1 -rwxrwxrwx 1 apex supergroup 501 2017-03-20 04:27 /tmp/wordcount-result/wordcountresult_4.2 -rwxrwxrwx 1 apex supergroup 0 2017-03-20 04:27 /tmp/wordcount-result/wordcountresult_4.3
Cat the first file to see the output (the remaining files contain empty lines, because the unique counter happens to emit empty maps even when no input was received):
apex@7bd66492cedc:~$ hdfs dfs -cat /tmp/wordcount-result/wordcountresult_4.0 {using=1, 2004=1, incurred=1, party=2, free=2, event=1, solely=1, interfaces=1, sublicense=1, Legal=4, meet=1, fee=1, conditions=9, 3=1, 2=4, 1=2, 0=3, 7=1, 6=1, 5=1, 4=1, retain=1, 9=2, ...
All output was produced in a single streaming interval, so that the first part file contains a long line with all word counts. This is the consequence of reading from a small file. With a streaming source and continuous data, we would see multiple part files with counts continuously written.
Congratulations, we have built and run our first Apex application on a Hadoop cluster. Even though the functionality is simple, it covered a lot of ground in terms of environment setup and getting an overview of the development process.