Saturday, January 5, 2019

Running Pantaho in Docker container and job as command line argument

Pentaho is one of the software used in IT industries for execution of certain pre-defined tasks. It comes with two editions one is enterprise edition and another is open-source "community-edition". The open source CE version is freely available on websites. There are other tools also like Talend but since i have used pentaho, i will write about pentaho "CE(community Edition)" version and how we can use it as GUI (on your local machine ) and also to run the same in a docker container.

If you want to use pentaho on your local machine as GUI and complete your job

1. Download pentaho CE version from this link the latest version.  Download and unzip it. There are different folders inside data-integration folder, There are different files named cart, pan, kitchen and there are Batch(.bat) files and (.sh) files.

2. In macOS terminal go to the file directory and execute "/spoon.sh" (make sure it is executable and if not run command "chmod +x  *.sh")

3. An interactive UI opens up and you are ready to create a job.





4. Some useful tips for creating job:

* There are two types of tasks you can create 1. Transformations 2. Job both are different in their ways.

* Transformations are collections of small piece of jobs called steps while executing Transformations keep in mind that all the steps run concurrent or parallel so try to avoid dependency in between.

* Transformations are identified by .KTR files and they get stored in XML like format and can be run from command line using ./pan.sh -file="Myfile.ktr"

* Jobs composed of steps that run sequentially, you can use any transformation or job in between. They are saved as .KJB files and run by ./kitchen.sh -file="MYfile.kjb" from command line.

* If your Job/TR requires connection to database you can create a db connection from Database connection option in views.

* You can provide database connection information by defining connection parameters using ${VAR1} and pass value of the parameter at run time in job properties->parameter->value
or in command line using -param:VAR_NAME='value' and so on. see this
and for command line see this

START PENTAHO IN A DOCKER CONTAINER

Starting pentaho in a docker container requires docker installation as pre-requisite

1. There are Good github repos for running pentaho in a container like this and this

2. start the docker container with the image (i have used this )

3. To build the image go to docker folder and use Dockerfile to build image follow instruction there.

4. It requires CARTE environment variables (i have used master only) while spinning up containers.

4. When the container is up enter inside container using docker exec -it container_name bash command and go to data-integration folder inside the container. try ./kitchen.sh and ./pan.sh

5. If you have job or Tr file that requires interaction with database and you want to give connection param from command line argument, you can store connection parameter in variables ${VAR} like following and use variable substitution wherever require.


and save the .ktr and .kjb files. if you open these files after saving as it is you can see the connection param are stored in variable that can be provided at run time from command line like figure shown in 6.

6. Now you can run any custom job and transformation using pan and kitchen by providing command-line arguments like this.


here i have provided connection params for postgreSQL from command line for the job file as it requires interaction with two databases i have given all of them at run time you can save logs by providing -level:Basic
as command line args.

7. What i observed is you can provide any number of command line args at run time from terminal or from GUI in GUI params are specified in Job properties-> parameter and in case from command line you can use -param:VAR_NAME='value' any number of time

8. To create a job or Tr having variables use ${VAR} in steps or job while creating the same and save them as it is and check the saved .ktr and .kjb files to see if they are stored as parameter ${VAR} so that you can pass all those variables at runtime from command line as shown in 6.


NOTE: If database containers and pentaho are running on same host than remove port -p statement while spinning up containers as containers are listening on their default port, but If you are using pentaho as UI on local you need to port database containers at different ports on local and build you job locally.


















No comments:

Post a Comment