ETL Process

Step 1: The following folders should exist on the Database server. If not, the folders should be created with appropriate write permission.

/srv/etl

/srv/etl/new
/srv/etl/done
/srv/etl/logs

Step 2: The other process that will extract data from MPAU  as a CSV file on a daily basis is expected to be placed in the /srv/etl/new folder. The filename format is recommended as: yyyy-mm-dd.csv

The columns in the CSV file must be as follows:

degree
unikey
sid
enrolmentyear
cohortyear
stage
title
lastname
givennames
gender
flexsisstatus
majorsubschool
email
mobileph

Step 3: The script file (load.php) that will process the CSV file is located in /srv/etl/script folder

Step 4: The bash script which should be placed in the crontab is /srv/etl/extract.sh this script will invoke the load.php file and the load does the followings as mentioned in the next step.

Step 5:

  • Checks if there is any file to process in /srv/etl/new folder, if there is file then check the validity of the file in terms of content (the columns, and file type).
  • If the file content is valid, the data is loaded onto tbl_etl_load table for the MDWombat Application to process
  • The data loading operation log is written in the /srv/etl/logs/ folder. The filename convention is: etl_processed_unixtimestamp (eg, etl_processed_1425530922) An example log content is cited below:

    File format, content = OK, filename =>2015-5-5.csv
    clearning tbl_etl_load ….. Successful
    Rec #:1Load SQL:
    INSERT INTO tbl_etl_load (degree, unikey, sid, enrolmentyear, cohortyear, stage, title, lastname, givennames, gender, flexsisstatus, majorsubschool, email, mobileph) VALUES (‘M.D’, ‘jcit0000’, ‘300000000’, ‘2014’, ‘2014’, ‘Stage One’, ‘MISS ‘, ‘CITIZEN1’, ‘JESSICA’, ‘Female’, ‘Continuing’, ‘Westmead’, ‘[email protected]’, ‘400000000’);
    Rec #:2Load SQL:
    INSERT INTO tbl_etl_load (degree, unikey, sid, enrolmentyear, cohortyear, stage, title, lastname, givennames, gender, flexsisstatus, majorsubschool, email, mobileph) VALUES (‘M.D’, ‘jcit0001’, ‘300000001’, ‘2014’, ‘2014’, ‘Stage One’, ‘MR ‘, ‘CITIZEN2’, ‘John’, ‘Male’, ‘Continuing’, ‘Westmead’, ‘[email protected]’, ‘400000001’);
    Rec #:3Load SQL:
    INSERT INTO tbl_etl_load (degree, unikey, sid, enrolmentyear, cohortyear, stage, title, lastname, givennames, gender, flexsisstatus, majorsubschool, email, mobileph) VALUES (‘M.D’, ‘bjan1234’, ‘123456’, ‘2014’, ‘2015’, ‘Stage two’, ‘MRS’, ‘JANE2’, ‘JANENEW’, ‘Female’, ‘Continuing’, ‘Nepean’, ‘[email protected]’, ‘468889204’);
    All Data Inserted for Processing….. Successful
    2015-5-5.csv copied to /srv/etl/done
    2015-5-5.csv succesfully deleted from /srv/etl/new
    tbl_etl_load_history record….. Successful

  • After the load the input file is copied back to /srv/etl/done folder and an entry to the tbl_etl_load_history table is made to keep the record.

Step 6: There is a Job on the MDWombat (https://md-web-uat-1.ucc.usyd.edu.au/mdwombat/index.php?r=cronjob) application that can be scheduled to run after the above step.

Step 7: After running the job in Step 6 (which basically does the data transformation of the loaded data  and then synchronisation with the existing matching records) email(s) are sent to the MDWombat Admin notifying the details of the result. A Sample email is provided as follows:

Sample Email - ETL Result Notification
Sample Email – ETL Result Notification