Development Setup ----- ### AWS CLI You will need the AWS CLI configured on your machine (with access to [Leaf's Dev AWS Account](https://478904141323.signin.aws.amazon.com/console)) to run this code. [Here is how to get the AWS CLI installed](https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-install.html). Note that there are two AWS accounts used at Leaf (the two accounts do not share resources). The AWS account linked above is the Analytics account. You may have the AWS CLI configured with both accounts, in which case you will need to manually set the `AWS_PROFILE` environment variable to the profile that is linked to the Analytics account before running the OA pipeline. ### Conda Development Environment The commands below assume you've cloned the repository locally and are executing them from the project root directory (i.e., the directory containing this `README.md` file). Assuming you have `GNU Make` installed (it's already on all macOS distributions and many Linux distributions, but not available in the development containers as of writing), to set up a development conda environment please run: ``` make create_env ``` This will create a conda environment and install all required dependencies for local development (including Jupyter Lab). In addition, it will install `leaf_engine` as an editable dependency and set all required environment variables in your conda environment. If you encounter any issues with this setup, please open an issue. After the environment is created, run: ``` conda activate pipeline_env ``` If you need more control over environment creation, check the Makefile to see what commands are run in the `create_env` recipe. ### Installing Git LFS [Git LFS](https://git-lfs.github.com/) is a Git extension that replaces large files with text pointers inside Git, while storing the file contents in the remote repo. We use it to work with the test dataset. To install, run: ``` git lfs install ``` Your `adapt/.gitattributes` file should already point to the test directory: ``` tests/test_data/* filter=lfs diff=lfs merge=lfs -text ``` But if not, run: ``` git lfs track tests/test_data/* ``` Lastly, pull the contents of the tracked files: ``` git lfs fetch ``` ### Installing pre-commit hooks The most important step in this setup is running: ``` pre-commit install ``` This will set up pre-commit hooks that lint and format Python code before each commit. You can run these pre-commit hooks manually at any time by running: `pre-commit run --all-files` These hooks will enforce basic quality rules for the repo and make working with this code much easier. You only need to run the command above once, after creating (or re-creating) the environment. ### Running the pipeline Now you should be able to run the pipeline locally with: ``` python run_etl_pipeline.py --params ./run_params/Johnson\ \&\ Johnson.json ``` You should see logs in standard out and intermediate files being cached in a created `./cache` folder. Where pipeline outputs are stored is controlled via the `ANALYTICS_ENVIRONMENT` environment variable. By default, this is set to `local`, which means the outputs will be stored in your local `./cache` folder. If you want to store the outputs on S3, you need to set: ``` conda env config vars set ANALYTICS_ENVIRONMENT=dev; ``` The `test-analytics-data-bucket` S3 bucket in Leaf's Analytics account is currently being used to store outputs. You can change the bucket by setting: ``` conda env config vars set ANALYTICS_DATA_BUCKET=; ``` Note that your IAM user (i.e., the user your AWS CLI is configured with) must have read-write access to the S3 bucket. #### Running with Lane Level Data In the absence of shipment level data, the pipeline is slightly modified to return a csv `lanes.csv` instead of `shipments.csv`. You can trigger this run by passing in any lane-level run type in the `--run-type` parameter. For example: ``` python run_etl_pipeline.py --params "./run_params/Ace Hardware.json" --run-type "LANE" ``` Only datasets that have a matching `type` in the parameter file will be used in that run. For example, using `--run-type RFP` will only use datasets with the `type` value of `RFP` and ignore all others. #### Internal Runs ##### Type 1 - Active To run the type 1 active run: Save a CSV of the data from `analytics.v_leaf_active` view. Save it to the "External_Data_Shippers/Leaf Internal" folder here: https://drive.google.com/drive/folders/1MowEJFl4BxBIuDmmetYfOm92iqkA7R6D Create a new dataset in the `Leaf T1 Active.json` param file with a URL to the CSV file. Update the `filter_params.max_date` and `oa_params.batch date` to today's date. Update the `filter_params.min_date` to 6 months before today's date. Run the flow: ``` python run_etl_pipeline.py --params "./run_params/Leaf T1 Active.json" --run-type "TYPE 1 ACTIVE" python run_adapt_pipeline.py --params "./run_params/Leaf T1 Active.json" --run-type "TYPE 1 ACTIVE" python run_insert_pipeline.py --params "./run_params/Leaf T1 Active.json" --run-type "TYPE 1 ACTIVE" ``` ##### Type 2 - Lighthouse To run the type 2 lighthouse flow, run: ``` python run_etl_pipeline.py --params ./run_params/Leaf T2 Lighthouse.json --run-type "TYPE 2 LIGHTHOUSE" python run_adapt_pipeline.py --params ./run_params/Leaf T2 Lighthouse.json --run-type "TYPE 2 LIGHTHOUSE" python run_insert_pipeline.py --params ./run_params/Leaf T2 Lighthouse.json --run-type "TYPE 2 LIGHTHOUSE" ``` The run_params for the T2 run do not need a `label` or `url` entry as the data is pulled from the analytics db at runtime. Placeholders are used to prevent confusion like ``` { "label": "Placeholder - not needed for T2", "lane_end_date": "2023-02-28", "lane_start_date": "2021-01-01", "type": "lane", "url": "Placeholder - not needed for T2" } ``` Additionally, the `lane_end_date` and `lane_start_date` are used in the Adapt pipeline but not ETL. They are pulled from the etl result: `lanes.csv`. ### Debugging: missing dependencies If you are getting errors because of missing dependencies, it might be because your local environment has become out of date with respect to the `environment.dev.yml` file. The easiest way to update your local environment without running into issues is by removing it: ``` make remove_env ``` And re-creating it: ``` make create_env ``` ### Updating the environment If you need to update `environment.yml`, install the required dependencies using: ``` conda install -c conda-forge ``` Then manually add the dependency to the `environment.yml` file. You can find package version using: ``` conda env export --no-builds | grep ``` Similarly, if you need to add pip dependencies, list them explicitly in `requirements.txt`. ### Getting set up with Prefect Server **This step is not necessary for local development.** Prefect Server is our data pipeline orchestration system that makes it easy to monitor, schedule, and execute analytics runs on remote infrastructure. Read more about it [here](https://docs.prefect.io/orchestration/server/overview.html). This guide assumes you have the **prod VPN** on. If you don't have access to that VPN, then please ask Andy/Gerrit/Engineering for access. #### Accessing the UI With VPN on, you should be able to open the following link in your browser: [http://a70d100ec15dd414986850f9aedb585a-d231ecbc10f6aaf7.elb.us-east-1.amazonaws.com:8080/](http://a70d100ec15dd414986850f9aedb585a-d231ecbc10f6aaf7.elb.us-east-1.amazonaws.com:8080/) (you can bookmark this link for quick access). On the landing page, scroll down to setup step number 2 and in the input box that says **"Prefect Server GraphQL endpoint"**, paste the following URL [http://a03c99403491545eaa8d9a3c266675b9-0e872f8866538cd1.elb.us-east-1.amazonaws.com:4200/](http://a03c99403491545eaa8d9a3c266675b9-0e872f8866538cd1.elb.us-east-1.amazonaws.com:4200/) and click **"Connect"** (there is a widget in the top right of the page that should turn green). You can ignore the next setup steps. From the navigation bar at the top left, expand the side menu, click **"Switch team"** and select **"LeafAnalytics"** (if not already selected). #### Triggering remote runs For your local machine to speak with the remote server, you need to add the following config file in `~/.prefect/config.toml` (if you don’t have a `.prefect` directory in your $HOME, please create one): ``` backend = "server" [server] host = "http://a03c99403491545eaa8d9a3c266675b9-0e872f8866538cd1.elb.us-east-1.amazonaws.com" port = "4200" endpoint = "${server.host}:${server.port}" [server.ui] host = "http://a70d100ec15dd414986850f9aedb585a-d231ecbc10f6aaf7.elb.us-east-1.amazonaws.com" port = "8080" host_port = "8080" endpoint = "${server.ui.host}:${server.ui.port}" apollo_url = "${server.host}:${server.port}/graphql" [server.telemetry] enabled = false [cloud] api = "${server.host}:${server.port}" ``` This file needs to be on the same machine/container where you currently execute etl/adapt runs. After this, you'll be able to trigger runs on the server by adding the `--remote` flag to runners that support it (e.g., `python run_etl_pipeline.py --params ./run_params/Treehouse.json --remote`). You can monitor run execution in the UI linked above (note that you need prod VPN on for these steps to work). Run outputs will be available on `s3://test-analytics-data-bucket`. ### Developing with Docker and Docker Compose In order to use docker, docker compose, and the Dev Container extensions, you will need to run the `bin/setup_secrets.py` scripts which will generate a `docker.env` file in the repo root directory. This file will be used by the docker compose files to set environment variables. #### Leaf Analytics Engine Only In the project root, the following command can be used to start the leaf engine in a docker container which will use the production database, analytics service, and PostgREST service: ``` docker compose -p leaf-engine-stack -f ./bin/docker-compose.engine.yml up ``` The `leaf-analytics-engine` repo will be mounted as a bind mount in the container. This means that any changes made to the code will be reflected immediately in the container and vice versa. #### Leaf Analytics Engine and Local Database and Services In order to use a local database, the [analytics-service](https://bitbucket.org/leaflogistics/analytics-service/src/production/) and [migrations](https://bitbucket.org/leaflogistics/migrations/src/production/) repos will need to be cloned in the same directory as the `leaf-analytics-engine` repo. The directory structure should look like this: ``` ➜ ls analytics-service/ migrations/ leaf-analytics-engine/ ``` Then the following command can be used to start the leaf engine, local database, analytics service, and PostgREST service as separate docker containers: ``` docker compose -p leaf-engine-stack -f ./bin/docker-compose.all.yml up ``` The database data will be persisted in a docker volume called `bin_db_data`. Deleting the volume will delete the local database data. Database tools such as DBeaver and Datagrip can be used to connect to the local database. The connection details are: * url: localhost * port: 5432 * database: platform * user: leuser * password: 123456 #### Developing in VSCode with Dev Containers A basic devcontainer file is included at `.devcontainers/devcontainer.json` which can be used to develop in VSCode with a docker container using the Dev Containers plugin. Use the `Remote Containers: Open Folder in Container...` command to open the project in a container. By default it uses the `docker-compose.all.yml` file which will start several services at once. If you want to use the `docker-compose.engine.yml` file instead, you can change the `dockerComposeFile` property in the devcontainer file.