Development Setup
AWS CLI
You will need the AWS CLI configured on your machine (with access to Leaf’s Dev AWS Account) to run this code. Here is how to get the AWS CLI installed.
Note that there are two AWS accounts used at Leaf (the two accounts do not share resources). The AWS account linked above is the Analytics account. You may have the AWS CLI configured with both accounts, in which case you will need to manually set the AWS_PROFILE environment variable to the profile that is linked to the Analytics account before running the OA pipeline.
Conda Development Environment
The commands below assume you’ve cloned the repository locally and are executing them from the project root directory (i.e., the directory containing this README.md file).
Assuming you have GNU Make installed (it’s already on all macOS distributions and many Linux distributions, but not available in the development containers as of writing), to set up a development conda environment please run:
make create_env
This will create a conda environment and install all required dependencies for local development (including Jupyter Lab). In addition, it will install leaf_engine as an editable dependency and set all required environment variables in your conda environment. If you encounter any issues with this setup, please open an issue.
After the environment is created, run:
conda activate pipeline_env
If you need more control over environment creation, check the Makefile to see what commands are run in the create_env recipe.
Installing Git LFS
Git LFS is a Git extension that replaces large files with text pointers inside Git, while storing the file contents in the remote repo. We use it to work with the test dataset. To install, run:
git lfs install
Your adapt/.gitattributes file should already point to the test directory:
tests/test_data/* filter=lfs diff=lfs merge=lfs -text
But if not, run:
git lfs track tests/test_data/*
Lastly, pull the contents of the tracked files:
git lfs fetch
Installing pre-commit hooks
The most important step in this setup is running:
pre-commit install
This will set up pre-commit hooks that lint and format Python code before each commit. You can run these pre-commit hooks manually at any time by running: pre-commit run --all-files
These hooks will enforce basic quality rules for the repo and make working with this code much easier. You only need to run the command above once, after creating (or re-creating) the environment.
Running the pipeline
Now you should be able to run the pipeline locally with:
python run_etl_pipeline.py --params ./run_params/Johnson\ \&\ Johnson.json
You should see logs in standard out and intermediate files being cached in a created ./cache folder.
Where pipeline outputs are stored is controlled via the ANALYTICS_ENVIRONMENT environment variable. By default, this is set to local, which means the outputs will be stored in your local ./cache folder. If you want to store the outputs on S3, you need to set:
conda env config vars set ANALYTICS_ENVIRONMENT=dev;
The test-analytics-data-bucket S3 bucket in Leaf’s Analytics account is currently being used to store outputs. You can change the bucket by setting:
conda env config vars set ANALYTICS_DATA_BUCKET=<your-bucket-name>;
Note that your IAM user (i.e., the user your AWS CLI is configured with) must have read-write access to the S3 bucket.
Running with Lane Level Data
In the absence of shipment level data, the pipeline is slightly modified to return a csv lanes.csv instead of shipments.csv. You can trigger this run by passing in any lane-level run type in the --run-type parameter. For example:
python run_etl_pipeline.py --params "./run_params/Ace Hardware.json" --run-type "LANE"
Only datasets that have a matching type in the parameter file will be used in that run. For example, using --run-type RFP will only use datasets with the type value of RFP and ignore all others.
Internal Runs
Type 1 - Active
To run the type 1 active run:
Save a CSV of the data from analytics.v_leaf_active view. Save it to the “External_Data_Shippers/Leaf Internal” folder here: https://drive.google.com/drive/folders/1MowEJFl4BxBIuDmmetYfOm92iqkA7R6D
Create a new dataset in the Leaf T1 Active.json param file with a URL to the CSV file.
Update the filter_params.max_date and oa_params.batch date to today’s date.
Update the filter_params.min_date to 6 months before today’s date.
Run the flow:
python run_etl_pipeline.py --params "./run_params/Leaf T1 Active.json" --run-type "TYPE 1 ACTIVE"
python run_adapt_pipeline.py --params "./run_params/Leaf T1 Active.json" --run-type "TYPE 1 ACTIVE"
python run_insert_pipeline.py --params "./run_params/Leaf T1 Active.json" --run-type "TYPE 1 ACTIVE"
Type 2 - Lighthouse
To run the type 2 lighthouse flow, run:
python run_etl_pipeline.py --params ./run_params/Leaf T2 Lighthouse.json --run-type "TYPE 2 LIGHTHOUSE"
python run_adapt_pipeline.py --params ./run_params/Leaf T2 Lighthouse.json --run-type "TYPE 2 LIGHTHOUSE"
python run_insert_pipeline.py --params ./run_params/Leaf T2 Lighthouse.json --run-type "TYPE 2 LIGHTHOUSE"
The run_params for the T2 run do not need a label or url entry as the data is pulled from the analytics db at runtime. Placeholders are used to prevent confusion like
{
"label": "Placeholder - not needed for T2",
"lane_end_date": "2023-02-28",
"lane_start_date": "2021-01-01",
"type": "lane",
"url": "Placeholder - not needed for T2"
}
Additionally, the lane_end_date and lane_start_date are used in the Adapt pipeline but not ETL. They are pulled from the etl result: lanes.csv.
Debugging: missing dependencies
If you are getting errors because of missing dependencies, it might be because your local environment has become out of date with respect to the environment.dev.yml file. The easiest way to update your local environment without running into issues is by removing it:
make remove_env
And re-creating it:
make create_env
Updating the environment
If you need to update environment.yml, install the required dependencies using:
conda install -c conda-forge <dependency-name>
Then manually add the dependency to the environment.yml file. You can find package version using:
conda env export --no-builds | grep <dependency-name>
Similarly, if you need to add pip dependencies, list them explicitly in requirements.txt.
Getting set up with Prefect Server
This step is not necessary for local development.
Prefect Server is our data pipeline orchestration system that makes it easy to monitor, schedule, and execute analytics runs on remote infrastructure. Read more about it here.
This guide assumes you have the prod VPN on. If you don’t have access to that VPN, then please ask Andy/Gerrit/Engineering for access.
Accessing the UI
With VPN on, you should be able to open the following link in your browser: http://a70d100ec15dd414986850f9aedb585a-d231ecbc10f6aaf7.elb.us-east-1.amazonaws.com:8080/ (you can bookmark this link for quick access).
On the landing page, scroll down to setup step number 2 and in the input box that says “Prefect Server GraphQL endpoint”, paste the following URL http://a03c99403491545eaa8d9a3c266675b9-0e872f8866538cd1.elb.us-east-1.amazonaws.com:4200/ and click “Connect” (there is a widget in the top right of the page that should turn green). You can ignore the next setup steps.
From the navigation bar at the top left, expand the side menu, click “Switch team” and select “LeafAnalytics” (if not already selected).
Triggering remote runs
For your local machine to speak with the remote server, you need to add the following config file in ~/.prefect/config.toml (if you don’t have a .prefect directory in your $HOME, please create one):
backend = "server"
[server]
host = "http://a03c99403491545eaa8d9a3c266675b9-0e872f8866538cd1.elb.us-east-1.amazonaws.com"
port = "4200"
endpoint = "${server.host}:${server.port}"
[server.ui]
host = "http://a70d100ec15dd414986850f9aedb585a-d231ecbc10f6aaf7.elb.us-east-1.amazonaws.com"
port = "8080"
host_port = "8080"
endpoint = "${server.ui.host}:${server.ui.port}"
apollo_url = "${server.host}:${server.port}/graphql"
[server.telemetry]
enabled = false
[cloud]
api = "${server.host}:${server.port}"
This file needs to be on the same machine/container where you currently execute etl/adapt runs.
After this, you’ll be able to trigger runs on the server by adding the --remote flag to runners that support it (e.g., python run_etl_pipeline.py --params ./run_params/Treehouse.json --remote). You can monitor run execution in the UI linked above (note that you need prod VPN on for these steps to work).
Run outputs will be available on s3://test-analytics-data-bucket.
Developing with Docker and Docker Compose
In order to use docker, docker compose, and the Dev Container extensions, you will need to run the bin/setup_secrets.py scripts which will generate a docker.env file in the repo root directory. This file will be used by the docker compose files to set environment variables.
Leaf Analytics Engine Only
In the project root, the following command can be used to start the leaf engine in a docker container which will use the production database, analytics service, and PostgREST service:
docker compose -p leaf-engine-stack -f ./bin/docker-compose.engine.yml up
The leaf-analytics-engine repo will be mounted as a bind mount in the container. This means that any changes made to the code will be reflected immediately in the container and vice versa.
Leaf Analytics Engine and Local Database and Services
In order to use a local database, the analytics-service and migrations repos will need to be cloned in the same directory as the leaf-analytics-engine repo. The directory structure should look like this:
➜ ls
analytics-service/
migrations/
leaf-analytics-engine/
Then the following command can be used to start the leaf engine, local database, analytics service, and PostgREST service as separate docker containers:
docker compose -p leaf-engine-stack -f ./bin/docker-compose.all.yml up
The database data will be persisted in a docker volume called bin_db_data. Deleting the volume will delete the local database data.
Database tools such as DBeaver and Datagrip can be used to connect to the local database. The connection details are:
url: localhost
port: 5432
database: platform
user: leuser
password: 123456
Developing in VSCode with Dev Containers
A basic devcontainer file is included at .devcontainers/devcontainer.json which can be used to develop in VSCode with a docker container using the Dev Containers plugin. Use the Remote Containers: Open Folder in Container... command to open the project in a container. By default it uses the docker-compose.all.yml file which will start several services at once. If you want to use the docker-compose.engine.yml file instead, you can change the dockerComposeFile property in the devcontainer file.