-
Notifications
You must be signed in to change notification settings - Fork 2
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
1 parent
c77bcae
commit ea91371
Showing
5 changed files
with
130 additions
and
111 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,80 +1,30 @@ | ||
# MusicBrainz - External URLs - Internet Archive Service | ||
(Sorry for such a long messy name, will update later ig) | ||
- [Proposal Doc Link](https://docs.google.com/document/d/1Bk66_HFWEA6gBbFfQzIriGGgxxbEIwN1CbVDcz7FTys/edit?usp=sharing) | ||
# <div style="text-align: center;">MusicBrainz - External URLs - Internet Archive Service</div> | ||
|
||
### Current Implementation (WIP) | ||
**<div style="text-align: center;">MusicBrainz - Internet Archive integration for preserving external URLs.</div>** | ||
|
||
We want to get URLs from `edit_data` and `edit_note` tables, and archive them in Internet Archive history. | ||
The app provides multiple command line functionalities to archive URLs from `edit_data` and `edit_note` tables: | ||
![CLI functionality](assets/cli.png) | ||
-[Proposal Doc Link](https://docs.google.com/document/d/1Bk66_HFWEA6gBbFfQzIriGGgxxbEIwN1CbVDcz7FTys/edit?usp=sharing) | ||
|
||
We create a `external_url_archiver` schema, under which we create the required table, functions, trigger to make the service work. | ||
|
||
Following are the long-running tasks: | ||
## About | ||
|
||
1. `poller task` | ||
- Create a `Poller` implementation which: | ||
- Gets the latest `edit_note` id `edit_data` edit from `internet_archive_urls` table. We start polling the `edit_note` and `edit_data` from these ids. | ||
- Poll `edit_note` and `edit_data` table for URLs | ||
- Transformations to required format | ||
- Save output to `internet_archive_urls` table | ||
2. `archival task` | ||
- Has 2 parts: | ||
1. `notifer` | ||
- Creates a `Notifier` implementation which: | ||
- Fetches the last unarchived URL row from `internet_archive_urls` table, and start notifying from this row id. | ||
- Initialises a postgres function `notify_archive_urls`, which takes the `url_id` integer value, and sends the corresponding `internet_archive_urls` row through the channel called `archive_urls`. | ||
- This periodically run in order to archive URLs from `internet_archive_urls`. | ||
2. `listener` | ||
- Listens to the `archive_urls` channel, and makes the necessary Wayback Machine API request (The API calls are still to be made). | ||
- The listener task is delayed for currently 5 seconds, so that no matter how many URLs are passed to the channel, it only receives 1 URL per 5 seconds, in order to work under IA rate limits. | ||
3. `retry/cleanup task` | ||
- Runs every 24 hours, and does the following: | ||
1. If the `status` of the URL archival is `success`, and the URL is present in the table for more than 24 hours, cleans it. | ||
2. In case the URL's status is still null which means pending, it resends the URL to `archive_urls` channel from `notify_archive_urls` function, so that it can be re-archived. | ||
The project is a rust based service which utilizes Internet Archive's [Wayback Machine](https://web.archive.org/) APIs to preserve URLs present in Musicbrainz database, in Internet Archive history. | ||
MusicBrainz database sees a lot of edits made on a daily basis. With each edit, there’s associated an edit note which provides additional information about the edit. Often, these edit notes, as well as some edits, contain external links, which we want to archive in the Internet Archive. | ||
|
||
### See the app architecture [here](./docs/architecture.md) | ||
## Installation | ||
|
||
### Local setup | ||
> - Make sure musicbrainz db and the required database tables are present. | ||
> - Follow https://github.com/metabrainz/musicbrainz-docker to install the required containers and db dumps. | ||
> - Rename the `.env.example` to `.env`. | ||
> - After ensuring musicbrainz_db is running on port 5432, Run the script `init_db.sh` in scripts dir. | ||
> - In `config/development.toml` file, make sure to create a sentry rust project, enter your sentry project [DSN](https://docs.sentry.io/platforms/rust/#configure) (Data Source Name) in the `url` key's value. | ||
> - Get the Internet Archive API accesskey and secret from [here](https://archive.org/account/s3.php) (requires sign in). Paste them in `config/development.toml` file `[wayback_machine_api]`'s variables `myaccesskey` and `mysecret`. | ||
Prerequisites: | ||
- Rust | ||
- Postgres | ||
- Docker | ||
- [musicbrainz-docker](https://github.com/metabrainz/musicbrainz-docker) local setup for musicbrainz database | ||
|
||
Follow the instructions in [INSTALL.md](docs/INSTALL.md) | ||
|
||
There are 2 methods to run the program: | ||
1. Build the project and run. | ||
- Make sure rust is installed. | ||
- ```shell | ||
cargo build && | ||
./target/debug/mb-ia | ||
``` | ||
2. Use the Dockerfile | ||
- Note that the container has to run in the same network as musicbrainz db network bridge. | ||
1. ```shell | ||
cargo sqlx prepare | ||
``` | ||
## App architecture | ||
|
||
2. ```shell | ||
docker-compose -f docker/docker-compose.dev.yml up --build | ||
``` | ||
For understanding how the project is structured, check [here](docs/ARCHITECTURE.md) | ||
|
||
#### Setting up Prometheus, Grafana | ||
## Maintenance | ||
|
||
1. On your browser, go to `localhost:3000`, to access grafana. Login using admin as username and password. | ||
Refer to [Maintenance guide](docs/MAINTENANCE.md) for guidelines and instructions for maintaining the project. | ||
|
||
![img.png](assets/grafana_login_page.png) | ||
|
||
2. Go to Dashboard. Select `mb-ia-dashboard`. | ||
|
||
![img.png](assets/mb-ia-dashboard.png) | ||
|
||
3. If the `Rust app metrics panel` shows no data, just click on the refresh icon on top right corner. | ||
|
||
![img.png](assets/mb-ia-dashboard-rust-panel.png) | ||
|
||
4. To edit, right-click on the panel and select edit option. You can edit the panel, and save the generated json in `grafana/dashboards/metrics-dashboard.json`. | ||
|
||
![img.png](assets/working_grafana_dashboard.png) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,65 @@ | ||
### Architecture | ||
|
||
This is a high level overview of the folder structure of the project. | ||
|
||
``` | ||
. | ||
├── Cargo.toml | ||
├── config/ (config contains .yaml files that provides configs and various numeric values required to run the project.) | ||
├── docker/ (contains Dockerfiles and docker compose configs) | ||
├── grafana ( contains dashboard configs and prometheus datasources config) | ||
│ ├── dashboards/ | ||
│ └── datasources/ | ||
├── prometheus.yaml (define prometheus metric collection related configs here) | ||
├── scripts/ (various scripts that helps in populating tables, schema and test data) | ||
│ └── sql/ | ||
├── src | ||
│ ├── app/ (main application where we start poller and archival tasks) | ||
│ ├── archival/ (deals with network requests to archive URLs, check status of archival, and cleanup of completed values) | ||
│ │ └── tests/ (contains unit tests for archival service) | ||
│ ├── cli/ (cli options are set here, along with the utils) | ||
│ ├── configuration/ (parsing logic for .yaml configs belongs here) | ||
│ ├── lib.rs (treats the app as a library) | ||
│ ├── main.rs (entry point to the app) | ||
│ ├── metrics/ (module contains metrics, and metrics collection methods for the app) | ||
│ ├── poller/ (polling logic resides here) | ||
│ │ └── tests/ (unit tests for poller module) | ||
│ └── structs/ | ||
└── tests (contains Integration tests) | ||
├── archival/ | ||
├── fixtures/ | ||
├── main.rs | ||
└── poller/ | ||
``` | ||
|
||
|
||
## Current Implementation (WIP) | ||
|
||
We want to get URLs from `edit_data` and `edit_note` tables, and archive them in Internet Archive history. | ||
The app provides multiple command line functionalities to archive URLs from `edit_data` and `edit_note` tables: | ||
![CLI functionality](../assets/cli.png) | ||
|
||
We create a `external_url_archiver` schema, under which we create the required table, functions, trigger to make the service work. | ||
|
||
Following are the long-running tasks: | ||
|
||
1. `poller task` | ||
- Create a `Poller` implementation which: | ||
- Gets the latest `edit_note` id `edit_data` edit from `internet_archive_urls` table. We start polling the `edit_note` and `edit_data` from these ids. | ||
- Poll `edit_note` and `edit_data` table for URLs | ||
- Transformations to required format | ||
- Save output to `internet_archive_urls` table | ||
2. `archival task` | ||
- Has 2 parts: | ||
1. `notifer` | ||
- Creates a `Notifier` implementation which: | ||
- Fetches the last unarchived URL row from `internet_archive_urls` table, and start notifying from this row id. | ||
- Initialises a postgres function `notify_archive_urls`, which takes the `url_id` integer value, and sends the corresponding `internet_archive_urls` row through the channel called `archive_urls`. | ||
- This periodically run in order to archive URLs from `internet_archive_urls`. | ||
2. `listener` | ||
- Listens to the `archive_urls` channel, and makes the necessary Wayback Machine API request (The API calls are still to be made). | ||
- The listener task is delayed for currently 5 seconds, so that no matter how many URLs are passed to the channel, it only receives 1 URL per 5 seconds, in order to work under IA rate limits. | ||
3. `retry/cleanup task` | ||
- Runs every 24 hours, and does the following: | ||
1. If the `status` of the URL archival is `success`, and the URL is present in the table for more than 24 hours, cleans it. | ||
2. In case the URL's status is still null which means pending, it resends the URL to `archive_urls` channel from `notify_archive_urls` function, so that it can be re-archived. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,44 @@ | ||
# Installing the app | ||
|
||
> - Make sure musicbrainz db and the required database tables are present. | ||
> - Follow https://github.com/metabrainz/musicbrainz-docker to install the required containers and db dumps. | ||
> - Rename the `.env.example` to `.env`. | ||
> - After ensuring musicbrainz_db is running on port 5432, Run the script `init_db.sh` in scripts dir. | ||
> - In `config/development.toml` file, make sure to create a sentry rust project, enter your sentry project [DSN](https://docs.sentry.io/platforms/rust/#configure) (Data Source Name) in the `url` key's value. | ||
> - Get the Internet Archive API accesskey and secret from [here](https://archive.org/account/s3.php) (requires sign in). Paste them in `config/development.toml` file `[wayback_machine_api]`'s variables `myaccesskey` and `mysecret`. | ||
|
||
There are 2 methods to run the program: | ||
1. Build the project and run. | ||
- Make sure rust is installed. | ||
- ```shell | ||
cargo build && | ||
./target/debug/mb-ia | ||
``` | ||
2. Use the Dockerfile | ||
- Note that the container has to run in the same network as musicbrainz db network bridge. | ||
1. ```shell | ||
cargo sqlx prepare | ||
``` | ||
|
||
2. ```shell | ||
docker-compose -f docker/docker-compose.dev.yml up --build | ||
``` | ||
|
||
## Setting up Prometheus, Grafana | ||
|
||
1. On your browser, go to `localhost:3000`, to access grafana. Login using admin as username and password. | ||
|
||
![img.png](../assets/grafana_login_page.png) | ||
|
||
2. Go to Dashboard. Select `mb-ia-dashboard`. | ||
|
||
![img.png](../assets/mb-ia-dashboard.png) | ||
|
||
3. If the `Rust app metrics panel` shows no data, just click on the refresh icon on top right corner. | ||
|
||
![img.png](../assets/mb-ia-dashboard-rust-panel.png) | ||
|
||
4. To edit, right-click on the panel and select edit option. You can edit the panel, and save the generated json in `grafana/dashboards/metrics-dashboard.json`. | ||
|
||
![img.png](../assets/working_grafana_dashboard.png) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,3 +1,7 @@ | ||
# Maintaining the project | ||
|
||
This doc provides instructions, guidelines and references to maintain the project without running into troubles. | ||
|
||
## Schema Guidelines | ||
|
||
- The project depends on `musicbrainz_db`, therefore, make sure all the `CREATE TABLE musicbrainz.*` instructions, present in `scripts/sql` scripts are in sync with MusicBrainz database schema. |
This file was deleted.
Oops, something went wrong.