Skip to content

Commit

Permalink
docs: updated docs structure
Browse files Browse the repository at this point in the history
  • Loading branch information
yellowHatpro committed Aug 28, 2024
1 parent c77bcae commit ea91371
Show file tree
Hide file tree
Showing 5 changed files with 130 additions and 111 deletions.
84 changes: 17 additions & 67 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,80 +1,30 @@
# MusicBrainz - External URLs - Internet Archive Service
(Sorry for such a long messy name, will update later ig)
- [Proposal Doc Link](https://docs.google.com/document/d/1Bk66_HFWEA6gBbFfQzIriGGgxxbEIwN1CbVDcz7FTys/edit?usp=sharing)
# <div style="text-align: center;">MusicBrainz - External URLs - Internet Archive Service</div>

### Current Implementation (WIP)
**<div style="text-align: center;">MusicBrainz - Internet Archive integration for preserving external URLs.</div>**

We want to get URLs from `edit_data` and `edit_note` tables, and archive them in Internet Archive history.
The app provides multiple command line functionalities to archive URLs from `edit_data` and `edit_note` tables:
![CLI functionality](assets/cli.png)
-[Proposal Doc Link](https://docs.google.com/document/d/1Bk66_HFWEA6gBbFfQzIriGGgxxbEIwN1CbVDcz7FTys/edit?usp=sharing)

We create a `external_url_archiver` schema, under which we create the required table, functions, trigger to make the service work.

Following are the long-running tasks:
## About

1. `poller task`
- Create a `Poller` implementation which:
- Gets the latest `edit_note` id `edit_data` edit from `internet_archive_urls` table. We start polling the `edit_note` and `edit_data` from these ids.
- Poll `edit_note` and `edit_data` table for URLs
- Transformations to required format
- Save output to `internet_archive_urls` table
2. `archival task`
- Has 2 parts:
1. `notifer`
- Creates a `Notifier` implementation which:
- Fetches the last unarchived URL row from `internet_archive_urls` table, and start notifying from this row id.
- Initialises a postgres function `notify_archive_urls`, which takes the `url_id` integer value, and sends the corresponding `internet_archive_urls` row through the channel called `archive_urls`.
- This periodically run in order to archive URLs from `internet_archive_urls`.
2. `listener`
- Listens to the `archive_urls` channel, and makes the necessary Wayback Machine API request (The API calls are still to be made).
- The listener task is delayed for currently 5 seconds, so that no matter how many URLs are passed to the channel, it only receives 1 URL per 5 seconds, in order to work under IA rate limits.
3. `retry/cleanup task`
- Runs every 24 hours, and does the following:
1. If the `status` of the URL archival is `success`, and the URL is present in the table for more than 24 hours, cleans it.
2. In case the URL's status is still null which means pending, it resends the URL to `archive_urls` channel from `notify_archive_urls` function, so that it can be re-archived.
The project is a rust based service which utilizes Internet Archive's [Wayback Machine](https://web.archive.org/) APIs to preserve URLs present in Musicbrainz database, in Internet Archive history.
MusicBrainz database sees a lot of edits made on a daily basis. With each edit, there’s associated an edit note which provides additional information about the edit. Often, these edit notes, as well as some edits, contain external links, which we want to archive in the Internet Archive.

### See the app architecture [here](./docs/architecture.md)
## Installation

### Local setup
> - Make sure musicbrainz db and the required database tables are present.
> - Follow https://github.com/metabrainz/musicbrainz-docker to install the required containers and db dumps.
> - Rename the `.env.example` to `.env`.
> - After ensuring musicbrainz_db is running on port 5432, Run the script `init_db.sh` in scripts dir.
> - In `config/development.toml` file, make sure to create a sentry rust project, enter your sentry project [DSN](https://docs.sentry.io/platforms/rust/#configure) (Data Source Name) in the `url` key's value.
> - Get the Internet Archive API accesskey and secret from [here](https://archive.org/account/s3.php) (requires sign in). Paste them in `config/development.toml` file `[wayback_machine_api]`'s variables `myaccesskey` and `mysecret`.
Prerequisites:
- Rust
- Postgres
- Docker
- [musicbrainz-docker](https://github.com/metabrainz/musicbrainz-docker) local setup for musicbrainz database

Follow the instructions in [INSTALL.md](docs/INSTALL.md)

There are 2 methods to run the program:
1. Build the project and run.
- Make sure rust is installed.
- ```shell
cargo build &&
./target/debug/mb-ia
```
2. Use the Dockerfile
- Note that the container has to run in the same network as musicbrainz db network bridge.
1. ```shell
cargo sqlx prepare
```
## App architecture

2. ```shell
docker-compose -f docker/docker-compose.dev.yml up --build
```
For understanding how the project is structured, check [here](docs/ARCHITECTURE.md)

#### Setting up Prometheus, Grafana
## Maintenance

1. On your browser, go to `localhost:3000`, to access grafana. Login using admin as username and password.
Refer to [Maintenance guide](docs/MAINTENANCE.md) for guidelines and instructions for maintaining the project.

![img.png](assets/grafana_login_page.png)

2. Go to Dashboard. Select `mb-ia-dashboard`.

![img.png](assets/mb-ia-dashboard.png)

3. If the `Rust app metrics panel` shows no data, just click on the refresh icon on top right corner.

![img.png](assets/mb-ia-dashboard-rust-panel.png)

4. To edit, right-click on the panel and select edit option. You can edit the panel, and save the generated json in `grafana/dashboards/metrics-dashboard.json`.

![img.png](assets/working_grafana_dashboard.png)
65 changes: 65 additions & 0 deletions docs/ARCHITECTURE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,65 @@
### Architecture

This is a high level overview of the folder structure of the project.

```
.
├── Cargo.toml
├── config/ (config contains .yaml files that provides configs and various numeric values required to run the project.)
├── docker/ (contains Dockerfiles and docker compose configs)
├── grafana ( contains dashboard configs and prometheus datasources config)
│ ├── dashboards/
│ └── datasources/
├── prometheus.yaml (define prometheus metric collection related configs here)
├── scripts/ (various scripts that helps in populating tables, schema and test data)
│ └── sql/
├── src
│ ├── app/ (main application where we start poller and archival tasks)
│ ├── archival/ (deals with network requests to archive URLs, check status of archival, and cleanup of completed values)
│ │ └── tests/ (contains unit tests for archival service)
│ ├── cli/ (cli options are set here, along with the utils)
│ ├── configuration/ (parsing logic for .yaml configs belongs here)
│ ├── lib.rs (treats the app as a library)
│ ├── main.rs (entry point to the app)
│ ├── metrics/ (module contains metrics, and metrics collection methods for the app)
│ ├── poller/ (polling logic resides here)
│ │ └── tests/ (unit tests for poller module)
│ └── structs/
└── tests (contains Integration tests)
├── archival/
├── fixtures/
├── main.rs
└── poller/
```


## Current Implementation (WIP)

We want to get URLs from `edit_data` and `edit_note` tables, and archive them in Internet Archive history.
The app provides multiple command line functionalities to archive URLs from `edit_data` and `edit_note` tables:
![CLI functionality](../assets/cli.png)

We create a `external_url_archiver` schema, under which we create the required table, functions, trigger to make the service work.

Following are the long-running tasks:

1. `poller task`
- Create a `Poller` implementation which:
- Gets the latest `edit_note` id `edit_data` edit from `internet_archive_urls` table. We start polling the `edit_note` and `edit_data` from these ids.
- Poll `edit_note` and `edit_data` table for URLs
- Transformations to required format
- Save output to `internet_archive_urls` table
2. `archival task`
- Has 2 parts:
1. `notifer`
- Creates a `Notifier` implementation which:
- Fetches the last unarchived URL row from `internet_archive_urls` table, and start notifying from this row id.
- Initialises a postgres function `notify_archive_urls`, which takes the `url_id` integer value, and sends the corresponding `internet_archive_urls` row through the channel called `archive_urls`.
- This periodically run in order to archive URLs from `internet_archive_urls`.
2. `listener`
- Listens to the `archive_urls` channel, and makes the necessary Wayback Machine API request (The API calls are still to be made).
- The listener task is delayed for currently 5 seconds, so that no matter how many URLs are passed to the channel, it only receives 1 URL per 5 seconds, in order to work under IA rate limits.
3. `retry/cleanup task`
- Runs every 24 hours, and does the following:
1. If the `status` of the URL archival is `success`, and the URL is present in the table for more than 24 hours, cleans it.
2. In case the URL's status is still null which means pending, it resends the URL to `archive_urls` channel from `notify_archive_urls` function, so that it can be re-archived.
44 changes: 44 additions & 0 deletions docs/INSTALL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
# Installing the app

> - Make sure musicbrainz db and the required database tables are present.
> - Follow https://github.com/metabrainz/musicbrainz-docker to install the required containers and db dumps.
> - Rename the `.env.example` to `.env`.
> - After ensuring musicbrainz_db is running on port 5432, Run the script `init_db.sh` in scripts dir.
> - In `config/development.toml` file, make sure to create a sentry rust project, enter your sentry project [DSN](https://docs.sentry.io/platforms/rust/#configure) (Data Source Name) in the `url` key's value.
> - Get the Internet Archive API accesskey and secret from [here](https://archive.org/account/s3.php) (requires sign in). Paste them in `config/development.toml` file `[wayback_machine_api]`'s variables `myaccesskey` and `mysecret`.

There are 2 methods to run the program:
1. Build the project and run.
- Make sure rust is installed.
- ```shell
cargo build &&
./target/debug/mb-ia
```
2. Use the Dockerfile
- Note that the container has to run in the same network as musicbrainz db network bridge.
1. ```shell
cargo sqlx prepare
```

2. ```shell
docker-compose -f docker/docker-compose.dev.yml up --build
```

## Setting up Prometheus, Grafana

1. On your browser, go to `localhost:3000`, to access grafana. Login using admin as username and password.

![img.png](../assets/grafana_login_page.png)

2. Go to Dashboard. Select `mb-ia-dashboard`.

![img.png](../assets/mb-ia-dashboard.png)

3. If the `Rust app metrics panel` shows no data, just click on the refresh icon on top right corner.

![img.png](../assets/mb-ia-dashboard-rust-panel.png)

4. To edit, right-click on the panel and select edit option. You can edit the panel, and save the generated json in `grafana/dashboards/metrics-dashboard.json`.

![img.png](../assets/working_grafana_dashboard.png)
4 changes: 4 additions & 0 deletions docs/maintainance.md → docs/MAINTENANCE.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,7 @@
# Maintaining the project

This doc provides instructions, guidelines and references to maintain the project without running into troubles.

## Schema Guidelines

- The project depends on `musicbrainz_db`, therefore, make sure all the `CREATE TABLE musicbrainz.*` instructions, present in `scripts/sql` scripts are in sync with MusicBrainz database schema.
44 changes: 0 additions & 44 deletions docs/architecture.md

This file was deleted.

0 comments on commit ea91371

Please sign in to comment.