Things to do in order for them to run correctly:

Set env var:

export PYSPARK_PYTHON=which python3

Install required modules:

pip3 install -r requirements.txt

Install java and scala:

apt-get install default-jdk scala

Install spark (download 2.3.0 tgz for hadoop and unzip in /usr/local/spark

To run the scripts:

spark-submit --master spark://195.201.112.36:7077 --executor-memory=29g pwd/<script>

spark-submit --master spark://195.201.112.36:7077 --executor-memory=29g pwd/train_models.py df models

Name		Name	Last commit message	Last commit date
Latest commit History 174 Commits
docker		docker
listenbrainz_spark		listenbrainz_spark
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
SCRIPTS.md		SCRIPTS.md
develop.sh		develop.sh
import.py		import.py
load_data.py		load_data.py
manage.py		manage.py
queries.md		queries.md
read.py		read.py
readme.md		readme.md
requirements.txt		requirements.txt
run.sh		run.sh
setup.py		setup.py
spark-submit.sh		spark-submit.sh
utils.py		utils.py