Install Java Module into Puppet
/etc/puppetlabs/code/environments/production$ sudo /opt/puppetlabs/bin/puppet module install puppetlabs/java
Install Spark Module to /etc/puppetlabs/code/environments/production/manifests/spark.pp Note that this hard codes server names. Not ideal, but it's a starting point.
$master_hostname='spark-master.bpopp.net' class{'hadoop': realm => '', hdfs_hostname => $master_hostname, slaves => ['spark1.bpopp.net', 'spark2.bpopp.net'], } class{'spark': master_hostname => $master_hostname, hdfs_hostname => $master_hostname, historyserver_hostname => $master_hostname, yarn_enable => false, } node 'spark-master.bpopp.net' { include spark::master include spark::historyserver include hadoop::namenode include spark::hdfs } node /spark(1|2).bpopp.net/ { include spark::worker include hadoop::datanode } node 'client.bpopp.net' { include hadoop::frontend include spark::frontend }
SSH Setup
The master must be able to SSH to the slaves without a password. To do this, you typically use certificates from the master loaded to each slave. From the master:
ssh-keygen -t rsa ssh-copy-id spark@spark1.lab.bpopp.net ssh-copy-id spark@spark2.lab.bpopp.net ssh-copy-id spark@spark3.lab.bpopp.net
Make sure it worked by trying:
ssh spark@localhost
You shouldn't be prompted for a password.
Spark Config
/usr/local/spark/conf/slaves
# A Spark Worker will be started on each of the machines listed below. spark1 spark2 spark3 #spark4
/usr/local/spark/conf/spark-env.sh
export SPARK_DIST_CLASSPATH=$(/usr/local/hadoop/bin/hadoop --config /usr/local/hadoop/etc/hadoop classpath) export SPARK_DIST_CLASSPATH=$(/usr/local/hadoop/bin/hadoop classpath)
/usr/local/spark/conf/spark-defaults.conf
# Example: spark.master spark://spark1.lab.bpopp.net:7077 #spark.driver.memory 2g spark.executor.memory 2g # spark.eventLog.enabled true # spark.eventLog.dir hdfs://namenode:8021/directory # spark.serializer org.apache.spark.serializer.KryoSerializer # spark.executor.extraJavaOptions -XX:+PrintGCDetails -Dkey=value -Dnumbers="one two three"
Hadoop Config
/usr/local/hadoop/etc/hadoop/hdfs-site.xml
<!-- Put site-specific property overrides in this file. --> <configuration> <property> <name>dfs.replication</name> <value>1</value> </property> <property> <name>dfs.namenode.name.dir</name> <value>file:/home/spark/hdfs/namenode</value> </property> <property> <name>dfs.datanode.data.dir</name> <value>file:/home/spark/hdfs/dfs</value> </property> </configuration>
/usr/local/hadoop/etc/hadoop/core-site.xml
<!-- Put site-specific property overrides in this file. --> <configuration> <property> <name>fs.defaultFS</name> <value>hdfs://0.0.0.0:9000</value> </property> </configuration>
Start the Services
ssh spark@localhost /usr/local/spark/sbin/start-all.sh /usr/local/hadoop/sbin/start-all.sh
Jupyter Config
/home/spark/.jupyter/env
PYSPARK_PYTHON=/usr/bin/python3 HADOOP_HOME=/usr/local/hadoop SPARK_DIST_CLASSPATH=/usr/local/hadoop/etc/hadoop:/usr/local/hadoop/share/hadoop/common/lib/*:/usr/local/hadoop/share/hadoop/common/*:/usr/local/hadoop/share/hadoop/hdfs:/usr/local/hadoop/share/hadoop/hdfs/lib/*:/usr/local/hadoop/share/hadoop/hdfs/*:/usr/local/hadoop/share/hadoop/mapreduce/lib/*:/usr/local/hadoop/share/hadoop/mapreduce/*:/usr/local/hadoop/share/hadoop/yarn:/usr/local/hadoop/share/hadoop/yarn/lib/*:/usr/local/hadoop/share/hadoop/yarn/* SPARK_HOME=/usr/local/spark JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk-amd64/ PYSPARK_SUBMIT_ARGS=--master spark://spark1.lab.bpopp.net:7077 pyspark-shell
Create Jupyter Service
/lib/systemd/system/jupyter.service
[Unit] Description=Jupyter Notebook Server [Service] Type=simple PIDFile=/run/jupyter.pid EnvironmentFile=/home/spark/.jupyter/env # Jupyter Notebook: change PATHs as needed for your system ExecStart=/usr/local/bin/jupyter notebook User=spark Group=spark WorkingDirectory=/home/spark/work Restart=always RestartSec=10 #KillMode=mixed [Install] WantedBy=multi-user.target