Spark/Hadoop Cluster: Difference between revisions
No edit summary |
|||
Line 46: | Line 46: | ||
export SPARK_MASTER_HOST=spark1.lab.bpopp.net | export SPARK_MASTER_HOST=spark1.lab.bpopp.net | ||
</pre> | </pre> | ||
= Hadoop 3.3 on Java 11 = | |||
Sadly, Hadoop requires some hacks to run correctly on newer versions of Java. This is outlined [https://stackoverflow.com/questions/53562981/hadoop-hdfs-3-1-1-on-java-11-web-ui-crash-when-loading-the-file-explorer here]. Download a copy of the [https://repo1.maven.org/maven2/com/sun/activation/javax.activation/1.2.0/ java.activation module] and install it in {HADOOP_HOME}/share/hadoop/common. | |||
= Starting Spark = | = Starting Spark = |
Revision as of 04:47, 30 January 2024
Getting Started
This assumes the spark/hadoop cluster were configured in a particular way. You can see the general configuration from the Foreman page, but in general, spark was configured in the /usr/local/spark directory and hadoop was installed to /usr/local/hadoop.
This is a good guide for general setup of a single-node cluster
Once everything is up and running, these URL's should be available:
Passwordless SSH from Master
To allow the spark master user to ssh to itself (for a local worker) and also the workers, you need ssh passwordless to be enabled. This can be done by logging into the spark user on the master server and doing:
ssh-keygen -t rsa -P ""
Once the key has been generated, it will be in /home/spark/.ssh/id_rsa (by default). Copy it to the authorized hosts file (to allow spark to ssh to itself):
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
Or, for each worker, do something like:
ssh-copy-id -i ~/.ssh/id_rsa.pub spark@localhost ssh-copy-id -i ~/.ssh/id_rsa.pub spark@spark2.lab.bpopp.net
Binding Spark to External Interface
If you want to be able to connect to your spark-master from an external PC, you will probably need to make the following change. If you do a
lsof -i -P -n | grep LISTEN
You may notice that spark is binding to a 127.0.0.1:7077 interface. This won't allow external connections. To fix it, you need to make sure the /etc/hosts file is mapping to your hostname:
127.0.0.1 localhost 192.168.2.31 spark1.lab.bpopp.net spark1
And then in /usr/local/spark/conf/spark-env.sh, add:
export SPARK_LOCAL_IP=spark1.lab.bpopp.net export SPARK_MASTER_HOST=spark1.lab.bpopp.net
Hadoop 3.3 on Java 11
Sadly, Hadoop requires some hacks to run correctly on newer versions of Java. This is outlined here. Download a copy of the java.activation module and install it in {HADOOP_HOME}/share/hadoop/common.
Starting Spark
su spark cd /usr/local/spark/sbin ./start-all.sh
Hadoop Configuration
From /usr/local/hadoop/etc/hadoop/core-site.xml:
<configuration> <property> <name>dfs.replication</name> <value>1</value> </property> <property> <name>dfs.namenode.name.dir</name> <value>file:/home/spark/hdfs/namenode</value> </property> <property> <name>dfs.datanode.data.dir</name> <value>file:/home/spark/hdfs/datanode</value> </property> <property> <name>dfs.webhdfs.enabled</name> <value>true</value> </property> </configuration>
From /usr/local/hadoop/etc/hadoop/hdfs-site.xml:
<configuration> <property> <name>dfs.replication</name> <value>1</value> </property> </configuration>
Starting Hadoop
Note that the namenode needs to be formatted prior to startup or it will not work.
(assuming still spark user)
hdfs namenode -format cd /usr/local/hadoop/sbin ./start-all.sh