Spark on K8S (spark on kubernetes operator) FAQ

Spark on K8S (spark on kubernetes operator) environment construction and demo process (2)

Common problems in the process of Spark Demo (two)

How to persist logs in Spark's executor/driver
How to configure Spark history server to take effect
What does xxxxx webhook do under spark operator namespace

Common problems in Spark Demo (2)

How to persist logs in Spark's executor/driver

Two ways to think about it:

During the execution of driver/executor, a new environment will be built for the code to be modified and connected to the unified logging system (e.g. ES). Later, we will study it
Spark has its own log persistence configuration, which can be persisted to the hdfs path (which is also commonly used on yarn)
This time, method 2 is available in the following configurations:

  #sparkConf:
    #"spark.eventLog.enabled": "true"
    #"spark.eventLog.dir": "hdfs://hdfscluster:port/spark-test/spark-events"

If you encounter problems related to kerberos, you can refer to the following chapters to solve them. After log persistence, the web ui provided by spark will shut down after the driver is executed, and you cannot see the execution process of spark job. First, you think of a local method, adding a sleep before spark.stop(), and the web ui will not shut down. In the kubernets cluster, you can see a spark PI driver The corresponding svc has no nodeport, so in order to access the svc outside the cluster, you can add 80 port mapped ingress, configure backend to access the svc, or directly point the backend of the ingress controller to the svc. I use this kind of svc, but the problem is that every spark application creates a new svc, which cannot be seen in a unified interface; In addition, there should be a simpler way to access an svc directly outside the cluster through kubectl port forward, which has not been tried yet; you can record a thinking question and try later.

Try to use kubectl port forward to implement external access cluster SVC

How to configure Spark history server to take effect

It is impossible to keep every sparkApplication, so you have to build a Spark history server in two ways:

Just use the spark v2.4.4 image to create a history image and run it in kubernetes
Deploy a spark history server outside the cluster

Dokcerfile is as follows:

ARG SPARK_IMAGE=gcr.io/spark-operator/spark:v2.4.4

FROM gcr.io/spark-operator/spark:v2.4.4
RUN chmod 777 /opt/spark/sbin/start-history-server.sh
RUN ls -l /opt/spark/sbin/start-history-server.sh

COPY spark-daemon.sh /opt/spark/sbin/spark-daemon.sh
RUN chmod 777 /opt/spark/sbin/spark-daemon.sh
COPY run.sh /opt/run.sh
RUN chmod 777 /opt/run.sh

RUN mkdir -p /etc/hadoop/conf
RUN chmod 777 /etc/hadoop/conf

COPY core-site.xml /etc/hadoop/conf/core-site.xml
COPY hdfs-site.xml /etc/hadoop/conf/hdfs-site.xml
COPY user.keytab /etc/hadoop/conf/user.keytab
COPY krb5.conf /etc/hadoop/conf/krb5.conf

RUN chmod 777 /etc/hadoop/conf/core-site.xml
RUN chmod 777 /etc/hadoop/conf/hdfs-site.xml
RUN chmod 777 /etc/hadoop/conf/user.keytab
RUN chmod 777 /etc/hadoop/conf/krb5.conf

ENTRYPOINT ["/opt/run.sh"]

After docker run gets up, the following errors are found:

ps: unrecognized option: p
BusyBox v1.29.3 (2019-01-24 07:45:07 UTC) multi-call binary.

Usage: ps [-o COL1,COL2=HEADER]

Show list of processes

        -o COL1,COL2=HEADER     Select columns for display

In order to locate and analyze the cause of error reporting, a native method is used here. Instead of directly executing start-history-server.sh after the container is loaded, it is encapsulated into a script year written by itself. sleep in the script, and then use docker exec - it XXXXXX bash to enter the container analysis;

#!/bin/bash
sh /opt/spark/sbin/start-history-server.sh "hdfs://xxxxxxxxxx:xxxx/spark-test/spark-events"
while [ 1 == 1 ]
do
        cat /opt/spark/logs/*
        sleep 60
done

It is found that the spark-day.sh used in the startup script contains the usage of ps-P, which will directly report an error. Therefore, it is necessary to modify the spark-day.sh script, replace the ps-P in the script with ps, and then replace the script when docker prints the image. Another problem is found in the script

execute_command() {
  if [ -z ${SPARK_NO_DAEMONIZE+set} ]; then
      nohup -- "$@" >> $log 2>&1 < /dev/null &
      newpid="$!"

I don't understand this - what is it? Replace it directly, and start the process directly without execute command:

  case "$mode" in
    (class)
      "${SPARK_HOME}"/bin/spark-class "$command" "$@"

Problems with kerberos will occur:

starting org.apache.spark.deploy.history.HistoryServer, logging to /opt/spark/logs/spark--org.apache.spark.deploy.history.HistoryServer-1-7c7f7db06bdc.out
Spark Command: /usr/lib/jvm/java-1.8-openjdk/bin/java -cp /opt/spark/conf:/opt/spark/jars/*:/etc/hadoop/conf/ -Dspark.history.ui.port=18080 -Dspark.history.fs.logDirectory=hdfs://xxxxxxx:xxxx/spark-test/spark-events -Dspark.history.kerberos.principal=ossuser/hadoop@HADOOP.COM -Dspark.history.kerberos.keytab=/etc/hadoop/conf/user.keytab -Dspark.history.kerberos.enabled=true -Xmx1g org.apache.spark.deploy.history.HistoryServer hdfs://10.120.16.127:25000/spark-test/spark-events
========================================
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
20/01/17 06:38:21 INFO HistoryServer: Started daemon with process name: 334@7c7f7db06bdc
20/01/17 06:38:21 INFO SignalUtils: Registered signal handler for TERM
20/01/17 06:38:21 INFO SignalUtils: Registered signal handler for HUP
20/01/17 06:38:21 INFO SignalUtils: Registered signal handler for INT
20/01/17 06:38:21 WARN HistoryServerArguments: Setting log directory through the command line is deprecated as of Spark 1.1.0. Please set this through spark.history.fs.logDirectory instead.
Exception in thread "main" java.lang.IllegalArgumentException: Can't get Kerberos realm
        at org.apache.hadoop.security.HadoopKerberosName.setConfiguration(HadoopKerberosName.java:65)
        at org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:276)
        at org.apache.hadoop.security.UserGroupInformation.setConfiguration(UserGroupInformation.java:312)
        at org.apache.spark.deploy.SparkHadoopUtil.<init>(SparkHadoopUtil.scala:53)
        at org.apache.spark.deploy.SparkHadoopUtil$.instance$lzycompute(SparkHadoopUtil.scala:392)
        at org.apache.spark.deploy.SparkHadoopUtil$.instance(SparkHadoopUtil.scala:392)
        at org.apache.spark.deploy.SparkHadoopUtil$.get(SparkHadoopUtil.scala:413)
        at org.apache.spark.deploy.history.HistoryServer$.initSecurity(HistoryServer.scala:342)
        at org.apache.spark.deploy.history.HistoryServer$.main(HistoryServer.scala:289)
        at org.apache.spark.deploy.history.HistoryServer.main(HistoryServer.scala)
Caused by: java.lang.reflect.InvocationTargetException
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at org.apache.hadoop.security.authentication.util.KerberosUtil.getDefaultRealm(KerberosUtil.java:84)
        at org.apache.hadoop.security.HadoopKerberosName.setConfiguration(HadoopKerberosName.java:63)
        ... 9 more
Caused by: KrbException: Cannot locate default realm
        at sun.security.krb5.Config.getDefaultRealm(Config.java:1029)
        ... 15 more

As with the common kerberos problem, you need to consider:

Specify kerberos authentication parameters to spark history server. Fortunately, spark history server supports parameter configuration
Specify the path of krb5.conf
This is done by setting two environment variables (it is useless to add discovery in Dockerfile, so it is written in run.sh):

export SPARK_HISTORY_OPTS="-Dspark.history.ui.port=18080 -Dspark.history.fs.logDirectory=hdfs://xxxxx:xxxx/spark-test/spark-events -Dspark.history.kerberos.principal=ossuser/hadoop@HADOOP.COM -Dspark.history.kerberos.keytab=/etc/hadoop/conf/user.keytab -Dspark.history.kerberos.enabled=true -Djava.security.krb5.conf=/etc/hadoop/conf/krb5.conf"
export HADOOP_CONF_DIR=/etc/hadoop/conf

Start normally after docker run again:

starting org.apache.spark.deploy.history.HistoryServer, logging to /opt/spark/logs/spark--org.apache.spark.deploy.history.HistoryServer-1-spark-history-server-5ccf5dbd4d-f7f8l.out
Spark Command: /usr/lib/jvm/java-1.8-openjdk/bin/java -cp /opt/spark/conf:/opt/spark/jars/*:/etc/hadoop/conf/ -Dspark.history.ui.port=18080 -Dspark.history.fs.logDirectory=hdfs://10.120.16.127:25000/spark-test/spark-events -Dspark.history.kerberos.principal=ossuser/hadoop@HADOOP.COM -Dspark.history.kerberos.keytab=/etc/hadoop/conf/user.keytab -Dspark.history.kerberos.enabled=true -Djava.security.krb5.conf=/etc/hadoop/conf/krb5.conf -Xmx1g org.apache.spark.deploy.history.HistoryServer hdfs://10.120.16.127:25000/spark-test/spark-events
========================================
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
20/01/17 07:11:13 INFO HistoryServer: Started daemon with process name: 13@spark-history-server-5ccf5dbd4d-f7f8l
20/01/17 07:11:13 INFO SignalUtils: Registered signal handler for TERM
20/01/17 07:11:13 INFO SignalUtils: Registered signal handler for HUP
20/01/17 07:11:13 INFO SignalUtils: Registered signal handler for INT
20/01/17 07:11:13 WARN HistoryServerArguments: Setting log directory through the command line is deprecated as of Spark 1.1.0. Please set this through spark.history.fs.logDirectory instead.
20/01/17 07:11:14 INFO SparkHadoopUtil: Attempting to login to Kerberos using principal: ossuser/hadoop@HADOOP.COM and keytab: /etc/hadoop/conf/user.keytab
20/01/17 07:11:15 INFO UserGroupInformation: Login successful for user ossuser/hadoop@HADOOP.COM using keytab file /etc/hadoop/conf/user.keytab
20/01/17 07:11:15 INFO SecurityManager: Changing view acls to: root,ossuser
20/01/17 07:11:15 INFO SecurityManager: Changing modify acls to: root,ossuser
20/01/17 07:11:15 INFO SecurityManager: Changing view acls groups to:
20/01/17 07:11:15 INFO SecurityManager: Changing modify acls groups to:
20/01/17 07:11:15 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(root, ossuser); groups with view permissions: Set(); users  with modify permissions: Set(root, ossuser); groups with modify permissions: Set()
20/01/17 07:11:15 INFO FsHistoryProvider: History server ui acls disabled; users with admin permissions: ; groups with admin permissions
20/01/17 07:11:15 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
20/01/17 07:11:15 WARN DomainSocketFactory: The short-circuit local reads feature cannot be used because libhadoop cannot be loaded.
20/01/17 07:11:20 INFO Utils: Successfully started service on port 18080.
20/01/17 07:11:20 INFO HistoryServer: Bound HistoryServer to 0.0.0.0, and started at http://spark-history-server-5ccf5dbd4d-f7f8l:18080
20/01/17 07:11:20 INFO FsHistoryProvider: Parsing hdfs://xxxxx:xxxx/spark-test/spark-events/spark-8cc1feeddb5f4e54b87613613752eac1 for listing data...

20/01/17 07:11:27 INFO FsHistoryProvider: Finished parsing hdfs://xxxxx:xxxx/spark-test/spark-events/spark-8cc1feeddb5f4e54b87613613752eac1
20/01/17 07:11:27 INFO FsHistoryProvider: Parsing hdfs://xxxxx:xxxx/spark-test/spark-events/spark-2b356130dc12418aa526bf56328fe840 for listing data...

The next thing to do is to use this image to deploy a deployment, build svc, and make it accessible outside the cluster.

Here, I'm lazy to put the kerberos/hadoop configuration into the image. In fact, I can also mount the configMap to volume in the same way as the sparkApplication, including the environment variables. I can specify it through configMap or directly in the deployment, but after that, I won't do the repetitive work.

Deployment yaml

apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  labels:
    k8s-app: spark-history-server
  name: spark-history-server
spec:
  replicas: 1
  template:
    metadata:
      labels:
        k8s-app: spark-history-server
    spec:
      imagePullSecrets:
      - name: dockersecret
      containers:
      - name: spark-history-server
        image: xxxxx/spark-history-server:v1.0
        imagePullPolicy: Always
        ports:
        - containerPort: 18080
          protocol: TCP
        resources:
          requests:
            cpu: 2
            memory: 4Gi
          limits:
            cpu: 4
            memory: 8Gi

svc yaml

apiVersion: v1
kind: Service
metadata:
  labels:
    k8s-app: spark-history-server
  name: spark-history-server
spec:
  type: NodePort
  ports:
  - name: http
    port: 18080
    targetPort: 18080
    nodePort: 30118
  selector:
    k8s-app: spark-history-server

After deployment:

[root@linux100-99-81-13 spark_history]# kubectl get pod
NAME                                    READY   STATUS    RESTARTS   AGE
spark-history-server-5ccf5dbd4d-f7f8l   1/1     Running   0          178m
[root@linux100-99-81-13 spark_history]# kubectl get svc
NAME                           TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)           AGE
kubernetes                     ClusterIP   10.96.0.1       <none>        443/TCP           51d
spark-history-server           NodePort    10.111.132.22   <none>        18080:30118/TCP   169m

Through ip:nodeport, you can see the history page in the browser
nodeport's path is quite wild. A better way is to direct to the svc of spark history server through the configuration of ingress. Therefore, build a cluster IP type svc:

apiVersion: v1
kind: Service
metadata:
  labels:
    k8s-app: spark-history-server
  name: spark-history-server-cluster
  namespace: default
spec:
  type: ClusterIP
  ports:
  - port: 5601
    protocol: TCP
    targetPort: 18080
  selector:
    k8s-app: spark-history-server

Another ingress is to forward URLs with the suffix of spark history to the svc of the 5601 spark history server cluster

apiVersion: extensions/v1beta1
kind: Ingress
metadata:
  annotations:
    ingress.kubernetes.io/rewrite-target: /
    ingress.kubernetes.io/ssl-redirect: "false"
  name: spark-history-ingress
  namespace: default
spec:
  rules:
  - http:
      paths:
      - backend:
          serviceName: spark-history-server-cluster
          servicePort: 5601
        path: /sparkHistory

However, at present, there are still some problems. You need to optimize the address. Some of the history server pages do not have the sparkHistory keyword, so the page load is incomplete. For example, this resource: http://100.99.65.73/static/history-common.js

It's a bit hard to create the svc of ClusterIP to access through ingress. Both ingress and svc are built, but the page F5 is always 503. Later, it was found that the configuration of svc is wrong, and the corresponding pod cannot be found, so it can't return an exception
Selector: k8s app: Spark history server (initially configured as spark history server cluster)

At this point, the process of viewing the history application by setting up the spark history server is finished. The only disadvantage is that how to configure the ingress has not been studied, so let's use nodeport to see it first.

What does xxxxx webhook do under spark operator namespace

After clustering, it is found that there is an svc called litering woodpicker webhook, which feels like a spark operator management interface or a rest service portal provided by a spark operator

spark-operator   littering-woodpecker-webhook   ClusterIP   10.101.113.106   <none>        443/TCP           49d

An attempt was made to create an ingress pointing to the svc, but it seems that the direct get request is illegal.

apiVersion: extensions/v1beta1
kind: Ingress
metadata:
  annotations:
    ingress.kubernetes.io/rewrite-target: /
    ingress.kubernetes.io/ssl-redirect: "false"
  name: spark-operator-ingress
  namespace: spark-operator
spec:
  rules:
  - http:
      paths:
      - backend:
          serviceName: littering-woodpecker-webhook
          servicePort: 443
        path: /sparkOperator

On page F12, you can see that http 400 bad request is returned

weixin_42305433

Published 16 original articles, won praise 0, visited 706

Private letter follow

Keywords: Spark Hadoop Apache Java

Added by diggysmalls on Fri, 17 Jan 2020 14:10:41 +0200

Programming VIP