my old own spark dump https://docs.google.com/document/d/1tej2kgwqY0w6kO6kcjjGbg3brTeZnE_igmMfvd6SgyM/edit?usp=sharing
Best practice doc
https://cloud.google.com/blog/topics/developers-practitioners/dataproc-best-practices-guide
For all of the Dataproc clusters gcs integration is provided by default and they recognize gs path prefixes
https://cloud.google.com/dataproc/docs/concepts/connectors/cloud-storage
connect from your local ide debugger to the remote jvm on the worker
https://stackoverflow.com/questions/63052302/how-to-debug-a-spark-job-on-dataproc
more on the remote debugging
https://medium.com/agile-lab-engineering/spark-remote-debugging-371a1a8c44a8
pyspark debugging
https://spark.apache.org/docs/3.1.3/api/python/development/debugging.html
fixing data skew with salting
https://chengzhizhao.com/deep-dive-into-handling-apache-spark-data-skew
web ui overview https://spark.apache.org/docs/latest/web-ui.html