Dataproc (Spark) | Notion

my old own spark dump https://docs.google.com/document/d/1tej2kgwqY0w6kO6kcjjGbg3brTeZnE_igmMfvd6SgyM/edit?usp=sharing

Best practice doc

https://cloud.google.com/blog/topics/developers-practitioners/dataproc-best-practices-guide

For all of the Dataproc clusters gcs integration is provided by default and they recognize gs path prefixes

https://cloud.google.com/dataproc/docs/concepts/connectors/cloud-storage

connect from your local ide debugger to the remote jvm on the worker

https://stackoverflow.com/questions/63052302/how-to-debug-a-spark-job-on-dataproc

more on the remote debugging

https://medium.com/agile-lab-engineering/spark-remote-debugging-371a1a8c44a8

pyspark debugging

https://spark.apache.org/docs/3.1.3/api/python/development/debugging.html

fixing data skew with salting

https://chengzhizhao.com/deep-dive-into-handling-apache-spark-data-skew

web ui overview https://spark.apache.org/docs/latest/web-ui.html