A word of warning if you typically partition your DataFrames/RDDs/Datasets in Spark based on a Integer, Long or Timestamp keys. If you’re using Spark’s default partitioning settings (or even something similar) and your values are sufficiently large and regularly spaced out, it’s possible that you’ll see poor partitioning performance partitioning these datasets.
This can manifest itself in DataFrame “repartition” or “groupBy” commands and in the SQL “DISTRIBUTE BY” or “CLUSTER BY” keywords, among other things. It can also manifest itself in RDD “groupByKey”, “reduceByKey” commands if your number of cores evenly divides the hashCodes of your data (read on if this doesn’t exactly make sense).
If you’re interested in why this happens, read on. Otherwise feel free to jump to the end to see how to work around this.