Tag Archives: shuffle

Shuffle Free Joins in Spark SQL

As I’ve mentioned my previous post on shuffles, shuffles in Spark can be a source of huge slowdowns but for a lot of operations, such as joins, they’re necessary to do the computation. Or are they?

Yes, they are. But you can exercise some more control over your queries and ensure that they only occur once if you know you’re going to be performing the same shuffle/join over and over again. We’ll briefly explore how the Catalyst query planner takes advantage of knowledge of distributions to ensure shuffles are only performed when necessary. As you’ll see below we can actually improve performance in these cases to the point that joins can be done in linear time!

Continue reading

Apache Spark Shuffles Explained In Depth

I originally intended this to be a much longer post about memory in Spark, but I figured it would be useful to just talk about Shuffles generally so that I could brush over it in the Memory discussion and just make it a bit more digestible. Shuffles are one of the most memory/network intensive parts of most Spark jobs so it’s important to understand when they occur and what’s going on when you’re trying to improve performance.

Continue reading