Tag Archives: free

Shuffle Free Joins in Spark SQL

As I’ve mentioned my previous post on shuffles, shuffles in Spark can be a source of huge slowdowns but for a lot of operations, such as joins, they’re necessary to do the computation. Or are they?

Yes, they are. But you can exercise some more control over your queries and ensure that they only occur once if you know you’re going to be performing the same shuffle/join over and over again. We’ll briefly explore how the Catalyst query planner takes advantage of knowledge of distributions to ensure shuffles are only performed when necessary. As you’ll see below we can actually improve performance in these cases to the point that joins can be done in linear time!

Continue reading