Tag Archives: spark

Poor Hash Partitioning of Timestamps, Integers and Longs in Spark

A word of warning if you typically partition your DataFrames/RDDs/Datasets in Spark based on a Integer, Long or Timestamp keys. If you’re using Spark’s default partitioning settings (or even something similar) and your values are sufficiently large and regularly spaced out, it’s possible that you’ll see poor partitioning performance partitioning these datasets.

This can manifest itself in DataFrame “repartition” or “groupBy” commands and in the SQL “DISTRIBUTE BY” or “CLUSTER BY” keywords, among other things. It can also manifest itself in RDD “groupByKey”, “reduceByKey” commands if your number of cores evenly divides the hashCodes of your data (read on if this doesn’t exactly make sense).

If you’re interested in why this happens, read on. Otherwise feel free to jump to the end to see how to work around this.

Continue reading

Shuffle Free Joins in Spark SQL

As I’ve mentioned my previous post on shuffles, shuffles in Spark can be a source of huge slowdowns but for a lot of operations, such as joins, they’re necessary to do the computation. Or are they?

Yes, they are. But you can exercise some more control over your queries and ensure that they only occur once if you know you’re going to be performing the same shuffle/join over and over again. We’ll briefly explore how the Catalyst query planner takes advantage of knowledge of distributions to ensure shuffles are only performed when necessary. As you’ll see below we can actually improve performance in these cases to the point that joins can be done in linear time!

Continue reading

Apache Spark Shuffles Explained In Depth

I originally intended this to be a much longer post about memory in Spark, but I figured it would be useful to just talk about Shuffles generally so that I could brush over it in the Memory discussion and just make it a bit more digestible. Shuffles are one of the most memory/network intensive parts of most Spark jobs so it’s important to understand when they occur and what’s going on when you’re trying to improve performance.

Continue reading

In the Code: Spark SQL Query Planning and Execution

If you’ve dug around into the Spark SQL code, whether it’s in catalyst, or the core code, you’ve probably seen tons of references to Logical and Physical plans. These are the core units of query planning, a common concept in database programming. In this post we’re going to go over what these are at a high level and then how they’re represented and used in Spark. The query planning code is especially important because it determines how your SparkSQL query gets turned into actual RDD primitive calls and has large implications on performance of applications.

This document is based on the Spark master branch as of early Februrary. It assumes that you’ve got a cursory understanding of the Spark RDD API, and some of the lower level concepts such as partitioning and shuffling (what it is, when it takes place).

For higher level coverage of what we’re going over, I would recommend the Spark SQL whitepaper that was published last year. The content it covers is pretty similar, but we’ll be diving a little deeper with examples from the code and more explanation on how execution actually takes place rather than focusing on features.

Continue reading

Writing a Spark Data Source

Note: This is being written as of early December of 2015 and currently assumes Spark 1.5.2 API. The data sources API has been out for a few versions now but it’s still stabilizing so some of this might change to be out of date.

I’m writing this guide as part of my own exploration on how to write a data source using the Spark SQL Data Sources API. We’ll start with exploration of the interfaces and features, then dive into 2 examples: Parquet and Spark JDBC. Finally we will cover an end to end example of writing your own simple data source.

Continue reading