If you’ve dug around into the Spark SQL code, whether it’s in catalyst, or the core code, you’ve probably seen tons of references to Logical and Physical plans. These are the core units of query planning, a common concept in database programming. In this post we’re going to go over what these are at a high level and then how they’re represented and used in Spark. The query planning code is especially important because it determines how your SparkSQL query gets turned into actual RDD primitive calls and has large implications on performance of applications.
This document is based on the Spark master branch as of early Februrary. It assumes that you’ve got a cursory understanding of the Spark RDD API, and some of the lower level concepts such as partitioning and shuffling (what it is, when it takes place).
For higher level coverage of what we’re going over, I would recommend the Spark SQL whitepaper that was published last year. The content it covers is pretty similar, but we’ll be diving a little deeper with examples from the code and more explanation on how execution actually takes place rather than focusing on features.