Practical Introduction to PySpark
In this introduction to Practical PySpark, we will follow closely the implementation of functions from textbook Spark, The Definitive Guide. The idea is to simply access the facilities of PySpark using simple dataset available. In the coming sections, we will explore more advanced yet practical way to use PySpark.
To begin, download the data 2015-summary.csv It can also be found at the github repository https://github.com/databricks/Spark-The-Definitive-Guide/blob/master/data/flight-data/csv/2015-summary.csv
Reading CSV into Spark
Once you have downloaded the CSV file, we can read the CSV into a spark session object using the
- options("inferSchema", "true"): This infers the meta data directly from the CSV file
- options("header", "true"): This sets the CSV first row as column names
# reading csv to dataframe object
flight_data = spark.read.option("header", "true").option("inferSchema", "true").csv("Downloads/2015-summary.csv")
flight_data
When we call the object
printSchema()
Another useful method is the printSchema() which returns the schema of our dataframe. This is more closely linked to the schema representation in SQL. To utilize this method, we call the method on the dataframe object.
flight_data.printSchema()
Retrieving Data with head()
If you have used tools such as pandas, command line or R packages like tidyverse, you may be used to the idea of
using a head or tail to get
a quick glimpse of the data. PySpark offer the same facilities albeit a little differently. For example, we can use
the
flight_data.head(3)
Retrieving Data with take()
Much like the head() method, the
flight_data.take(3)
Sample data with show()
While the
flight_data.show()
By default, the show method returns the top 20 rows of the dataframe. Passing in an n value, will return that number of rows
Vertical Display Format
Of course, there are other ways to display the data. By default, Vertical display is set to false. However, it can be set to true to achive the results below:
flight_data.show(n=5, vertical=True)
Truncate Display Format
Alternatively, we can choose to truncate the result of each row by some specified character length.
flight_data.show(10,truncate=10)
While in our case, this may not be readable, we can definitely see how it may be useful in other cases
Count
Another fundamental action we can do is to simply count the number of observations in the dataset.
flight_data.count()
Sort
Much like we can do in pandas, we can aggregate transformation methods. In the example below, we sort the data and return
the sample of the sorted frame using the
flight_data.sort("count", ascending=False).show()
GROUP BY
Finally, let's perform a simple aggregration group by. In this case, we simply sum the count of flights by their destination and return the first 5 rows of data.
flight_data.groupBy("DEST_COUNTRY_NAME").sum('count').withColumnRenamed("sum(count)", "destination_total").limit(5).show()
While this operation is useful and the begining of more extensive analytics, one of the probles we are already seeing is that
using methods directly requires remembering the method names and input. This can be useful for interactive sessions but often times
the use of
Write to local csv
Finally, we can write to local csv using the write method. This method has a number of options and configurations. We will explore each more extensively. For now, we simply write toe csv and save the header.
# aggregating the data
summary = flight_data.groupBy("DEST_COUNTRY_NAME").sum('count').withColumnRenamed("sum(count)", "destination_total")
summary.write.option("header",True).csv("Downloads/summary.csv")
This concludes the introduction to PySpark and some of the facilities. The coming sections will explore more functionality of PySpark leading to ultimately creating extensive pipelines