25. JDBC Connection

In this chapter, you will learn how to use JDBC source to write and read data in PySpark. The idea database for spark is HDFS. But not many companies are not willing to move all the data (PII data) into Databricks’ HDFS. Then you have to use JDBC to connect to external database. While JDBC read and write are always tricky and confusion for beginner.

25.1. JDBC Driver

For successfully connection, you need the corresponding JDBC driver for the specify Database. Here I will use Greenplum database as an example to demonstrate how to get the correct .jar file and where to put the .jar.

25.1.1. Get the .jar file

Since Greenplum is using PostgreSQL, you can search with ‘PostgreSQL JDBC Driver’. There is a high chance that you will reach to this page: https://jdbc.postgresql.org/download.html. Then download the .jar file.

25.1.2. Put .jar in the jars folder

Now what you need to do is putting the .jar file in the jar folder under your spark installation folder. Here is my jar folder: /opt/spark/jars

_images/postgresql_driver.png

JDBC connection jars folder

25.2. JDBC read

See code JDBC lower-upper Bound

stride = (upper_bound/partitions_number) - (lower_bound/partitions_number)
partition_nr = 0
while (partition_nr < partitions_number)
  generate WHERE clause:
    partition_column IS NULL OR partition_column < stride
    if:
      partition_nr == 0 AND partition_nr < partitions_number
  or generate WHERE clause:
    partition_column &gt;= stride AND partition_column &lt;  next_stride
    if:
      partition_nr < 0 AND partition_nr &lt; partitions_number
  or generate WHERE clause
    partition_column >= stride
    if:
      partition_nr > 0 AND partition_nr == partitions_number
  where next_stride is calculated after computing the left sideo
  of the WHERE clause by next_stride += stride



(stride = (20/5) - (0/5) = 4
SELECT * FROM my_table WHERE partition_column IS NULL OR partition_column < 4
SELECT * FROM my_table WHERE partition_column >= 4 AND partition_column < 8
SELECT * FROM my_table WHERE partition_column >= 8 AND partition_column < 12
SELECT * FROM my_table WHERE partition_column >= 12 AND partition_column < 16
SELECT * FROM my_table WHERE partition_column >= 16

As you see, the above queries generate 5 partitions of data, each containing the values from: (0-3), (4-7), (8-11), (12-15) and (16 and more).

25.3. JDBC write

TODO…

25.4. JDBC temp_view

TODO…