22. PySpark Data Audit Library

PySparkAudit: PySpark Data Audit Library. The PDF version can be downloaded from HERE. The python version PyAudit: Python Data Audit Library API can be found at PyAudit.

22.1. Install with pip

You can install the PySparkAudit from [PyPI](https://pypi.org/project/PySparkAudit):

pip install PySparkAudit

22.2. Install from Repo

22.2.1. Clone the Repository

git clone https://github.com/runawayhorse001/PySparkAudit.git

22.2.2. Install

cd PySparkAudit
pip install -r requirements.txt
python setup.py install

22.3. Uninstall

pip uninstall statspy

22.4. Test

22.4.1. Run test code

cd PySparkAudit/test
python test.py

test.py

from pyspark.sql import SparkSession

spark = SparkSession \
    .builder \
    .appName("Python Spark regression example") \
    .config("spark.some.config.option", "some-value") \
    .getOrCreate()


# from PySparkAudit import dtypes_class, hist_plot, bar_plot, freq_items,feature_len
# from PySparkAudit import dataset_summary, rates, trend_plot

# path = '/home/feng/Desktop'

# import PySpark Audit function
from PySparkAudit import auditing

# load dataset
data = spark.read.csv(path='Heart.csv',
                      sep=',', encoding='UTF-8', comment=None, header=True, inferSchema=True)

# auditing in one function 
print(auditing(data, display=True))

22.4.2. Audited Results

_images/t_folder.png

The files in 00-audited_results.xlsx:

  1. Dataset_summary
_images/t_excel1.png
  1. Numeric_summary
_images/t_excel2.png
  1. Category_summary
_images/t_excel3.png
  1. Correlation_matrix
_images/t_excel4.png
  1. Histograms in Histograms.pdf
_images/hists.png
  1. Barplots in Bar_plots.pdf
_images/bars.png

22.5. Auditing on Big Dataset

In this section, we will demonstrate the auditing performance and audited results on the big data set. The data set is Spanish High Speed Rail tickets pricing. It is available at : https://www.kaggle.com/thegurus/spanish-high-speed-rail-system-ticket-pricing. This data set has 2579771 samples and 10 features.

From the following CPU time, you will see most of the time was spent on plotting the histograms. If your time and memory are limited, we will suggest you to use sample_size to generate the subset of the the dataset to plot histograms.

For example:

from pyspark.sql import SparkSession

spark = SparkSession \
    .builder \
    .appName("Python Spark regression example") \
    .config("spark.some.config.option", "some-value") \
    .getOrCreate()


# from PySparkAudit import dtypes_class, hist_plot, bar_plot, freq_items,feature_len
# from PySparkAudit import dataset_summary, rates, trend_plot

# Audited results output path
out_path = '/home/feng/Desktop'

# import PySpark Audit function
from PySparkAudit import auditing

# load dataset
# Spanish High Speed Rail tickets pricing - Renfe (~2.58M)
# https://www.kaggle.com/thegurus/spanish-high-speed-rail-system-ticket-pricing

data = spark.read.csv(path='/home/feng/Downloads/renfe.csv',
                      sep=',', encoding='UTF-8', comment=None, header=True, inferSchema=True)

# auditing in one function
auditing(data, output_dir=out_path, tracking=True)

Result:

22.5.2. Audited results folder

_images/demo3_folder.png