4. Auditing Demos

The following demos are designed to show how to use PySparkAudit to aduit rdd DataFrame.

4.1. Auditing function by function

If you just need a piece of the audit result, you can call the corresponding function to generate it. There are 9 basic auditing functions, 3 figure plot functions and 3 summary functions in the PySparkAudit library.

syntax

from PySparkAudit import *
  1. Basic Functions:

    1. data_types: PySparkAudit.data_types
    2. dtypes_class: PySparkAudit.dtypes_class
    3. dtypes_class: PySparkAudit.counts
    4. dtypes_class: PySparkAudit.describe
    5. dtypes_class: PySparkAudit.percentiles
    6. dtypes_class: PySparkAudit.feature_len
    7. dtypes_class: PySparkAudit.freq_items
    8. dtypes_class: PySparkAudit.rates
    9. dtypes_class: PySparkAudit.corr_matrix
  2. Plot Functions:

    1. hist_plot: PySparkAudit.hist_plot
    2. bar_plot: PySparkAudit.bar_plot
    3. trend_plot: PySparkAudit.trend_plot
  3. Summary Functions

    1. dataset_summary: PySparkAudit.dataset_summary
    2. numeric_summary: PySparkAudit.numeric_summary
    3. category_summary: PySparkAudit.category_summary

For example:

from pyspark.sql import SparkSession

spark = SparkSession \
    .builder \
    .appName("Python Spark regression example") \
    .config("spark.some.config.option", "some-value") \
    .getOrCreate()


# import PySpark Audit functions
from PySparkAudit import data_types, hist_plot, bar_plot, freq_items,feature_len
from PySparkAudit import dataset_summary, rates
from PySparkAudit import trend_plot, auditing

# load dataset
data = spark.read.csv(path='Heart.csv',
                      sep=',', encoding='UTF-8', comment=None, header=True, inferSchema=True)

# audit function by function

# data types
print(data_types(data))

# check frequent items
print(freq_items(data))

# bar plot for categorical features
bar_plot(data,  display=True)

Result:

      feature  dtypes
0         Age     int
1         Sex     int
2   ChestPain  string
3      RestBP     int
4        Chol     int
5         Fbs     int
6     RestECG     int
7       MaxHR     int
8       ExAng     int
9     Oldpeak  double
10      Slope     int
11         Ca  string
12       Thal  string
13        AHD  string
      feature                            freq_items[value, freq]
0         Age  [[58, 19], [57, 17], [54, 16], [59, 14], [52, ...
1         Sex                                [[1, 206], [0, 97]]
2   ChestPain  [[asymptomatic, 144], [nonanginal, 86], [nonty...
3      RestBP  [[120, 37], [130, 36], [140, 32], [110, 19], [...
4        Chol  [[197, 6], [234, 6], [204, 6], [254, 5], [212,...
5         Fbs                                [[0, 258], [1, 45]]
6     RestECG                       [[0, 151], [2, 148], [1, 4]]
7       MaxHR  [[162, 11], [163, 9], [160, 9], [152, 8], [132...
8       ExAng                                [[0, 204], [1, 99]]
9     Oldpeak  [[0.0, 99], [1.2, 17], [0.6, 14], [1.0, 14], [...
10      Slope                      [[1, 142], [2, 140], [3, 21]]
11         Ca     [[0, 176], [1, 65], [2, 38], [3, 20], [NA, 4]]
12       Thal  [[normal, 166], [reversable, 117], [fixed, 18]...
13        AHD                            [[No, 164], [Yes, 139]]
================================================================
The Bar plot Bar_plots.pdf was located at:
/home/feng/Dropbox/MyTutorial/PySparkAudit/test/Audited

Process finished with exit code 0

and

_images/bars.png

4.2. Auditing in one function

For example:

from pyspark.sql import SparkSession

spark = SparkSession \
    .builder \
    .appName("Python Spark regression example") \
    .config("spark.some.config.option", "some-value") \
    .getOrCreate()


# from PySparkAudit import dtypes_class, hist_plot, bar_plot, freq_items,feature_len
# from PySparkAudit import dataset_summary, rates, trend_plot

# path = '/home/feng/Desktop'

# import PySpark Audit function
from PySparkAudit import auditing

# load dataset
data = spark.read.csv(path='Heart.csv',
                      sep=',', encoding='UTF-8', comment=None, header=True, inferSchema=True)

# auditing in one function 
print(auditing(data, display=True))

Result:

4.2.2. Audited results folder

_images/t_folder.png

The files in 00-audited_results.xlsx:

  1. Dataset_summary
_images/t_excel1.png
  1. Numeric_summary
_images/t_excel2.png
  1. Category_summary
_images/t_excel3.png
  1. Correlation_matrix
_images/t_excel4.png
  1. Histograms in Histograms.pdf
_images/hists.png
  1. Barplots in Bar_plots.pdf
_images/bars.png

4.3. Auditing on Big Dataset

In this section, we will demonstrate the auditing performance and audited results on the big data set. The data set is Spanish High Speed Rail tickets pricing. It is available at : https://www.kaggle.com/thegurus/spanish-high-speed-rail-system-ticket-pricing. This data set has 2579771 samples and 10 features.

From the following CPU time, you will see most of the time was spent on plotting the histograms. If your time and memory are limited, we will suggest you to use sample_size to generate the subset of the the dataset to plot histograms.

For example:

from pyspark.sql import SparkSession

spark = SparkSession \
    .builder \
    .appName("Python Spark regression example") \
    .config("spark.some.config.option", "some-value") \
    .getOrCreate()


# from PySparkAudit import dtypes_class, hist_plot, bar_plot, freq_items,feature_len
# from PySparkAudit import dataset_summary, rates, trend_plot

# Audited results output path
out_path = '/home/feng/Desktop'

# import PySpark Audit function
from PySparkAudit import auditing

# load dataset
# Spanish High Speed Rail tickets pricing - Renfe (~2.58M)
# https://www.kaggle.com/thegurus/spanish-high-speed-rail-system-ticket-pricing

data = spark.read.csv(path='/home/feng/Downloads/renfe.csv',
                      sep=',', encoding='UTF-8', comment=None, header=True, inferSchema=True)

# auditing in one function
auditing(data, output_dir=out_path, tracking=True)

Result:

4.3.1. print in bash

================================================================
The audited results summary audited_results.xlsx was located at:
/home/feng/Desktop/Audited
Generate data set summary took = 60.535009145736694 s
================================================================
Collecting data types.... Please be patient!
Generate counts took = 0.0016515254974365234 s
================================================================
Collecting features' counts.... Please be patient!
Generate counts took = 6.502962350845337 s
================================================================
Collecting data frame description.... Please be patient!
Generate data frame description took = 1.5562639236450195 s
================================================================
Calculating percentiles.... Please be patient!
Generate percentiles took = 19.76785445213318 s
================================================================
Calculating features' length.... Please be patient!
Generate features' length took = 4.953453540802002 s
================================================================
Calculating top 5 frequent items.... Please be patient!
Generate rates took: 4.761325359344482 s
================================================================
Calculating rates.... Please be patient!
Generate rates took: 17.201056718826294 s
Auditing numerical data took = 54.77840781211853 s
================================================================
Collecting data types.... Please be patient!
Generate counts took = 0.001623392105102539 s
================================================================
Collecting features' counts.... Please be patient!
Generate counts took = 12.59226107597351 s
================================================================
Calculating features' length.... Please be patient!
Generate features' length took = 5.332952976226807 s
================================================================
Calculating top 5 frequent items.... Please be patient!
Generate rates took: 6.832213878631592 s
================================================================
Calculating rates.... Please be patient!
Generate rates took: 23.704302072525024 s
Auditing categorical data took = 48.484763622283936 s
================================================================
The correlation matrix plot Corr.png was located at:
/home/feng/Desktop/Audited
Calculating correlation matrix... Please be patient!
Generate correlation matrix took = 19.61273431777954 s
================================================================
The Histograms plots *.png were located at:
/home/feng/Desktop/Audited/02-hist
Plotting histograms of _c0.... Please be patient!
Plotting histograms of price.... Please be patient!
Histograms plots are DONE!!!
Generate histograms plots took = 160.3421311378479 s
================================================================
The Bar plot Bar_plots.pdf was located at:
/home/feng/Desktop/Audited
Plotting barplot of origin.... Please be patient!
Plotting barplot of destination.... Please be patient!
Plotting barplot of train_type.... Please be patient!
Plotting barplot of train_class.... Please be patient!
Plotting barplot of fare.... Please be patient!
Plotting barplot of insert_date.... Please be patient!
Plotting barplot of start_date.... Please be patient!
Plotting barplot of end_date.... Please be patient!
Bar plots are DONE!!!
Generate bar plots took = 24.17994236946106 s
================================================================
The Trend plot Trend_plots.pdf was located at:
/home/feng/Desktop/Audited
Plotting trend plot of _c0.... Please be patient!
Plotting trend plot of price.... Please be patient!
Trend plots are DONE!!!
Generate trend plots took = 11.697550296783447 s
Generate all the figures took = 196.25823402404785 s
Generate all audited results took = 379.73954820632935 s
================================================================
The auditing processes are DONE!!!

4.3.2. Audited results folder

_images/demo3_folder.png
       .,,.
     ,;;*;;;;,
    .-'``;-');;.
   /'  .-.  /*;;
 .'    \d    \;;               .;;;,
/ o      `    \;    ,__.     ,;*;;;*;,
\__, _.__,'   \_.-') __)--.;;;;;*;;;;,
 `""`;;;\       /-')_) __)  `\' ';;;;;;
    ;*;;;        -') `)_)  |\ |  ;;;;*;
    ;;;;|        `---`    O | | ;;*;;;
    *;*;\|                 O  / ;;;;;*
   ;;;;;/|    .-------\      / ;*;;;;;
  ;;;*;/ \    |        '.   (`. ;;;*;;;
  ;;;;;'. ;   |          )   \ | ;;;;;;
  ,;*;;;;\/   |.        /   /` | ';;;*;
   ;;;;;;/    |/       /   /__/   ';;;
   '*wf*/     |       /    |      ;*;
        `""""`        `""""`     ;'