3. Python Data Audit Functions¶
3.1. Basic Functions¶
3.1.1. dtypes_class¶
-
PyAudit.basics.
dtypes_class
(df_in)[source]¶ numerical, categorical and bool name list in the DataFrame
Parameters: df_in – input pandas DataFrame Returns: numerical, categorical and bool name list Author: Wenqiang Feng and Ming Chen Email: von198@gmail.com >>> from PyAudit.basics import dtypes_class >>> df = pd.read_csv('Heart.csv', dtype={'Sex': bool}) >>> (num_fields, cat_fields, bool_fields, data_types, type_class) = dtypes_class(df) >>> num_fields ['Age', 'RestBP', 'Chol', 'Fbs', 'RestECG', 'MaxHR', 'ExAng', 'Oldpeak', 'Slope', 'Ca']
3.1.2. missing_rate¶
-
PyAudit.basics.
missing_rate
(df_in)[source] calculate missing rate for each feature in the DataFrame
Parameters: df_in – input pandas DataFrame Returns: missing rate Author: Wenqiang Feng and Ming Chen Email: von198@gmail.com >>> import pandas as pd >>> d = {'A': [1, 0, None, 3], 'B': [1, 0, 0, 0], 'C': ['a', None, 'c', 'd']} >>> # create DataFrame >>> df = pd.DataFrame(d) >>> from PyAudit.basics import missing_rate >>> missing_rate(df) feature missing_rate 0 A 0.25 1 B 0.00 2 C 0.25
3.1.3. zero_rate¶
-
PyAudit.basics.
zero_rate
(df_in)[source] calculate the percentage of 0 value for each feature in the DataFrame
Parameters: df_in – input pandas DataFrame Returns: zero rate Author: Wenqiang Feng and Ming Chen Email: von198@gmail.com >>> import pandas as pd >>> d = {'A': [1, 0, None, 3], 'B': [1, 0, 0, 0], 'C': ['a', None, 'c', 'd']} >>> # create DataFrame >>> df = pd.DataFrame(d) >>> from PyAudit.basics import zero_rate >>> zero_rate(df) feature zero_rate 0 A 0.333333 1 B 0.750000 2 C 0.000000
3.1.4. feature_variance¶
-
PyAudit.basics.
feature_variance
(df_in)[source] calculate the variance for each feature
Parameters: df_in – input pandas DataFrame Returns: feature variance Author: Wenqiang Feng and Ming Chen Email: von198@gmail.com >>> import pandas as pd >>> d = {'A': [1, 0, None, 3], 'B': [1, 0, 0, 0], 'C': ['a', None, 'c', 'd']} >>> # create DataFrame >>> df = pd.DataFrame(d) >>> from PyAudit.basics import zero_rate >>> zero_rate(df) feature feature_variance 0 A 1.0 1 B 0.5 2 C 1.0
3.1.5. freq_items_df¶
-
PyAudit.basics.
freq_items_df
(df_in, top_n=3)[source] find out the top n values and the corresponding frequency for each feature
Parameters: - df_in – input pandas DataFrame
- top_n – the number of the top values
Returns: top n values and the corresponding frequency for each feature
Author: Wenqiang Feng and Ming Chen
Email: >>> d ={ >>> 'num': list('1223334444'), >>> 'cat': list('wxxyyyzzzz') >>> } >>> df = pd.DataFrame(d) >>> df = df.astype({"num": int, "cat": object}) >>> print(freq_items_df(df, top_n=4)) feature top_items top_freqs 0 num [4, 3, 2, 1] [4, 3, 2, 1] 1 cat [z, y, x, w] [4, 3, 2, 1]
3.1.6. feature_len¶
-
PyAudit.basics.
feature_len
(df_in)[source] find out the min and max length of values for each feature
Parameters: df_in – input pandas DataFrame Returns: min and max length DataFrame Author: Wenqiang Feng and Ming Chen Email: von198@gmail.com >>> d = {'A': [1, 0, None, 3], >>> 'B': [1, 0, 0, 0], >>> 'C': ['a', None, 'c', 'd']} >>> # create DataFrame >>> df = pd.DataFrame(d) >>> print(df) A B C 0 1.0 1 a 1 0.0 0 None 2 NaN 0 c 3 3.0 0 d >>> print(feature_len(df)) feature min_length max_length 0 A 3 3 1 B 1 1 2 C 1 4
3.1.7. correlation matrix¶
-
PyAudit.basics.
corr_matrix
(df_in, output_dir)[source] generate correlation matrix for numerical dataframe
Parameters: - df_in – input pandas DataFrame
- output_dir – output path
Returns: correlation matrix
Author: Wenqiang Feng and Ming Chen
Email: >>> d = {'A': [1, 0, None, 3], >>> 'B': [1, 0, 0, 0], >>> 'C': ['a', None, 'c', 'd']} >>> # create DataFrame >>> df = pd.DataFrame(d) >>> print(corr_matrix(df)) A B A 1.000000 -0.188982 B -0.188982 1.000000
3.2. Summary Functions¶
3.2.1. numeric_summary¶
-
PyAudit.basics.
numeric_summary
(df_in, output_dir, top_n=4, deciles=False)[source] generate statistical summary for numerical DateFrame
Parameters: - df_in – input pandas DataFrame
- output_dir – output files directory
- top_n – the number of the top item to show
- deciles – flag for percentiles style
Returns: statistical summary for numerical data
Author: Wenqiang Feng and Ming Chen
Email: >>> d = {'A': [1, 0, None, 3], >>> 'B': [1, 0, 0, 0], >>> 'C': ['a', None, 'c', 'd']} >>> # create DataFrame >>> df = pd.DataFrame(d) >>> print(numeric_summary(df)) feature data_type min_digits ... zero_rate pos_rate neg_rate A A float64 3 ... 0.333333 0.666667 0.0 B B int64 3 ... 0.750000 0.250000 0.0
3.2.2. category_summary¶
-
PyAudit.basics.
category_summary
(df_in, output_dir, top_n=4)[source] generate statistical summary for numerical DateFrame
Parameters: - df_in – input pandas DataFrame
- output_dir – output files directory
- top_n – the number of the top item to show
Returns: statistical summary for numerical data
Author: Wenqiang Feng and Ming Chen
Email: >>> d = {'A': [1, 0, None, 3], >>> 'B': [1, 0, 0, 0], >>> 'C': ['a', None, 'c', 'd']} >>> # create DataFrame >>> df = pd.DataFrame(d) >>> print(numeric_summary(df)) feature data_type min_digits ... top_values top_freqs missing_rate C C object 1 ... [a, d, c] [1, 1, 1] 0.25
3.3. Auditing Function¶
3.3.1. auditing¶
-
PyAudit.basics.
auditing
(df_in, output_dir, top_n=4, deciles=False)[source] generate audited results
Parameters: - df_in – input pandas DataFrame
- output_dir – output files directory
- top_n – the number of the top item to show
- deciles – flag for percentiles style
Author: Wenqiang Feng and Ming Chen
Email: >>> d = {'A': [1, 0, None, 3], >>> 'B': [1, 0, 0, 0], >>> 'C': ['a', None, 'c', 'd']} >>> # create DataFrame >>> df = pd.DataFrame(d) >>> print(auditing(df,path)) feature data_type min_digits ... zero_rate pos_rate neg_rate A A float64 3 ... 0.333333 0.666667 0.0 B B int64 3 ... 0.750000 0.250000 0.0