3. Python Data Audit Functions¶
3.1. Basic Functions¶
3.1.1. dtypes_class¶
- 
PyAudit.basics.dtypes_class(df_in)[source]¶
- numerical, categorical and bool name list in the DataFrame - Parameters: - df_in – input pandas DataFrame - Returns: - numerical, categorical and bool name list - Author: - Wenqiang Feng and Ming Chen - Email: - von198@gmail.com - >>> from PyAudit.basics import dtypes_class >>> df = pd.read_csv('Heart.csv', dtype={'Sex': bool}) >>> (num_fields, cat_fields, bool_fields, data_types, type_class) = dtypes_class(df) >>> num_fields ['Age', 'RestBP', 'Chol', 'Fbs', 'RestECG', 'MaxHR', 'ExAng', 'Oldpeak', 'Slope', 'Ca'] 
3.1.2. missing_rate¶
- 
PyAudit.basics.missing_rate(df_in)[source]
- calculate missing rate for each feature in the DataFrame - Parameters: - df_in – input pandas DataFrame - Returns: - missing rate - Author: - Wenqiang Feng and Ming Chen - Email: - von198@gmail.com - >>> import pandas as pd >>> d = {'A': [1, 0, None, 3], 'B': [1, 0, 0, 0], 'C': ['a', None, 'c', 'd']} >>> # create DataFrame >>> df = pd.DataFrame(d) >>> from PyAudit.basics import missing_rate >>> missing_rate(df) feature missing_rate 0 A 0.25 1 B 0.00 2 C 0.25 
3.1.3. zero_rate¶
- 
PyAudit.basics.zero_rate(df_in)[source]
- calculate the percentage of 0 value for each feature in the DataFrame - Parameters: - df_in – input pandas DataFrame - Returns: - zero rate - Author: - Wenqiang Feng and Ming Chen - Email: - von198@gmail.com - >>> import pandas as pd >>> d = {'A': [1, 0, None, 3], 'B': [1, 0, 0, 0], 'C': ['a', None, 'c', 'd']} >>> # create DataFrame >>> df = pd.DataFrame(d) >>> from PyAudit.basics import zero_rate >>> zero_rate(df) feature zero_rate 0 A 0.333333 1 B 0.750000 2 C 0.000000 
3.1.4. feature_variance¶
- 
PyAudit.basics.feature_variance(df_in)[source]
- calculate the variance for each feature - Parameters: - df_in – input pandas DataFrame - Returns: - feature variance - Author: - Wenqiang Feng and Ming Chen - Email: - von198@gmail.com - >>> import pandas as pd >>> d = {'A': [1, 0, None, 3], 'B': [1, 0, 0, 0], 'C': ['a', None, 'c', 'd']} >>> # create DataFrame >>> df = pd.DataFrame(d) >>> from PyAudit.basics import zero_rate >>> zero_rate(df) feature feature_variance 0 A 1.0 1 B 0.5 2 C 1.0 
3.1.5. freq_items_df¶
- 
PyAudit.basics.freq_items_df(df_in, top_n=3)[source]
- find out the top n values and the corresponding frequency for each feature - Parameters: - df_in – input pandas DataFrame
- top_n – the number of the top values
 - Returns: - top n values and the corresponding frequency for each feature - Author: - Wenqiang Feng and Ming Chen - Email: - >>> d ={ >>> 'num': list('1223334444'), >>> 'cat': list('wxxyyyzzzz') >>> } >>> df = pd.DataFrame(d) >>> df = df.astype({"num": int, "cat": object}) >>> print(freq_items_df(df, top_n=4)) feature top_items top_freqs 0 num [4, 3, 2, 1] [4, 3, 2, 1] 1 cat [z, y, x, w] [4, 3, 2, 1] 
3.1.6. feature_len¶
- 
PyAudit.basics.feature_len(df_in)[source]
- find out the min and max length of values for each feature - Parameters: - df_in – input pandas DataFrame - Returns: - min and max length DataFrame - Author: - Wenqiang Feng and Ming Chen - Email: - von198@gmail.com - >>> d = {'A': [1, 0, None, 3], >>> 'B': [1, 0, 0, 0], >>> 'C': ['a', None, 'c', 'd']} >>> # create DataFrame >>> df = pd.DataFrame(d) >>> print(df) A B C 0 1.0 1 a 1 0.0 0 None 2 NaN 0 c 3 3.0 0 d >>> print(feature_len(df)) feature min_length max_length 0 A 3 3 1 B 1 1 2 C 1 4 
3.1.7. correlation matrix¶
- 
PyAudit.basics.corr_matrix(df_in, output_dir)[source]
- generate correlation matrix for numerical dataframe - Parameters: - df_in – input pandas DataFrame
- output_dir – output path
 - Returns: - correlation matrix - Author: - Wenqiang Feng and Ming Chen - Email: - >>> d = {'A': [1, 0, None, 3], >>> 'B': [1, 0, 0, 0], >>> 'C': ['a', None, 'c', 'd']} >>> # create DataFrame >>> df = pd.DataFrame(d) >>> print(corr_matrix(df)) A B A 1.000000 -0.188982 B -0.188982 1.000000 
3.2. Summary Functions¶
3.2.1. numeric_summary¶
- 
PyAudit.basics.numeric_summary(df_in, output_dir, top_n=4, deciles=False)[source]
- generate statistical summary for numerical DateFrame - Parameters: - df_in – input pandas DataFrame
- output_dir – output files directory
- top_n – the number of the top item to show
- deciles – flag for percentiles style
 - Returns: - statistical summary for numerical data - Author: - Wenqiang Feng and Ming Chen - Email: - >>> d = {'A': [1, 0, None, 3], >>> 'B': [1, 0, 0, 0], >>> 'C': ['a', None, 'c', 'd']} >>> # create DataFrame >>> df = pd.DataFrame(d) >>> print(numeric_summary(df)) feature data_type min_digits ... zero_rate pos_rate neg_rate A A float64 3 ... 0.333333 0.666667 0.0 B B int64 3 ... 0.750000 0.250000 0.0 
3.2.2. category_summary¶
- 
PyAudit.basics.category_summary(df_in, output_dir, top_n=4)[source]
- generate statistical summary for numerical DateFrame - Parameters: - df_in – input pandas DataFrame
- output_dir – output files directory
- top_n – the number of the top item to show
 - Returns: - statistical summary for numerical data - Author: - Wenqiang Feng and Ming Chen - Email: - >>> d = {'A': [1, 0, None, 3], >>> 'B': [1, 0, 0, 0], >>> 'C': ['a', None, 'c', 'd']} >>> # create DataFrame >>> df = pd.DataFrame(d) >>> print(numeric_summary(df)) feature data_type min_digits ... top_values top_freqs missing_rate C C object 1 ... [a, d, c] [1, 1, 1] 0.25 
3.3. Auditing Function¶
3.3.1. auditing¶
- 
PyAudit.basics.auditing(df_in, output_dir, top_n=4, deciles=False)[source]
- generate audited results - Parameters: - df_in – input pandas DataFrame
- output_dir – output files directory
- top_n – the number of the top item to show
- deciles – flag for percentiles style
 - Author: - Wenqiang Feng and Ming Chen - Email: - >>> d = {'A': [1, 0, None, 3], >>> 'B': [1, 0, 0, 0], >>> 'C': ['a', None, 'c', 'd']} >>> # create DataFrame >>> df = pd.DataFrame(d) >>> print(auditing(df,path)) feature data_type min_digits ... zero_rate pos_rate neg_rate A A float64 3 ... 0.333333 0.666667 0.0 B B int64 3 ... 0.750000 0.250000 0.0