7. Data Read and Ingestion with DataBase

7.1. Data Ingestion from Local to DataBase

# User Information
try:
    login = pd.read_csv(r'login.txt', header=None)
    user = login[0][0]
    pw = login[0][1]
    print('User information is ready!')
except:
    print('Login information is not available!!!')

# Database information
host = '##.###.###.##'
db_name = 'db_name'
table_name = 'table_name'

# Setup connection
conn = psycopg2.connect(host=host, database=db_name, user=user, password=pw)
cur = conn.cursor()

# Creat table in DataBase
conn.commit()
query = """
    DROP TABLE IF EXISTS {table_name};
    CREATE TABLE {table_name}
    (   id character varying(20)
      , val1 double precision
      , val2 double precision
      , val3 double precision
      , val4 text
    )
    DISTRIBUTED BY (id);
    GRANT SELECT ON TABLE {table_name} TO xxxx;
    """.format(table_name=table_name)
cur.execute(query)
conn.commit()


# load the data
df = pd.read_csv('xx.csv')

# Write dataframe to memory as csv
csv_io = io.StringIO()
df.to_csv(csv_io, sep='\t', header=True, index=False)
csv_io.seek(0)

# Copy the dataframe in memory to GP
conn.commit()
copy_sql = """
           COPY {table_name} FROM stdin WITH CSV HEADER
           DELIMITER as '\t'
           """.format(table_name=table_name)
cur.copy_expert(sql=copy_sql, file=csv_io)
conn.commit()

Note

You can also use copy_to to copy the dataframe from local memory to GP

cur.copy_to(df, table_name)

7.2. Data Read from DataBase to Local

# User information
try:
    login = pd.read_csv(r'login.txt', header=None)
    user = login[0][0]
    pw = login[0][1]
    print('User information is ready!')
except:
    print('Login information is not available!!!')

# Database information
host = '##.###.###.##'
db_name = 'db_name'
table_name = 'table_name'

# Setup connection
conn = psycopg2.connect(host=host, database=db_name, user=user, password=pw)
cur = conn.cursor()

# Read table
sql = """
      select *
      from {table_name}
      """.format(table_name=table_name)
dp = pd.read_sql(sql, conn)

7.3. Connect to various DataBases (pyodbc)

One open source python library pyodbc makes accessing ODBC databases simple. For example, it can connect with Google BigQuery, Hive from Ubuntu / Debian, Microsoft Excel, Microsoft SQL Server etc.

# set up DSN (database source name) connection
import pyodbc
conn = pyodbc.connect('DRIVER={ODBC Driver 17 for SQL Server};SERVER={server};DATABASE={db};UID={user};PWD={password}')
query_string = "SQL QUERY"

import pandas as pd
df = pd.read_sql(query_string, conn)

7.4. Hive and Impala Table Ingestion

_images/ali.png