Easiest Way to Connect Hive with Python:PyHive

Easiest Way to Connect Hive with Python:PyHive

In this article, We share with you a friendly way to connect Python to Hive, letting you query and analyze large datasets with simplicity and speed

Connecting Python to Apache Hive is streamlined by utilizing PyHive, which enables easy interactions and SQL executions on large Hadoop datasets. It offers a simplified, efficient approach to big data exploration, making it crucial for leveraging Python’s versatility with Hive’s robust data warehousing

Connect Hive with Python:PyHive

Quick introduction about hive and pyhive

Hive

Apache Hive is a data warehousing solution built on top of Hadoop, designed for data query, analysis, and summarization. It turns multi-structured and large datasets into analyzable data, enabling users to execute SQL-like queries (HQL) on data stored in Hadoop.

PyHive

PyHive is a Python library developed to facilitate the interaction between Python applications and Hive, a data warehousing solution. Additionally, PyHive can also be used to interact with Presto, a distributed SQL query engine for big data.

Step-by-step guide to accessing Apache Hive via Python

Python, with its powerful libraries called pyhive, allows developers to interact with Apache Hive programmatically

Installing Required Libraries

Before we can connect to Hive using Python, we need to install the necessary libraries. The pyhive library can be installed using pip as follows:

pip install pyhive
]# pip install pyhive
Collecting pyhive
  Downloading https:///packages/PyHive-0.7.0.tar.gz (46kB)
    100% |################################| 51kB 123kB/s 
Collecting future (from pyhive)
  Downloading https:///packages/future-0.18.3.tar.gz (840kB)
    100% |################################| 849kB 94kB/s 
Collecting python-dateutil (from pyhive)
  Downloading https:///packages/python_dateutil-2.8.2-py2.py3-none-any.whl (247kB)
    100% |################################| 256kB 2.6MB/s 
Collecting six>=1.5 (from python-dateutil->pyhive)
  Downloading https:///packages/six-1.16.0-py2.py3-none-any.whl
Installing collected packages: future, six, python-dateutil, pyhive
  Running setup.py install for future ... done
  Running setup.py install for pyhive ... done
Successfully installed future-0.18.3 pyhive-0.7.0 python-dateutil-2.8.2 six-1.16.0

Additionally, you need to install thrift, sasl, and thrift-sasl else you will see the below message

>>> from pyhive import hive
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python2.7/site-packages/pyhive/hive.py", line 18, in <module>
    from TCLIService import TCLIService
  File "/usr/lib/python2.7/site-packages/TCLIService/TCLIService.py", line 9, in <module>
    from thrift.Thrift import TType, TMessageType, TFrozenDict, TException, TApplicationException
ImportError: No module named thrift.Thrift

Install as below

pip install thrift sasl thrift-sasl
# pip install thrift sasl thrift-sasl
Requirement already satisfied (use --upgrade to upgrade): thrift in /usr/lib/python2.7/site-packages
Collecting sasl
  Cache entry deserialization failed, entry ignored
  Using cached 
Collecting thrift-sasl
  Cache entry deserialization failed, entry ignored
  Using cached 
Requirement already satisfied (use --upgrade to upgrade): six>=1.7.2 in /usr/lib/python2.7/site-packages (from thrift)
Collecting pure-sasl>=0.6.2 (from thrift-sasl)
  Cache entry deserialization failed, entry ignored
  Using cached 
Installing collected packages: sasl, pure-sasl, thrift-sasl
  Running setup.py install for sasl ... done
Successfully installed pure-sasl-0.6.2 sasl-0.3.1 thrift-sasl-0.4.3

Quick Note: You Might be facing the below error while installing the above packages.

ERROR 1:

    gcc -pthread -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -fPIC -Isasl -I/usr/local/include/python3.6m -c sasl/saslwrapper.cpp -o build/temp.linux-x86_64-3.6/sasl/saslwrapper.o
    gcc: error trying to exec 'cc1plus': execvp: No such file or directory
    error: command 'gcc' failed with exit status 1
    

ERROR 2:

    In file included from sasl/saslwrapper.cpp:629:0:
    sasl/saslwrapper.h:22:23: fatal error: sasl/sasl.h: No such file or directory
     #include <sasl/sasl.h>
                           ^
    compilation terminated.
    error: command 'gcc' failed with exit status 1

Resolution:

To resolve this, We need to install the necessary system libraries (I am using RHEL If you have a different Linux version. Please check it accordingly )

# yum install gcc-c++ python-devel.x86_64 cyrus-sasl-devel.x86_64
Installed:
  cyrus-sasl-devel.x86_64 0:2.1.26-24.el7_9                                                                              

Dependency Installed:
  cyrus-sasl.x86_64 0:2.1.26-24.el7_9                                                                                    

Updated:
  python-devel.x86_64 0:2.7.5-93.el7_9                                                                                   

Dependency Updated:
  python.x86_64 0:2.7.5-93.el7_9                           python-libs.x86_64 0:2.7.5-93.el7_9        

Connecting to Hive

After installing the necessary libraries, you can create a connection to your Hive server. Here’s a basic example of connecting to Hive, executing a query, and fetching the results:

Pre-requisite:

— Working Hadoop Cluster with HIVESERVER & HIVE METASTORE server running

— Create a sample Hive table

create table sample (name string, learning string);
insert into table sample values('Rich','python');
insert into table sample values('JJ','java')

Code:

from pyhive import hive

host_name = "your_host_name"
port = 10000 

try:
    conn = hive.Connection(host=host_name, port=port, username='your_username', database='default')
    cursor = conn.cursor()
    cursor.execute('SELECT * FROM sample LIMIT 10')
    
    for result in cursor.fetchall():
        print(result)
    
except Exception as e:
    print(f"An error occurred: {e}")
finally:
    try:
        cursor.close()
        conn.close()
    except Exception as e:
        print(f"Error closing connection: {e}")

Replace the placeholders like your_host_name, your_username, and your_table_name with the actual values specific to your setup.

Output:

# python
Python 2.7.5  
[GCC 4.8.5 34234 (Red Hat 5.8.5-44)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from pyhive import hive
>>> conn = hive.Connection(host='<hostname>',port='10000',username='test_user', database='default')
>>> cursor = conn.cursor()
>>> cursor.execute('SELECT * FROM sample LIMIT 10')
>>> for result in cursor.fetchall():
...     print(result)
... 
(u'Rich', u'python')
(u'JJ', u'java')
>>> 

Handling Exceptions

In the above example, error handling is managed with try/except blocks to catch any exceptions that might occur during the connection, execution of the query, or closing of the connection to the Hive server. Proper error handling is crucial for diagnosing issues and ensuring that resources like connections and cursors are always properly closed, even when an error occurs.

except Exception as e:
    print(f"An error occurred: {e}")
finally:
    try:
        cursor.close()
        conn.close()
    except Exception as e:
        print(f"Error closing connection: {e}")

Leveraging SQLAlchemy Support

pyhive also offers SQLAlchemy support, which is incredibly beneficial for leveraging high-level Object Relational Mappers (ORMs) like SQLAlchemy ORM or for directly using Pandas DataFrames. SQLAlchemy can be particularly useful if you’re dealing with complex queries and data manipulations, allowing for a more pythonic interaction with your data.

Conclusion

Interacting with Apache Hive through Python is made hassle-free by the pyhive library. This library allows developers to connect to Hive, execute queries, and fetch results programmatically. Always remember to handle exceptions properly to maintain the robustness of your Python applications when interacting with external systems like Hive.

By following the guidelines mentioned in this article, developers can harness the power of Python to interact with and leverage Apache Hive efficiently in their projects.

Good Luck with your Learning !!

Related Topics

Fix – “ImportError: No module named pip” in Python

Mastering Python List of Dictionaries: Your Step-By-Step Guide

How to Find N Root of a Number – Python

How to Modify Python Global Variable inside a Function

Jerry Richard R
Follow me