Data Types

Column Types

Timestamps

Timestamps are interpreted to be timezoneless and stored as an offset from the UNIX epoch.

Supported conversions:

  • Integer numeric types: Interpreted as UNIX timestamp in seconds
  • Floating point numeric types: Interpreted as UNIX timestamp in seconds with decimal precision
  • Strings: JDBC compliant java.sql.Timestamp format YYYY-MM-DD HH:MM:SS.fffffffff (9 decimal place precision)

Dates

Date values describe a particular year/month/day, in the form YYYY-­MM-­DD. The range of values supported for the Date type is 0000-­01-­01 to 9999-­12-­31.

  • cast(timestamp as date): The year/month/day of the timestamp is determined, based on the local timezone, and returned as a date value.
  • cast(string as date): If the string is in the form YYYY-MM-DD, then a date value corresponding to that year/month/day is returned. If the string value does not match this formate, then NULL is returned.

Union Types

The first part in the deserialized union is the tag which lets us know which part of the union is being used. In this example 0 means the first data_type from the definition which is an int and so on.

1
2
3
4
5
6
7
8
9
10
11
CREATE TABLE union_test(foo UNIONTYPE<int, double, array<string>, struct<a:int,b:string>>);
SELECT foo FROM union_test;
{0:1}
{1:2.0}
{2:["three","four"]}
{3:{"a":5,"b":"five"}}
{2:["six","seven"]}
{3:{"a":8,"b":"eight"}}
{0:9}
{1:10.0}

To create a union you have to provide this tag to the create_union UDF:

1
2
3
4
SELECT create_union(0, key), create_union(if(key<100, 0, 1), 2.0, value), create_union(1, "a", struct(2, "b")) FROM src LIMIT 2;
{0:"238"} {1:"val_238"} {1:{"col1":2,"col2":"b"}}
{0:"86"} {0:2.0} {1:{"col1":2,"col2":"b"}}

Literals

Floating Point Types

Decimal literals provide precise values and greater range for floating point numbers than the DOUBLE type. Decimal types are needed for use cases in which the (very close) approximation of a DOUBLE is insufficient, such as financial applications, equality and inequality checks, and rounding operations.

Date Definition Statements

Create/Drop/Truncate Table

Managed and External Tables

  • Use managed tables when Hive should manage the lifecycle of the table, or when generating temporary tables.

  • External table files can be accessed and managed by processes outside of Hive.

If a managed table or partition is dropped, the data and metadata associated with that table or partition are deleted. If the PURGE option is not specified, the data is moved to a trash folder for a defined duration.

CTSAS

CTAS(create table as select) has these restrictions:

  • The target table cannot be a partitioned table.
  • The target table cannot be an external table.
  • The target table cannot be a list bucketing table.

##Macro and Function

Create/Drop Temporary Macro

1
2
3
4
5
6
7
CREATE TEMPORARY MACRO macro_name([col_name col_type, ...]) expression;
CREATE TEMPORARY MACRO fixed_number() 42;
CREATE TEMPORARY MACRO string_len_plus_two (x string) length(x) + 2;
CREATE TEMPORARY MACRO simple_add (x int, y int) x + y;
DROP TEMPORARY MACRO [IF EXISTS] macro_name;

Temporary Functions

1
2
3
4
5
CREATE TEMPORARY FUNCTION function_name AS class_name;
DROP TEMPORARY FUNCTION [IF EXISTS] function_name;
CREATE FUNCTION [db_name.]function_name AS class_name [USING JAR|FILE|ARCHIVE 'file_uri' [, JAR|FILE|ARCHIVE 'file_uri'] ];

Data Retrieval: Queries

HAVING Clause

The

clause was added to SQL because the ```WHERE``` keyword could not be used with aggregate functions.
1
2
3
4
5
6
```WHERE``` works before return result, ```HAVING``` filter the result returned by query.
```sql
select col1 from t1 group by col1 having sum(col2) > 10;
select col1 from (select col1, sum(col2) as col2sum from t1 group by col1) t2 where t2.col2sum > 10;

LIMIT Clause

1
2
3
4
5
6
7
8
select * from customers limit 5;
--The previous query returns 5 arbitrary customers
select * from customers order by create_date limit 5;
--The previous query returns the first 5 customers to be created
select * from customers order by create_date limit 2,5;
--The previous query returns the 3rd to the 7th customers to be created

REGEX Column Specification

1
2
3
set hive.support.quoted.identifiers=none;
--The following query selects all columns except ds and hr.
SELECT `(ds|hr)?+.+` FROM sales;

GROUP BY

Multiple aggregations can be done at the same time, however, no two aggregations can have different DISTINCT columns.

1
select users.gender, count(distinct users.userid), count(*), sum(distinct users.userid) from users group by users.gender;

When using group by clause, the select statement can only include columns included in the group by clause. Of course, you can have as many aggregation functions in the select statement as well.

1
select a, sum(b) from t1 group by a;

hive.map.aggr controls how we do aggregations. The default is false. If it is set to true, Hive will do the first-level aggregation directly in the map task. This usually provides better efficiency, but may require more memory to run successfully.

1
2
set hive.map.aggr=true;
select count(*) from table2;

Order/Sort/Distribute/Cluster By

There are some limitations in the

by``` clause. In the strict mode (i.e., ``` set hive.mapred.mode=strict```), the order by clause has to be followed by a ```limit``` clause. The reason is that in order to impose total order of all results, there has to be one reducer to sort the final output. If the number of rows in the output is too large, the single reducer could take a very long time to finish.
1
2
3
```sql
order by colName (asc|desc) (nulls first|last) (, (asc|desc) (nulls first|last))

The difference between

by``` and ```sort by``` is that the former guarantees total order in the output while the latter only guarantees ordering of the rows within a reducer. If there are more than one reducer, ```sort by``` may give partially ordered final results. The difference between ```sort by``` alone of a single column and ```cluster by``` is that ```cluster by``` partitions by the field and ```sort by``` if there are multiple reducers partitions randomly in order to distribute data (and load) uniformly across the reducers.
1
2
3
```sql
select key, value from src sort by key asc, value desc;

cluster By is a short-cut for both distribute By and sort By.

Operators and UDFS