Select Clause Syntax
This section shows the general syntax of an SQL statement:
SELECT [DISTINCT] expr_list [FROM [db.] table | (subquery) | table_function] [SAMPLE sample_coeff] [GLOBAL] [ANY|ALL] [INNER|LEFT|RIGHT|FULL|CROSS] [OUTER] JOIN (subquery)|table USING columns_list [PREWHERE expr] [WHERE expr] [GROUP BY expr_list] [WITH TOTALS] [HAVING expr] [ORDER BY expr_list] [LIMIT [n, ]m] [UNION ALL...] [LIMIT n BY columns] |
FROM Clause
The FROM clause specifies the source to read data from: a table, a subquery, or a JOIN clause. Subqueries must be specified in parenthesis. Unlike standard SQL, you do not need to specify a synonym after a subquery. For compatibility purposes, you can write ‘name AS’ after a subquery, but the specified name will not be used.
SAMPLE Clause
The SAMPLE clause allows for approximated query processing.
When you use the SAMPLE
clause, the query is not performed on all the data, but only on a certain fraction of the data (sample). For example, if you need to calculate statistics for a number of events, it is enough to execute the query on the 1/10 fraction of all the events and then multiply the result by 10.
Approximated query processing can be useful in these cases:
-
When you want to speed up result collection.
-
When your raw data is not accurate, so approximation does not noticeably degrade the quality of results.
The features of data sampling are these:
-
Data sampling is a deterministic mechanism. The result of the same query is always the same.
-
Sampling works consistently for different tables. For tables with a single sampling key, a sample with the same coefficient always selects the same subset of possible data. This means that you can use the sample in subqueries in the IN clause. Also, you can join samples using the JOIN clause.
-
Sampling allows reading less data from a disk. Note that you must specify the sampling key correctly.
For the SAMPLE
clause this syntax is supported::
Syntax | Description |
---|---|
SAMPLE k |
Number from 0 to 1. The query is run on a |
SAMPLE n |
|
SAMPLE k OFFSET m |
|
JOIN Clause
Join indicates a join operation in relational algebra that combines columns from one or more tables, creating a new group that can be stored in a table or be used as is. Join produces a new table by combining columns from one or multiple tables by using values common to each. Supported types of JOIN in Cytomic Orion:
-
INNER JOIN (or JOIN): Compares each row of table A with rows of table B to find all pairs of rows that satisfy the join-predicate specified in the
ON
clause. When the join-predicate is satisfied by matching non-NULL values, column values for each matched pair of rows of A and B are combined into a result row. -
LEFT JOIN (or LEFT OUTER JOIN): Always contains all rows of the "left" table (A), even if the join-condition does not find any matching row in the "right" table (B). This means that if the
ON
clause matches 0 (zero) rows in B (for a given row in A), the join will still return a row in the result (for that row)—but withNULL
in each column from B. -
RIGHT JOIN (or RIGHT OUTER JOIN): Always contains all rows of the "right" table (B), even if the join-condition does not find any matching row in the "left" table (A). This means that if the
ON
clause matches 0 (zero) rows in A (for a given row in B), the join will still return a row in the result (for that row)—but withNULL
in each column from A. -
FULL JOIN (or FULL OUTER JOIN): Combines the effect of applying both
LEFT JOIN
s andRIGHT JOIN
s Where rows in the full outer joined tables do not match, the result set will haveNULL
values for every column of the table that lacks a matching row. For those rows that do match, a single row is produced in the result set (containing columns populated from both tables). -
CROSS JOIN (or , ): Returns the Cartesian product of rows from tables in the join. In other words, it produces rows which combine each row from the first table with each row from the second table.
-
ANY or ALL modifier: If
ALL
is specified and the right table has several matching rows, the data is multiplied by the number of rows. This is the normal behavior of aJOIN
clause in standard SQL. IfANY
is specified and the right table has several matching rows, only the first one found is joined. If the right table has only one matching row, the results ofANY
andALL
are the same.
WHERE Clause
If there is a WHERE
clause, it must contain an expression with the UInt8 type. This is usually an expression with comparison and logical operators. This expression is used to filter data before all the transformations included in the statement.
PREWHERE Clause
This clause has the same meaning as the WHERE
clause. The difference is in which data is read from the table. With PREWHERE
, at first only the columns necessary for executing the clause are read. Then, the other columns are read that are needed for running the rest of the query, but only those blocks where the PREWHERE
expression is true.
PREWHERE
filters data more efficiently and allows to read a lot less data from disk for query execution.
GROUP BY Clause
This clause groups results by one or more columns. For grouping, Cytomic Orion interprets NULL
as a value.
If the WITH TOTALS
modifier is specified, another row is calculated. This row has key columns containing default values (zeros or empty lines), and columns of aggregate functions with the values calculated across all the rows (the “total” values). This extra row is only produced in JSON, TabSeparated*, and Pretty formats. In JSON formats, this row is output as a separate ‘totals’ field. In TabSeparated formats, the row comes after the main result, preceded by an empty row. In Pretty formats, the row is output as a separate table after the main result.
You can use WITH TOTALS
in subqueries, including subqueries in the JOIN clause (in this case, the respective total values are combined).
As opposed to MySQL (and conforming to standard SQL, the GROUP BY
statement does not support positional arguments. For example, GROUP BY 1, 2
is interpreted as group by constant (that is, all rows in one).
LIMIT N BY Columns Clause
This clause selects the first N
rows for each group of columns. LIMIT N BY
is not related to LIMIT
. They can both be used in the same query.. LIMIT N BY
can contain any number of columns or expressions.
HAVING Clause
This clause enables you to filter the aggregation results produced by GROUP BY
. It is similar to the WHERE
clause, but the difference is that WHERE
is performed before aggregation (GROUP BY
), while HAVING
is performed after it. You cannot use HAVING
if aggregation is not performed.
As opposed to MySQL (and conforming to standard SQL, the GROUP BY
statement does not support positional arguments.
ORDER BY Clause
The ORDER BY
clause contains a list of expressions, which can each be attributed with a DESC
(descending) or ASC
(ascending) modifier which determine the sorting direction. If the direction is not specified, ASC
is assumed. The sorting direction applies to a single expression, not to the entire list.
Rows that have identical values for the list of sorting expressions are output in an arbitrary order, which can also be non-deterministic (different each time). If the ORDER BY
clause is omitted, the order of the rows is also undefined, and might be non-deterministic as well.
COLLATE
For sorting by String values, you can specify COLLATE
to specify the alphabet you want to use. Example: ORDER BY SearchPhrase COLLATE 'tr'
- for sorting by keyword in ascending order, using the Turkish alphabet, case insensitive, assuming that strings are UTF-8 encoded.
You can specify COLLATE
or not for each expression independently. If you specify ASC
or DESC
, specify COLLATE
after it. When you usie COLLATE
, sorting is always case-insensitive.
We recommend that you use COLLATE
only for final sorting of a small number of rows, because sorting with COLLATE
is less efficient than normal sorting by bytes.
NaN and NULL Sorting Order:
-
With the NULLS FIRST modifier: First
NULL
, thenNaN
, then other values. -
With the NULLS LAST modifier: First the values, then
NaN
, thenNULL
. -
Default: The same as with the
NULLS LAST
modifier.
When floating point numbers are sorted, NaN
s are separate from the other values. Regardless of the sorting order, NaN
s come at the end. In other words, for ascending sorting they are placed as if they are larger than all the other numbers, while for descending sorting they are placed as if they are smaller than the rest.
SELECT Clause
Expressions specified in the SELECT
clause are calculated after all the operations in the clauses described above are finished. If expressions in the SELECT
clause contain aggregate functions, then the solution processes aggregate functions and the expressions used as their arguments during the GROUP BY
aggregation. These expressions work as if they apply to separate rows in the result.
DISTINCT Clause
If you specify DISTINCT
, only a single row remains out of all the sets of fully matching rows in the result. It is possible to obtain the same result by applying GROUP BY
across the same set of values as specified as SELECT
clause, without using any aggregate functions. But there are few differences from the GROUP BY
approach:
-
DISTINCT
can be applied together withGROUP BY
. -
When you omit
ORDER BY
and defineLIMIT
, the query stops running immediately after the required number of different rows has been read. -
Data blocks are output as they are processed, without waiting for the entire query to finish running.
DISTINCT
works with NULL
as if NULL
were a specific value. In other words, in the DISTINCT
results, different combinations with NULL
occur only once.
LIMIT m Clause
This clause selects the first m rows from the result.
LÍMIT n, m
selects the first m
rows from the result after skipping the first n
rows. The LIMIT m OFFSET n
syntax is equivalent. n
and m
must be non-negative integers.
If there is no ORDER BY
clause that explicitly sorts results, the choice of rows for the result might be arbitrary and non-deterministic.
UNION ALL Clause
You can use this clause to combine any number of queries. Only UNION ALL
is supported. UNION (UNION DISTINCT)
is not supported. If you need UNION DISTINCT
, you can write SELECT DISTINCT
from a subquery containing UNION ALL
. Queries that are parts of UNION ALL
can be run simultaneously, and their results can be mixed together.
The structure of results (the number and type of columns) must match for the queries. But the column names can differ. In this case, the column names for the final result will be taken from the first query. Type casting is performed for unions. For example, if two queries that are being combined have the same field with non-Nullable and Nullable types from a compatible type, the resulting UNION ALL
has a Nullable type field.
You cannot enclose queries that are parts of UNION ALL
in brackets. ORDER BY
and LIMIT
are applied to separate queries, not to the final result. If you need to apply a conversion to the final result, you can put all the queries with UNION ALL
in a subquery in the FROM
clause.
IN Operators
In this section, we cover the IN
, NOT
IN
, GLOBAL
IN
, and GLOBAL NOT IN
operators separately, because their functionality is quite rich.
The left side of the IN
operator is either a single column or a tuple. For example SELECT UserID IN (123, 456) FROM ..
.
If the left side is a single column that is in the index, and the right side is a set of constants, the system uses the index for processing the query.
The right side of the operator can be a set of constant expressions, a set of tuples, the name of a database table, or a SELECT
subquery in brackets. The subquery can specify more than one column for filtering tuples. For example:
SELECT (CounterID, UserID) IN (SELECT CounterID, UserID FROM ...) FROM ...
.
The columns to the left and right of the IN operator should have the same type.
The IN operator and the subquery can occur in any part of the query, including in aggregate functions and lambda functions.
NULL Processing
During request processing, the IN
operator assumes that the result of an operation with NULL
is always equal to 0, regardless of whether NULL
is on the right or left side of the operator. NULL
values are not included in any dataset, do not correspond to each other, and cannot be compared.
The Asterisk Symbol (*)
You can put an asterisk in any part of a query instead of an expression. When the query is analyzed, the asterisk is expanded to a list of all table columns. There are only a few cases when using an asterisk is justified:
-
When you creatie a table dump.
-
For tables that contain just a few columns.
-
For getting information about what columns are in a table. In this case, set
LÍMIT 1
. We recommend that you use theDESC TABLE
query. -
When there is strong filtration on a small number of columns, use
PREWHERE
. -
In subqueries (because columns that are not needed for the external query are excluded from subqueries).
In all other cases, we do not recommend that you use the asterisk due to performance issues.