Select Clause Syntax

This section shows the general syntax of an SQL statement:

SELECT [DISTINCT] expr_list

[FROM [db.] table | (subquery) | table_function]

[SAMPLE sample_coeff]

[PREWHERE expr]

[WHERE expr]

[GROUP BY expr_list] [WITH TOTALS]

[HAVING expr]

[ORDER BY expr_list]

[LIMIT [n, ]m]

[UNION ALL...]

[LIMIT n BY columns]

FROM Clause

The FROM clause specifies the source to read data from: a table, a subquery, or a JOIN clause. Subqueries must be specified in parenthesis. Unlike standard SQL, you do not need to specify a synonym after a subquery. For compatibility purposes, you can write ‘name AS’ after a subquery, but the specified name will not be used.

SAMPLE Clause

The SAMPLE clause allows for approximated query processing.

When you use the SAMPLE clause, the query is not performed on all the data, but only on a certain fraction of the data (sample). For example, if you need to calculate statistics for a number of events, it is enough to execute the query on the 1/10 fraction of all the events and then multiply the result by 10.

Approximated query processing can be useful in these cases:

When you want to speed up result collection.
When your raw data is not accurate, so approximation does not noticeably degrade the quality of results.

The features of data sampling are these:

Data sampling is a deterministic mechanism. The result of the same query is always the same.
Sampling works consistently for different tables. For tables with a single sampling key, a sample with the same coefficient always selects the same subset of possible data. This means that you can use the sample in subqueries in the IN clause. Also, you can join samples using the JOIN clause.
Sampling allows reading less data from a disk. Note that you must specify the sampling key correctly.

For the SAMPLE clause this syntax is supported::

Syntax	Description
SAMPLE k	Number from 0 to 1. The query is run on a `k` fraction of data. For example, `SAMPLE 0.1` runs the query on 10% of data.
SAMPLE n	`n` is a sufficiently large integer. The query is run on a sample of at least `n` rows (but not significantly more than this). For example, `SAMPLE 10000000` runs the query on a minimum of 10,000,000 rows.
SAMPLE k OFFSET m	`k` and `m` are the numbers from 0 to 1. The query is run on a sample of `k` fraction of the data. The data used for the sample is offset by `m` fraction.
Parameters supported by the SAMPLE clause

JOIN Clause

Join indicates a join operation in relational algebra that combines columns from one or more tables, creating a new group that can be stored in a table or be used as is. Join produces a new table by combining columns from one or multiple tables by using values common to each. Supported types of JOIN in Cytomic Orion:

INNER JOIN (or JOIN): Compares each row of table A with rows of table B to find all pairs of rows that satisfy the join-predicate specified in the ON clause. When the join-predicate is satisfied by matching non-NULL values, column values for each matched pair of rows of A and B are combined into a result row.
LEFT JOIN (or LEFT OUTER JOIN): Always contains all rows of the "left" table (A), even if the join-condition does not find any matching row in the "right" table (B). This means that if the ON clause matches 0 (zero) rows in B (for a given row in A), the join will still return a row in the result (for that row)—but with NULL in each column from B.
RIGHT JOIN (or RIGHT OUTER JOIN): Always contains all rows of the "right" table (B), even if the join-condition does not find any matching row in the "left" table (A). This means that if the ON clause matches 0 (zero) rows in A (for a given row in B), the join will still return a row in the result (for that row)—but with NULL in each column from A.
FULL JOIN (or FULL OUTER JOIN): Combines the effect of applying both LEFT JOINs and RIGHT JOINs Where rows in the full outer joined tables do not match, the result set will have NULL values for every column of the table that lacks a matching row. For those rows that do match, a single row is produced in the result set (containing columns populated from both tables).
CROSS JOIN (or , ): Returns the Cartesian product of rows from tables in the join. In other words, it produces rows which combine each row from the first table with each row from the second table.
ANY or ALL modifier: If ALL is specified and the right table has several matching rows, the data is multiplied by the number of rows. This is the normal behavior of a JOIN clause in standard SQL. If ANY is specified and the right table has several matching rows, only the first one found is joined. If the right table has only one matching row, the results of ANY and ALL are the same.

WHERE Clause

If there is a WHERE clause, it must contain an expression with the UInt8 type. This is usually an expression with comparison and logical operators. This expression is used to filter data before all the transformations included in the statement.

PREWHERE Clause

This clause has the same meaning as the WHERE clause. The difference is in which data is read from the table. With PREWHERE, at first only the columns necessary for executing the clause are read. Then, the other columns are read that are needed for running the rest of the query, but only those blocks where the PREWHERE expression is true.

PREWHERE filters data more efficiently and allows to read a lot less data from disk for query execution.

GROUP BY Clause

This clause groups results by one or more columns. For grouping, Cytomic Orion interprets NULL as a value.

If the WITH TOTALS modifier is specified, another row is calculated. This row has key columns containing default values (zeros or empty lines), and columns of aggregate functions with the values calculated across all the rows (the “total” values). This extra row is only produced in JSON, TabSeparated*, and Pretty formats. In JSON formats, this row is output as a separate ‘totals’ field. In TabSeparated formats, the row comes after the main result, preceded by an empty row. In Pretty formats, the row is output as a separate table after the main result.

You can use WITH TOTALS in subqueries, including subqueries in the JOIN clause (in this case, the respective total values are combined).

As opposed to MySQL (and conforming to standard SQL, the GROUP BY statement does not support positional arguments. For example, GROUP BY 1, 2 is interpreted as group by constant (that is, all rows in one).

LIMIT N BY Columns Clause

This clause selects the first N rows for each group of columns. LIMIT N BY is not related to LIMIT. They can both be used in the same query.. LIMIT N BY can contain any number of columns or expressions.

HAVING Clause

This clause enables you to filter the aggregation results produced by GROUP BY. It is similar to the WHERE clause, but the difference is that WHERE is performed before aggregation (GROUP BY), while HAVING is performed after it. You cannot use HAVING if aggregation is not performed.

As opposed to MySQL (and conforming to standard SQL, the GROUP BY statement does not support positional arguments.

ORDER BY Clause

The ORDER BY clause contains a list of expressions, which can each be attributed with a DESC (descending) or ASC (ascending) modifier which determine the sorting direction. If the direction is not specified, ASC is assumed. The sorting direction applies to a single expression, not to the entire list.

Rows that have identical values for the list of sorting expressions are output in an arbitrary order, which can also be non-deterministic (different each time). If the ORDER BY clause is omitted, the order of the rows is also undefined, and might be non-deterministic as well.

COLLATE

For sorting by String values, you can specify COLLATE to specify the alphabet you want to use. Example: ORDER BY SearchPhrase COLLATE 'tr' - for sorting by keyword in ascending order, using the Turkish alphabet, case insensitive, assuming that strings are UTF-8 encoded.

You can specify COLLATE or not for each expression independently. If you specify ASC or DESC, specify COLLATE after it. When you usie COLLATE, sorting is always case-insensitive.

We recommend that you use COLLATE only for final sorting of a small number of rows, because sorting with COLLATE is less efficient than normal sorting by bytes.

NaN and NULL Sorting Order:

With the NULLS FIRST modifier: First NULL, then NaN, then other values.
With the NULLS LAST modifier: First the values, then NaN, then NULL.
Default: The same as with the NULLS LAST modifier.

When floating point numbers are sorted, NaNs are separate from the other values. Regardless of the sorting order, NaNs come at the end. In other words, for ascending sorting they are placed as if they are larger than all the other numbers, while for descending sorting they are placed as if they are smaller than the rest.

SELECT Clause

Expressions specified in the SELECT clause are calculated after all the operations in the clauses described above are finished. If expressions in the SELECT clause contain aggregate functions, then the solution processes aggregate functions and the expressions used as their arguments during the GROUP BY aggregation. These expressions work as if they apply to separate rows in the result.

DISTINCT Clause

If you specify DISTINCT, only a single row remains out of all the sets of fully matching rows in the result. It is possible to obtain the same result by applying GROUP BY across the same set of values as specified as SELECT clause, without using any aggregate functions. But there are few differences from the GROUP BY approach:

DISTINCT can be applied together with GROUP BY.
When you omit ORDER BY and define LIMIT, the query stops running immediately after the required number of different rows has been read.
Data blocks are output as they are processed, without waiting for the entire query to finish running.

DISTINCT works with NULL as if NULL were a specific value. In other words, in the DISTINCT results, different combinations with NULL occur only once.

LIMIT m Clause

This clause selects the first m rows from the result.

LÍMIT n, m selects the first m rows from the result after skipping the first n rows. The LIMIT m OFFSET n syntax is equivalent. n and m must be non-negative integers.

If there is no ORDER BY clause that explicitly sorts results, the choice of rows for the result might be arbitrary and non-deterministic.

UNION ALL Clause

You can use this clause to combine any number of queries. Only UNION ALL is supported. UNION (UNION DISTINCT) is not supported. If you need UNION DISTINCT, you can write SELECT DISTINCT from a subquery containing UNION ALL. Queries that are parts of UNION ALL can be run simultaneously, and their results can be mixed together.

The structure of results (the number and type of columns) must match for the queries. But the column names can differ. In this case, the column names for the final result will be taken from the first query. Type casting is performed for unions. For example, if two queries that are being combined have the same field with non-Nullable and Nullable types from a compatible type, the resulting UNION ALL has a Nullable type field.

You cannot enclose queries that are parts of UNION ALL in brackets. ORDER BY and LIMIT are applied to separate queries, not to the final result. If you need to apply a conversion to the final result, you can put all the queries with UNION ALL in a subquery in the FROM clause.

IN Operators

In this section, we cover the IN, NOT IN, GLOBAL IN, and GLOBAL NOT IN operators separately, because their functionality is quite rich.

The left side of the IN operator is either a single column or a tuple. For example SELECT UserID IN (123, 456) FROM ...

If the left side is a single column that is in the index, and the right side is a set of constants, the system uses the index for processing the query.

The right side of the operator can be a set of constant expressions, a set of tuples, the name of a database table, or a SELECT subquery in brackets. The subquery can specify more than one column for filtering tuples. For example:

SELECT (CounterID, UserID) IN (SELECT CounterID, UserID FROM ...) FROM ....

The columns to the left and right of the IN operator should have the same type.

The IN operator and the subquery can occur in any part of the query, including in aggregate functions and lambda functions.

NULL Processing

During request processing, the IN operator assumes that the result of an operation with NULL is always equal to 0, regardless of whether NULL is on the right or left side of the operator. NULL values are not included in any dataset, do not correspond to each other, and cannot be compared.

The Asterisk Symbol (*)

You can put an asterisk in any part of a query instead of an expression. When the query is analyzed, the asterisk is expanded to a list of all table columns. There are only a few cases when using an asterisk is justified:

When you creatie a table dump.
For tables that contain just a few columns.
For getting information about what columns are in a table. In this case, set LÍMIT 1. We recommend that you use the DESC TABLE query.
When there is strong filtration on a small number of columns, use PREWHERE.
In subqueries (because columns that are not needed for the external query are excluded from subqueries).

In all other cases, we do not recommend that you use the asterisk due to performance issues.