DataBricks: Game-Changing Built-in SQL Functions

aps08
4 min readDec 1, 2023

DataBricks stands out as a leading open analytics platform, empowering users to construct, deploy, share, and uphold robust data, analytics, and AI solutions on an enterprise scale. For those venturing into the realm of DataBricks and seeking insights into its SQL capabilities beyond the conventional functionality, this article serves as an indispensable guide. Explore the additional dimensions that DataBricks introduces to SQL, enhancing your learning journey.

DataBricks offers a treasure trove of built-in SQL functions, designed to streamline your data tasks. This article covers the essentials (1) What are built-in SQL functions, (2) Why to use SQL built-in functions, and (3) Top 10 game-changing built-in SQL functions in DataBricks. Dive into this guide for quick insights that can boost your efficiency in data pipeline development.

keywords — DataBricks, Integration, function, SQL, analytics, deploying, cluster.

DataBricks

What are built-in SQL functions?

DataBricks built-in SQL functions amplify the capabilities of SQL, specifically tailored for the DataBricks runtime environment and applicable to Delta Lake tables.

If the existing functions fall short, users have the flexibility to craft custom functions to meet their unique needs.

Why to use SQL built-in functions?

During data pipeline development, it’s common to employ Python or pyspark for manipulating data within Delta table columns. However, beginners may resort to less efficient approaches, such as using loops to iterate through column elements. Regrettably, these methods can decrease code efficiency, prolong runtime, and subsequently escalate cluster usage, contributing to higher project costs. It’s crucial to adopt optimized practices to enhance both code performance and cost-effectiveness.

Consider a scenario where a table includes a column named array_data, holding ARRAY type data like [“alpha”, “beta”, “gamma”]. The task is to select or load data into another table where the array must contain “beta”. How would you tackle this? A novice might initially consider looping through the data, checking for the presence of “beta” in the array, or perhaps splitting the array_data column based on commas (,) and inspecting each element.

The easiest and most efficient way to solve this is by using “array_contains” which is a built-in function.

From DataBricks docs

So, to tackle your issue, first, check if DataBricks provides a built-in function for your specific problem. If not, consider creating a custom function. If a custom function isn’t feasible for your use case, explore the possibility of using a pyspark User Defined Function (UDF).

Top 10 game-changing built-in SQL functions

Here are 10 built-in functions on DataBricks which according to me will help you a lot during your development, I won’t be explaining the functionality in detail you can read it here. Some function might depend on the runtime version of the cluster.

Presented here are 10 DataBricks built-in functions that, in my opinion, will greatly aid your development. I won’t dive into detailed explanations of their functionalities in this context; you can find that information here. Note that the functionality of some functions may be contingent on the runtime version of the cluster.

  1. array_[function] — used for performing operations on array in a table. Contains functions like array, array_append, array_compact, array_contains and many more.
  2. zip_with — Merges the arrays in expr1 and expr2, element-wise, into a single array using func.
  3. split_part — Splits str around occurrences of delim and returns the partNum part.
  4. secret — Extracts a secret value with the given scope and key from Databricks secret service.
  5. sha — Returns a sha1 hash value as a hex string of expr. This function is a synonym for sha1 function.
  6. monotonically_increasing_id — Returns monotonically increasing 64-bit integers.
  7. md5 — Returns an MD5 128-bit checksum of expr as a hex string.
  8. mask — Returns a masked version of the input str.
  9. collect_list — Returns an array consisting of all values in expr within the group. This function is a synonym for array_agg aggregate function.
  10. convert_timezone — Converts TIMESTAMP_NTZ to another time zone. The input column is converted to TIMESTAMP_NTZ type before the time zone conversion, if the input column is of TIMESTAMP or DATE or STRING type.

Bonus:

  1. DataBricks also has mathematical function like sin, cos, tan, mean, median, mode, percentile and etc.
  2. Just like array_[function] databricks also has map_[function], used for working with map.

Thank you for reading this article, Of course, this information is based on my knowledge. If you like this article, give me a clap and feel free to connect with me on Twitter, GitHub and LinkedIn.

--

--