Databricks is an optimized data analytics platform based on Apache Spark. Monitoring Databricks plateform is crucial to ensure data quality, job performance, and security issues by limiting access to production workspaces.
Spark application metrics, logs, and events produced by a Databricks workspace can be customized, sent, and centralized to various monitoring platforms including Azure Monitor Logs. This tool, formerly called Log Analytics by Microsoft, is an Azure cloud service integrated into Azure Monitor that collects and stores logs from cloud and on-premises environments. It provide a mean for querying logs from data collected using a read-only query language named “Kusto”, for building “Workbooks” dashboards and setting up alerts on identified patterns.
This article focus on automating the export of Databricks logs to a Log Analytics workspace by using the Spark-monitoring library at a workspace scale.
This section is an overview of the architecture. More detailed information and the associated source code are provided further down in the article.
Spark-monitoring is a Microsoft Open Source project to export Databricks logs at a cluster level. Once downloaded, the library is locally built with Docker or Maven according to the Databricks Runtime version of the cluster to configure (Spark and Scala versions). The build of the library generates two jar files:
spark-listeners_$spark-version_$scala_version-$version: collects data from a running cluster;
spark-listenersby collecting data, connecting to a Log Analytics workspace, parsing and sending logs via Data Collector API
In the documentation, once the jars are