
Databricks is a web based platform for working with Apache Spark. It provided automated cluster management and iPython-styled notebooks. Azure data bricks is the jointly developed data and AI cloud service from Microsoft and Databricks for data analytics, data science, data engineering and machine learning.
Azure Data Bricks Architecture
Azure Data bricks is a cloud service that lets you set up and use a cluster of Azure Instances with Apache Spark Installed with a Master-Worker nodal dynamic(similar to a local Hadoop/Spark cluster). Data bricks architecture is basically split into 2 parts: 1- The Control Plane and 2- The Data Plane. The Control Plane contains the Data bricks UX and also the cluster manager. Its also home to the Data Bricks File System – DBFS and also metadata about cluster, files mounted etc. The Data plane is located in the customer subscription. When you create a data bricks service in Azure there are 4 resources created for your subscription. A virtual network, a network security group for the virtual network, Azure blob storage and also a data bricks workspace.
What are Azure Data Bricks Clusters?
A cluster is a collection of Virtual Machines. In a cluster there is a drive node that performed by one or more worker nodes. Clusters allow us to treat these group of computers as a single computer via the driver node. Data bricks clusters allow us to run different types of work loads such as ETL for Data Engineering, Data Science and Machine Learning work loads. Data bricks offers 2 types of clusters. The All Purpose Cluster which is created manually via GUI, CLI or the API. The Job Cluster is created when the job starts to execute and when the Job has been configured to use a job cluster. All Purpose Clusters are persistent, they can be terminated and restarted at any time. Job Clusters are terminated at the end of the job so it is no longer usable after the job is completed. All Purpose Clusters are suitable for interactive workloads and Ad Hoc Analysis work loads. Job Clusters are suitable for automated workload such as running a ETL pipeline or ML workload at regular intervals. All Purpose Clusters can be share amongst many users and good for collaborative analysis. Job Clusters are isolated just for job being executed. All purpose clusters are great for interactive analysis and Ad Hoc Work where as Job Cluster are for repeated workloads.
Azure Data Bricks Cluster Configuration
Standard | High Concurrency | Single Node |
Single User | Multiple Users | Singe User |
No Process Isolation | Provides Process Isolation | No Process Isolation |
No Task preemption | Provides Task Preemption | No Task Preemption |
Support for all DSL | Doesn’t Support Scala | Support for all DSL |
What is Data bricks Runtime?
Data bricks Runtime is the set of core libraries that run on Data Bricks clusters.
Data Bricks Runtime | Data Bricks Runtime ML | Data Bricks Runtime Genomics | Data Bricks Runtime Light |
Spark | Everything from Data bricks Runtime | Everything from Data bricks Runtime | Runtime option for only jobs not requiring advanced features |
Scala, Java, Python, R | ML Libraries : Pytorch, Keras, TensorFlow, XGBoost | Popular Open Source Libraries:Glow, Adam | |
Ubuntu Libraries | Popular Genomic Pipelines: DNASeq, RNASeq | ||
GPU Libraries | |||
Delta Lake |
Benefits of using Azure Databricks:
Optimized Spark Engine – Data processing with auto-scaling and Spark optimized for up to 50x performance gains.
Machine Learning – Pre-configured environments with frameworks such as PyTorch, TensorFlow and sci-kit learn installed.
Mlflow – Track and share experiments, reproduce runs and manage models collaboratively from a central repository.
Benefits of using Databricks Vs SSIS
Exposure of Distribute computing memory.
Cost effective with respect to cluster management.
Scheduling can happen in Databricks UI itself.
Data can be exposed to any cloud platform like AWS/Azure/GCP.
SSIS is licensed with free and paid versions but Databricks is “Pay as you go” plan.
SSIS can handle only structured data but Databricks can handle both structured and unstructured.
Features of Azure Databricks:
Collaborative Notebooks – Quickly access and explore data, find and share new insights and build models collaboratively with the languages and tools of your choice.
Delta Lake – Bring data reliably and scalability to your existing data lake with an open source transactional storage layer designed for the full data lifecycle
Integration with Azure Services – Complete your end-to-end analytics and machine learning solution with deep integration with Azure services such as Azure Data Factory, Azure Data Lake Storage, Azure Machine Learning and Power BI.
Interactive Work spaces – Easy and seamless coordination between Data Analysts, Data Scientist, Data Engineers and Business Analyst to ensure smooth collaboration.
Enterprise Grade Security – Security provided by Microsoft Azure ensures protection of data with storage services and private work spaces.
Production Ready – Easily run, implement and monitor your heavy data-oriented jobs and job-related statistics.