Databricks Certified Associate Developer for Apache Spark 3.0 – How to pass the Python Certification Exam

Passing the Azure Databricks Apache Spark 3.0 Certification Exam for Python is no walk in the park. What you need to know is that you will have to have at least basic understanding of the following topics: Azure Databricks, Spark architecture, including Adaptive Query Execution, The Python programming language and at least some knowledge of creating SQL queries and Apply the Spark DataFrame API to complete individual data manipulation task, including:

  • selecting, renaming and manipulating columns
  • filtering, dropping, sorting, and aggregating rows
  • joining, reading, writing and partitioning DataFrames
  • working with UDFs and Spark SQL functions
Databricks Spark Developer Associate

Here are some of the questions that will help you pass the exam:

What is the Spark driver?

As the part of the Spark application responsible for instantiating a SparkSession, the Spark driver has multiple roles: it communicates with the cluster manager; it requests resources (CPU, memory, etc.) from the cluster manager for Spark’s executors (JVMs); and it transforms all the Spark operations into DAG computations, schedules them, and distributes their execution as tasks across the Spark executors. Once the resources are allocated, it communicates directly with the executors.The Spark driver is horizontally scaled to increase overall processing output.

What is the role of the Cluster manager?

The cluster manager is responsible for managing and allocating resources for the cluster of nodes on which your Spark application runs. Currently, Spark supports four cluster managers: the built-in standalone cluster manager, Apache Hadoop YARN, Apache Mesos, and Kubernetes.

What is the difference between Cluster Mode and Client Mode?

In cluster mode, the driver runs on the worker nodes, while the client mode runs the driver on the client machine. In cluster mode, the cluster manager is located on a node other than the client machine. From there it starts and ends executor processes on the cluster nodes as required by the Spark application running on the Spark driver. In client mode, the Spark driver schedules tasks on the cluster – not the cluster manager.

 

What is the Spark executor?

A Spark executor runs on each worker node in the cluster. The executors communicate with the driver program and are responsible for executing tasks on the workers. In most deployments’ modes, only a single executor runs per node.

What is the responsibility of the executors in Spark?

Executors accept tasks from the driver, execute those tasks, and return results to driver.

What are Deployment modes?

An attractive feature of Spark is its support for myriad deployment modes, enabling Spark to run in different configurations and environments. Because the cluster manager is agnostic to where it runs (as long as it can manage Spark’s executors and fulfill resource requests), Spark can be deployed in some of the most popular environments—such as Apache Hadoop YARN and Kubernetes—and can operate in different modes.

What are Spark Jobs?

During interactive sessions with Spark shells, the driver converts your Spark application into one or more Spark jobs. It then transforms each job into a DAG. This, in essence, is Spark’s execution plan, where each node within a DAG could be a single or multiple Spark stages. A job is a sequence of stages, and thus may reach over multiple stage boundaries.

What are Spark Stages?

As part of the DAG nodes, stages are created based on what operations can be per formed serially or in parallel. Not all Spark operations can happen in a single stage, so they may be divided into multiple stages. Often stages are delineated on the operator’s computation boundaries, where they dictate data transfer among Spark executors. A stage is a group of tasks that can be executed in parallel to compute the same set of operations on potentially multiple machines?

What are Spark Tasks?

Each stage is comprised of Spark tasks (a unit of execution), which are then federated across each Spark executor; each task maps to a single core and works on a single partition of data. As such, an executor with 16 cores can have 16 or more tasks working on 16 or more partitions in parallel, making the execution of Spark’s tasks exceedingly parallel. A task is a combination of a block of data and a set of transformers that will run on a single executor. It is also the deepest level in Spark’s execution hierarchy.

The hierarchy is, from top to bottom: Job, Stage, Task.

Executors and slots facilitate the execution of tasks, but they are not directly part of the hierarchy. Executors are launched by the driver on worker nodes for the purpose of running a specific Spark application. Slots help Spark parallelize work.

What are Slots?

Slots are resources for parallelization within a Spark application.An executor can have multiple slots which enable it to process multiple tasks in parallel. A JVM working as an executor can be considered as a pool of slots for task execution. Spark does not actively create and destroy slots in accordance with workload. Per executors, slots are made available in accordance with how many cores per executor(property spark.executor.cores) and how many cpus per task(property spark.tasks.cpus) the Spark configuration calls for. A slot can span multiple cores. If a task would require multiple cores, it would have to be executed through a slot that spans multiple cores.

What is Dynamic Partition Prunning?

Dynamic Partition Prunning provides an efficient way to selectively read data from files by skipping data that is irrelevant for the query. For example, if a query asks to consider only rows which have numbers >12 in column purchases via a filter, Spark would only read the rows that match this criteria from the underlying files. This method works in an optimal way if the purchases data is in a nonpartitioned table and the data to be filtered is partitioned.

What are worker nodes?

Worker nodes are machines that host the executors responsible for the execution of tasks.

What is a Shuffle? 

A shuffle is the process by which data is compared across partitions. During a shuffle, data is compared across partitions because shuffling includes the process of sorting. For sorting, data need to be compared. Since per definition, more than one partition is involved in a shuffle, it can be said that data is compared across partitions.

What are Accumulators?

In Spark, accumulators are only updated when the query that refers to the is actually executed. In other words, they are not updated if the query is not (yet) executed due to lazy evaluation. Accumulators are instantiated via the accumulator(n) method of the sparkContext, for example: counter = spark.sparkContext.accumulator(0)

What is Lazy Evaluation?

During definition, due to lazy evaluation, the job is not executed and thus certain errors, for example reading from a non-existing file, cannot be caught. To be caught, the job needs to be executed, for example through an action.

What are RDDS?

RDDs (Resilient Distributed Datasets) are the foundation of Spark DataFrames and are immutable. As such, DataFrames are immutable, too. Any command that changes anything in the DataFrame therefore necessarily returns a copy, or a new version, of it that has the changes applied.

What is difference between Transformations and actions?

Transformations are business logic operations that do not induce execution while actions are execution triggers focused on returning results.

What is the difference between coalesce() and re-partition()?

repartition() always triggers a full shuffle coalesce does not perform a full shuffle of the data. Whenever you see “full shuffle”, you know that you are not dealing with coalesce(). While coalesce()can perform a partial shuffle when required, it will try to minimize shuffle operations, so the amount of data that is sent between executors.

How to convert a DataFrame column from one type to another type ?

col().cast()

How to drop rows containing missing values?

storesDF.na.drop(“all”)

How to get mean of column sqft?

storesDF.agg(mean(col(“sqft”).alias(“sqftMean”))

How to return number of rows in DataFrame storesDF?

storesDF.count()

How to return the sum of the values in column sqft in DataFrame storesDF grouped by distinct value in column division?

storesDF.groupBy(“division”).agg(sum(col(“sqft”)))

How to return a DataFrame containing summary statistics only for column sqft in DataFrame storesDF?

storesDF.describe(“sqft”)

How to sort the rows of a DataFrame?

sort() and orderBy()

How to Return DataFrame transactionsDf sorted in descending order by column predError, showing missing values last?

transactionsDf.sort(“predError”, ascending=False)

How to return a 15 percent sample of rows from DataFrame storesDF without replacement:

storesDF.sample(True, fraction = 0.15) – The first argument True sets the sampling to be with replacement.

 

 

 

 

Memory Facts

Default storage level for persist() for a non-streaming DataFrame/Dataset = MEMORY_AND_DISK

The default storage level of DataFrame.cache() is MEMORY_AND_DISK

MEMORY_ONLY means that all partitions that do not fit into memory will be recomputed when they are needed.

 

 

 

 

 

 

Duration

Testers will have 120 minutes to complete the certification exam.

Questions

There are 60 multiple-choice questions on the certification exam. The questions will be distributed by high-level topic in the following way:

  • Apache Spark Architecture Concepts – 17% (10/60)
  • Apache Spark Architecture Applications – 11% (7/60)
  • Apache Spark DataFrame API Applications – 72% (43/60)

Cost

Each attempt of the certification exam will cost the tester $200. Testers might be subjected to tax payments depending on their location. Testers are able to retake the exam as many times as they would like, but they will need to pay $200 for each.

 

Resources to prepare:

Free Ebook: O’Reilly’s – Learning Spark 2nd Edition

 

 

To take the exam you will need to sign up to : Databricks Certification Exam Web Site and download their software which will monitor you during the exam via web camera.

Developing a C# WebApi on Domain-Driven Design (DDD) in .Net6 using Clean Architecture

Total Page Visits: 3162 – Today Page Visits: 2

How to code your own sFTP File Uploader Windows application using C# and SSH

Creating your own SFTP application is now easier than ever. Using C# allows you to easily create a windows application to securely transfer your files. SFTP stands for SSH File Transfer Protocol. SSH is an encrypted and secure communication protocol

ArcMap 10.4 – Using python function in field calculator to update date values from another column

Total Page Visits: 3162 – Today Page Visits: 2

Total Page Visits: 3162 - Today Page Visits: 2