Courses and certifications Open Source

Big data

Apache Spark for Data Engineers - Advanced Optimizations

15.000 CZK

Price (without VAT)

1. 8. 2025

virtual

Back

Apache Spark is a general purpose computing engine that provides unified framework for big data processing, ad hoc analytics, machine learning, graph processing and streaming. For the past few years Spark is becoming standard for handling these workloads not just in the big data ecosystem. It’s popularity is growing also because of the high level DataFrame API that allows to express business logic in very high level and expressive way.

This training is oriented on advanced topics of Spark SQL that have impact on the resulting performance of Spark jobs such as optimizing execution plans, elimination of shuffle, choosing optimal distribution of data, reusing efficiently computation and others. The goal of this training is to learn some techniques for achieving maximal performance.

The training is taught using the programing language Python in the Spark local mode (Spark 3.2 with Jupyter in notebook).

Audience

Data Engineers, scientists and other users of Spark that already have some prior experience with the engine and want to get deeper knowledge and learn how to optimize Spark jobs and achieve maximal performance.

Goals

Participants will learn:

Understand and interpret execution plans in Spark SQL
Rewrite query in order to achive better performance
Use efficiently some internal configurations
Using Spark prepare data for analytical purpose
Find the bottleneck of a Spark job

Guarantor of the Training

David Vrba Ph.D.

David Vrba Ph.D. works at Socialbakers as a data scientist and Spark consultant. On daily basis he is optimizing ETL pipelines built in Spark and develops jobs that process data on the scale up to tens of TBs. David is also lecturing Spark trainings and workshops and during the last two years he trained in Spark several teams including data engineers, data analysts and researchers. David is also contributing to the Spark source code and is active in the community by giving public talks at conferences and meetups such as Spark + AI Summit, MLPrague or Spark + AI Prague Meetup.

Outline

Spark SQL internals (Query Execution)

Logical planning (Catalog, Analyzer, Cache Management, Optimizer)
- Catalyst API
- Extending the optimizer
- Limiting the optimizer
Physical planning
- Query planner, strategies
- Spark plan
- Executed plan
- Understanding operators in the physical plan
Cost based optimizer
- How cost-based optimizations work
- Statistics collection
- Statistics usage

Query optimization

Shuffle elimination
- Bucketing
- Data repartition (when and how)
Optimizing joins
- Shuffle-free join
- One-side shuffle-free join
- Broadcast join vs sort-merge join
Data reuse
- Caching
- Checkpointing
- Exchange reuse

Optimization tips

Choose the appropriate number of shuffle partitions
Nondeterministic expressions
Configuration settings

Data layout

Different file formats
- Parquet vs Json
Partitioning and bucketing
- How bucketing works
- How to ensure the proper number of files
Tables management
- Working with the Catalog API
Delta-io
- Open-source storage layer with ACID transactions

Prerequisites

This is a follow-up training to the course Apache Spark - From Simple Transformations to Highly Efficient Jobs where people get (among other things) solid understanding of DataFrame API and basic knowledge about internal processes in Spark.

To get the most out of this training it is recommended to have some previous experience with Spark (for example on the level of the aforementioned course), know the DataFrame API and understand basic principles of distributed computing.