Courses and certifications Open Source
Apache Spark for Data Engineers - Advanced Optimizations
Apache Spark is a general purpose computing engine that provides unified framework for big data processing, ad hoc analytics, machine learning, graph processing and streaming. For the past few years Spark is becoming standard for handling these workloads not just in the big data ecosystem. It’s popularity is growing also because of the high level DataFrame API that allows to express business logic in very high level and expressive way.
This training is oriented on advanced topics of Spark SQL that have impact on the resulting performance of Spark jobs such as optimizing execution plans, elimination of shuffle, choosing optimal distribution of data, reusing efficiently computation and others. The goal of this training is to learn some techniques for achieving maximal performance.
The training is taught using the programing language Scala in the Spark local mode (Spark 2.4 with Zeppelin notebook).
Data Engineers, scientists and other users of Spark that already have some prior experience with the engine and want to get deeper knowledge and learn how to optimize Spark jobs and achieve maximal performance.
Participants will learn:
- Understand and interpret execution plans in Spark SQL
- Rewrite query in order to achive better performance
- Use efficiently some internal configurations
- Using Spark prepare data for analytical purpose
- Find the bottleneck of a Spark job
Guarantor of the Training
David Vrba Ph.D.
David Vrba Ph.D. works at Socialbakers as a data scientist and Spark consultant. On daily basis he is optimizing ETL pipelines built in Spark and develops jobs that process data on the scale up to tens of TBs. David is also lecturing Spark trainings and workshops and during the last two years he trained in Spark several teams including data engineers, data analysts and researchers. David is also contributing to the Spark source code and is active in the community by giving public talks at conferences and meetups such as Spark + AI Summit, MLPrague or Spark + AI Prague Meetup.
Spark SQL internals (Query Execution)
Logical planning (Catalog, Analyzer, Cache Managerm, Optimizer)
- Catalyst API
- Extending the optimizer
- Limiting the optimizer
- Query planner, strategies
- Spark plan
- Executed plan
- Understanding operators in physical plan
Cost based optimizer
- How cost based optimizations work
- Statistics collection
- Statistics usage
- Implement simple optimization rule
- Fix a query based on the information from the query plan
- Data repartition (when and how)
- Shuffle-free join
- One-side shuffle-free join
- Broadcast join vs sort-merge join
- Exchange reuse
- Choose appropriate number of shuffle partitions
- Nondeterministic expressions
- Configuration settings
Different file formats
Parquet vs Json
Partitioning and bucketing
- How bucketing works
- How to ensure appropriate number of files
Open source storage layer with ACID transactions
Prepare data for analytical queries
This is a follow-up training to the course Apache Spark - From Simple Transformations to Highly Efficient Jobs where people get (among other things) solid understanding of DataFrame API and basic knowledge about internal processes in Spark.
To get the most out of this training it is recommended to have some previous experience with Spark (for example on the level of the aforementioned course), know the DataFrame API and understand basic principles of distributed computing.