Courses and certifications Open Source

Big data

Apache Spark - From Simple Transformations to Highly Efficient Jobs

25.900 CZK

Price (without VAT)

28. 7. – 29. 7. 2025

virtual

Back

Apache Spark is a general purpose computing engine that provides unified framework for big data processing, ad hoc analytics, machine learning, graph processing and streaming. For the past few years Spark is becoming standard for handling these workloads not just in the big data ecosystem. It’s popularity is growing also because of the high level DataFrame API that allows to express business logic in very high level and expressive way.

This training covers Spark from three perspectives: The first part is devoted to the programming interface of the DataFrame API which allows you to quickly build Spark applications. In the second part we focus on Spark architecture, explain how things work under the cover in the Spark SQL and execution layer and use that understanding to achieve high performance queries. In the last part we explore machine learning and graph processing techniques that Spark provides for advanced data analysis.

Audience

Data Scientists with no or little experience of Apache Spark that want to quickly learn the technology for doing ad hoc analysis or building machine learning applications
Data Engineers with some experience of Apache Spark that want to understand better the internal processes of Spark and use that knowledge in order to write high performance queries and ETL jobs

Goals

Participants will learn:

Basic concepts of Apache Spark and distributed computing
How to use DataFrame API in Spark for ETL jobs or ad hoc data analysis
How the DataFrame API works under the hood
How the optimization engine works in Spark
How is Spark application executed
How to understand query plans and use that information to optimize queries
Basic concepts of library ML Pipelines for machine learning
Basic concepts of library GraphFrames for graph processing
How to process data in (nearly) real time in Spark (Structured Streaming)
News in Spark 2.3, 2.4, 3.0

Guarantor of the Training

DAVID VRBA

David Vrba Ph.D. works at Socialbakers as a data scientist and Spark consultant. On daily basis he is optimizing ETL pipelines built in Spark and develops jobs that process data on the scale up to tens of TBs. David is also lecturing Spark trainings and workshops and during the last two years he trained in Spark several teams including data engineers, data analysts and researchers. David is also contributing to the Spark source code and is active in the community by giving public talks at conferences and meetups such as Spark + AI Summit, MLPrague or Spark + AI Prague Meetup.

Outline

Introduction to Apache Spark

High level introduction to Spark
Introduction to Spark architecture
Spark APIs: high level vs low level vs internal APIs

Structured APIs in Spark

Basic concepts of DataFrame API
DataFrame, Row, Column
Operations in SparkSQL: transformations, actions
Working with DataFrame: creating a DataFrame and basic transformations
Working with different data types (Integer, String, Date, Timestamp, Boolean)
FilteringConditions
Dealing with null valuesJoins

Lab I

Simple ETL

Advanced transformations with DataFrames

Aggregations and Window functions
User Defined Functions
Higher Order Functions and complex data types (news in Spark 2.4)

Lab II

Analyzing data using DataFrame API

Metastore and tables

Catalog API
Table creation
Saving data
Caveats to be careful about

Lab III

Saving data and working with tables

Internal processes in Spark SQL

Catalyst - Optimization engine in Spark
Logical Planning
Physical Planning

Execution Layer

Introduction to low level APIs: RDDs
Structure of Spark job (Stages, Tasks, Shuffle)
DAG SchedulerLifecycle of Spark application

Lab IV

Spark UI

Performance Tuning

Data persistence: caching, checkpointing
Bucketing & Partitioning
Most often bottlenecks in Spark applications
Optimization tips

Introduction to advanced analytics in Spark

Machine learning: basic concepts of ML Pipelines
Graph processing: basic concepts of GraphFrames library

Lab V

Machine learning and Graph processing

Structured Streaming

Basic concepts of streaming in Spark
Stateful vs stateless transformations
Event time processing
What is watermark and how to use it to close the state
Real time vs near real time processing

Prerequisites

There is no prior knowledge of Spark required to pass this training. Very basic knowledge of Python and SQL is advantage but it is not a prerequisite. The training is taught in the Jupyter notebook environment using Python programming language.

Inquire course

Reviews

16. 8. 2024

Igor Kováč

5. 12. 2023

HIgh-quality course with a professional lecturer. Igor Kováč, ČSOB

všechny recenze