Courses and certifications Open Source

Big data

Apache Spark - From Simple Transformations to Highly Efficient Jobs

23.500 CZK
7. 11. 8. 11. 2019

Apache Spark is a general purpose computing engine that provides unified framework for big data processing, ad hoc analytics, machine learning, graph processing and streaming. For the past few years Spark is becoming standard for handling these workloads not just in the big data ecosystem. It’s popularity is growing also because of the high level DataFrame API that allows to express business logic in very high level and expressive way.

This training covers Spark from three perspectives: The first part is devoted to the programming interface of the DataFrame API which allows you to quickly build Spark applications. In the second part we focus on Spark architecture, explain how things work under the cover in the Spark SQL and execution layer and use that understanding to achieve high performance queries. In the last part we explore machine learning and graph processing techniques that Spark provides for advanced data analysis.



  • Data Scientists with no or little experience of Apache Spark that want to quickly learn the technology for doing ad hoc analysis or building machine learning applications
  • Data Engineers with some experience of Apache Spark that want to understand better the internal processes of Spark and use that knowledge in order to write high performance queries and ETL jobs


Participants will learn:

  • Basic concepts of Apache Spark and distributed computing
  • How to use DataFrame API in Spark for ETL jobs or ad hoc data analysis
  • How the DataFrame API works under the hood
  • How the optimization engine works in Spark
  • What is happening under the cover when you send a query for execution
  • How is Spark application executed
  • How to understand query plans and use that information to optimize queries
  • Advanced optimization techniques to achieve high performance
  • Basic concepts of libraries ML Pipelines and GraphFrames
  • How to use these libraries for advanced analytics
  • What are basic concepts of Structured Streaming in Spark
  • How to use three different environments for running Spark: Databricks CE, local Spark with Jupyter notebook, vanilla Spark on Amazon EMR
  • News in Spark 2.2, 2.3, 2.4

Course guarantor


David Vrba Ph.D. works in Socialbakers as a data scientist and Spark consultant. During the last year he trained in Spark several teams including data engineers, data analysts and researchers. Besides his Spark trainings and workshops he is also optimizing various ETL pipelines built in Spark, writes jobs that process data on the scale up to TBs and works on machine learning and predictive analytics projects.


Introduction to Apache Spark

  • High level introduction to Spark
  • Introduction to Spark architecture
  • Spark APIs: high level vs low level vs internal APIs

Structured APIs in Spark

  • Basic concepts of DataFrame API
  • DataFrame, Row, Column
  • Operations in SparkSQL: transformations, actions
  • Working with DataFrame: creating a DataFrame and basic transformations
  • Working with different data types (Integer, String, Date, Timestamp, Boolean)
  • Filtering
  • Conditions
  • Dealing with null values
  • Joins

Lab I

  • Simple ETL

Advanced transformations with DataFrames

  • Aggregations and Window functions
  • User Defined Functions
  • Higher Order Functions and complex data types (news in Spark 2.4)

Lab II

  • Analyzing data using DataFrame API

Internal processes in Spark SQL

  • Catalyst - Optimization engine in Spark
  • Logical Planning (Analyzer, Optimizer, Catalog, Cache Manager)
  • Rule based optimizations
  • Cost based optimizations
  • Physical Planning (Query Planner, Strategies, WholeStage CodeGen)

Execution Layer

  • Introduction to low level APIs: RDDs
  • Structure of Spark job (Stages, Tasks, Shuffle)
  • DAG Scheduler
  • Lifecycle of Spark application


  • RDD transformations

Performance in Spark

  • Data persistence: caching, checkpointing
  • Most often bottlenecks in Spark applications
  • How to avoid shuffle
  • Bucketing & Partitioning

Lab IV

  • Query Optimizations

Introduction to advanced analytics in Spark

  • Basic concepts of ML Pipelines (native library for machine learning)
  • Basic concepts of GraphFrames (library for graph processing)

Lab V

  • Machine learning & Graph processing

Structured Streaming

  • Basic concepts of streaming in Spark
  • Stateful vs stateless transformations
  • Event time processing
  • What is watermark and how to use it to close the state
  • Real time vs near real time processing

Lab VI

  • Structured Streaming API


There is no prior knowledge of Spark required to pass this training. Very basic knowledge of Python and SQL is advantage but it is not a prerequisite. The training is taught in the Jupyter notebook environment using Python programming language.

Technical requirements (BYOD)

  • Installed Jupyter notebook with Spark 2.2 or higher

Inquire course

* Required field