Courses and certifications Open Source

Big data

Apache Spark 3.0 for Data Scientists - Advanced Analytics

15.000 CZK

Price (without VAT)

16. 5. 2024

virtual

21. 6. 2024

virtual

16. 8. 2024

virtual

Back

Apache Spark is a general purpose computing engine that provides unified framework for big data processing, ad hoc analytics, machine learning, graph processing and streaming. For the past few years Spark is becoming standard for handling these workloads not just in the big data ecosystem. It’s popularity is growing also because of the high level DataFrame API that allows to express business logic in very high level and expressive way.

This training is oriented on four main areas of data analytics. The first one is interactive data analytics with the DataFrame API where we will also see how Spark is integrated with popular Python library Pandas. The second area is machine learning with native library ML Pipelines where we will take a look how to train ml models and create ml prototypes. The next topic is deep learning and Spark integration with other deep learning frameworks such as TensorFlow and Keras. And the last area is graph processing with the library GraphFrames.

The training is taught using the Python programing language in the Spark local mode with Jupyter notebook. In the training we will also focus on the new features and properties of the new version Spark 3.0.

Audience

Data analysts, scientist and researchers that already have some previous experience with Spark and want to learn how to use Spark for advanced analytics, machine learning, deep learning or graph processing,
All Spark users that want to see where the technology is moving in the most up to date version (especially in the area of data analytics).

Goals

Participants will learn:

How to analyze data using Spark
Train ML models and create ML prototypes in Spark
Integrate Spark with other data science technologies such as Pandas, SciPy, TensorFlow, Keras
State of the art features of the latest version of Spark

Guarantor of the Training

DAVID VRBA

David Vrba Ph.D. works at Socialbakers as a data scientist and Spark consultant. On daily basis he is optimizing ETL pipelines built in Spark and develops jobs that process data on the scale up to tens of TBs. David is also lecturing Spark trainings and workshops and during the last two years he trained in Spark several teams including data engineers, data analysts and researchers. David is also contributing to the Spark source code and is active in the community by giving public talks at conferences and meetups such as Spark + AI Summit, MLPrague or Spark + AI Prague Meetup.

Outline

Data analysis with DataFrame API

Advanced features of DataFrame API
Integration with Pandas

Lab I

Analyzing data with DataFrame API

Machine learning with ML Pipelines

Basic concepts: Transformer, Estimator, Evaluator, Pipeline
Training/saving/loading a model
Classification problems
Cluster analysis

Lab II

Training ML prototypes

Deep learning

Integration with Tensorflow and Keras
Image processing
Transfer learning

Lab III

Inference with DL model on large scale

Graph processing with GraphFrames

Basic concepts: Vertices & Edges
Running Graph algoritms

Prerequisites

This is a follow-up training to the course Apache Spark - From Simple Transformations to Highly Efficient Jobs where people get (among other things) solid understanding of DataFrame API and basic introduction to analytics in Spark.

To get the most out of this training it is recommended to have some previous experience with Spark (for example on the level of the aforementioned course), know the DataFrame API and understand basic principles of machine learning and data analytics in general.

Courses and certifications Open Source

Apache Spark 3.0 for Data Scientists - Advanced Analytics

Audience

Goals

Guarantor of the Training

Outline

Prerequisites

Inquire course

member of group: