Courses and certifications Open Source
Apache Spark 3.0 for Data Scientists - Advanced Analytics
Apache Spark is a general purpose computing engine that provides unified framework for big data processing, ad hoc analytics, machine learning, graph processing and streaming. For the past few years Spark is becoming standard for handling these workloads not just in the big data ecosystem. It’s popularity is growing also because of the high level DataFrame API that allows to express business logic in very high level and expressive way.
This training is oriented on four main areas of data analytics. The first one is interactive data analytics with the DataFrame API where we will also see how Spark is integrated with popular Python library Pandas. The second area is machine learning with native library ML Pipelines where we will take a look how to train ml models and create ml prototypes. The next topic is deep learning and Spark integration with other deep learning frameworks such as TensorFlow and Keras. And the last area is graph processing with the library GraphFrames.
The training is taught using the Python programing language in the Spark local mode with Jupyter notebook. In the training we will also focus on the new features and properties of the new version Spark 3.0.
- Data analysts, scientist and researchers that already have some previous experience with Spark and want to learn how to use Spark for advanced analytics, machine learning, deep learning or graph processing,
- All Spark users that want to see where the technology is moving in the most up to date version (especially in the area of data analytics).
Participants will learn:
- How to analyze data using Spark
- Train ML models and create ML prototypes in Spark
- Integrate Spark with other data science technologies such as Pandas, SciPy, TensorFlow, Keras
- State of the art features of the latest version of Spark
Guarantor of the Training
David Vrba Ph.D. works at Socialbakers as a data scientist and Spark consultant. On daily basis he is optimizing ETL pipelines built in Spark and develops jobs that process data on the scale up to tens of TBs. David is also lecturing Spark trainings and workshops and during the last two years he trained in Spark several teams including data engineers, data analysts and researchers. David is also contributing to the Spark source code and is active in the community by giving public talks at conferences and meetups such as Spark + AI Summit, MLPrague or Spark + AI Prague Meetup.
Data analysis with DataFrame API
- Advanced features of DataFrame API
- Integration with Pandas
- Analyzing data with DataFrame API
Machine learning with ML Pipelines
- Basic concepts: Transformer, Estimator, Evaluator, Pipeline
- Training/saving/loading a model
- Classification problems
- Cluster analysis
- Training ML prototypes
- Integration with Tensorflow and Keras
- Image processing
- Transfer learning
- Inference with DL model on large scale
Graph processing with GraphFrames
- Basic concepts: Vertices & Edges
- Running Graph algoritms
This is a follow-up training to the course Apache Spark - From Simple Transformations to Highly Efficient Jobs where people get (among other things) solid understanding of DataFrame API and basic introduction to analytics in Spark.
To get the most out of this training it is recommended to have some previous experience with Spark (for example on the level of the aforementioned course), know the DataFrame API and understand basic principles of machine learning and data analytics in general.