Apache Spark for Java Developers

Apache Spark for Java Developers

image description

What you will learn

  • Use functional style Java to define complex data processing jobs

  • Learn the differences between the RDD and DataFrame APIs

  • Use an SQL style syntax to produce reports against Big Data sets

  • Use Machine Learning Algorithms with Big Data and SparkML

  • Connect Spark to Apache Kafka to process Streams of Big Data

  • See how Structured Streaming can be used to build pipelines with Kafka


Section 1: Introduction

Section 2: Getting Started

Section 3: Reduces on RDDs

Section 4: Mapping and Outputting

Section 5: Tuples

Section 6: PairRDDs

Section 7: FlatMaps and Filters

Section 8: Reading from Disk

Section 9: Keyword Ranking Practical

Section 10: Sorts and Coalesce

Section 11: Deploying to AWS EMR (Optional)

Section 12: Joins

Section 13: Big Data Big Exercise

Section 14: RDD Performance

Section 15: Module 2 - Chapter 1 SparkSQL Introduction

Section 16: SparkSQL Getting Started

Section 17: Datasets

Section 18: The Full SQL Syntax

Section 19: In Memory Data

Section 20: Groupings and Aggregations

Section 21: Date Formatting

Section 22: Multiple Groupings

Section 23: Ordering

Section 24: DataFrames API

Section 25: Pivot Tables

Section 26: More Aggregations

Section 27: Practical Exercise

Section 28: User Defined Functions

Section 29: SparkSQL Performance

Section 30: HashAggregation

Section 31: SparkSQL Performance vs RDDs

Section 32: Module 3 - SparkML for Machine Learning

Section 33: Linear Regression Models

Section 34: Training Data

Section 35: Model Fitting Parameters

Section 36: Feature Selection

Section 37: Non-Numeric Data

Section 38: Pipelines

Section 39: Case Study

Section 40: Logistic Regression

Section 41: Decision Trees

Section 42: K Means Clustering

Section 43: Recommender Systems

Section 44: Module 4 -Spark Streaming and Structured Streaming with Kafka

Section 45: Streaming Chapter 2 - Streaming with Apache Kafka

Course Description

Get processing Big Data using RDDs, DataFrames, SparkSQL and Machine Learning - and real-time streaming with Kafka!


  • Java 8 is required for the course. Spark does not currently support Java9+, and you need Java 8 for the functional Lambda syntax
  • Previous knowledge of Java is assumed, but anything above the basics is explained
  • Some previous SQL will be useful for part of the course, but if you've never used it before this will be a good first experience


Get started with the amazing Apache Spark parallel computing framework - this course is designed especially for Java Developers.

If you're new to Data Science and want to find out about how massive datasets are processed in parallel, then the Java API for spark is a great way to get started, fast.

All of the fundamentals you need to understand the main operations you can perform in Spark CoreSparkSQL and DataFrames are covered in detail, with easy to follow examples. You'll be able to follow along with all of the examples, and run them on your own local development computer.

Included with the course is a module covering SparkML, an exciting addition to Spark that allows you to apply Machine Learning models to your Big Data! No mathematical experience is necessary!

And finally, there's a full 3 hour module covering Spark Streaming, where you will get hands-on experience of integrating Spark with Apache Kafka to handle real-time big data streams. We use both the DStream and the Structured Streaming APIs.


Optionally, if you have an AWS account, you'll see how to deploy your work to a live EMR (Elastic Map Reduce) hardware cluster. If you're not familiar with AWS you can skip this video, but it's still worthwhile to watch rather than following along with the coding.

You'll be going deep into the internals of Spark and you'll find out how it optimizes your execution plans. We'll be comparing the performance of RDDs vs SparkSQL, and you'll learn about the major performance pitfalls which could save a lot of money for live projects.

Throughout the course, you'll be getting some great practice with Java 8 Lambdas - a great way to learn functional-style Java if you're new to it.

NOTE: Java 8 is required for the course. Spark does not currently support Java9+ (we will update when this changes) and Java 8 is required for the lambda syntax.


Who this course is for:

  • Anyone who already knows Java and would like to explore Apache Spark
  • Anyone new to Data Science who want a fast way to get started, without learning Python, Scala or R!