Apache Spark Tutorial Series

This repository contains a comprehensive tutorial series for learning Apache Spark with Python (PySpark). The tutorials progress from basic RDD operations to advanced machine learning pipelines.

Prerequisites

Python 3.6 or higher
Apache Spark 2.0 or higher
Java 8 or higher (required by Spark)

Installation

Option 1: Install PySpark via pip (Recommended)

pip install pyspark numpy pandas

Option 2: Manual Spark Installation

Download Apache Spark from https://spark.apache.org/downloads.html
Extract the archive
Set environment variables:

export SPARK_HOME=/path/to/spark
export PATH=$PATH:$SPARK_HOME/bin

Tutorial Structure

Tutorial 1: Spark RDD Basics (`tutorial_1_spark_rdd_basics.py`)

Learn the fundamentals of Resilient Distributed Datasets (RDDs):

Creating RDDs from various sources
Basic transformations (map, filter, flatMap, distinct)
Basic actions (collect, count, reduce, take)
Key-value pair operations
Advanced transformations (union, intersection, cartesian)
Persistence and caching
Broadcast variables and accumulators

Tutorial 2: Spark SQL and DataFrames (`tutorial_2_spark_sql_dataframes.py`)

Master the DataFrame API and SQL interface:

Creating DataFrames from different sources
Schema inspection and basic operations
Filtering and conditional operations
Adding and modifying columns
Grouping and aggregation
SQL queries on DataFrames
Joins and complex data types
Window functions
User Defined Functions (UDFs)
Handling missing data
Working with dates
Performance optimization

Tutorial 3: Spark MLlib (`tutorial_3_spark_mllib.py`)

Explore machine learning with RDD-based MLlib:

Basic statistics and correlations
K-Means clustering
Linear regression
Logistic regression classification
Decision trees
Collaborative filtering (recommendations)
Frequent pattern mining
Feature transformation (scaling, TF-IDF)
Model persistence

Tutorial 4: Spark ML Pipeline (`tutorial_4_spark_ml_pipeline.py`)

Build end-to-end machine learning pipelines:

ML Pipeline concepts
Feature engineering pipelines
Classification pipelines
Random Forest classification
Regression pipelines
Clustering pipelines
Text processing pipelines
Hyperparameter tuning with cross-validation
Recommendation systems with ALS
Saving and loading pipeline models

Running the Tutorials

Run Individual Tutorials

# Using spark-submit
spark-submit tutorial_1_spark_rdd_basics.py

# Or using Python (if PySpark is installed via pip)
python tutorial_1_spark_rdd_basics.py

Run All Tutorials (Test Suite)

python test_all_tutorials.py

Tutorial Output

Each tutorial will:

Print section headers for different concepts
Show example code execution
Display results and explanations
Complete with a success message

Common Issues and Solutions

Issue: "spark-submit: command not found"

Solution: Ensure Spark is installed and $SPARK_HOME/bin is in your PATH, or use pip install pyspark

Issue: "Java gateway process exited before sending its port number"

Solution: Ensure Java 8+ is installed and JAVA_HOME is set correctly

Issue: Out of memory errors

Solution: Increase driver memory:

spark-submit --driver-memory 2g tutorial_name.py

Issue: "Py4JJavaError" exceptions

Solution: Check that:

Spark version matches PySpark version
No conflicting Spark installations
Proper permissions on output directories

Tips for Learning

Start with Tutorial 1: Even if you're familiar with big data, understanding RDDs is fundamental
Run code sections individually: Each tutorial is divided into sections - feel free to run them separately
Experiment with parameters: Try changing values like number of clusters, regularization parameters, etc.
Check the outputs: Each section produces output to help you understand what's happening
Monitor Spark UI: Access at http://localhost:4040 while jobs are running

Advanced Usage

Running on a Cluster

Modify the master configuration:

spark = SparkSession.builder \
    .appName("Tutorial") \
    .master("spark://master:7077") \  # Change from "local[*]"
    .getOrCreate()

Adjusting Parallelism

spark.conf.set("spark.sql.shuffle.partitions", "200")
spark.conf.set("spark.default.parallelism", "100")

Enable Adaptive Query Execution (Spark 3.0+)

spark.conf.set("spark.sql.adaptive.enabled", "true")
spark.conf.set("spark.sql.adaptive.coalescePartitions.enabled", "true")

Next Steps

After completing these tutorials:

Explore Spark Streaming for real-time data processing
Learn about Spark GraphX for graph processing
Try Delta Lake for ACID transactions
Experiment with larger datasets
Deploy Spark applications on cloud platforms (AWS EMR, Databricks, etc.)

Resources

Contributing

Feel free to submit issues or pull requests if you find bugs or have suggestions for improvements.

License

These tutorials are provided as-is for educational purposes. Feel free to use and modify as needed.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.history		.history
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
test_all_tutorials.py		test_all_tutorials.py
tutorial_1_spark_rdd_basics.py		tutorial_1_spark_rdd_basics.py
tutorial_2_spark_sql_dataframes.py		tutorial_2_spark_sql_dataframes.py
tutorial_3_spark_mllib.py		tutorial_3_spark_mllib.py
tutorial_4_spark_ml_pipeline.py		tutorial_4_spark_ml_pipeline.py
verify_tutorials.py		verify_tutorials.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Apache Spark Tutorial Series

Prerequisites

Installation

Option 1: Install PySpark via pip (Recommended)

Option 2: Manual Spark Installation

Tutorial Structure

Tutorial 1: Spark RDD Basics (`tutorial_1_spark_rdd_basics.py`)

Tutorial 2: Spark SQL and DataFrames (`tutorial_2_spark_sql_dataframes.py`)

Tutorial 3: Spark MLlib (`tutorial_3_spark_mllib.py`)

Tutorial 4: Spark ML Pipeline (`tutorial_4_spark_ml_pipeline.py`)

Running the Tutorials

Run Individual Tutorials

Run All Tutorials (Test Suite)

Tutorial Output

Common Issues and Solutions

Issue: "spark-submit: command not found"

Issue: "Java gateway process exited before sending its port number"

Issue: Out of memory errors

Issue: "Py4JJavaError" exceptions

Tips for Learning

Advanced Usage

Running on a Cluster

Adjusting Parallelism

Enable Adaptive Query Execution (Spark 3.0+)

Next Steps

Resources

Contributing

License

About

Uh oh!

Releases

Packages

Languages

License

ScottfreeLLC/apache-spark-stuff

Folders and files

Latest commit

History

Repository files navigation

Apache Spark Tutorial Series

Prerequisites

Installation

Option 1: Install PySpark via pip (Recommended)

Option 2: Manual Spark Installation

Tutorial Structure

Tutorial 1: Spark RDD Basics (tutorial_1_spark_rdd_basics.py)

Tutorial 2: Spark SQL and DataFrames (tutorial_2_spark_sql_dataframes.py)

Tutorial 3: Spark MLlib (tutorial_3_spark_mllib.py)

Tutorial 4: Spark ML Pipeline (tutorial_4_spark_ml_pipeline.py)

Running the Tutorials

Run Individual Tutorials

Run All Tutorials (Test Suite)

Tutorial Output

Common Issues and Solutions

Issue: "spark-submit: command not found"

Issue: "Java gateway process exited before sending its port number"

Issue: Out of memory errors

Issue: "Py4JJavaError" exceptions

Tips for Learning

Advanced Usage

Running on a Cluster

Adjusting Parallelism

Enable Adaptive Query Execution (Spark 3.0+)

Next Steps

Resources

Contributing

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Tutorial 1: Spark RDD Basics (`tutorial_1_spark_rdd_basics.py`)

Tutorial 2: Spark SQL and DataFrames (`tutorial_2_spark_sql_dataframes.py`)

Tutorial 3: Spark MLlib (`tutorial_3_spark_mllib.py`)

Tutorial 4: Spark ML Pipeline (`tutorial_4_spark_ml_pipeline.py`)

Packages