Skip to content

ScottfreeLLC/apache-spark-stuff

Repository files navigation

Apache Spark Tutorial Series

This repository contains a comprehensive tutorial series for learning Apache Spark with Python (PySpark). The tutorials progress from basic RDD operations to advanced machine learning pipelines.

Prerequisites

  • Python 3.6 or higher
  • Apache Spark 2.0 or higher
  • Java 8 or higher (required by Spark)

Installation

Option 1: Install PySpark via pip (Recommended)

pip install pyspark numpy pandas

Option 2: Manual Spark Installation

  1. Download Apache Spark from https://spark.apache.org/downloads.html
  2. Extract the archive
  3. Set environment variables:
export SPARK_HOME=/path/to/spark
export PATH=$PATH:$SPARK_HOME/bin

Tutorial Structure

Tutorial 1: Spark RDD Basics (tutorial_1_spark_rdd_basics.py)

Learn the fundamentals of Resilient Distributed Datasets (RDDs):

  • Creating RDDs from various sources
  • Basic transformations (map, filter, flatMap, distinct)
  • Basic actions (collect, count, reduce, take)
  • Key-value pair operations
  • Advanced transformations (union, intersection, cartesian)
  • Persistence and caching
  • Broadcast variables and accumulators

Tutorial 2: Spark SQL and DataFrames (tutorial_2_spark_sql_dataframes.py)

Master the DataFrame API and SQL interface:

  • Creating DataFrames from different sources
  • Schema inspection and basic operations
  • Filtering and conditional operations
  • Adding and modifying columns
  • Grouping and aggregation
  • SQL queries on DataFrames
  • Joins and complex data types
  • Window functions
  • User Defined Functions (UDFs)
  • Handling missing data
  • Working with dates
  • Performance optimization

Tutorial 3: Spark MLlib (tutorial_3_spark_mllib.py)

Explore machine learning with RDD-based MLlib:

  • Basic statistics and correlations
  • K-Means clustering
  • Linear regression
  • Logistic regression classification
  • Decision trees
  • Collaborative filtering (recommendations)
  • Frequent pattern mining
  • Feature transformation (scaling, TF-IDF)
  • Model persistence

Tutorial 4: Spark ML Pipeline (tutorial_4_spark_ml_pipeline.py)

Build end-to-end machine learning pipelines:

  • ML Pipeline concepts
  • Feature engineering pipelines
  • Classification pipelines
  • Random Forest classification
  • Regression pipelines
  • Clustering pipelines
  • Text processing pipelines
  • Hyperparameter tuning with cross-validation
  • Recommendation systems with ALS
  • Saving and loading pipeline models

Running the Tutorials

Run Individual Tutorials

# Using spark-submit
spark-submit tutorial_1_spark_rdd_basics.py

# Or using Python (if PySpark is installed via pip)
python tutorial_1_spark_rdd_basics.py

Run All Tutorials (Test Suite)

python test_all_tutorials.py

Tutorial Output

Each tutorial will:

  1. Print section headers for different concepts
  2. Show example code execution
  3. Display results and explanations
  4. Complete with a success message

Common Issues and Solutions

Issue: "spark-submit: command not found"

Solution: Ensure Spark is installed and $SPARK_HOME/bin is in your PATH, or use pip install pyspark

Issue: "Java gateway process exited before sending its port number"

Solution: Ensure Java 8+ is installed and JAVA_HOME is set correctly

Issue: Out of memory errors

Solution: Increase driver memory:

spark-submit --driver-memory 2g tutorial_name.py

Issue: "Py4JJavaError" exceptions

Solution: Check that:

  • Spark version matches PySpark version
  • No conflicting Spark installations
  • Proper permissions on output directories

Tips for Learning

  1. Start with Tutorial 1: Even if you're familiar with big data, understanding RDDs is fundamental
  2. Run code sections individually: Each tutorial is divided into sections - feel free to run them separately
  3. Experiment with parameters: Try changing values like number of clusters, regularization parameters, etc.
  4. Check the outputs: Each section produces output to help you understand what's happening
  5. Monitor Spark UI: Access at http://localhost:4040 while jobs are running

Advanced Usage

Running on a Cluster

Modify the master configuration:

spark = SparkSession.builder \
    .appName("Tutorial") \
    .master("spark://master:7077") \  # Change from "local[*]"
    .getOrCreate()

Adjusting Parallelism

spark.conf.set("spark.sql.shuffle.partitions", "200")
spark.conf.set("spark.default.parallelism", "100")

Enable Adaptive Query Execution (Spark 3.0+)

spark.conf.set("spark.sql.adaptive.enabled", "true")
spark.conf.set("spark.sql.adaptive.coalescePartitions.enabled", "true")

Next Steps

After completing these tutorials:

  1. Explore Spark Streaming for real-time data processing
  2. Learn about Spark GraphX for graph processing
  3. Try Delta Lake for ACID transactions
  4. Experiment with larger datasets
  5. Deploy Spark applications on cloud platforms (AWS EMR, Databricks, etc.)

Resources

Contributing

Feel free to submit issues or pull requests if you find bugs or have suggestions for improvements.

License

These tutorials are provided as-is for educational purposes. Feel free to use and modify as needed.

About

Apache Spark Tutorials in Python

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages