This repository contains a comprehensive tutorial series for learning Apache Spark with Python (PySpark). The tutorials progress from basic RDD operations to advanced machine learning pipelines.
- Python 3.6 or higher
- Apache Spark 2.0 or higher
- Java 8 or higher (required by Spark)
pip install pyspark numpy pandas
- Download Apache Spark from https://spark.apache.org/downloads.html
- Extract the archive
- Set environment variables:
export SPARK_HOME=/path/to/spark
export PATH=$PATH:$SPARK_HOME/bin
Learn the fundamentals of Resilient Distributed Datasets (RDDs):
- Creating RDDs from various sources
- Basic transformations (map, filter, flatMap, distinct)
- Basic actions (collect, count, reduce, take)
- Key-value pair operations
- Advanced transformations (union, intersection, cartesian)
- Persistence and caching
- Broadcast variables and accumulators
Master the DataFrame API and SQL interface:
- Creating DataFrames from different sources
- Schema inspection and basic operations
- Filtering and conditional operations
- Adding and modifying columns
- Grouping and aggregation
- SQL queries on DataFrames
- Joins and complex data types
- Window functions
- User Defined Functions (UDFs)
- Handling missing data
- Working with dates
- Performance optimization
Explore machine learning with RDD-based MLlib:
- Basic statistics and correlations
- K-Means clustering
- Linear regression
- Logistic regression classification
- Decision trees
- Collaborative filtering (recommendations)
- Frequent pattern mining
- Feature transformation (scaling, TF-IDF)
- Model persistence
Build end-to-end machine learning pipelines:
- ML Pipeline concepts
- Feature engineering pipelines
- Classification pipelines
- Random Forest classification
- Regression pipelines
- Clustering pipelines
- Text processing pipelines
- Hyperparameter tuning with cross-validation
- Recommendation systems with ALS
- Saving and loading pipeline models
# Using spark-submit
spark-submit tutorial_1_spark_rdd_basics.py
# Or using Python (if PySpark is installed via pip)
python tutorial_1_spark_rdd_basics.py
python test_all_tutorials.py
Each tutorial will:
- Print section headers for different concepts
- Show example code execution
- Display results and explanations
- Complete with a success message
Solution: Ensure Spark is installed and $SPARK_HOME/bin
is in your PATH, or use pip install pyspark
Solution: Ensure Java 8+ is installed and JAVA_HOME is set correctly
Solution: Increase driver memory:
spark-submit --driver-memory 2g tutorial_name.py
Solution: Check that:
- Spark version matches PySpark version
- No conflicting Spark installations
- Proper permissions on output directories
- Start with Tutorial 1: Even if you're familiar with big data, understanding RDDs is fundamental
- Run code sections individually: Each tutorial is divided into sections - feel free to run them separately
- Experiment with parameters: Try changing values like number of clusters, regularization parameters, etc.
- Check the outputs: Each section produces output to help you understand what's happening
- Monitor Spark UI: Access at http://localhost:4040 while jobs are running
Modify the master configuration:
spark = SparkSession.builder \
.appName("Tutorial") \
.master("spark://master:7077") \ # Change from "local[*]"
.getOrCreate()
spark.conf.set("spark.sql.shuffle.partitions", "200")
spark.conf.set("spark.default.parallelism", "100")
spark.conf.set("spark.sql.adaptive.enabled", "true")
spark.conf.set("spark.sql.adaptive.coalescePartitions.enabled", "true")
After completing these tutorials:
- Explore Spark Streaming for real-time data processing
- Learn about Spark GraphX for graph processing
- Try Delta Lake for ACID transactions
- Experiment with larger datasets
- Deploy Spark applications on cloud platforms (AWS EMR, Databricks, etc.)
Feel free to submit issues or pull requests if you find bugs or have suggestions for improvements.
These tutorials are provided as-is for educational purposes. Feel free to use and modify as needed.