Description
This is a complete PySpark Developer course for Data Engineers and Data Scientists and others who wants to process Big Data in an effective manner. We will cover below topics and more:
- Complete Curriculum for a successful PySpark Developer
- Complete Flow of Installation of PySpark
- Introduction to Spark (Why Spark was Developed, Spark Features, Spark Components)
- Understand SparkSession
- Spark RDD Fundamentals
- How to Create RDDs
- RDD Operations (Transformations & Actions)
- Spark Cluster Architecture – Execution, YARN, JVM Processes, DAG Scheduler, Task Scheduler
- RDD Persistence
- Spark Shared Variables (Broadcast and Accumulators)
- Spark SQL Architecture, Catalyst Optimizer, Volcano Iterator Model, Tungsten Execution Engine, Different Benchmarks
- Spark Commonly Used Functions – Version, range, createDataFrame, sql, table, SparkContext, conf, read, udf, newSession, stop, catalog etc
- DataFrame Built-in functions – new column, encryption, string, regexp, date, null, collection, na, math and statistics, explode, flatten, formatting and json
- What is Partition, Repartition and Coalesce
- Repartition Vs Coalesce
- Extraction – csv file, text file, Parquet File, orc file, json file, avro file, hive, jdbc
- DataFrame Fundamentals (What is a DataFrame, DataFrame Sources, DataFrame Features, DataFrame Organization)
- DataFrame Rows, Columns and DataTypes. Practical examples.
- ETL Using DataFrame (Extraction APIs, Transformation APIs, and Loading APIs). Practical Examples.
- Optimization and Management – Join Strategies, Driver Conf, Parallelism Configurations, Executor Conf etc
- HDFS Commands (Will be added shortly)
- Python Fundamentals (Will be added shortly)
- More will be added