Complete PySpark Developer Course

Complete PySpark Developer Course

Description

This is a complete PySpark Developer course for Data Engineers and Data Scientists and others who wants to process Big Data in an effective manner. We will cover below topics and more:

  • Complete Curriculum for a successful PySpark Developer
  • Complete Flow of Installation of PySpark
  • Introduction to Spark (Why Spark was Developed, Spark Features, Spark Components)
  • Understand SparkSession
  • Spark RDD Fundamentals
  • How to Create RDDs
  • RDD Operations (Transformations & Actions)
  • Spark Cluster Architecture – Execution, YARN, JVM Processes, DAG Scheduler, Task Scheduler
  • RDD Persistence
  • Spark Shared Variables (Broadcast and Accumulators)
  • Spark SQL Architecture, Catalyst Optimizer, Volcano Iterator Model, Tungsten Execution Engine, Different Benchmarks
  • Spark Commonly Used Functions – Version, range, createDataFrame, sql, table, SparkContext, conf, read, udf, newSession, stop, catalog etc
  • DataFrame Built-in functions – new column, encryption, string, regexp, date, null, collection, na, math and statistics, explode, flatten, formatting and json
  • What is Partition, Repartition and Coalesce
  • Repartition Vs Coalesce
  • Extraction – csv file, text file, Parquet File, orc file, json file, avro file, hive, jdbc
  • DataFrame Fundamentals (What is a DataFrame, DataFrame Sources, DataFrame Features, DataFrame Organization)
  • DataFrame Rows, Columns and DataTypes. Practical examples.
  • ETL Using DataFrame (Extraction APIs, Transformation APIs, and Loading APIs). Practical Examples.
  • Optimization and Management – Join Strategies, Driver Conf, Parallelism Configurations, Executor Conf etc
  • HDFS Commands (Will be added shortly)
  • Python Fundamentals (Will be added shortly)
  • More will be added

Leave a Reply