8340  Reviews star_rate star_rate star_rate star_rate star_half

Intermediate Data Engineering with Python

This Data Engineering with Python course teaches attendees how to use Apache Spark and AWS Glue to build scalable and reliable data pipelines. Skills Gained Understand the Spark platform and its...

Read More
$1,650 USD
Duration 2 days
Course Code WA3032
Available Formats Classroom, Virtual

Overview

This Data Engineering with Python course teaches attendees how to use Apache Spark and AWS Glue to build scalable and reliable data pipelines.

Skills Gained

  • Understand the Spark platform and its architecture
  • Use the Spark Shell to create and run Spark applications
  • Work with Spark RDDs and Spark SQL DataFrames
  • Use AWS Glue to crawl, classify, and transform data
  • Build scalable and reliable data pipelines using Spark and AWS Glue

Who Can Benefit

Developers, Software Engineers, Data Scientists, and IT Architects.

Prerequisites

Participants must have practical experience coding in Python or another modern programming language. Knowledge of AWS Management Console is desirable but not necessary. The students are expected to be able to quickly learn the new material and reinforce the knowledge of a learned topic by doing programming exercises (labs).

Course Details

Outline

Chapter 1. Introduction to Apache Spark

  • What is Apache Spark
  • The Spark Platform
  • Spark vs Hadoop's MapReduce (MR)
  • Common Spark Use Cases
  • Languages Supported by Spark
  • Running Spark on a Cluster
  • The Spark Application Architecture
  • The Driver Process
  • The Executor and Worker Processes
  • Spark Shell
  • Jupyter Notebook Shell Environment
  • Spark Applications
  • The spark-submit Tool
  • The spark-submit Tool Configuration
  • Interfaces with Data Storage Systems
  • Project Tungsten
  • The Resilient Distributed Dataset (RDD)
  • Datasets and DataFrames
  • Spark SQL, DataFrames, and Catalyst Optimizer
  • Spark Machine Learning Library
  • GraphX
  • Extending Spark Environment with Custom Modules and Files
  • Summary

Chapter 2. The Spark Shell

  • The Spark Shell
  • The Spark v.2 + Command-Line Shells
  • The Spark Shell UI
  • Spark Shell Options
  • Getting Help
  • Jupyter Notebook Shell Environment
  • Example of a Jupyter Notebook Web UI (Databricks Cloud)
  • The Spark Context (sc) and Spark Session (spark)
  • Creating a Spark Session Object in Spark Applications
  • The Shell Spark Context Object (sc)
  • The Shell Spark Session Object (spark)
  • Loading Files
  • Saving Files
  • Summary

Chapter 3. Spark RDDs

  • The Resilient Distributed Dataset (RDD)
  • Ways to Create an RDD
  • Supported Data Types
  • RDD Operations
  • RDDs are Immutable
  • Spark Actions
  • RDD Transformations
  • Other RDD Operations
  • Chaining RDD Operations
  • RDD Lineage
  • The Big Picture
  • What May Go Wrong
  • Miscellaneous Pair RDD Operations
  • RDD Caching
  • Summary

Chapter 4. Introduction to Spark SQL

  • What is Spark SQL?
  • Uniform Data Access with Spark SQL
  • Hive Integration
  • Hive Interface
  • Integration with BI Tools
  • What is a DataFrame?
  • Creating a DataFrame in PySpark
  • Commonly Used DataFrame Methods and Properties in PySpark
  • Grouping and Aggregation in PySpark
  • The "DataFrame to RDD" Bridge in PySpark
  • The SQLContext Object
  • Examples of Spark SQL / DataFrame (PySpark Example)
  • Converting an RDD to a DataFrame Example
  • Example of Reading / Writing a JSON File
  • Using JDBC Sources
  • JDBC Connection Example
  • Performance, Scalability, and Fault-tolerance of Spark SQL
  • Summary

Chapter 5. Overview of the Amazon Web Services (AWS)

  • Amazon Web Services
  • The History of AWS
  • The Initial Iteration of Moving amazon.com to AWS
  • The AWS (Simplified) Service Stack
  • Accessing AWS
  • Direct Connect
  • Shared Responsibility Model
  • Trusted Advisor
  • The AWS Distributed Architecture
  • AWS Services
  • Managed vs Unmanaged Amazon Services
  • Amazon Resource Name (ARN)
  • Compute and Networking Services
  • Elastic Compute Cloud (EC2)
  • AWS Lambda
  • Auto Scaling
  • Elastic Load Balancing (ELB)
  • Virtual Private Cloud (VPC)
  • Route53 Domain Name System
  • Elastic Beanstalk
  • Security and Identity Services
  • Identity and Access Management (IAM)
  • AWS Directory Service
  • AWS Certificate Manager
  • AWS Key Management Service (KMS)
  • Storage and Content Delivery
  • Elastic Block Storage (EBS)
  • Simple Storage Service (S3)
  • Glacier
  • CloudFront Content Delivery Service
  • Database Services
  • Relational Database Service (RDS)
  • DynamoDB
  • Amazon ElastiCache
  • Redshift
  • Messaging Services
  • Simple Queue Service (SQS)
  • Simple Notifications Service (SNS)
  • Simple Email Service (SES)
  • AWS Monitoring with CloudWatch
  • Other Services Example
  • Summary

Chapter 6. Introduction to AWS Glue

  • What is AWS Glue?
  • AWS Glue Components
  • AWS Glue Components (Cont'd)
  • Managing Notebooks
  • AWS Glue Components (Cont'd)
  • Putting it Together: The AWS Glue Environment Architecture
  • AWS Glue Main Activities
  • Additional Glue Services
  • When To Use AWS Glue?
  • Integration with other AWS Services
  • Summary

Chapter 7. Introduction to Apache Spark

  • What is Apache Spark
  • The Spark Platform
  • Uniform Data Access with Spark SQL
  • Common Spark Use Cases
  • Languages Supported by Spark
  • Running Spark on a Cluster
  • The Spark Application Architecture
  • The Driver Process
  • The Executor and Worker Processes
  • Spark Shell
  • Jupyter Notebook Shell Environment
  • Interfaces with Data Storage Systems
  • The Resilient Distributed Dataset (RDD)
  • Datasets and DataFrames
  • Data Partitioning
  • Data Partitioning Diagram
  • Summary

Chapter 8. AWS Glue PySpark Extensions

  • AWS Glue and Spark
  • The DynamicFrame Object
  • The DynamicFrame API
  • The GlueContext Object
  • Glue Transforms
  • A Sample Glue PySpark Script
  • Using PySpark
  • AWS Glue PySpark SDK
  • Summary

Lab Exercises

  • Lab 1. Learning the Databricks Community Cloud Lab Environment
  • Lab 2. Data Visualization and EDA with pandas and seaborn
  • Lab 3. Correlating Cause and Effect
  • Lab 4. Learning PySpark Shell Environment
  • Lab 5. Understanding Spark DataFrames
  • Lab 6. Learning the PySpark DataFrame API
  • Lab 7. Data Repair and Normalization in PySpark
  • Lab 8. Working with Parquet File Format in PySpark and pandas
  • Lab 9. AWS Glue Overview
  • Lab 10. AWS Glue Crawlers and Classifiers
  • Lab 11. Creating an S3 Bucket for AWS Glue ETL Script Output
  • Lab 12. Creating and Working with Glue Scripts Using Dev Endpoints
  • Lab 13. Using PySpark API Directly
  • Lab 14. Understanding AWS Glue ETL Jobs
|
View Full Schedule

Schedule

2 options available

  • May 26, 2025 - May 27, 2025 (2 days)
    Virtual | 10:00 AM 6:00 PM EST
    Language English
    Select from 1 options below
    Virtual |10:00 AM 6:00 PM EST
    Virtual | 10:00 AM 6:00 PM EST
    Enroll
    Enroll Add to quote
  • Jul 7, 2025 - Jul 8, 2025 (2 days)
    Virtual | 10:00 AM 6:00 PM EST
    Language English
    Select from 1 options below
    Virtual |10:00 AM 6:00 PM EST
    Virtual | 10:00 AM 6:00 PM EST
    Enroll
    Enroll Add to quote

FAQ

Does the course schedule include a Lunchbreak?

Classes typically include a 1-hour lunch break around midday. However, the exact break times and duration can vary depending on the specific class. Your instructor will provide detailed information at the start of the course.

What languages are used to deliver training?

Most courses are conducted in English, unless otherwise specified. Some courses will have the word "FRENCH" marked in red beside the scheduled date(s) indicating the language of instruction.

What does GTR stand for?

GTR stands for Guaranteed to Run; if you see a course with this status, it means this event is confirmed to run. View our GTR page to see our full list of Guaranteed to Run courses.

Does Ascendient Learning deliver group training?

Yes, we provide training for groups, individuals and private on sites. View our group training page for more information.

What does vendor-authorized training mean?

As a vendor-authorized training partner, we offer a curriculum that our partners have vetted. We use the same course materials and facilitate the same labs as our vendor-delivered training. These courses are considered the gold standard and, as such, are priced accordingly.

Is the training too basic, or will you go deep into technology?

It depends on your requirements, your role in your company, and your depth of knowledge. The good news about many of our learning paths, you can start from the fundamentals to highly specialized training.

How up-to-date are your courses and support materials?

We continuously work with our vendors to evaluate and refresh course material to reflect the latest training courses and best practices.

Are your instructors seasoned trainers who have deep knowledge of the training topic?

Ascendient Learning instructors have an average of 27 years of practical IT experience and have also served as consultants for an average of 15 years. To stay current, instructors spend at least 25 percent of their time learning new, emerging technologies and courses.

Do you provide hands-on training and exercises in an actual lab environment?

Lab access is dependent on the vendor and the type of training you sign up for. However, many of our top vendors will provide lab access to students to test and practice. The course description will specify lab access.

Will you customize the training for our company’s specific needs and goals?

We will work with you to identify training needs and areas of growth.  We offer a variety of training methods, such as private group training, on-site of your choice, and virtually. We provide courses and certifications that are aligned with your business goals.

How do I get started with certification?

Getting started on a certification pathway depends on your goals and the vendor you choose to get certified in. Many vendors offer entry-level IT certification to advanced IT certification that can boost your career. To get access to certification vouchers and discounts, please contact info@ascendientlearning.com.

Will I get access to content after I complete a course?

You will get access to the PDF of course books and guides, but access to the recording and slides will depend on the vendor and type of training you receive.

How do I request a W9 for Ascendient Learning?

View our filing status and how to request a W9.

Reviews

Easy to use and exactly what I was looking for. Value for money was exceptional.

Good Course. We covered a lot of material in a short amount of time. This course had useful labs that built upon each other.

Overall ExitCertified is a great training provider and the remote learning is as effective as in person.

The class was very vast paced however the teacher was very good at checking in on us while giving us time to complete the labs.

Instructor was great, course was mostly very good except for too much focus on pricing