A Study Guide to the Google Cloud Professional Data Engineer Certification Path
I recently studied for the Google Cloud Professional Data Engineer certification, here are some key learning outcomes one should be familiar with before heading to the exam. My study path was following the Data Engineering on Google Cloud Platform Specialisation — a 5 course module presented by Google that provides a hands-on (Qwiklabs) introduction to designing and building data processing systems on GCP.
The exam is 2 hours in length with 50 multiple choice questions, and contains several questions which are based on the example case studies. In addition, a practice exam is available to assess your readiness for the exam.
Fundamental Services on Google Cloud Platform
Your understanding of GCP should include familiarisation of the following fundamental services, categorised broadly as follows:
As the GCP ecosystem grows daily with new services being introduced, it is beneficial (though not necessary) to have an overview of these GCP products/services; refer to this Google Cloud Developer’s Cheat Sheet
As well as individual services, you will need to understand how these services fit together in an overall data pipeline architecture depending on your perspective of data:
- Streaming or batch
- Structured or unstructured
- OLTP or OLAP usage
- Fully managed or semi-managed service
- High throughput and low latency requirements
- GB or TB or PB storage requirements
- Read/write/update requirements
The following illustrates one example for a streaming and batch pipeline architecture:
Machine Learning
- Know the difference between Regression vs Classification model, RMSE vs Cross Entropy, real-value vs categorical features.
- Have an understanding of the ML terminologies — Label, Input, Example, Training, Evaluation, Prediction, Gradient Descent, Weights, Batch Size, Epoch, Hidden Layers, Neurons, Features, Feature Engineering. Differentiate Accuracy vs Precision vs Recall.
- Familiar with the process of creating a ML model from data source to creating data set for training, validation, and testing, and the need for a benchmark/heuristic.
- Know what TensorFlow is, and the purpose of Cloud ML Engine. Basic understanding of TensorFlow architecture — out of the box APIs, full control APIs, custom ML models, low level APIs and hardware types.
- Learn how to use the Estimator API for Linear/DNN regression/classification. Know what is a Wide and Deep model. Use the API for train and evaluate and how it uses checkpointing for fault tolerance and distributed processing. Understand how the Tensorboard is used for monitoring
- Feature Engineering — why do you need it? What is Feature Cross and Bucketising. What makes a bad/good ML model. Where would you perform feature engineering in a data pipeline. What is a good candidate for Hyperparameter Tuning.
- Cloud AutoML, and how to use prebuilt ML APIs like Cloud Vision, Cloud Speech to Text, Cloud Natural Language, Cloud Translation and Cloud Video Intelligence.
Streaming Systems with Pub/Sub, DataFlow
- What are the 3 challenges of streaming: Scalability & fault tolerant, latency, and instant insights as data arrives. Understand how GCP services and architecture can meet these challenges for batch and streaming data pipeline.
- Learn to use Pub/Sub — creating a topic, creating a subscription, publishing a message, and subscribing to receive a message. Understand its place in the data pipeline architecture, and how it can be used for Fan-In/Fan-Out. Know the difference between Push and Pull and how it is implemented on Pub/Sub.
- How does DataFlow support Windowing — eg Fixed/Sliding/Sessions. Differentiate Event Time vs Processing Time in the context of data arrival latency. How to capture timestamp at message ingestion. How to create a DataFlow pipeline, apply windowing and perform aggregation techniques. Define Watermark, Triggers, Transformations, Accumulation, Windowing.
Serverless Data Analysis with Big Query
- How to load/export data to/from Big Query, using the Console or via API tools. Know which file types are supported (CSV/JSON/AVRO/ORC/Parquet). Know which GCP services can be integrated and load data into Big Query.
- Learn the basics of SQL queries with Big Query, including join on equality, join on boolean conditions, join on functions, join on window functions. Advanced capabilities using With Structure, Array_Agg, Unnest. What are user defined functions (UDFs) and extended UDFs (and their limitations).
- Know how to troubleshoot queries with the Big Query explain plan.
Serverless Data Storage with Big Table
- Differentiate the use case scenarios between Big Table and Cloud Data Store.
- Understand the need to have a good row key design for Big Table (and what to avoid), especially with time ranged data.
- Have an idea of some good performance tips for using Big Table, and scaling resources.
Serverless Data Pipelines with DataFlow
- How to create an ETL pipeline, parallel do, how to write/read to/from GCP storage services, running a pipeline job on GCP. Know the API options for Apache Beam with Java and Python
- How to perform MapReduce operations using Dataflow, using GroupByKey and Combine operations.
- What is SideInputs and how are they used.
- What is DataPrep used and how it is used together with DataFlow.
Unstructured Data with DataProc
- General knowledge of the main Hadoop components — Hive, Spark, Pig, HDFS, MapReduce, RDDs. Know what is a Master, Worker and Preemptible Worker node.
- How Google Cloud Storage can be used for persistence.
- Have an idea of the general configuration options for launching and creating a DataProc cluster on GCP.
- How to customise clusters with initialisation scripts.
Data Analysis & Visualisation
- How is DataLab used with Python to analyse data over BigQuery and Cloud Storage. How to manipulate data over Pandas Dataframe.
- How to query and create visualisations using Data Studio over various GCP services like Big Query, YouTube analytics etc
Good luck all to your journey to Google Cloud certification!
Originally posted in LinkedIn: https://www.linkedin.com/pulse/study-guide-google-cloud-professional-data-engineer-path-simon-lee/