I recently sat for the Amazon AWS Certified Big Data Specialty exam and passed it! In this article I would like to provide an outline of the topics covered and my learning path towards certification. The exam is for anyone who wishes to validate their technical skills and experience in designing and implementing big data solutions using AWS cloud services.
For anyone interested in gaining this certification, having an existing AWS Associate level certification is required, and some experience with data analysis is recommended. For myself, I have recently obtained the Certified Solutions Architect Associate and Certified Developer Associate along with at least 5 years experience in the business intelligence & data analytics field.
The exam is 3 hours long with 65 multiple choice questions. The questions are all scenario-based type of questions that requires an understanding of how multiple services are interconnected to solve a big data problem. Refer to the exam blueprint and sample questions.
Enrol in Training Courses
A Cloud Guru’s AWS Big Data Certification course was my main source for training. The course is regularly updated to the fast changing pace of AWS and topics cover over a range of AWS services that will be examinable. Once you have an idea of which AWS services make up the big data ecosystem, proceed with deep diving into each AWS service via documentations and whitepapers, and further study via YouTube videos. Note that this course alone won’t get you through the exam; though it helps a lot.
Read the Documentations
There is a plethora amount of online documentation and resources on the AWS website alone. I looked at the Developer Guides for each AWS service covered for the exam. The Big Data Blog was also another source of learning material, however I only paid attention to the blog posts as mentioned in the A Cloud Guru course. I felt it would have been beneficial to read up more articles there.
The whitepapers is a good complementary to the online documentation and guides which had some useful material particularly in understanding use cases and problem solving different scenarios. Below are the minimum papers one should go through:
Watch AWS Videos
Lastly, I watched numerous YouTube videos coming from AWS Re:Invent and AWS Summit which provided me with customer use cases and real world examples in big data architectures. The deep dive videos provided me further understanding of the AWS Services as well as newer features announced on top of the services covered by A Cloud Guru, however these were not examinable. Below are some of the videos I went through:
- AWS re:Invent 2017: Advanced Design Patterns for Amazon DynamoDB (DAT403-R) Link
- AWS re:Invent 2015 | (DAT401) Amazon DynamoDB Deep Dive Link
- AWS re:Invent 2017: Analyzing Streaming Data in Real Time with Amazon Kinesis (ABD301) Link
- AWS re:Invent 2017: Best Practices for Data Warehousing with Amazon Redshift & Redsh (ABD304) Link
- Amazon Redshift Masterclass Link
- Amazon EMR Masterclass Link
- AWS re:Invent 2016: Deep Dive: Amazon EMR Best Practices & Design Patterns (BDM401) Link
- AWS re:Invent 2017: Building Visualizations and Dashboards with Amazon QuickSight (ABD206) Link
- AWS re:Invent 2017: Deploying Business Analytics at Enterprise Scale with Amazon Qui (ABD311) Link
Overview of Exam Topics
Below is an outline of the topics that I have covered, and may come up for the exam. Pay extra attention to services like Kinesis, Redshift, EMR and in particular how it integrates with S3.
Kinesis Streams
- KPL, KCL, Kinesis Agent, Kinesis API, Connector Library
- Sharding, Retention period, autoscaling
- Difference between Streams Vs SQS
- Batching, Aggregation, Collection
- KCL checkpointing
- Monitoring and exceptions
Kinesis Firehose
- Integration with S3/Redshift/ElasticSearch
- Kinesis Agent, Kinesis API
- Monitoring and exceptions
IOT — general knowledge
Data Pipeline — Integration with AWS services
S3 — Integration with AWS services
Glacier — general knowledge
DynamoDB
- Integration with AWS services
- Choice of Partition/Sort Key, LSI/GSI
- Partitioning size
- Throttling reads/writes, and mitigations
- DynamoDB streams — general knowledge
Lambda — Integration with AWS services
EMR
- Instance types, storage and compression
- Consistent view
- S3Distcp
- Resizing and autoscaling a cluster
- Hadoop ecosystem with Hive, HBase, Presto, Spark
- Spark integration with Kinesis
- File formats Text/Parquet/ORC/AVRO — general knowledge
Redshift
- Node slice
- Distribution styles
- Sort Key
- Data Types
- Compression
- Constraints
- Workload Management / Queues
- Data loading techniques, encryption and compression
- Upsert
- Vacuum and Deep Copy
- Snapshots, Cross Region Snapshots, Restore from Snapshots
ElasticSearch — general knowledge
Data Visualisation
- QuickSight
- Zepplin, Jupyter, D3.js, Microstrategy — general knowledge
Athena / Glue — general knowledge
Machine Learning — general knowledge
Security
- Data at rest/in-transit
- SSE/CSE
- KMS
- Private Subnet / VPC endpoints
- Redshift Security
- EMR Security
Good luck all to your journey to AWS certification!
Originally posted in LinkedIn: https://www.linkedin.com/pulse/my-path-aws-big-data-speciality-certification-simon-lee/