Hey guys, I am new to data engineering and am currently learning a few AWS services. I am designing a portfolio project to get some experience building ETL pipelines. My current plan is this:
Extract data from API and ingest into S3
Transform and clean data in AWS mapreduce(EMR) or Glue using Spark
save the cleaned data into s3
Use Quicksight for dashboarding
My question is why is Redshift popular as a datawarehousing solution. In my workflow Can I include Reshift and if I did wouldn't it be a redundant step?
I would appreciate any kind of feedback regarding my question or in general with respect to the data pipeline I designed. Thanks
Videos
Does Redshift work on S3?
What is Amazon Redshift?
What Does Amazon Redshift Do?
It depends on the amount of data you have to process and how much of the processing you can offload to Hadoop. Redshift has pretty good performance but it doesn't support too many concurrent operations so data transformation in Redshift may affect your user's querying performance. Also, in Hadoop you can process many types of data and file formats - Redshift is obviously more limited.
I am using S3 -> Redshift, and the performance is pretty good. Like the previous comment, there is a trade off, if you dont want block the user queries, either use Redshift WLM or EMR. In Redshift WLM, your process will be throttled, where as in EMR you will be charged for the aws resources.