Videos
Context
Working at a startup as a Data Engineer for the analytics teams. We'll probably get an Architect to come in soon but I'm trying to do some preliminary research on this. Currently the pipeline we have for analytics:
backend application data (Postgres) -> ETL for analytics structuring (Matillion) -> Single Redshift prod cluster. The data from redshift is referenced by various BI platforms, and various analytics tools and SQL clients. I'll be re-building a lot of this stuff in the near future with better tools and efficiency returns. The redshift schema needs some significant re-designs which I'm doing ERD design on right now based on user acceptance criteria and scalability needs.
I would like to have a dev environment for this, so that I can "build around it" as the one thing that everything touches (ETL through analysis client). From there I'm planning on building staging environments within the client tools themselves for onboarding/performance/troubleshooting.
Questions
I imagine there's a way to use DB snapshots as a data source for the dev DB to ingest in Redshift, right? I'm guessing that I would essentially set up a dev cluster and push prod data to s3 where it could either by refreshed into dev manually or via automation, but it seems like something that AWS would have built in as a feature, not requiring Glue or any external ETL, right? (btw thinking about switching all ETLs to Glue if it does).
I'm curious about what other folks are doing out there and why. I don't usually do much on the cloud architecture side so I'm very open to any and all suggestions.
Thanks!
With my experience in working with Redshift, I can assert the following points with confidence:
Multiple schema: You should create multiple schema and create tables accordingly. When you'll scale, it'll be easier for you to pin-point where exactly the table is supposed to be. Let us say, you have 3 schema, named
production,aggregatesandrough. Now, you know that the tableproductionwill contain the tables that are not supposed to be changed (mostly OLTP data) - such asuser, order, transactionstables. Tableaggregateswill have aggregated data built over raw tables - such asnumber of orders placed per user per day per category. Finally,roughwill contain any table that doesn't hold a business logic but is required for some temporary work - let us say to check the genre of movies for a list of 1 lakh users, which is shared with you in an excel file. Simply create a table inroughschema, perform your operations and drop the table. Now you very clearly know where you'll find the tables based on whether they are raw, aggregated or simply temporary tables.Public schema: Forget it exists. Any table that is not preceded with a schema name, gets created there. A lot of clutter - no point in storing any important data there.
- Cross schema joins: There's no stopping here. You may join as many tables from as many schema as required. In fact, it is desirable you create dimension tables and join on a PK later, rather than to keep all the information in a single table.
Spend some quality time in designing the schema and underlying table structure. When you expand, it'll be easier for you to classify things better in terms of access control. Do let me know if I've missed some obvious points.
You can have multiple databases in a Redshift cluster but I would stick with one. You are correct that schemas (essentially namespaces) are a good way to divide things up. You can query across schemas but not databases.
I would avoid using the public schema as managing certain permissions there can be difficult (easier to deny someone access to public than prevent them from being able to create a table for example).
For best results if you have the time, learn about the permissions system up front. You want to create groups that have access to schemas or tables and add/remove users from groups to control what they can do. Once you have that going it becomes pretty easy to manage.
Hello,
I am building a Modern Data Platform with tools like RDS, s3, Airbyte (for the integration), Redshift (as a Datawarehouse), VPC (security), Terraform( IaC), and Lambda.
Is using Redshift as a Datawarehouse a good choice?
PS : The project is to showcase how to build a modern data platform.