Factsheet
The .md (machine description) files of GCC source contain stuff to generate assembly. GCC contains several specialized C/C++ code generators (and some of them translates the .md files into code emitting assembly).
GCC is a very complex program. The documentation of GCC MELT (an obsolete project) contains several interesting links and slides, notably refering to the Indian GCC Resource Center
Most of the optimizations in GCC happens in the middle-end (which is mostly independent of source language or target system), notably with many passes working on the Gimple representations.
The GCC repo is an SVN repository.
See also this answer, notably the pictures inside it.
The actual source code for GCC is most accessible from here:
https://gcc.gnu.org/svn.html
The software is accessible via SVN (subversion), a source code control system. This would be installed on many versions of Linux/UNIX, but if not on your platform, you can install the svn kit and then fetch the source using the following command:
svn checkout svn://gcc.gnu.org/svn/gcc/trunk SomeLocalDir
GCC is complex and would take significant experience to understand the nature of how the application actually compiles to different architectures.
In a nutshell, GCC has three major components - front-end, middle and back-end processing. The front-end processor has the component of the language parsing to understand the syntax of languages (like C, C++, Objective-C, etc). The front-end deconstructs the code to a portable construct which is then passed to the back-end for compilation to the target environment.
The middle part performs code analysis and optimisation, attempting to prioritise the code to generate the best possible output at the end of the full process. Technically, optimisation can occur at any part of the process as patterns are discovered during analysis.
The back-end processor compiles the code to a tree-style output format (not actually final executable code). Based on what the expected output is designed to be, the "pseudo-code" is optimised for using registers, bit-sizes, endian-ness, and so on. The final code is then generated during the assembly phase, which converts the back-end code into machine executable instructions.
It's important to note that the compiler has many options to deal with output formats so you can create output to many classes of architecture, usually out of the box. For cross-compiling and target compiler options, try checking out this link:
https://gcc.gnu.org/install/configure.html
Videos
So there is the helpful "A Tourist's Guide to the LLVM Source Code" and I was wondering, if something similar exists for GCC?
I found the repo a bit off-putting, like this mirror with hundreds of files in one folder.
I'm mostly interested in codegen, e.g., what register allocation algorithms are implemented.
Thanks!
As a starting point see Links and Selected Readings on GCC site. Of particular interest to you, I think, are:
- GNU C Compiler Internals
- Compilation of Functional Programming Languages using GCC -- Tail Calls by Andreas Bauer
- Porting GCC for Dunces by Hans-Peter Nilsson
If you want to develop on Windows you probably need to start from MinGW (Minimalist GNU for Windows) Compiler Suite sources (it includes GNU GDB debugger), which is a port of GCC to Windows.
For a comfortable development environment I cannot help much because I don't develop in C++. But I suppose a good IDE for C/C++ is what you need: have a look at this comparison, there are plenty free/open source IDEs for Windows.
Update: I think ICI can also be of interest to you:
The Interactive Compilation Interface (or 'ICI' for short) is a plugin system with a high-level compiler-independent and low-level compiler-dependent API to transform current compilers into collaborative open modular interactive toolsets. The ICI framework acts as a "middleware" interface between the compiler and the user-definable plugins. It opens up and reuses the production-quality compiler infrastructure to enable program analysis and instrumentation, fine-grain program optimizations, simple prototyping of new development and research ideas while avoiding building new compilation tools from scratch. For example, it is used in MILEPOST GCC to automate compiler and architecture design and program optimizations based on statistical analysis and machine learning. It should enable universal self-tuning compilers adaptable to heterogeneous, reconfigurable, multi-core architectures ranging from supercomputers to embedded systems.
.. as the rest of projects under the Collective TUNING umbrella.
Note: Writing "compilers are one of the most complex programs there are", as BlueRaja wrote in comments, is an overstatement: there are very simple compilers and very complex compilers. But in compiler theory (once you have studied it) there is nothing esoteric. GCC is a complex program to understand as whatever BIG, poorly documented program out there1. So rizwanhudda don't be discouraged: start studying the documentation available and then ask GCC developers (on GCC irc channel, as suggested by nvl or GCC developers mailing list) to explain what is poorly (or not at all) documented.
- In fact program comprehension is an active field of research.
I would suggest you to use the GCC irc channel, it is meant for discussion of development of GCC.