A biomolecule composed of chains of amino acid residues.
This article discusses a class of molecules. For the dietary component, see Protein (nutrient). For other meanings, consult Protein (disambiguation).
The image displays a three-dimensional representation of the protein myoglobin, prominently featuring turquoise α-helices. This particular protein holds historical significance as the first whose structure was elucidated through X-ray crystallography. Towards the central-right portion of the coiled structure, a prosthetic group known as a heme group is visible, with an oxygen molecule (rendered in red) attached.
Proteins are substantial biomolecules and macromolecules, each constructed from one or more extended chains of amino acid residues. They are responsible for an astonishingly diverse range of functions within living organisms. These functions include the catalysis of metabolic reactions, the intricate process of DNA replication, responding to external and internal stimuli, providing essential structural support to cells and entire organisms, and facilitating the transport of molecules across cellular boundaries and throughout the body. The unique characteristics of each protein are primarily determined by its specific sequence of amino acids, a sequence dictated by the nucleotide sequence of its corresponding gene. This sequence, in turn, typically governs the protein's protein folding into a precise 3D structure, which is the ultimate determinant of its biological activity.
A linear sequence of amino acid residues is referred to as a polypeptide. A protein, by definition, contains at least one such long polypeptide chain. Shorter chains, typically comprising fewer than 20–30 residues, are generally not classified as proteins and are commonly designated as peptides. The amino acid residues within a polypeptide are linked together by peptide bonds, forming an extended chain. The sequence of these amino acid residues within a protein is directly encoded by the sequence of the gene responsible for its production, which is itself inscribed within the genetic code. While the standard genetic code specifies 20 common amino acids, certain organisms have evolved to incorporate others, such as selenocysteine and, in some archaea, pyrrolysine. Furthermore, shortly after or even during their synthesis, protein residues frequently undergo chemical modifications through a process known as post-translational modification. These modifications can significantly alter the protein's physical and chemical properties, its folding stability, its activity, and ultimately, its function. Some proteins also possess non-peptide components, referred to as prosthetic groups or cofactors. Proteins often collaborate, associating to form stable protein complexes to carry out more elaborate functions.
Once synthesized, proteins have a finite lifespan within the cell. They are eventually degraded and their constituent components recycled by the cell's sophisticated machinery through a process called protein turnover. The lifespan of a protein, often quantified by its half-life, can vary dramatically, ranging from mere minutes to several years. In mammalian cells, the average lifespan is typically around one to two days. Proteins that are abnormal or misfolded are generally degraded more rapidly, either because they are specifically targeted for destruction or due to inherent instability.
Similar to other fundamental biological macromolecules such as polysaccharides and nucleic acids, proteins are indispensable components of all living organisms, participating in virtually every biological process occurring within cells. A significant proportion of proteins function as enzymes, acting as biological catalysts to accelerate biochemical reactions and playing a vital role in metabolism. Other proteins are primarily structural, providing mechanical support; examples include actin and myosin in muscle tissue, and the scaffolding proteins that maintain the shape of the cytoskeleton. Proteins are also crucial for intercellular communication through cell signaling, orchestrating immune responses, facilitating cell adhesion, and regulating the cell cycle. For animals, dietary intake of proteins is essential to supply the essential amino acids that cannot be synthesized internally. The process of digestion breaks down these dietary proteins into their constituent amino acids for metabolic utilization.
History and etymology
Discovery and early studies
The scientific exploration of proteins dates back to the 18th century, with pioneers like Antoine Fourcroy and others. These early researchers often grouped these substances under the general term "albumins" or "albuminous materials," a concept also recognized in German as Eiweisskörper. For instance, gluten, a protein complex found in wheat, was first isolated and described in published research around 1747, with later studies confirming its presence in numerous plant species. By 1789, Antoine Fourcroy had identified three distinct categories of animal proteins: albumin, fibrin, and gelatin. During the late 1700s and early 1800s, plant proteins under investigation included gluten, plant albumin, gliadin, and legumin.
The formal description of proteins as a distinct class of molecules is attributed to the Dutch chemist Gerardus Johannes Mulder, with the name "protein" itself being coined by the Swedish chemist Jöns Jacob Berzelius in 1838. Mulder's meticulous elemental analysis of common proteins revealed a striking similarity in their empirical formula, consistently yielding approximately C 400 H 620 N 100 O 120 P 1 S 1 . This led him to an initially incorrect conclusion that all proteins might be composed of a single, albeit very large, molecular type. Berzelius proposed the term "protein," derived from the ancient Greek word prōteios (πρώτειος), signifying "primary," "in the lead," or "standing in front." Mulder's subsequent work also led to the identification of protein degradation products, such as the amino acid leucine, for which he determined a molecular weight remarkably close to its actual value of 131 Da.
Early nutritional scientists, such as the German physiologist Carl von Voit, held the view that protein was the most crucial nutrient for maintaining bodily structure, largely based on the prevailing belief that "flesh makes flesh." Around 1862, Karl Heinrich Ritthausen succeeded in isolating the amino acid glutamic acid. Later, Thomas Burr Osborne, working at the Connecticut Agricultural Experiment Station, compiled extensive reviews on vegetable proteins. In collaboration with Lafayette Mendel, Osborne conducted feeding experiments with laboratory rats that led to the identification of several nutritionally essential amino acids. Their findings demonstrated that diets deficient in these essential amino acids resulted in stunted growth in the rats, consistent with Liebig's law of the minimum. The final essential amino acid to be discovered was threonine, identified by William Cumming Rose.
The progress of early protein biochemistry was significantly hampered by the inherent difficulty in purifying proteins. While large quantities of protein could be obtained from sources like blood, egg whites, and keratin, isolating individual protein species in a pure form proved challenging. A notable exception occurred in the 1950s when the Armour Hot Dog Company purified a kilogram of bovine pancreatic ribonuclease A and made it freely available to the scientific community. This generous gesture propelled ribonuclease A to become a central focus of biochemical research for decades.
Polypeptides
The understanding that proteins are fundamentally polypeptides—chains of amino acids linked together—crystallized through the pioneering work of Franz Hofmeister and Hermann Emil Fischer in 1902. However, the central role of proteins as enzymes catalyzing biological reactions wasn't fully appreciated until 1926, when James B. Sumner definitively demonstrated that the enzyme urease was, in fact, a protein.
Linus Pauling is credited with the groundbreaking prediction of regular protein secondary structures, based on the principle of hydrogen bonding, a concept initially proposed by William Astbury in 1933. Subsequent research by Walter Kauzmann on denaturation, building upon earlier studies by Kaj Linderstrøm-Lang, provided crucial insights into the mechanisms of protein folding and structure, particularly highlighting the role of hydrophobic interactions.
The first complete amino acid sequencing of a protein was achieved for insulin by Frederick Sanger in 1949. Sanger's precise determination of the insulin amino acid sequence provided conclusive evidence that proteins are linear polymers of amino acids, definitively refuting earlier hypotheses of branched chains, colloids, or cyclols. His remarkable achievement earned him the Nobel Prize in 1958. Later, Christian Anfinsen's studies on the oxidative folding process of ribonuclease A, for which he was awarded the Nobel Prize in 1972, firmly established the thermodynamic hypothesis of protein folding, postulating that the native, folded state of a protein represents its minimum free energy conformation.
Structure
John Kendrew is shown here with a model of myoglobin under construction.
The advent of X-ray crystallography revolutionized the field, enabling not only the determination of protein sequences but also their intricate three-dimensional structures. [25] The first protein structures to be elucidated were those of hemoglobin by Max Perutz and myoglobin by John Kendrew, both reported in 1958. [26] [27] The increasing power of computational resources has been instrumental in advancing the sequencing of increasingly complex proteins. For instance, in 1999, Roger Kornberg determined the highly complex structure of RNA polymerase using high-intensity X-rays sourced from synchrotrons. [25]
More recently, cryo-electron microscopy (cryo-EM) has emerged as a powerful technique for visualizing large macromolecular assemblies. [28] Unlike X-ray crystallography, cryo-EM involves freezing protein samples and utilizing electron beams rather than X-rays. This method is less damaging to the sample, allowing scientists to gather more detailed information and analyze larger structures. [25] Concurrently, computational protein structure prediction methods, particularly for smaller protein structural domains, [29] have made significant strides, bringing researchers closer to achieving atomic-level resolution of protein structures.
As of April 2024, the Protein Data Bank houses an extensive collection of over 181,018 structures determined by X-ray crystallography, 19,809 by cryo-EM, and 12,697 by NMR spectroscopy. [30]
Classification
Protein families, Gene Ontology, and EC numbers
Proteins are primarily classified based on their amino acid sequence and three-dimensional structure. However, other classification systems are also widely employed. For enzymes, the Enzyme Commission number (EC number) system provides a standardized functional classification. Similarly, the Gene Ontology (GO) project offers a framework for classifying genes and proteins based on their biological roles, biochemical functions, and subcellular localization.
Sequence similarity serves as a crucial criterion for classifying proteins, reflecting both evolutionary relationships and functional similarities. This analysis can be performed on entire proteins or, more commonly, on individual protein domains, especially in the context of multi-domain proteins. Protein domains are particularly useful for classification as they represent distinct structural and functional units that can be combined in various ways. A study encompassing approximately 170,000 proteins revealed that about two-thirds could be assigned at least one domain, with larger proteins typically possessing a greater number of domains (for example, proteins exceeding 600 amino acids averaged over five domains). [33]
Biochemistry
Chemical structure and the peptide bond
The vast majority of proteins are constructed from linear polymers composed of a sequence of up to 20 different L-α-amino acids. With the exception of proline, all proteinogenic amino acids share a common structural core: an α-carbon atom covalently bonded to an amino group, a carboxyl group, and a variable side chain. Proline's unique cyclic side chain, which attaches to its own amino group, restricts the flexibility of the protein chain. [34] The side chains of the standard amino acids exhibit a wide array of chemical structures and properties, and it is the collective influence of all these amino acids that ultimately dictates the protein's three-dimensional conformation and its chemical reactivity. [35]
Amino acids within a polypeptide chain are joined by peptide bonds formed between the amino group of one amino acid and the carboxyl group of another. Each amino acid residue within the chain retains its identity, and the continuous chain of carbon, nitrogen, and oxygen atoms constitutes the main chain, or protein backbone. [36] : 19 The peptide bond itself exhibits resonance structures, imparting a degree of double-bond character to the backbone. This characteristic renders the alpha carbons, the nitrogen atom, and the carbonyl group (C=O) roughly coplanar. The rotational freedom around the remaining two dihedral angles within the peptide unit dictates the local spatial arrangement of the protein backbone. Consequently, proteins possess a degree of inherent rigidity. [36] : 31 A polypeptide chain terminates with a free amino group at one end, known as the N-terminus or amino terminus, and a free carboxyl group at the other, termed the C-terminus or carboxy terminus. [37] By convention, amino acid sequences are written from the N-terminus to the C-terminus, mirroring the directionality of protein synthesis by ribosomes. [37] [38]
The terms protein, polypeptide, and peptide are not always strictly defined and can sometimes overlap in usage. "Protein" generally refers to the complete, biologically active molecule in its stable, folded conformation, while "peptide" is typically reserved for shorter amino acid oligomers that may not possess a stable three-dimensional structure. However, the distinction is often blurred, with the boundary usually lying around 20–30 amino acid residues. [39]
Proteins possess the remarkable ability to interact with a vast array of molecules and ions, including other proteins, lipids, carbohydrates, and DNA. [40] [41] [42]
Abundance in cells
A typical bacterial cell, such as E. coli or Staphylococcus aureus, is estimated to harbor approximately 2 million protein molecules. Smaller bacteria, like Mycoplasma or spirochetes, contain fewer proteins, ranging from about 50,000 to 1 million. In contrast, eukaryotic cells, being larger and more complex, contain significantly more protein. For example, yeast cells are estimated to contain around 50 million proteins, while human cells contain on the order of 1 to 3 billion. [43] The concentration of individual protein copies within a cell can span from just a few molecules to as many as 20 million. [44] It's important to note that not all genes encoding proteins are expressed in every cell; the number of proteins present depends on factors such as cell type and external stimuli. For instance, of the roughly 20,000 proteins encoded by the human genome, only about 6,000 have been detected in lymphoblastoid cells. [45] The most abundant protein known in nature is widely considered to be RuBisCO, an enzyme crucial for photosynthesis as it catalyzes the incorporation of carbon dioxide into organic matter. In plants, RuBisCO can constitute as much as 1% of their total dry weight. [46]
Synthesis
Biosynthesis
The DNA sequence of a gene ultimately dictates the amino acid sequence of a protein.
The DNA sequence of a gene encodes the amino acid sequence of a protein.
Proteins are assembled from amino acids, with the information for their specific sequence encoded in genes. Each protein possesses a unique amino acid sequence determined by the nucleotide sequence of the gene responsible for its synthesis. The genetic code is structured as a series of three-nucleotide units called codons, with each codon specifying a particular amino acid. For example, the codon AUG (composed of adenine, uracil, and guanine) codes for the amino acid methionine. Given that DNA utilizes four distinct nucleotides, there are 64 possible codon combinations, leading to redundancy in the genetic code, where some amino acids are specified by multiple codons. [42] : 1002–42 In organisms, genes encoded in DNA are first transcribed into precursor messenger RNA (mRNA) by enzymes such as RNA polymerase. Most organisms then process this pre-mRNA (also known as a primary transcript) through various forms of post-transcriptional modification to yield mature mRNA. This mature mRNA then serves as the template for protein synthesis by the ribosome. In prokaryotes, mRNA can be translated as it is being produced or immediately after it detaches from the nucleoid. Conversely, in eukaryotes, mRNA is synthesized within the cell nucleus and subsequently transported across the nuclear membrane into the cytoplasm, where protein synthesis takes place. The rate of protein synthesis is generally higher in prokaryotes compared to eukaryotes, potentially reaching up to 20 amino acids per second. [47]
The process of synthesizing a protein from an mRNA template is termed translation. The mRNA molecule is loaded onto the ribosome, where it is read in three-nucleotide increments. Each codon on the mRNA is matched by a complementary base pair on an anticodon located on a transfer RNA (tRNA) molecule. Each tRNA molecule carries the specific amino acid corresponding to the codon it recognizes. The enzyme aminoacyl tRNA synthetase is responsible for "charging" the tRNA molecules with the correct amino acids. The growing polypeptide chain is often referred to as the nascent chain. Protein synthesis always proceeds in the direction from the N-terminus to the C-terminus. [42] : 1002–42
The size of a synthesized protein is typically described by the number of amino acids it contains and its total molecular mass, usually expressed in daltons (Da) or kilodaltons (kDa). The average protein size tends to increase from Archaea to Bacteria to Eukaryotes, with corresponding average lengths of 283, 311, and 438 residues, and molecular masses of 31, 34, and 49 kDa, respectively. This increase is attributed to a greater number of protein domains that often constitute proteins in more complex organisms. [48] For instance, proteins in yeast average 466 amino acids in length and have a molecular mass of 53 kDa. [39] The largest known proteins are the titins, a crucial component of the muscle sarcomere, which possess a molecular mass of nearly 3000 kDa and extend to approximately 27,000 amino acids in length. [49]
Chemical synthesis
Peptide Synthesis
Short proteins, or peptides, can be synthesized chemically using a variety of peptide synthesis methods. These techniques leverage organic synthesis principles, such as chemical ligation, to efficiently produce peptides. [50] Chemical synthesis offers the advantage of incorporating non-natural amino acids into polypeptide chains, for example, by attaching fluorescent probes to specific amino acid side chains. [51] While these methods are invaluable in laboratory settings for biochemistry and cell biology, they are generally less practical for large-scale commercial applications. Chemical synthesis becomes inefficient for polypeptides exceeding approximately 300 amino acids, and the resulting proteins may not readily adopt their correct native tertiary structure. Notably, most chemical synthesis methods proceed in the reverse direction of biological synthesis, from the C-terminus to the N-terminus. [52]
Structure
The crystal structure of the chaperonin, a remarkably large protein complex, is depicted. A single subunit of this complex is highlighted. Chaperonins play a critical role in assisting protein folding.
Three distinct representations of the three-dimensional structure of the enzyme triose phosphate isomerase are shown. The leftmost image provides an all-atom representation, color-coded by atom type. The middle image offers a simplified view, illustrating the backbone conformation and colored according to secondary structure elements. The rightmost image displays a solvent-accessible surface representation, colored based on residue type: acidic residues are red, basic residues are blue, polar residues are green, and nonpolar residues are white.
Protein structure levels
Most proteins spontaneously fold into unique, stable three-dimensional structures. This naturally adopted folded state is referred to as the native conformation. [36] : 36 While many proteins can achieve their native state through intrinsic chemical properties alone, others require the assistance of specialized molecules known as molecular chaperones to facilitate proper folding. [36] : 37 Biochemists typically analyze protein structure across four distinct hierarchical levels: [36] : 30–34
-
Primary structure: This refers to the linear amino acid sequence of the polypeptide chain. A protein is essentially a [polyamide].
-
Secondary structure: This describes regularly repeating local structural arrangements, primarily stabilized by hydrogen bonds between backbone atoms. The most prevalent examples include the α-helix, the β-sheet, and various types of turns. Since secondary structures are local phenomena, a single protein molecule can incorporate multiple distinct regions of secondary structure.
-
Tertiary structure: This encompasses the overall three-dimensional shape of a single protein molecule, representing the spatial arrangement of its secondary structure elements relative to each other. Tertiary structure is typically stabilized by interactions that span longer distances within the polypeptide chain, most notably the formation of a hydrophobic core. Other stabilizing forces include salt bridges, hydrogen bonds, disulfide bonds, and occasionally, post-translational modifications. The term "tertiary structure" is often used interchangeably with "fold." The specific tertiary structure is fundamental to the protein's function.
-
Quaternary structure: This level of structure applies to proteins composed of multiple polypeptide chains (known as protein subunits in this context). The quaternary structure describes how these subunits associate to form a functional protein complex.
-
Quinary structure: This concept refers to specific surface patterns on proteins that facilitate the organization of the crowded cellular interior. Quinary structure arises from transient yet essential macromolecular interactions occurring within living cells.
It is crucial to understand that proteins are not entirely rigid entities. Beyond these defined structural levels, proteins can undergo dynamic shifts between several related conformations as they perform their functions. In the context of these functional movements, these tertiary or quaternary structures are often referred to as "conformations", and the transitions between them are termed conformational changes. Such changes are frequently triggered by the binding of a substrate molecule to an enzyme's active site, the specific region of the protein involved in chemical catalysis. In aqueous solutions, protein structures are also subject to constant fluctuations due to thermal motion and collisions with other molecules. [42] : 368–75
The molecular surface of several proteins is displayed, illustrating their relative sizes. From left to right, these are: immunoglobulin G (IgG, an antibody), hemoglobin, insulin (a hormone), adenylate kinase (an enzyme), and glutamine synthetase (also an enzyme).
Proteins can be broadly categorized into three main classes based on their general three-dimensional structures: globular proteins, fibrous proteins, and membrane proteins. Most globular proteins are soluble in aqueous environments, and many of them function as enzymes. Fibrous proteins typically serve structural roles; examples include collagen, the primary structural protein in connective tissues, and keratin, the main component of hair and nails. Membrane proteins often function as receptors or form channels that facilitate the passage of polar or charged molecules across the cell membrane. [42] : 165–85
A specific type of intramolecular hydrogen bond found within proteins, which are poorly shielded from water and thus promote their own dehydration, are termed dehydrons. [53]
Protein domains
Many proteins are modular structures, composed of several distinct protein domains. These domains are segments of the protein that fold independently into stable, three-dimensional units. [54] : 134 Domains typically possess specific functions, such as enzymatic activity (e.g., a kinase domain) or serving as binding modules for other molecules. [54] : 155–156
The diagram contrasts protein domains with motifs. Protein domains, exemplified by the EVH1 domain, are self-contained functional units within proteins that fold into defined three-dimensional structures. Motifs, in contrast, are typically short amino acid sequences that confer specific functions but generally lack a stable, independent three-dimensional structure. Many motifs function as binding sites for other proteins, as illustrated by the red and green bars representing such motifs within a VASP protein. [55]
Sequence motif
Short amino acid sequences within proteins often act as specific recognition sites for other proteins. [56] For instance, SH3 domains characteristically bind to short sequence motifs known as PxxP motifs (where 'P' represents the amino acid proline and 'x' represents any unspecified amino acid). While the PxxP sequence is a key feature, the surrounding amino acids can significantly influence the precise binding specificity. Numerous such motifs have been cataloged in the Eukaryotic Linear Motif (ELM) database. [57]
Cellular functions
Proteins are the primary functional agents within the cell, executing the tasks specified by the genetic information encoded in DNA. [39] With the exception of certain types of RNA, most other biological molecules are relatively inert, serving as substrates or components upon which proteins act. In the bacterium Escherichia coli, proteins constitute approximately half of the cell's dry weight, whereas other macromolecules like DNA and RNA account for only about 3% and 20%, respectively. [58] The complete set of proteins expressed by a particular cell or cell type at a given time is known as its proteome. [54] : 120
The enzyme hexokinase is depicted as a standard ball-and-stick molecular model. To the same scale, shown in the upper right corner, are two of its substrates: ATP and glucose.
The defining characteristic that enables proteins to perform such a diverse array of functions is their remarkable ability to bind specifically and tightly to other molecules. The region of a protein responsible for this binding is termed the binding site, which is often shaped like a pocket or depression on the protein's surface. This binding capability is orchestrated by the protein's tertiary structure, which defines the geometry of the binding site, and by the chemical properties of the amino acid side chains lining this site. Protein binding can be exceptionally strong and highly specific. For example, the ribonuclease inhibitor protein binds to human angiogenin with an extremely low dissociation constant (less than 10⁻¹⁵ M) yet shows no significant binding to its amphibian counterpart, [onconase] (> 1 M). Even subtle chemical differences, such as the addition of a single methyl group to a potential binding partner, can drastically reduce or abolish binding. A prime illustration is the aminoacyl tRNA synthetase specific for the amino acid valine, which exhibits remarkable discrimination against the structurally similar amino acid isoleucine. [59]
Proteins can bind not only to small-molecule substrates but also to other proteins. When proteins bind specifically to identical molecules, they can oligomerize to form larger structures, such as fibrils. This process is common in structural proteins, where globular monomers self-associate to create rigid fibers. Protein–protein interactions are fundamental to regulating enzyme activity, controlling progression through the cell cycle, and assembling large protein complexes that carry out coordinated sets of reactions with a shared biological purpose. Proteins can also bind to, or become embedded within, cell membranes. The capacity of binding partners to induce conformational changes in proteins is a cornerstone of the complex signaling networks within cells. [42] : 830–49 The reversible nature of protein interactions, heavily influenced by the availability of different partner proteins to form functional aggregates, underscores the importance of studying these interactions to comprehend cellular function and the unique characteristics of different cell types. [60] [61]
Enzymes
The most widely recognized role of proteins in the cell is their function as enzymes, which act as biological catalysts to accelerate chemical reactions. Enzymes are typically highly specific, catalyzing only one or a limited number of reactions. They drive the majority of reactions involved in metabolism and are essential for processes like DNA replication, DNA repair, and transcription. Some enzymes modify other proteins by adding or removing chemical groups, a process known as post-translational modification. Approximately 4,000 distinct reactions are known to be catalyzed by enzymes. [62] The rate enhancement provided by enzymatic catalysis can be extraordinary, reaching as high as a 10¹⁷-fold increase over the uncatalyzed reaction in the case of orotate decarboxylase – transforming a reaction that would take 78 million years without the enzyme into one completed in just 18 milliseconds. [63]
The molecules that enzymes bind to and act upon are called substrates. Although enzymes can comprise hundreds of amino acids, typically only a small fraction of these residues directly interact with the substrate. An even smaller subset, averaging three to four residues, is directly involved in the catalytic process. [64] The specific region of the enzyme where the substrate binds and which contains the catalytic residues is known as the active site. [54] : 389
Dirigent proteins represent a class of proteins that play a crucial role in dictating the stereochemistry of compounds synthesized by other enzymes. [65]
Cell signaling and ligand binding
A ribbon diagram illustrates a mouse antibody targeting cholera, shown binding to a carbohydrate antigen.
Numerous proteins are integral to the processes of cell signaling and signal transduction. Some proteins, like insulin, are secreted extracellularly and function as signaling molecules, transmitting messages from the cell of origin to distant tissues. Others are membrane proteins that serve as receptors. Their primary role is to bind specific signaling molecules, initiating a cascade of biochemical responses within the cell. Many receptors possess a binding site exposed on the cell's exterior and an effector domain within the cell, which may possess enzymatic activity or undergo a conformational change that is recognized by other intracellular proteins. [41] : 251–81
Antibodies, key components of the adaptive immune system, are proteins designed to bind to antigens – foreign substances in the body – thereby marking them for destruction. Antibodies can be released into the extracellular environment or remain anchored to the membranes of specialized B cells known as plasma cells. Unlike enzymes, whose binding affinity for substrates is constrained by the need to perform catalysis, antibodies can exhibit extraordinarily high binding affinities for their targets. [42] : 275–50
Many ligand transport proteins are responsible for binding specific small biomolecules and delivering them to other locations within the body of a multicellular organism. These proteins must exhibit high binding affinity in environments with high ligand concentrations and efficiently release the ligand in target tissues where its concentration is low. The classic example of such a protein is hemoglobin, which transports oxygen from the lungs to various organs and tissues in all vertebrates, and has closely related counterparts across all biological kingdoms. [42] : 222–29 Lectins are a class of sugar-binding proteins with high specificity for particular sugar moieties, typically playing roles in biological recognition processes involving cells and other proteins. [66] Receptors and hormones are also examples of highly specific binding proteins.
Transmembrane proteins can function as ligand transport proteins, modulating the permeability of the cell membrane to small molecules and ions. The lipid bilayer of the membrane, with its hydrophobic core, prevents the free passage of polar or charged molecules. Membrane proteins form internal channels that allow these molecules to traverse the membrane. Many ion channel proteins are highly selective, allowing passage of only a specific ion; for instance, potassium and sodium channels can discriminate effectively between these two ions. [41] : 232–34
Structural proteins
Structural proteins provide rigidity and resilience to biological components that would otherwise be fluid. The majority of structural proteins fall into the category of fibrous proteins. For instance, collagen and elastin are fundamental constituents of connective tissue such as cartilage, while keratin is found in hard or filamentous structures like hair, nails, feathers, hooves, and certain animal shells. [42] : 178–81 Some globular proteins also play structural roles. A prime example is actin and tubulin. These proteins exist as soluble monomers but can polymerize to form long, rigid fibers that constitute the cytoskeleton, thereby enabling the cell to maintain its shape and volume. [54] : 490
Other proteins with significant structural functions include motor proteins such as myosin, kinesin, and dynein. These proteins possess the remarkable ability to generate mechanical forces. They are essential for the motility of single-celled organisms and play a vital role in the movement of sperm in many sexually reproducing multicellular organisms. [42] : 258–64, 272 They are responsible for the forces generated during muscle contraction and are critical for intracellular transport processes. [54] : 481, 490
Methods of study
Protein purification
To analyze proteins in vitro, they must first be isolated from other cellular components. This process typically begins with cell lysis, which involves disrupting the cell membrane to release the intracellular contents into a solution known as a crude lysate. This mixture can then be fractionated using ultracentrifugation, separating cellular components into fractions enriched in soluble proteins, membrane lipids and proteins, cellular organelles, and nucleic acids. Precipitation, often achieved through a technique called salting out, can be employed to concentrate the proteins from the lysate. Subsequent purification steps commonly involve various forms of chromatography, which separate proteins based on properties such as molecular weight, net charge, and binding affinity. [36] : 21–24 The purity of the protein can be assessed using gel electrophoresis if its molecular weight and isoelectric point are known, by spectroscopy if the protein exhibits characteristic spectral properties, or through enzyme assays if the protein possesses enzymatic activity. Additionally, proteins can be isolated based on their charge through electrofocusing. [72]
For naturally occurring proteins, multiple purification steps may be required to achieve the necessary purity for laboratory applications. To streamline this process, genetic engineering techniques are frequently employed to attach specific "tags" to proteins. These tags, often short sequences of histidine residues (termed a "His-tag"), are added to one end of the protein and do not typically affect its structure or activity. When the cell lysate containing the tagged protein is passed over a chromatography column containing nickel ions, the histidine tag binds strongly to the nickel, immobilizing the tagged protein on the column while other cellular components flow through. A variety of such tags have been developed to facilitate the purification of specific proteins from complex biological mixtures. [71]
Cellular localization
The study of proteins in vivo often focuses on their synthesis and precise localization within the cell. While many intracellular proteins are synthesized in the cytoplasm and secreted or membrane-bound proteins are synthesized in the endoplasmic reticulum, the mechanisms governing the targeting of proteins to specific organelles or cellular structures are not always fully understood. A powerful technique for investigating cellular localization involves expressing a fusion protein or chimera within the cell. This fusion protein consists of the natural protein of interest linked to a "reporter" such as green fluorescent protein (GFP). [73] The location of this fused protein within the cell can then be visualized and tracked with high precision using microscopy. [74]
Other methods for determining protein localization involve using known markers for specific cellular compartments, such as the ER, Golgi apparatus, lysosomes, vacuoles, mitochondria, chloroplasts, and the plasma membrane. By employing fluorescently tagged versions of these markers or antibodies that recognize them, it becomes easier to pinpoint the location of a protein of interest. For instance, indirect immunofluorescence allows for the colocalization of fluorescence, thereby confirming cellular localization. Fluorescent dyes are also used to label cellular compartments for similar purposes. [75]
Additional techniques include using antibodies against the protein of interest conjugated to enzymes that produce either luminescent or chromogenic signals, which can then be compared between samples to infer localization. This approach is known as immunohistochemistry. [76] Cofractionation using sucrose gradients and isopycnic centrifugation is another applicable method. While this technique does not definitively prove colocalization, it indicates a higher probability of association between the protein and a compartment of a specific density. [77]
The gold standard for confirming cellular localization is immunoelectron microscopy. This technique combines antibody labeling with traditional electron microscopy. The sample is prepared for electron microscopy, and then treated with an antibody against the protein of interest that is conjugated to an electron-dense material, typically gold. This allows for the simultaneous visualization of ultrastructural details and the precise location of the protein. [78]
Through site-directed mutagenesis, a genetic engineering technique, researchers can modify a protein's amino acid sequence, thereby altering its structure, cellular localization, and susceptibility to regulatory mechanisms. This method even permits the incorporation of unnatural amino acids into proteins using modified tRNAs, [79] opening avenues for the rational design of novel proteins with tailored properties. [80]
Proteomics
The complete set of proteins present in a cell or cell type at a specific time is referred to as its proteome. The study of these large-scale datasets defines the field of proteomics, a discipline named by analogy to the related field of genomics. Key experimental techniques employed in proteomics include 2D electrophoresis, which separates complex protein mixtures; [81] mass spectrometry, used for rapid, high-throughput identification and sequencing of proteins and peptides (often after in-gel digestion); [82] protein microarrays, which enable the detection of relative protein abundance; and two-hybrid screening, a method for systematically investigating protein–protein interactions. [83] The entire network of biologically possible protein interactions is known as the interactome. [84] The ambitious field of structural genomics aims to determine the structures of proteins representing every distinct protein fold. [85]
Structure determination
Elucidating the tertiary structure of a protein, or the quaternary structure of its complexes, provides invaluable insights into its function and potential interactions, particularly relevant for drug design. Since proteins are microscopic entities, invisible to conventional light microscopes, specialized techniques are required for structure determination. Prominent experimental methods include X-ray crystallography and NMR spectroscopy, both capable of yielding structural information at atomic resolution. NMR experiments provide data from which interatomic distances can be estimated, allowing for the determination of possible protein conformations through distance geometry calculations. Dual polarisation interferometry is a quantitative analytical technique used to measure overall protein conformation and detect conformational changes induced by interactions or other stimuli. Circular dichroism is another laboratory method used to assess the secondary structure composition (β-sheet and α-helical content) of proteins. Cryoelectron microscopy is employed to obtain lower-resolution structural information for very large protein complexes, including assembled viruses; [41] : 340–41 a variant known as electron crystallography can, in some cases, yield high-resolution data, particularly for two-dimensional crystals of membrane proteins. [86] Experimentally determined structures are typically deposited in the Protein Data Bank (PDB), a publicly accessible database containing structural data, including the Cartesian coordinates for each atom, for tens of thousands of proteins. [87]
The number of known gene sequences far exceeds the number of determined protein structures. Furthermore, the collection of solved structures is biased towards proteins that are amenable to crystallization, a prerequisite for X-ray crystallography. Globular proteins, for instance, are relatively easy to crystallize. In contrast, membrane proteins and large protein complexes are often challenging to crystallize and are consequently underrepresented in the PDB. [88] Initiatives in structural genomics aim to address these limitations by systematically determining representative structures across major protein fold classes. Protein structure prediction methods offer computational approaches to generate plausible structural models for proteins whose structures have not yet been experimentally determined. [89]
Structure prediction
In parallel with structural genomics, the field of protein structure prediction focuses on developing sophisticated mathematical models to computationally predict protein structures. [90] The most successful prediction method, known as homology modeling, relies on the availability of a known "template" structure that shares sequence similarity with the protein being modeled. The objective of structural genomics is to provide a sufficiently diverse set of solved structures to serve as templates for most remaining proteins. [91] While generating highly accurate models remains a challenge when only distantly related templates are available, it has been suggested that precise sequence alignment is the critical factor, as highly accurate models can be produced if a perfect alignment is achieved. [92] Many structure prediction methods have informed the burgeoning field of protein engineering, which has already seen the successful design of novel protein folds. [93] A significant proportion of proteins (approximately 33% in eukaryotes) contain intrinsically disordered regions that are functionally important and lack a stable three-dimensional structure. Characterizing and predicting protein disorder is therefore an essential aspect of protein structure analysis. [94]
In silico simulation of dynamical processes
A more computationally intensive challenge lies in predicting dynamic intermolecular interactions, such as those involved in molecular docking, [95] protein folding, protein–protein interaction, and chemical reactivity. Mathematical models designed to simulate these dynamic processes typically employ principles of molecular mechanics, particularly molecular dynamics. In this context, in silico simulations have successfully modeled the folding of small α-helical protein domains, such as the villin headpiece, [96] and the HIV accessory protein. [97] Hybrid approaches combining standard molecular dynamics with quantum mechanical calculations have also been used to explore the electronic states of rhodopsins. [98]
Beyond classical molecular dynamics, quantum dynamics methods enable the simulation of proteins at atomic detail, accurately accounting for quantum mechanical effects. Examples include the multi-layer multi-configuration time-dependent Hartree method and the hierarchical equations of motion approach, which have been applied to study plant cryptochromes [99] and bacterial light-harvesting complexes, [100] respectively. Both quantum and classical mechanical simulations of biological systems are computationally demanding, necessitating the use of distributed computing initiatives like the Folding@home project. These projects leverage advances in GPU parallel processing and Monte Carlo techniques to accelerate molecular modeling. [101] [102]
Chemical analysis
Digestion
In the absence of catalysts, proteins undergo hydrolysis very slowly. [105] The breakdown of proteins into smaller peptides and amino acids, a process known as proteolysis, is a critical step in digestion, allowing these components to be absorbed in the small intestine. [106] This hydrolysis is mediated by enzymes called proteases or peptidases. Proteases, which are themselves proteins, are classified based on the specific peptide bonds they cleave and whether they act on peptide bonds at the terminus (exopeptidases) or the interior (endopeptidases) of a protein. [107] Pepsin is a prominent example of an endopeptidase found in the stomach. Following digestion in the stomach, the pancreas secretes other proteases, including trypsin and chymotrypsin, to complete the hydrolysis process. [108]
Protein hydrolysis is also employed industrially to produce amino acids from abundant protein sources such as blood meal, feathers, and keratin. These materials are treated with hot hydrochloric acid, which effectively hydrolyzes the peptide bonds. [109]
Mechanical properties
The mechanical properties of proteins are remarkably diverse and often central to their biological functions, as exemplified by proteins like keratin and collagen. [110] For instance, the ability of muscle tissue to undergo repeated cycles of expansion and contraction is directly linked to the elastic characteristics of its underlying protein components. [111] [112] Beyond fibrous proteins, the conformational dynamics of enzymes [113] and the structural integrity of biological membranes, among other functions, are governed by protein mechanics. The unique mechanical properties of many proteins, coupled with their relative sustainability compared to synthetic polymers, have positioned them as attractive candidates for the development of next-generation biomaterials. [114] [115]
Young's modulus, denoted as E, is a measure of a material's relative stiffness, calculated as the ratio of axial stress (σ) to the resulting strain (ε). In the context of proteins, this stiffness often directly correlates with biological function. For example, collagen, found in connective tissue, bones, and cartilage, and keratin, present in nails, claws, and hair, exhibit stiffness values several orders of magnitude higher than that of elastin. [116] Elastin is believed to impart elasticity to structures like blood vessels, pulmonary tissue, and bladder tissue. [117] [118] In contrast, globular proteins, such as Bovine Serum Albumin, which are soluble in the cytosol and often function as enzymes (requiring frequent conformational changes), possess considerably lower Young's moduli. [119] [120]
The Young's modulus of an individual protein can be determined through molecular dynamics simulations. Using atomistic force fields, such as CHARMM or GROMOS, or coarse-grained force fields like Martini, [121] a single protein molecule can be subjected to a uniaxial stretching force, and the resulting extension is recorded to calculate the strain. [122] [123] Experimentally, techniques like atomic force microscopy can yield similar data. [124] The internal dynamics of proteins involve subtle elastic and plastic deformations influenced by viscoelastic forces, which can be probed using nano-rheology techniques. [125] These measurements suggest typical spring constants around k ≈ 100 pN/nm, corresponding to Young's moduli of E ≈ 100 MPa, and friction coefficients of γ ≈ 0.1 pN·s/nm, which translate to viscosities of η ≈ 0.01 pN·s/nm² (10⁷ times more viscous than water).
At the macroscopic level, the Young's modulus of cross-linked protein networks can be determined using conventional mechanical testing methods. The table below presents experimentally observed values for several proteins.
| Protein | Protein class | Young's modulus |
|---|---|---|
| keratin (cross-linked) | fibrous | 1.5–10 GPa [126] |
| elastin (cross-linked) | fibrous | 1 MPa [116] |
| fibrin (cross-linked) | fibrous | 1–10 MPa [116] |
| collagen (cross-linked) | fibrous | 5–7.5 GPa [116] [127] |
| resilin (cross-linked) | fibrous | 1–2 MPa [116] |
| bovine serum albumin (cross-linked) | globular | 2.5–15 kPa [119] |
| β-barrel outer membrane proteins | membrane | 20–45 GPa [128] |