Industry 4.0

Facility Layout · Assembly Line · Location Theory · Outsourcing
Topics · Definitions · Formulas
extracted from exam paper
01Facility Layouts — Types & Definitions
Product Layoutaka Line Layout
Machines arranged in the sequence of production operations. For high volume, low variety. Smooth one-directional flow. Low WIP, low material handling, high investment. e.g. aircraft, automobile lines
Process Layoutaka Functional Layout
Similar machines/functions grouped together by type. For low volume, high variety. Flexible but high WIP, high material handling, complex routing. Operators need higher skill. e.g. job shops, hospitals
Group Layoutaka Cellular Manufacturing
Products with similar processing needs grouped into part families, machines in cells. For medium variety, medium volume. Hybrid of process + product advantages. Best when products share common components. e.g. headset families
Fixed Position Layout
Product stays stationary; workers & machines come to it. Used when product is too large/heavy to move. Very high material handling. e.g. ships, bridges, construction
LayoutVolumeVarietyMat. HandlingWIP InventoryInvestmentFlexibility
Product (Line)HighLowLowLowHighLow
Process (Functional)LowHighHighHighLowHigh
Group (Cellular)MediumMediumMediumMediumMediumMedium
Fixed PositionVery LowVery HighVery HighHighHigh
Key rule: See stages like Press shop → Body shop → Paint shop → Assembly? That's a Product Layout. Similar machines grouped together (all drills, all mills)? That's Process Layout. Product families sharing parts? Group Layout.
02Assembly Line Balancing — Concepts & Formulas
Cycle Time (CT)
Time between successive finished units off the line. CT = Available Time ÷ Output. No station can exceed CT.
Bottleneck
Station with the longest task time. Limits entire line's output. Production rate = rate of the slowest station.
Min Cycle Time (Cmin)
= longest single task time. Absolute lower bound — CT cannot go below this.
Max Cycle Time (Cmax)
= sum of all task times. Upper bound — if only one station does everything.
Balance Delay (BD)
Fraction of total time that is idle/wasted across all stations. BD = 0 means perfect (impossible).
Line Efficiency (E)
Fraction of total time productively used. E% + BD% = 100% always.
Feasibility Condition
Target is feasible only if CTrequired ≥ Cmin. If not, target is physically impossible.
Precedence Diagram
Directed graph showing order constraints between tasks. A task can only start after all predecessors are done.
Formulas — Assembly Line Balancing
Cycle Time
CT = T_avail / Q
Q = required output per period
Min & Max CT
C_min = max(tᵢ)
C_max = Σtᵢ
tᵢ = individual task times
Balance Delay
BD = (N·C − Σtᵢ) / (N·C)
N = no. of workstations
Line Efficiency
E = Σtᵢ / (N·C) × 100
E% + BD% = 100% always
Min No. of Stations
N_min = ⌈ Σtᵢ / C ⌉
Always round UP
Feasibility Check
CT_req ≥ C_min → OK
If false → target impossible
Worked Example — from paper (12 tasks, Σt = 85 min)
Setup: 1 shift=8hr=480 min/day | Target=32 units/day | Longest task=12 min
────────────────────────────────────────────────────────
CT_required = 480 ÷ 32 = 15 min
C_min = max task time = 12 min  |  C_max = Σ tasks = 85 min
Feasible? 15 ≥ 12 → YES ✓
Layout: (1,2)→(3)→(4)→(5)→(6,7)→(8,9)→(10,11,12) → N=7
BD = (105 − 85)/105 = 0.19  |  Efficiency = 85/105×100 = 80.95%
03Facility Location — Methods
Center of Gravity (CoG)
Weighted average of demand coordinates. x* = Σ(wᵢ·xᵢ)/Σwᵢ. Fast but approximate — not always optimal.
Euclidean Method
Minimises total weighted straight-line distance. Gives one exact single point (x*, y*) — never a region. Rarely matches the CoG result.
Cross-Median Method
Optimal location via the weighted median of x & y independently. Uses Manhattan distance. Sort by coordinate → find cumulative demand crossing total/2 threshold.
Euclidean facts: Always a single point not a region. Rarely equals CoG (only in special symmetric cases). CoG = approximation. Euclidean = exact optimisation.
Cross-Median — Paper Example
A(1,2)×7, B(2,3)×1, C(3,5)×3, D(4,1)×5
Total=16, threshold=8
x-sort: x=1→cumul=7, x=2→cumul=8 → x*=2
y-sort: y=1→cumul=5, y=2→cumul=12 → y*=2
Optimal = (2, 2)
04Outsourcing & Offshoring
Outsourcing
Contracting a third party to perform a function. Focus: who does the work (external). Can be domestic or international.
Offshoring
Moving a process to a different country. Focus: where the work is done. Can be own subsidiary or foreign vendor.
Captive Offshoring
Company sets up its own operations abroad — in-house but in another country. Micron US → own team in India
Offshore Outsourcing
Contract a foreign company — both offshored AND outsourced. W. Australia → TCS Pune
Domestic Outsourcing
Third party in the same country. Outsourcing but NOT offshoring. Tata Motors → TACO group (both India)
ScenarioTypeOffshoring?Outsourcing?
Micron (US) → own team in IndiaCaptive OffshoringYesNo
Micron → TSMC Taiwan (contract)Foreign OutsourcingNo*Yes
W. Australia → TCS PuneOffshore OutsourcingYesYes
Tata Motors → TACO groupDomestic OutsourcingNoYes
Quick test: Different country? → Offshoring. Third party? → Outsourcing. Both? → Offshore Outsourcing. Same country + third party? → Domestic Outsourcing only.
05Quick Reference
Which Layout?
1 product, high volume → Product Layout
Many custom products, low vol → Process Layout
Product families, shared parts → Group Layout
Product can't move → Fixed Position
Assembly Line Rules
Rate set by → slowest station
E% + BD% = 100% always
Valid CT: Cmin ≤ CT ≤ Cmax
Feasible if: CTreq ≥ Cmin
All Formulas
CT = T_avail / Q
C_min = max(tᵢ)  |  C_max = Σtᵢ
N_min = ⌈ Σtᵢ / C ⌉
BD = (N·C − Σtᵢ) / (N·C)
E = Σtᵢ / (N·C) × 100
Feasible: CT_req ≥ C_min

Large Language Models

Transformers · Attention · Positional Encoding · Decoding · Training
Topics · Definitions · Formulas
extracted from exam paper
01Transformer Architecture — Core Concepts
Token Embedding
Converts discrete tokens (words/sub-words) into dense vectors. Used in both encoder and decoder. Maps vocabulary IDs to dmodel-dimensional space.
Positional Encoding
Added to token embeddings in both encoder and decoder to inject sequence order information. Transformers have no inherent sense of order without this.
Sinusoidal Encoding
Each position's encoding vector uses sine for even dimensions and cosine for odd dimensions. Both functions are used per position — NOT per word-position.
Encoder
Processes the full input sequence in parallel using bidirectional self-attention. Produces context-rich representations. Used in seq2seq models.
Decoder
Generates output tokens auto-regressively (one at a time). Uses masked self-attention + cross-attention to encoder output. GPT-style models are decoder-only.
Look-Ahead Mask
Applied in every decoder self-attention layer (not just the first). Prevents each position from attending to future tokens. Future scores set to −∞ before softmax.
Masked Self-Attention
For token xk at position k, only positions 1…k are visible. Future positions masked to −∞ so softmax assigns them zero weight.
Key architecture facts: Token embeddings → both encoder & decoder. Positional encoding → both encoder & decoder. Look-ahead mask → every decoder self-attention layer, not just the first.
02Attention Mechanism & Multi-Head Attention
Scaled Dot-Product Attention
Attention(Q,K,V) = softmax(QKᵀ / √dk) · V. Dividing by √dk prevents dot products from growing too large, which would push softmax into regions with vanishing gradients.
Why divide by √dk?
To avoid numerical issues and vanishing gradients during training. Large dot products → extreme softmax outputs → near-zero gradients → slow/failed learning.
Query (Q), Key (K), Value (V)
Each input is projected into three separate spaces. Q × Kᵀ computes compatibility scores. Softmax normalises them. V is aggregated by these weights.
Multi-Head Attention
Runs h parallel attention heads, each with its own Q/K/V projections. Outputs concatenated and projected by WO. Allows learning diverse local and global relations simultaneously.
Head Dimensionality
When number of heads h increases while keeping dmodel fixed, each head's dimension dk = dmodel/h decreases. More heads = narrower per-head attention.
Attention Score Matrix
For sequence length T with causal masking, each sequence has T×T scores but only the lower-triangular (including diagonal) are non-zero = T(T+1)/2 non-zero scores per head.
Formulas — Attention & Parameters
Scaled Dot-Product
softmax(QKᵀ / √d_k) · V
Head dimension
d_k = d_model / h
h = number of heads
Scalar attention scores (1 pass)
B × h × T × T
B=batch, h=heads, T=seq length
Q projection matrix size
d_model × d_k
= d_model × (d_model/h)
W_O output projection
(h × d_v) × d_model
= d_model × d_model
Non-zero scores (causal, per head)
T(T+1) / 2
Lower-triangular including diagonal
Worked Examples — from paper
Q166: B=4, T=12, h=2 → total scalar scores = 4 × 2 × 12 × 12 = 1152
Q168: d_model=512, h=8, d_k=64 → Q matrix params = 512 × 64 = 32768
────────────────────────────────────────────────────────
GPT setup (Q169-171): T=1024, B=32, d_model=768, h=8, d_v=96
Q169: next-token targets = B × (T−1) = 32 × 1023 = 32736
Q170: non-zero attn scores per head = T(T+1)/2 = 1024×1025/2 = 524800
Q171: W_O params = (h × d_v) × d_model = 768 × 768 = 589824
03Training — Teacher Forcing & Language Modelling
Teacher Forcing
During training, the correct ground-truth token from the training data is fed as input for each next time step — regardless of what the model predicted. Speeds up and stabilises training.
Decoder Input (teacher forcing)
At time step t, the decoder input is the ground-truth token from position t−1. Example: if target = [je, suis, heureux], at t=3 the input is 'suis' (t=2 ground truth).
Causal Language Modelling
Next-token prediction. Each token in a sequence of length T generates one prediction target (except the last). For T tokens: T−1 prediction targets per sequence.
Autoregressive Inference
At inference (GPT), each new token is conditioned on all previously generated tokens. Generation is sequential — no teacher forcing. Decoding strategies choose each next token.
Batch Normalisation
Normalises each feature across the batch. For feature column: subtract batch mean, divide by batch std (ε=0). With γ=1, β=0: normalised values are z-scores across the batch dimension.
Teacher forcing vs inference: Training uses teacher forcing (ground-truth inputs) → fast, stable. Inference is autoregressive (model's own outputs as next inputs) → error can accumulate (exposure bias).
04Decoding Strategies
Greedy Search
At each step, picks the single highest-probability token. Always 1 beam. Fast but suboptimal — can miss better sequences. Deterministic.
Beam Search
Keeps the top-k (beam size) partial sequences at each step. Deterministic — always keeps the same top-k. Beam search with size 2 always keeps both top-2 sequences.
Exhaustive (Exact) Decoding
Evaluates all possible sequences. At step t with vocab size V: Vt candidates. Computationally infeasible for large V or t. Deterministic.
Nucleus (Top-p) Sampling
Samples from the smallest set of tokens whose cumulative probability ≥ p. Inherently non-deterministic — introduces randomness via sampling.
Top-k Sampling
Restricts choices to top-k tokens, then samples one probabilistically. Non-deterministic. Top-2 sampling picks 1 token from the top 2 by sampling, not greedily.
Savings Heuristic (TSP)
A heuristic for Travelling Salesman Problem. Tries each city as base (fulcrum) and computes a tour. Always terminates (finite cities). Not suitable for computing the optimal TSP tour — it's a heuristic.
StrategyBeams at step 1 (V=10)Beams at step 3Deterministic?
Greedy11Yes
Beam (size=3)33Yes
Exhaustive10 (=V)1000 (=V³)Yes
Nucleus / Top-k SamplingNo (random)
Beam vs Top-k sampling: Beam search with size 2 → always keeps both top-2 candidates (deterministic). Top-2 sampling → restricts to top 2 tokens then samples one probabilistically (non-deterministic).
05Sequences, Vocabulary & Complexity
Sequences of max length L
With vocabulary size |V|, total sequences of max length L = Σk=1L Vk. This counts sequences of length 1, 2, …, L.
Exhaustive decoding growth
Going from length L to L+1 adds (V−1)·VL additional sequences to evaluate. (Not VL — those already existed.)
Attention score memory
Each score is 32-bit float (4 bytes). Total memory = h × T² × 4 bytes per sequence. Doubling heads doubles memory linearly.
Key Formulas — Sequences & Memory
Total sequences (max length L)
Σ V^k for k=1 to L
Extra seqs (L → L+1)
(V − 1) · V^L
Attention score memory (per seq)
h × T² × 4 bytes
Training targets (full batch)
B × (T − 1)
Each token except last predicts next
Q167 — Memory comparison worked example
T=1000, float32=4 bytes, 1 MB=10⁶ bytes
Model A (h=8): 8 × 1000² × 4 = 32,000,000 bytes = 32 MB
Model B (h=16): 16 × 1000² × 4 = 64,000,000 bytes = 64 MB
% increase = (64−32)/32 × 100 = 100%
06Quick Reference — Rules to Remember
Architecture Facts
Token embeddings → encoder AND decoder
Positional encoding → encoder AND decoder
Look-ahead mask → every decoder self-attn layer
Sinusoidal → sin=even dims, cos=odd dims
√d_k scaling → prevent vanishing gradients
Decoding Rules
Only non-deterministic → Nucleus / sampling
Beam keeps → top-k deterministically
Top-k sampling → samples 1 from top-k
Savings heuristic → always terminates, not optimal
All Formulas at a Glance
Attention = softmax(QKᵀ/√d_k)·V
d_k = d_model / h
Scores (1 pass) = B × h × T × T
Q params = d_model × d_k
W_O params = d_model × d_model
Non-zero causal scores = T(T+1)/2
Train targets = B × (T−1)
Total seqs (max L) = Σ V^k, k=1..L
Extra seqs (L→L+1) = (V−1)·V^L