Industry 4.0

Facility Layout · Assembly Line · Location Theory · Outsourcing

Topics · Definitions · Formulas
extracted from exam paper

01Facility Layouts — Types & Definitions

Product Layoutaka Line Layout

Machines arranged in the sequence of production operations. For high volume, low variety. Smooth one-directional flow. Low WIP, low material handling, high investment. e.g. aircraft, automobile lines

Process Layoutaka Functional Layout

Similar machines/functions grouped together by type. For low volume, high variety. Flexible but high WIP, high material handling, complex routing. Operators need higher skill. e.g. job shops, hospitals

Group Layoutaka Cellular Manufacturing

Products with similar processing needs grouped into part families, machines in cells. For medium variety, medium volume. Hybrid of process + product advantages. Best when products share common components. e.g. headset families

Fixed Position Layout

Product stays stationary; workers & machines come to it. Used when product is too large/heavy to move. Very high material handling. e.g. ships, bridges, construction

Layout	Volume	Variety	Mat. Handling	WIP Inventory	Investment	Flexibility
Product (Line)	High	Low	Low	Low	High	Low
Process (Functional)	Low	High	High	High	Low	High
Group (Cellular)	Medium	Medium	Medium	Medium	Medium	Medium
Fixed Position	Very Low	Very High	Very High	—	High	High

Key rule: See stages like Press shop → Body shop → Paint shop → Assembly? That's a Product Layout. Similar machines grouped together (all drills, all mills)? That's Process Layout. Product families sharing parts? Group Layout.

02Assembly Line Balancing — Concepts & Formulas

Cycle Time (CT)

Time between successive finished units off the line. CT = Available Time ÷ Output. No station can exceed CT.

Bottleneck

Station with the longest task time. Limits entire line's output. Production rate = rate of the slowest station.

Min Cycle Time (C_min)

= longest single task time. Absolute lower bound — CT cannot go below this.

Max Cycle Time (C_max)

= sum of all task times. Upper bound — if only one station does everything.

Balance Delay (BD)

Fraction of total time that is idle/wasted across all stations. BD = 0 means perfect (impossible).

Line Efficiency (E)

Fraction of total time productively used. E% + BD% = 100% always.

Feasibility Condition

Target is feasible only if CT_required ≥ C_min. If not, target is physically impossible.

Precedence Diagram

Directed graph showing order constraints between tasks. A task can only start after all predecessors are done.

Formulas — Assembly Line Balancing

Cycle Time

CT = T_avail / Q

Q = required output per period

Min & Max CT

C_min = max(tᵢ)
C_max = Σtᵢ

tᵢ = individual task times

Balance Delay

BD = (N·C − Σtᵢ) / (N·C)

N = no. of workstations

Line Efficiency

E = Σtᵢ / (N·C) × 100

E% + BD% = 100% always

Min No. of Stations

N_min = ⌈ Σtᵢ / C ⌉

Always round UP

Feasibility Check

CT_req ≥ C_min → OK

If false → target impossible

Worked Example — from paper (12 tasks, Σt = 85 min)

Setup: 1 shift=8hr=480 min/day | Target=32 units/day | Longest task=12 min
────────────────────────────────────────────────────────
CT_required = 480 ÷ 32 = 15 min
C_min = max task time = 12 min | C_max = Σ tasks = 85 min
Feasible? 15 ≥ 12 → YES ✓
Layout: (1,2)→(3)→(4)→(5)→(6,7)→(8,9)→(10,11,12) → N=7
BD = (105 − 85)/105 = 0.19 | Efficiency = 85/105×100 = 80.95%

03Facility Location — Methods

Center of Gravity (CoG)

Weighted average of demand coordinates. x* = Σ(wᵢ·xᵢ)/Σwᵢ. Fast but approximate — not always optimal.

Euclidean Method

Minimises total weighted straight-line distance. Gives one exact single point (x*, y*) — never a region. Rarely matches the CoG result.

Cross-Median Method

Optimal location via the weighted median of x & y independently. Uses Manhattan distance. Sort by coordinate → find cumulative demand crossing total/2 threshold.

Euclidean facts: Always a single point not a region. Rarely equals CoG (only in special symmetric cases). CoG = approximation. Euclidean = exact optimisation.

Cross-Median — Paper Example

A(1,2)×7, B(2,3)×1, C(3,5)×3, D(4,1)×5
Total=16, threshold=8
x-sort: x=1→cumul=7, x=2→cumul=8 → x*=2
y-sort: y=1→cumul=5, y=2→cumul=12 → y*=2
Optimal = (2, 2)

04Outsourcing & Offshoring

Outsourcing

Contracting a third party to perform a function. Focus: who does the work (external). Can be domestic or international.

Offshoring

Moving a process to a different country. Focus: where the work is done. Can be own subsidiary or foreign vendor.

Captive Offshoring

Company sets up its own operations abroad — in-house but in another country. Micron US → own team in India

Offshore Outsourcing

Contract a foreign company — both offshored AND outsourced. W. Australia → TCS Pune

Domestic Outsourcing

Third party in the same country. Outsourcing but NOT offshoring. Tata Motors → TACO group (both India)

Scenario	Type	Offshoring?	Outsourcing?
Micron (US) → own team in India	Captive Offshoring	Yes	No
Micron → TSMC Taiwan (contract)	Foreign Outsourcing	No*	Yes
W. Australia → TCS Pune	Offshore Outsourcing	Yes	Yes
Tata Motors → TACO group	Domestic Outsourcing	No	Yes

Quick test: Different country? → Offshoring. Third party? → Outsourcing. Both? → Offshore Outsourcing. Same country + third party? → Domestic Outsourcing only.

05Quick Reference

Which Layout?

1 product, high volume → Product Layout
Many custom products, low vol → Process Layout
Product families, shared parts → Group Layout
Product can't move → Fixed Position

Assembly Line Rules

Rate set by → slowest station
E% + BD% = 100% always
Valid CT: C_min ≤ CT ≤ C_max
Feasible if: CT_req ≥ C_min

All Formulas

          CT = T_avail / Q

          C_min = max(tᵢ)  |  C_max = Σtᵢ

          N_min = ⌈ Σtᵢ / C ⌉

          BD = (N·C − Σtᵢ) / (N·C)

          E = Σtᵢ / (N·C) × 100

          Feasible: CT_req ≥ C_min

Large Language Models

Transformers · Attention · Positional Encoding · Decoding · Training

Topics · Definitions · Formulas
extracted from exam paper

01Transformer Architecture — Core Concepts

Token Embedding

Converts discrete tokens (words/sub-words) into dense vectors. Used in both encoder and decoder. Maps vocabulary IDs to d_model-dimensional space.

Positional Encoding

Added to token embeddings in both encoder and decoder to inject sequence order information. Transformers have no inherent sense of order without this.

Sinusoidal Encoding

Each position's encoding vector uses sine for even dimensions and cosine for odd dimensions. Both functions are used per position — NOT per word-position.

Encoder

Processes the full input sequence in parallel using bidirectional self-attention. Produces context-rich representations. Used in seq2seq models.

Decoder

Generates output tokens auto-regressively (one at a time). Uses masked self-attention + cross-attention to encoder output. GPT-style models are decoder-only.

Look-Ahead Mask

Applied in every decoder self-attention layer (not just the first). Prevents each position from attending to future tokens. Future scores set to −∞ before softmax.

Masked Self-Attention

For token x_k at position k, only positions 1…k are visible. Future positions masked to −∞ so softmax assigns them zero weight.

Key architecture facts: Token embeddings → both encoder & decoder. Positional encoding → both encoder & decoder. Look-ahead mask → every decoder self-attention layer, not just the first.

02Attention Mechanism & Multi-Head Attention

Scaled Dot-Product Attention

Attention(Q,K,V) = softmax(QKᵀ / √d_k) · V. Dividing by √d_k prevents dot products from growing too large, which would push softmax into regions with vanishing gradients.

Why divide by √d_k?

To avoid numerical issues and vanishing gradients during training. Large dot products → extreme softmax outputs → near-zero gradients → slow/failed learning.

Query (Q), Key (K), Value (V)

Each input is projected into three separate spaces. Q × Kᵀ computes compatibility scores. Softmax normalises them. V is aggregated by these weights.

Multi-Head Attention

Runs h parallel attention heads, each with its own Q/K/V projections. Outputs concatenated and projected by W_O. Allows learning diverse local and global relations simultaneously.

Head Dimensionality

When number of heads h increases while keeping d_model fixed, each head's dimension d_k = d_model/h decreases. More heads = narrower per-head attention.

Attention Score Matrix

For sequence length T with causal masking, each sequence has T×T scores but only the lower-triangular (including diagonal) are non-zero = T(T+1)/2 non-zero scores per head.

Formulas — Attention & Parameters

Scaled Dot-Product

softmax(QKᵀ / √d_k) · V

Head dimension

d_k = d_model / h

h = number of heads

Scalar attention scores (1 pass)

B × h × T × T

B=batch, h=heads, T=seq length

Q projection matrix size

d_model × d_k

= d_model × (d_model/h)

W_O output projection

(h × d_v) × d_model

= d_model × d_model

Non-zero scores (causal, per head)

T(T+1) / 2

Lower-triangular including diagonal

Worked Examples — from paper

Q166: B=4, T=12, h=2 → total scalar scores = 4 × 2 × 12 × 12 = 1152
Q168: d_model=512, h=8, d_k=64 → Q matrix params = 512 × 64 = 32768
────────────────────────────────────────────────────────
GPT setup (Q169-171): T=1024, B=32, d_model=768, h=8, d_v=96
Q169: next-token targets = B × (T−1) = 32 × 1023 = 32736
Q170: non-zero attn scores per head = T(T+1)/2 = 1024×1025/2 = 524800
Q171: W_O params = (h × d_v) × d_model = 768 × 768 = 589824

03Training — Teacher Forcing & Language Modelling

Teacher Forcing

During training, the correct ground-truth token from the training data is fed as input for each next time step — regardless of what the model predicted. Speeds up and stabilises training.

Decoder Input (teacher forcing)

At time step t, the decoder input is the ground-truth token from position t−1. Example: if target = [je, suis, heureux], at t=3 the input is 'suis' (t=2 ground truth).

Causal Language Modelling

Next-token prediction. Each token in a sequence of length T generates one prediction target (except the last). For T tokens: T−1 prediction targets per sequence.

Autoregressive Inference

At inference (GPT), each new token is conditioned on all previously generated tokens. Generation is sequential — no teacher forcing. Decoding strategies choose each next token.

Batch Normalisation

Normalises each feature across the batch. For feature column: subtract batch mean, divide by batch std (ε=0). With γ=1, β=0: normalised values are z-scores across the batch dimension.

Teacher forcing vs inference: Training uses teacher forcing (ground-truth inputs) → fast, stable. Inference is autoregressive (model's own outputs as next inputs) → error can accumulate (exposure bias).

04Decoding Strategies

Greedy Search

At each step, picks the single highest-probability token. Always 1 beam. Fast but suboptimal — can miss better sequences. Deterministic.

Beam Search

Keeps the top-k (beam size) partial sequences at each step. Deterministic — always keeps the same top-k. Beam search with size 2 always keeps both top-2 sequences.

Exhaustive (Exact) Decoding

Evaluates all possible sequences. At step t with vocab size V: V^t candidates. Computationally infeasible for large V or t. Deterministic.

Nucleus (Top-p) Sampling

Samples from the smallest set of tokens whose cumulative probability ≥ p. Inherently non-deterministic — introduces randomness via sampling.

Top-k Sampling

Restricts choices to top-k tokens, then samples one probabilistically. Non-deterministic. Top-2 sampling picks 1 token from the top 2 by sampling, not greedily.

Savings Heuristic (TSP)

A heuristic for Travelling Salesman Problem. Tries each city as base (fulcrum) and computes a tour. Always terminates (finite cities). Not suitable for computing the optimal TSP tour — it's a heuristic.

Strategy	Beams at step 1 (V=10)	Beams at step 3	Deterministic?
Greedy	1	1	Yes
Beam (size=3)	3	3	Yes
Exhaustive	10 (=V)	1000 (=V³)	Yes
Nucleus / Top-k Sampling	—	—	No (random)

Beam vs Top-k sampling: Beam search with size 2 → always keeps both top-2 candidates (deterministic). Top-2 sampling → restricts to top 2 tokens then samples one probabilistically (non-deterministic).

05Sequences, Vocabulary & Complexity

Sequences of max length L

With vocabulary size |V|, total sequences of max length L = Σ_k=1^L V^k. This counts sequences of length 1, 2, …, L.

Exhaustive decoding growth

Going from length L to L+1 adds (V−1)·V^L additional sequences to evaluate. (Not V^L — those already existed.)

Attention score memory

Each score is 32-bit float (4 bytes). Total memory = h × T² × 4 bytes per sequence. Doubling heads doubles memory linearly.

Key Formulas — Sequences & Memory

Total sequences (max length L)

Σ V^k for k=1 to L

Extra seqs (L → L+1)

(V − 1) · V^L

Attention score memory (per seq)

h × T² × 4 bytes

Training targets (full batch)

B × (T − 1)

Each token except last predicts next

Q167 — Memory comparison worked example

T=1000, float32=4 bytes, 1 MB=10⁶ bytes
Model A (h=8): 8 × 1000² × 4 = 32,000,000 bytes = 32 MB
Model B (h=16): 16 × 1000² × 4 = 64,000,000 bytes = 64 MB
% increase = (64−32)/32 × 100 = 100%

06Quick Reference — Rules to Remember

Architecture Facts

Token embeddings → encoder AND decoder
Positional encoding → encoder AND decoder
Look-ahead mask → every decoder self-attn layer
Sinusoidal → sin=even dims, cos=odd dims
√d_k scaling → prevent vanishing gradients

Decoding Rules

Only non-deterministic → Nucleus / sampling
Beam keeps → top-k deterministically
Top-k sampling → samples 1 from top-k
Savings heuristic → always terminates, not optimal

All Formulas at a Glance

          Attention = softmax(QKᵀ/√d_k)·V

          d_k = d_model / h

          Scores (1 pass) = B × h × T × T

          Q params = d_model × d_k

          W_O params = d_model × d_model

          Non-zero causal scores = T(T+1)/2

          Train targets = B × (T−1)

          Total seqs (max L) = Σ V^k, k=1..L

          Extra seqs (L→L+1) = (V−1)·V^L