Study Notes B.S DATA SCIENCE UAF Agriculture Faisalabad

Explore study notes for B.S Data Science program at UAF Agriculture Faisalabad. Tips for success, curriculum overview, and why UAF is the top choice for Data Science enthusiasts.The University of Agriculture Faisalabad is renowned for its cutting-edge research facilities, expert faculty members, and industry partnerships. By enrolling in the B.S Data Science program at UAF, you will have access to state-of-the-art laboratories, hands-on training, and internship opportunities with leading agricultural companies.

1.1. Defining ICT

1.2. Core Components of ICT

1.3. The Digital Transformation Paradigm

2.1. Computer Networks & The Internet

2.2. Wireless & Mobile Technologies

2.3. Cloud Computing

2.4. Cybersecurity Fundamentals

3.1. E-Business and E-Commerce

3.2. E-Learning (Technology-Enhanced Learning)

3.3. E-Governance

3.4. Telemedicine and Digital Health

3.5. Digital Finance & FinTech

4.1. Internet of Things (IoT)

4.2. Artificial Intelligence (AI) and Machine Learning (ML)

4.3. Big Data and Analytics

4.4. Social, Ethical, and Professional Issues

ICT is the foundational engine of the modern world. Its applications—from the cloud-based systems running global supply chains to the AI algorithms curating our social media feeds—are transforming every sector. A deep understanding of the core technologies (networks, cloud, security) and their application in key domains (business, education, governance, health, finance) is essential. As we move forward, the convergence of AI, IoT, and 5G will create unprecedented opportunities and challenges, making the study of ICT applications a continuously evolving and critical field.


1. Introduction to Programming

Programming is the process of designing, writing, testing, and maintaining instructions (code) that a computer can execute to perform a specific task.

  • Algorithm: A step-by-step procedure for solving a problem.

  • Flowchart: A graphical representation of an algorithm using symbols (ovals, rectangles, diamonds, etc.).

  • Programming Paradigms:

    • Procedural: Focus on functions and step-by-step execution (e.g., C).

    • Object-Oriented: Organize code around objects (e.g., C++, Java, Python).

    • Functional: Treat computation as evaluation of mathematical functions.


2. Basic Syntax and Structure

A typical program in C/C++ starts with a main function, which is the entry point.

#include <iostream>
using namespace std;

int main() {
    
    cout << "Hello, World!" << endl;
    return 0;
}

3. Variables and Data Types

variable is a named memory location that holds a value.

Primitive Data Types

Declaration and Initialization

int age = 20;
float pi = 3.14159;
char grade = 'A';
bool isPassed = true;

Type Conversion

  • Implicit (automatic): e.g., int a = 5.5; → a becomes 5 (truncation).

  • Explicit (casting): float b = (float) 10 / 3; or float b = 10.0 / 3;


4. Operators and Expressions

Categories

Operator Precedence (high to low)

  1. ()

  2. ++ (postfix), -- (postfix)

  3. ++ (prefix), -- (prefix), !~- (unary)

  4. */%

  5. +-

  6. <<>>

  7. <<=>>=

  8. ==!=

  9. &&

  10. ||

  11. =+=-=, etc.

Use parentheses to make expressions clear and avoid ambiguity.


5. Control Structures

Conditional Statements

if (condition) {
    
} else {
    
}
if (score >= 90) {
    grade = 'A';
} else if (score >= 80) {
    grade = 'B';
} else {
    grade = 'F';
}
switch (variable) {
    case value1:
        
        break;
    case value2:
        
        break;
    default:
        
}

Loops

for (initialization; condition; increment) {
    
}
do {
    
} while (condition);

6. Functions

A function is a reusable block of code that performs a specific task.

Function Definition

return_type function_name(parameter_list) {
    
    return value; 
}

Example

int add(int a, int b) {
    return a + b;
}

Parameter Passing

void swap(int &x, int &y) {
    int temp = x;
    x = y;
    y = temp;
}

Scope and Lifetime

  • Local variables: declared inside a block; exist only within that block.

  • Global variables: declared outside all functions; accessible throughout the program (avoid overuse).

  • Static variables: retain value between function calls.

Recursion

A function that calls itself. Must have a base case to terminate.

int factorial(int n) {
    if (n <= 1) return 1;
    return n * factorial(n - 1);
}

7. Arrays and Strings

Arrays

A collection of elements of the same data type stored in contiguous memory.

int numbers[5];            
int scores[] = {90, 85, 70}; 
numbers[0] = 10;           

Strings

In C: null-terminated character arrays.

In C++: use std::string from <string> header.

#include <string>
string greeting = "Hello";
cout << greeting.length();  

8. Pointers and Dynamic Memory

Pointer Basics

A pointer is a variable that stores the memory address of another variable.

int x = 10;
int *ptr = &x;      
cout << *ptr;       

Pointer Arithmetic

When you increment a pointer, it moves by the size of the data type it points to.
ptr++ advances to the next element in an array.

Dynamic Memory Allocation

int *p = new int;   
*p = 25;
delete p;           

int *arr = new int[10]; 
delete[] arr;

Always free dynamically allocated memory to avoid memory leaks.


9. Structures, Unions, and Classes

Structures (struct)

Group related variables of different types into a single unit.

struct Student {
    string name;
    int id;
    float gpa;
};

Student s1 = {"John", 12345, 3.75};
cout << s1.name;

Unions (union)

Similar to structure but all members share the same memory location (only one member can hold a value at a time).

Classes and Object-Oriented Programming (C++)

class is a user-defined data type that contains data members and member functions.

class Rectangle {
private:
    int width, height;
public:
    Rectangle(int w, int h) : width(w), height(h) {} 
    int area() { return width * height; }
};

OOP Concepts

  • Encapsulation: Bundling data and methods; hiding internal state (private).

  • Inheritance: Deriving a new class from an existing one.

  • Polymorphism: Ability to use a common interface for different data types (function overloading, virtual functions).

  • Abstraction: Exposing only essential features.


10. File Handling

Reading from and writing to files using file streams.

#include <fstream>
using namespace std;


ofstream outFile("data.txt");
outFile << "Hello, file!" << endl;
outFile.close();


ifstream inFile("data.txt");
string line;
while (getline(inFile, line)) {
    cout << line << endl;
}
inFile.close();
  • Cfopen()fprintf()fscanf()fclose().


11. Debugging and Best Practices

  • Use meaningful variable names.

  • Comment your code – explain why, not what.

  • Avoid magic numbers – use named constants.

  • Check for errors:

  • Use a debugger (gdb, Visual Studio debugger) to step through code.

  • Write modular code – break problems into small functions.

  • Test edge cases (empty input, negative numbers, etc.).


Summary Table of Key Concepts


These notes cover the core topics of CS-308 Programming Fundamentals. Mastery requires hands-on practice: write code, debug errors, and experiment with examples.

These notes cover the fundamental and advanced concepts of Object-Oriented Programming (OOP) typically taught in a CS-409 course. The primary language used is C++, with occasional references to Java to highlight language-specific implementations.


1. Introduction to OOP

1.1 Procedural vs Object-Oriented Programming

1.2 Benefits of OOP

  • Modularity: Each object is an independent entity with well-defined boundaries.

  • Reusability: Classes can be reused through inheritance and composition.

  • Maintainability: Encapsulation hides internal details, making changes local.

  • Scalability: OOP naturally supports large, complex systems.


2. Core OOP Concepts

2.1 Classes and Objects

  • Class: A blueprint that defines attributes (data members) and behaviors (member functions).

  • Object: An instance of a class, occupying memory at runtime.

Example in C++:

class Student {
public:
    string name;
    int rollNo;
    void display() {
        cout << name << " " << rollNo;
    }
};

int main() {
    Student s1;          
    s1.name = "Alice";
    s1.rollNo = 101;
    s1.display();
    return 0;
}

2.2 Encapsulation

Example:

class BankAccount {
private:
    double balance;        

public:
    void deposit(double amt) {
        if (amt > 0) balance += amt;
    }
    double getBalance() {   
        return balance;
    }
};

2.3 Abstraction

  • Exposing only essential features and hiding implementation details.

  • Achieved using abstract classes (with pure virtual functions) and interfaces (in Java).

  • Reduces complexity and isolates impact of changes.

C++ Abstract Class:

class Shape {
public:
    virtual void draw() = 0;   
};

class Circle : public Shape {
public:
    void draw() override {
        cout << "Drawing Circle";
    }
};

2.4 Inheritance

Types:

  • Single – one base, one derived.

  • Multiple – derived from multiple bases (C++ supports; Java uses interfaces).

  • Multilevel – chain of inheritance.

  • Hierarchical – multiple derived classes from one base.

  • Hybrid – combination (e.g., multiple + multilevel).

C++ Example:

class Animal { public: void eat() { cout << "Eating"; } };
class Dog : public Animal { public: void bark() { cout << "Barking"; } };

Inheritance Access Specifiers:

  • public: public members of base remain public in derived.

  • protected: public/protected members become protected.

  • private: all become private (rarely used).

2.5 Polymorphism

Compile-time Polymorphism

  • Function Overloading: Same function name, different parameters.

  • Operator Overloading: Defining custom behavior for operators (e.g., + for complex numbers).

Runtime Polymorphism

C++ Example:

class Base {
public:
    virtual void show() { cout << "Base"; }
};
class Derived : public Base {
public:
    void show() override { cout << "Derived"; }
};

int main() {
    Base* ptr = new Derived();
    ptr->show();   
    return 0;
}

3. Detailed Topics

3.1 Constructors and Destructors

Order of Invocation:

Example:

class Base {
public:
    Base() { cout << "Base ctorn"; }
    virtual ~Base() { cout << "Base dtorn"; }   
};
class Derived : public Base {
public:
    Derived() { cout << "Derived ctorn"; }
    ~Derived() { cout << "Derived dtorn"; }
};

3.2 this Pointer

  • Implicit pointer available inside non-static member functions.

  • Points to the object for which the function is called.

  • Used to resolve name conflicts and for method chaining.

3.3 Static Members

3.4 Friend Functions and Classes

Example:

class A {
private:
    int secret;
    friend void showSecret(A& obj);
};
void showSecret(A& obj) { cout << obj.secret; }

3.5 Operator Overloading

  • Allows redefinition of operators for user-defined types.

  • Cannot change precedence, associativity, or arity.

  • Syntax: return_type operator op (parameters)

Example (Complex number addition):

class Complex {
    double real, imag;
public:
    Complex operator+(const Complex& other) {
        Complex temp;
        temp.real = real + other.real;
        temp.imag = imag + other.imag;
        return temp;
    }
};

3.6 Templates

Function Template:

template <typename T>
T max(T a, T b) { return (a > b) ? a : b; }

Class Template:

template <typename T>
class Stack {
    T* arr;
    int top;
public:
    void push(T val);
    T pop();
};

3.7 Exception Handling

  • Mechanism to handle runtime errors gracefully.

  • Keywords: trycatchthrow.

Example:

try {
    if (denominator == 0)
        throw "Division by zero!";
    result = numerator / denominator;
} catch (const char* msg) {
    cerr << msg << endl;
}

3.8 Standard Template Library (STL)

Iterators: Act like pointers to traverse containers.
Algorithmssortfindfor_each, etc.

3.9 Object Relationships

3.10 SOLID Principles

  • Single Responsibility: A class should have only one reason to change.

  • Open/Closed: Open for extension, closed for modification.

  • Liskov Substitution: Derived classes must be substitutable for base classes.

  • Interface Segregation: Many specific interfaces are better than one general interface.

  • Dependency Inversion: Depend on abstractions, not concretions.

3.11 UML Basics

  • Class Diagram: Shows classes, attributes, methods, and relationships.

    • + public, - private, # protected

    • Arrows: solid for inheritance, dashed for dependency, diamond for composition/aggregation.


4. Advanced OOP in C++

4.1 Virtual Functions & Pure Virtual

  • Virtual function: Can be overridden in derived classes; dynamic binding.

  • Pure virtual functionvirtual void f() = 0; makes the class abstract.

  • Virtual Table (vtable): Mechanism used by compilers to support runtime polymorphism.

4.2 Virtual Inheritance & Diamond Problem

Example:

class A { public: int x; };
class B : virtual public A {};
class C : virtual public A {};
class D : public B, public C {};

4.3 Move Semantics & RAII

  • Move semantics (C++11): Transfer ownership of resources instead of copying; implemented using move constructor and move assignment (std::move).

  • RAII (Resource Acquisition Is Initialization): Resources (memory, file handles) are acquired in constructors and released in destructors – essential for exception safety.

4.4 Smart Pointers (C++11)

  • std::unique_ptr: Exclusive ownership, cannot be copied.

  • std::shared_ptr: Reference-counted, shared ownership.

  • std::weak_ptr: Non-owning observer, breaks cyclic references.


5. OOP in Java – Brief Comparison


6. Key Points for Exams / Interviews

  • Difference between overloading and overriding:

    • Overloading: same name, different parameters; compile-time.

    • Overriding: redefining a base class virtual function in derived; runtime.

  • Why virtual destructor? Ensures destructor of derived class is called when deleting through base pointer.

  • Can constructors be virtual? No; because object type must be known at construction time.

  • Can static functions be virtual? No; static functions are not tied to an object.

  • What is object slicing? When a derived object is assigned to a base object, the derived part is “sliced off”.

  • Difference between new and malloc:

    • new calls constructor, malloc only allocates memory.

    • new is type-safe, returns exact type pointer.

  • What is RTTI? Run-Time Type Information: typeiddynamic_cast.


7. Summary

Object Oriented Programming is a paradigm centered on data and objects, promoting modularity, reusability, and maintainability. The four pillars—encapsulation, abstraction, inheritance, polymorphism—form the foundation. Mastering advanced topics like templates, exception handling, STL, and memory management (especially in C++) is crucial for writing efficient, robust software. Understanding design principles (SOLID) and UML further enhances software design skills.

Remember: OOP is not just about language features; it’s a way of modeling real-world entities and their interactions in code.


1. Algorithm Analysis

1.1 Complexity Analysis

1.2 Common Recurrences


2. Fundamental Data Structures

2.1 Arrays

  • Contiguous memory, constant-time access by index.

  • Static arrays: fixed size.

  • Dynamic arrays (e.g., std::vector in C++): resize automatically; amortized O(1) insertion at end.

  • Operations:

2.2 Linked Lists

  • Linear collection of nodes, each pointing to the next.

  • Types: Singly, Doubly, Circular.

  • Operations:

  • Advantages: dynamic size, efficient insert/delete at known positions.

struct Node {
    int data;
    Node* next;
    Node(int x) : data(x), next(nullptr) {}
};

2.3 Stacks

  • LIFO (Last In First Out).

  • Operations: pushpoptop (peek).

  • Implementations:

  • Applications: function call stack, expression evaluation, backtracking.

2.4 Queues

  • FIFO (First In First Out).

  • Operations: enqueuedequeuefront.

  • Implementations:

  • Variants: Deque (double-ended), Priority Queue.


3. Trees

3.1 Binary Trees

  • Each node has at most two children (left, right).

  • Traversals (recursive/iterative):

    • Preorder: root → left → right

    • Inorder: left → root → right

    • Postorder: left → right → root

    • Level-order: BFS using queue.

struct TreeNode {
    int val;
    TreeNode* left;
    TreeNode* right;
    TreeNode(int x) : val(x), left(nullptr), right(nullptr) {}
};

3.2 Binary Search Trees (BST)

  • Property: left subtree values < node value < right subtree values.

  • Operations:

    • Search: O(h) average O(log n) if balanced.

    • Insert: O(h)

    • Delete: O(h) (three cases: leaf, one child, two children → find inorder successor).

  • Balanced BSTs: AVL, Red‑Black trees guarantee O(log n).

3.3 Heaps

  • Max-Heap: parent ≥ children.

  • Min-Heap: parent ≤ children.

  • Usually implemented as array (binary heap).

  • Operations:

    • insert: O(log n) (bubble up)

    • extractMax/extractMin: O(log n) (bubble down)

    • buildHeap from array: O(n)

  • Applications: Priority queues, heap sort.


4. Graphs

4.1 Representations

  • Adjacency Matrix: O(V²) space; O(1) edge check.

  • Adjacency List: O(V+E) space; efficient for sparse graphs.

  • Edge List: list of (u,v) pairs.

4.2 Graph Traversals

  • BFS (Breadth‑First Search): Uses queue; finds shortest path in unweighted graphs; O(V+E).

  • DFS (Depth‑First Search): Uses stack (recursion); O(V+E); used for connectivity, cycle detection.

4.3 Shortest Paths

  • Dijkstra’s Algorithm: Non‑negative weights; O((V+E) log V) with binary heap.

  • Bellman‑Ford: Handles negative weights; O(VE).

  • Floyd‑Warshall: All‑pairs shortest path; O(V³).

4.4 Minimum Spanning Tree (MST)

  • Kruskal’s: Sort edges, use Union‑Find; O(E log E).

  • Prim’s: Grow tree from a start node; O((V+E) log V) with heap.

4.5 Topological Sorting


5. Sorting Algorithms

*Quick sort can be made stable with extra memory.

  • Divide & Conquer: Merge sort, Quick sort.

  • Comparison‑based lower bound: Ω(n log n).


6. Searching Algorithms

6.1 Linear Search

  • O(n) unsorted.

  • Works on any list.

6.2 Binary Search

int binarySearch(int arr[], int left, int right, int target) {
    while (left <= right) {
        int mid = left + (right - left) / 2;
        if (arr[mid] == target) return mid;
        if (arr[mid] < target) left = mid + 1;
        else right = mid - 1;
    }
    return -1;
}

7. Hashing

  • Hash Table: key‑value store with average O(1) operations.

  • Hash Function: maps keys to indices.

  • Collision Handling:

    • Chaining: each bucket holds a linked list.

    • Open Addressing: linear probing, quadratic probing, double hashing.

  • Load Factor: α = n/m; resizing when threshold exceeded.

  • Applications: symbol tables, caching, set/map implementations.


8. Advanced Topics

8.1 Tries

  • Tree structure for storing strings; each node represents a character.

  • Operations: insert, search, prefix search – O(L) where L is string length.

  • Used in autocomplete, spell checkers.

8.2 Union‑Find (Disjoint Set Union)

class DSU {
    vector<int> parent, rank;
public:
    DSU(int n) {
        parent.resize(n);
        rank.resize(n, 0);
        for (int i = 0; i < n; i++) parent[i] = i;
    }
    int find(int x) {
        if (parent[x] != x) parent[x] = find(parent[x]);
        return parent[x];
    }
    void unite(int x, int y) {
        int rx = find(x), ry = find(y);
        if (rx == ry) return;
        if (rank[rx] < rank[ry]) parent[rx] = ry;
        else if (rank[rx] > rank[ry]) parent[ry] = rx;
        else { parent[ry] = rx; rank[rx]++; }
    }
};

8.3 Dynamic Programming (DP)

  • Solve problems by breaking into overlapping subproblems, storing results.

  • Top‑down (memoization) vs Bottom‑up (tabulation).

  • Classic problems: Fibonacci, Knapsack, Longest Common Subsequence, Matrix Chain Multiplication.

8.4 Greedy Algorithms

  • Make locally optimal choice at each step.

  • Examples: Activity selection, Huffman coding, Dijkstra (though it’s also a DP with greedy property).

8.5 Backtracking


9. Important Algorithms to Know

  • KMP (Knuth‑Morris‑Pratt): Pattern matching O(n+m) using prefix function.

  • Rabin‑Karp: Rolling hash for pattern matching.

  • Floyd’s Cycle Detection: Tortoise and hare.

  • Binary Exponentiation: pow(x,n) in O(log n).

  • Sieve of Eratosthenes: Find primes up to n in O(n log log n).


10. Implementation Tips & Best Practices

  • Use standard library when appropriate: std::vectorstd::liststd::stackstd::queuestd::priority_queuestd::mapstd::unordered_map.

  • Understand trade‑offs between data structures:

  • Test edge cases: empty container, single element, duplicates, large inputs.

  • Memory management: avoid leaks; use smart pointers in C++ if needed.

  • Complexity analysis should accompany any algorithm design.


Summary Table

* If position known (e.g., head/tail).


Mastery of CS-410 Data Structures and Algorithms requires both theoretical understanding and extensive coding practice. Implement each data structure from scratch at least once, solve problems on platforms like LeetCode, and analyze complexity for every solution.

CS-408: DATABASE SYSTEMS

Module 1: Introduction to Database Systems

1.1. Basic Concepts

  • Data: Raw, unprocessed facts (e.g., “John”, “85”, “A”).

  • Information: Processed, organized, and contextualized data that has meaning (e.g., “John scored 85% which is an A grade”).

  • Database: A structured, organized collection of related data stored electronically. It is a shared, integrated, and persistent collection of data.

  • Database Management System (DBMS): A software system that enables users to define, create, maintain, and control access to the database. It acts as an intermediary between the user/application and the database.

    • Examples: MySQL, PostgreSQL, Oracle Database, Microsoft SQL Server, MongoDB, SQLite.

1.2. File System vs. DBMS

1.3. Database System Architecture

Three-Schema Architecture (ANSI-SPARC)

This architecture separates user applications from the physical database:

  1. Internal Schema (Physical Level):

    • Describes how data is physically stored on storage media.

    • Deals with file structures, indexing, compression, and storage paths.

    • Goal: Physical data independence.

  2. Conceptual Schema (Logical Level):

    • Describes the logical structure of the entire database.

    • Includes entity types, data types, relationships, and constraints.

    • Hides physical storage details.

    • Goal: Logical data independence.

  3. External Schema (View Level):

    • Describes how specific users or user groups see the data.

    • Multiple views can exist (e.g., a student view vs. an admin view).

    • Provides security by hiding sensitive data.

Data Independence:

  • Logical Data Independence: Ability to change the conceptual schema without affecting external schemas.

  • Physical Data Independence: Ability to change the internal schema without affecting the conceptual schema.

1.4. Database Users and Roles

  • Database Administrators (DBAs): Manage the system, handle security, backups, performance tuning, and user accounts.

  • Database Designers: Define the database structure (tables, relationships, constraints).

  • End Users:

    • Naive/Parametric Users: Interact through predefined applications (e.g., bank teller).

    • Sophisticated Users: Use query languages directly (e.g., analysts writing SQL).

  • Application Programmers: Develop applications that interact with the database.

1.5. Database Models Overview

  • Hierarchical Model: Tree-like structure (parent-child). Obsolete.

  • Network Model: Graph-like structure allowing multiple parent-child relationships. Obsolete.

  • Relational Model: Data organized in tables (relations). Most widely used.

  • Object-Oriented Model: Data stored as objects with attributes and methods.

  • Object-Relational Model: Hybrid of relational and object-oriented.

  • NoSQL Models: Non-relational databases (document, key-value, column-family, graph) for big data and distributed systems.


Module 2: Relational Model

2.1. Core Terminology

  • Relation: A table with rows and columns.

  • Tuple: A row in a relation (record).

  • Attribute: A column in a relation (field).

  • Domain: The set of permissible values for an attribute (e.g., integer, string, date).

  • Degree: Number of attributes in a relation.

  • Cardinality: Number of tuples in a relation.

  • Relation Schema: The logical definition of a relation: R(A1, A2, ..., An).

  • Relation Instance: The set of tuples in a relation at a particular point in time.

2.2. Relational Integrity Constraints

  1. Domain Constraints: Each attribute value must be from its defined domain.

  2. Key Constraints:

    • Superkey: An attribute or set of attributes that uniquely identifies a tuple.

    • Candidate Key: A minimal superkey (no subset is a superkey).

    • Primary Key: The candidate key chosen to uniquely identify tuples in a relation.

    • Alternate Key: Candidate keys not chosen as primary key.

  3. Entity Integrity: The primary key cannot contain NULL values.

  4. Referential Integrity (Foreign Key): A foreign key in one relation must either match a primary key value in another relation or be NULL (depending on constraints).

2.3. Relational Algebra (Formal Foundation)

Relational algebra is a procedural query language with operations that take relations as input and produce a new relation as output.


Module 3: Database Design

3.1. Entity-Relationship (ER) Modeling

ER modeling is a high-level conceptual design approach.

Components:

  • Entity: A real-world object distinguishable from others (e.g., Student, Course).

  • Attributes: Properties of an entity.

    • Simple/Composite, Single-valued/Multi-valued, Derived, Key attribute.

  • Relationships: Associations between entities.

    • Degree: Unary, Binary, Ternary.

    • Cardinality: 1:1, 1:N, M:N.

    • Participation: Total (double line) or Partial (single line).

Crow’s Foot Notation (Common):

  • | = One (mandatory)

  • O = Zero (optional)

  • || = One and only one

  • } = Many

3.2. Normalization

Normalization is the process of organizing data to minimize redundancy and dependency. It involves decomposing tables into smaller, well-structured relations.

Functional Dependency (FD): A constraint between two sets of attributes: X → Y means that the value of X uniquely determines the value of Y.

3.3. Database Design Process

  1. Requirements Analysis: Understand user needs and business rules.

  2. Conceptual Design: Create an ER model (independent of any DBMS).

  3. Logical Design: Map ER model to relational schema (tables, keys, constraints).

  4. Physical Design: Decide on storage structures, indexing, partitioning, etc.

  5. Implementation: Create the database using DDL and load data.

  6. Testing & Maintenance: Validate and maintain the database.


Module 4: Structured Query Language (SQL)

SQL is the standard language for relational database management.

4.1. Data Definition Language (DDL)

DDL defines the database structure.

CREATE DATABASE University;
CREATE TABLE Student (
    student_id INT PRIMARY KEY,
    name VARCHAR(100) NOT NULL,
    age INT CHECK (age >= 18),
    major VARCHAR(50) DEFAULT 'Undeclared'
);


ALTER TABLE Student ADD COLUMN email VARCHAR(100) UNIQUE;
ALTER TABLE Student DROP COLUMN age;


DROP TABLE Student;
DROP DATABASE University;


TRUNCATE TABLE Student;

4.2. Data Manipulation Language (DML)

DML manages data within objects.

INSERT INTO Student (student_id, name, major) VALUES (1, 'Alice', 'CS');


UPDATE Student SET major = 'AI' WHERE student_id = 1;


DELETE FROM Student WHERE student_id = 1;

4.3. Data Query Language (DQL) – SELECT

SELECT name, major FROM Student;


SELECT * FROM Student WHERE major = 'CS' AND age > 20;


SELECT name, age FROM Student ORDER BY age DESC, name ASC;


SELECT DISTINCT major FROM Student;


SELECT 
    COUNT(*) AS total_students,
    AVG(age) AS avg_age,
    MAX(age) AS oldest,
    MIN(age) AS youngest
FROM Student;


SELECT major, COUNT(*) FROM Student GROUP BY major;


SELECT major, COUNT(*) 
FROM Student 
GROUP BY major 
HAVING COUNT(*) > 5;


SELECT * FROM Student WHERE name LIKE 'A%';  
SELECT * FROM Student WHERE name LIKE '%son'; 
SELECT * FROM Student WHERE email LIKE '%@gmail.com';


SELECT * FROM Student WHERE major IN ('CS', 'AI');
SELECT * FROM Student WHERE age BETWEEN 18 AND 22;
SELECT * FROM Student WHERE email IS NULL;

4.4. Joins

Joins combine data from multiple tables based on related columns.

SELECT s.name, c.course_name
FROM Student s
INNER JOIN Enrollment e ON s.student_id = e.student_id
INNER JOIN Course c ON e.course_id = c.course_id;


SELECT s.name, e.grade
FROM Student s
LEFT JOIN Enrollment e ON s.student_id = e.student_id;


SELECT s.name, e.grade
FROM Enrollment e
RIGHT JOIN Student s ON e.student_id = s.student_id;


SELECT s.name, e.grade
FROM Student s
FULL OUTER JOIN Enrollment e ON s.student_id = e.student_id;


SELECT e1.name AS employee, e2.name AS manager
FROM Employee e1
LEFT JOIN Employee e2 ON e1.manager_id = e2.employee_id;

4.5. Subqueries (Nested Queries)

SELECT name FROM Student 
WHERE student_id IN (SELECT student_id FROM Enrollment WHERE course_id = 101);


SELECT major, avg_age
FROM (SELECT major, AVG(age) AS avg_age FROM Student GROUP BY major) AS major_stats;


SELECT name FROM Student s1
WHERE age > (SELECT AVG(age) FROM Student s2 WHERE s2.major = s1.major);

4.6. Set Operations

SELECT name FROM Student_2023
UNION
SELECT name FROM Student_2024;


SELECT name FROM Student_2023
UNION ALL
SELECT name FROM Student_2024;


SELECT name FROM Student_2023
INTERSECT
SELECT name FROM Student_2024;


SELECT name FROM Student_2023
EXCEPT
SELECT name FROM Student_2024;

4.7. Views

Views are virtual tables derived from queries.

CREATE VIEW CS_Students AS
SELECT student_id, name, email
FROM Student
WHERE major = 'CS';


SELECT * FROM CS_Students;


UPDATE CS_Students SET email = '[email protected]' WHERE student_id = 1;


DROP VIEW CS_Students;

Module 5: Transaction Management

5.1. ACID Properties

Transactions must ensure:

5.2. Transaction States

Active → Partially Committed → Committed
   ↓           ↓
   ↓      Failed
   ↓           ↓
Aborted ←───────┘
  • Active: Initial state, executing.

  • Partially Committed: After final statement, but before commit.

  • Failed: Cannot complete normally.

  • Aborted: Rolled back, changes undone.

  • Committed: Successfully completed, changes permanent.

5.3. Concurrency Control

Problems:

  • Lost Update: Two transactions overwrite each other’s changes.

  • Dirty Read: Reading uncommitted data that may be rolled back.

  • Non-Repeatable Read: Same query yields different results during a transaction.

  • Phantom Read: New rows appear/disappear during a transaction.

Lock-Based Protocols:

Two-Phase Locking (2PL): A protocol that ensures serializability.

  1. Growing Phase: Acquire locks, cannot release any.

  2. Shrinking Phase: Release locks, cannot acquire any.

Strict 2PL: Holds all locks until commit/abort (prevents cascading rollbacks).

5.4. Deadlock

A deadlock occurs when two or more transactions wait indefinitely for each other’s locks.

Handling Strategies:

  • Deadlock Prevention: Resource ordering, wait-die, wound-wait schemes.

  • Deadlock Detection: Wait-for graph; choose a victim to abort (rollback).

5.5. Recovery

Types of Failures:

  • Transaction failures (logical errors, system errors)

  • System crashes (power failure, OS crash)

  • Media failures (disk head crash)

Recovery Techniques:

  • Log-Based Recovery: Maintain a transaction log (write-ahead logging – WAL) before any changes are applied.

  • Checkpoints: Periodically sync database state to reduce recovery time.


Module 6: Advanced Database Topics

6.1. Indexing

Indexes speed up data retrieval at the cost of additional storage and slower writes.

  • B+ Tree Index: The most common index structure. Balanced tree with:

    • All leaves at same level.

    • Internal nodes store keys for navigation.

    • Leaf nodes contain actual data pointers.

  • Hash Index: Uses hash function for direct lookup. Best for equality searches.

  • Bitmap Index: Efficient for columns with low cardinality (e.g., gender, status).

  • Clustered Index: Determines physical order of data (only one per table).

  • Non-Clustered Index: Logical order separate from physical storage.

6.2. Query Processing and Optimization

Steps:

  1. Parsing: Syntax and semantics check.

  2. Query Optimization: Generate and evaluate multiple execution plans; choose the most efficient.

  3. Execution: Execute the chosen plan.

Optimization Techniques:

  • Selection Pushdown: Perform filtering early.

  • Join Ordering: Choose optimal join sequence.

  • Use of Indexes: Prefer index scans over full table scans when beneficial.

6.3. Distributed Databases

A database where data is stored across multiple physical locations.

Advantages: Reliability, scalability, local autonomy.
Challenges: Distributed concurrency control, distributed commit (2PC), data replication consistency.

6.4. NoSQL Databases

Designed for non-relational data, horizontal scalability, and high performance.

6.5. Data Warehousing and Business Intelligence

  • Data Warehouse: A centralized repository for analytical reporting and decision support. Typically follows ETL (Extract, Transform, Load) process.

  • Data Mart: A subset focused on a specific department or function.

  • OLTP vs. OLAP:


Conclusion

Database Systems form the backbone of modern information management. From foundational concepts like the relational model and normalization to practical skills in SQL, and from ensuring data integrity through ACID transactions to scaling with distributed and NoSQL solutions, this course covers the essential knowledge required to design, implement, and manage robust database systems. Understanding these principles is critical for any professional in software development, data science, or IT infrastructure.

CS-505 MACHINE LEARNING: STUDY NOTES

1. Introduction to Machine Learning and Fundamental Concepts

Machine Learning (ML) is a subfield of artificial intelligence that empowers computer systems to learn from data without being explicitly programmed for every rule. The core premise is to develop algorithms that can identify patterns, make decisions, and improve their performance on a specific task through experience. This process is fundamentally about function approximation: given a set of input data points, we aim to learn a function f that maps these inputs to desired outputs. The quality of this function is measured by its ability to generalize, meaning it performs accurately not only on the data it was trained on but also on new, unseen data. The course begins by categorizing learning paradigms. Supervised learning involves learning a mapping from input features to labeled outputs, where the goal is to predict the label for new inputs. This category is further divided into regression, where the output is a continuous value (e.g., predicting house prices), and classification, where the output is a discrete category (e.g., spam detection). Unsupervised learning, in contrast, deals with unlabeled data, aiming to discover inherent structures, such as grouping similar data points in clustering or reducing the number of features in dimensionality reduction. A third paradigm, Reinforcement Learning, involves an agent learning optimal actions through interactions with an environment to maximize a cumulative reward, a concept distinct from the pattern recognition focus of the first two.

A critical foundation for all machine learning is understanding the hypothesis space and the inductive bias. The hypothesis space is the set of all possible functions (models) that a learning algorithm can consider. The inductive bias is the set of assumptions the learner uses to choose one function over another from this space; without such bias, generalization from finite data is impossible. For instance, preferring a simpler line over a complex, wiggly curve to fit data points is a bias towards simplicity (Occam’s razor). The central challenge in ML is the bias-variance tradeoffBias is the error introduced by approximating a real-world problem, which may be complex, by a much simpler model. High bias leads to underfitting, where the model fails to capture the underlying trend of the data. Variance is the error introduced by the model’s sensitivity to small fluctuations in the training set. High variance leads to overfitting, where the model learns the noise in the training data as if it were a genuine pattern, resulting in poor performance on new data. The goal of a practitioner is to find a model that balances these two sources of error to minimize the total generalization error.


2. Supervised Learning: Regression and Classification Algorithms

Regression models predict continuous outputs. The simplest and most fundamental is Linear Regression, which assumes a linear relationship between input features and the output. The model is represented as hθ(x)=θTx, where θ are the model parameters (weights). The standard learning objective is to minimize the Mean Squared Error (MSE) cost function, often solved analytically using the Normal Equation (θ=(XTX)−1XTy) or iteratively via Gradient Descent. Gradient Descent is an optimization algorithm that iteratively updates parameters in the direction of the negative gradient of the cost function. Concepts like learning rate, convergence, and the distinction between batch, stochastic, and mini-batch gradient descent are crucial for efficient optimization. To combat overfitting in linear models, regularization techniques are introduced. Ridge Regression (L2 regularization) adds a penalty equal to the square of the magnitude of coefficients, shrinking them towards zero. Lasso Regression (L1 regularization) adds a penalty equal to the absolute value of coefficients, which can force some coefficients to become exactly zero, effectively performing feature selection.

Classification algorithms predict discrete labels. Logistic Regression is a fundamental classification algorithm despite its name. It models the probability that an instance belongs to a particular class using the logistic (sigmoid) function: hθ(x)=11+e−θTx. The output is interpreted as the probability of the positive class, and a threshold (usually 0.5) is applied to make a final classification. The model is trained by maximizing the likelihood of the observed data, which is equivalent to minimizing the log-loss (cross-entropy) cost function. For multi-class problems, strategies like one-vs-rest (OvR) are used.

Moving beyond linear models, k-Nearest Neighbors (k-NN) is a non-parametric, instance-based learning algorithm. It makes predictions by identifying the k training examples closest to a new data point and outputting the majority class (classification) or the average value (regression). Its performance depends heavily on the choice of the distance metric (e.g., Euclidean, Manhattan) and the value of k, with smaller k leading to more complex, high-variance models. While intuitive, k-NN suffers from the “curse of dimensionality,” where its performance degrades in high-dimensional spaces as distances become less meaningful. Decision Trees offer a highly interpretable model, learning a hierarchical structure of if-else rules from the features. Algorithms like ID3, C4.5, and CART use splitting criteria such as Gini impurity or information gain (based on entropy) to determine which feature to split on at each node. Decision trees are prone to overfitting but can be regularized by limiting the maximum depth, setting a minimum number of samples per leaf, or pruning branches.


3. Ensemble Methods and Support Vector Machines

To improve predictive performance and robustness, ensemble methods combine multiple base models. Bagging (Bootstrap Aggregating), exemplified by the Random Forest algorithm, reduces variance by training many independent models in parallel on bootstrapped subsets of the data and averaging their predictions. Random Forests add an extra layer of randomness by considering only a random subset of features for each split, further decorrelating the trees and leading to superior generalization. They also provide a useful feature importance measure. Boosting, in contrast, trains models sequentially, where each subsequent model focuses on correcting the errors made by the previous ones. AdaBoost (Adaptive Boosting) assigns higher weights to misclassified data points in each iteration. Gradient Boosting Machines (GBM) build trees sequentially, with each new tree attempting to predict the residual errors of the previous ensemble. Its optimized implementations, such as XGBoost, LightGBM, and CatBoost, are among the most powerful and widely used algorithms in competitive machine learning due to their speed and accuracy.

Support Vector Machines (SVMs) represent a powerful and theoretically well-founded approach for both linear and non-linear classification. The core idea is to find the optimal hyperplane that not only separates classes but also maximizes the margin—the distance between the hyperplane and the closest data points from each class, which are called support vectors. For linearly separable data, the optimization problem aims to find a decision boundary that maximizes this margin. For non-separable data, the soft-margin formulation introduces a hyperparameter C that controls the trade-off between maximizing the margin and minimizing the misclassification penalty. The true power of SVMs lies in the kernel trick. Instead of applying a non-linear transformation to the input features explicitly (which can be computationally prohibitive), a kernel function K(x,x′) computes the dot product between the transformed features in a higher-dimensional space directly. Common kernels include the polynomial kernel and the Radial Basis Function (RBF) kernel, which allows the SVM to create complex, non-linear decision boundaries. SVMs are particularly effective in high-dimensional spaces and are memory-efficient, as only the support vectors are needed to define the model.


4. Unsupervised Learning and Dimensionality Reduction

Unsupervised learning aims to extract patterns from unlabeled data. Clustering is the task of partitioning data into groups (clusters) such that objects within the same cluster are more similar to each other than to those in other clusters. k-Means is the most popular centroid-based algorithm. It iteratively assigns each point to the nearest centroid and then updates centroids to be the mean of assigned points. Its performance depends on the initial centroid placement and the pre-specified number of clusters k, often chosen using the elbow method or silhouette analysis. In contrast, Hierarchical Clustering builds a tree of clusters (a dendrogram) using either an agglomerative (bottom-up) or divisive (top-down) approach. It does not require pre-specifying k and can reveal hierarchical relationships, but it is computationally more intensive than k-Means. Density-Based Spatial Clustering of Applications with Noise (DBSCAN) groups together points that are closely packed together, marking points in low-density regions as outliers. It can find arbitrarily shaped clusters and is robust to noise, with key parameters being the radius of the neighborhood (eps) and the minimum number of points to form a dense region (minPts).

Dimensionality Reduction is crucial for visualizing data, reducing computational cost, and mitigating the curse of dimensionality. Principal Component Analysis (PCA) is a linear technique that finds a set of orthogonal axes (principal components) that capture the maximum variance in the data. By projecting the data onto the top k principal components, we achieve a lower-dimensional representation that preserves as much variance as possible. PCA works by performing an eigenvalue decomposition of the data’s covariance matrix. For non-linear structures, t-Distributed Stochastic Neighbor Embedding (t-SNE) is a powerful technique primarily used for visualization. It models the similarity between high-dimensional data points using a Gaussian distribution and then models the similarity between corresponding low-dimensional points using a heavy-tailed Student-t distribution. t-SNE excels at preserving local structure and revealing clusters but is stochastic and computationally expensive, making it unsuitable for general-purpose dimensionality reduction beyond visualization.


5. Model Evaluation, Validation, and Practical Considerations

The ultimate goal in ML is to create models that generalize well. Therefore, rigorous evaluation is paramount. The simplest approach is to split the data into a training set and a test set. However, this can be unreliable, especially with limited data, as performance can vary based on the specific split. Cross-validation provides a more robust estimate. k-Fold Cross-Validation involves partitioning the data into k equal-sized folds. The model is trained k times, each time using k−1 folds for training and the remaining fold for validation. The average performance across the k folds provides a more stable estimate of generalization error. For hyperparameter tuning, we use a three-way split: training, validation (for tuning), and test (for final, unbiased evaluation), or nested cross-validation.

To evaluate model performance, the choice of metrics is critical and task-dependent. For regression, common metrics include Mean Absolute Error (MAE)Mean Squared Error (MSE), and R-squared (R2) , which represents the proportion of variance explained by the model. For classification, a simple accuracy score can be misleading, especially with imbalanced datasets. The confusion matrix provides a detailed breakdown of true positives, true negatives, false positives, and false negatives. From this, we derive precision (the accuracy of positive predictions), recall (the ability to find all positive instances), and the F1-score (the harmonic mean of precision and recall). The Receiver Operating Characteristic (ROC) curve plots the true positive rate against the false positive rate at various thresholds, with the Area Under the Curve (AUC) providing a threshold-independent measure of a model’s ability to discriminate between classes.

Beyond algorithms and evaluation, a successful machine learning project involves significant data preparation and feature engineering. This includes handling missing data through imputation or deletion, encoding categorical variables via one-hot encoding or label encoding, and scaling features (e.g., standardization to zero mean and unit variance, or normalization to a range) which is essential for distance-based algorithms like SVM, k-NN, and for gradient descent convergence. Feature engineering—the process of creating new features from raw data using domain knowledge—can often have a greater impact on model performance than the choice of algorithm itself. Finally, understanding and mitigating overfitting is a continuous theme, addressed through simpler models, regularization, cross-validation, and the collection of more data.

Table of Contents

  1. Introduction to Algorithms

  2. Algorithm Analysis

  3. Asymptotic Notations

  4. Recurrence Relations and Master Theorem

  5. Divide and Conquer

  6. Greedy Algorithms

  7. Dynamic Programming

  8. Graph Algorithms

  9. Backtracking and Branch-and-Bound

  10. String Matching Algorithms

  11. NP-Completeness and Approximation Algorithms

  12. Advanced Topics (Optional)


1. Introduction to Algorithms

Definition: An algorithm is a finite sequence of well-defined, unambiguous steps that takes an input, processes it, and produces an output. It must terminate after a finite number of steps.

Properties:

  • Correctness: Produces the correct output for all inputs.

  • Finiteness: Terminates after a finite number of steps.

  • Definiteness: Each step is precisely defined.

  • Input/Output: Takes zero or more inputs, produces at least one output.

Importance: Algorithms are the core of computer science; they enable efficient problem solving and are the foundation of software systems.


2. Algorithm Analysis

2.1 Time Complexity

  • Measures the number of elementary operations (comparisons, assignments, arithmetic) as a function of input size n.

  • Worst-caseT(n) = maximum time over all inputs of size n.

  • Average-caseT(n) = expected time over all inputs (requires distribution assumption).

  • Best-case: Minimum time, rarely used for performance guarantees.

2.2 Space Complexity

  • Measures the amount of memory required during execution (excluding input).

  • Includes auxiliary space (temporary variables, recursion stack).

2.3 Growth Rates

Common functions ordered by increasing growth:

1<log⁡n<n<n<nlog⁡n<n2<n3<2n<n!


3. Asymptotic Notations

Used to describe the behavior of functions for large n.

3.1 Big-Oh (O)

  • Upper bound: f(n)=O(g(n)) if ∃c>0,n0≥0 such that 0≤f(n)≤c⋅g(n) for all n≥n0.

  • Example: 2n+5=O(n)n2+3n=O(n2).

3.2 Omega (Ω)

  • Lower bound: f(n)=Ω(g(n)) if ∃c>0,n0≥0 such that 0≤c⋅g(n)≤f(n) for all n≥n0.

  • Example: 2n+5=Ω(n)n2=Ω(n).

3.3 Theta (Θ)

  • Tight bound: f(n)=Θ(g(n)) if f(n)=O(g(n)) and f(n)=Ω(g(n)).

  • Example: 3n2+2n=Θ(n2).

3.4 Little-Oh (o) and Little-Omega (ω)

  • Strict upper bound: f(n)=o(g(n)) if lim⁡n→∞f(n)/g(n)=0.

  • Strict lower bound: f(n)=ω(g(n)) if lim⁡n→∞f(n)/g(n)=∞.

3.5 Common Complexity Classes


4. Recurrence Relations and Master Theorem

Recurrence relations define the running time of recursive algorithms.

4.1 Common Recurrence Forms

4.2 Master Theorem

For T(n)=aT(n/b)+f(n) where a≥1,b>1, and f(n) asymptotically positive:

  1. If f(n)=O(nlog⁡ba−ϵ) for some ϵ>0, then T(n)=Θ(nlog⁡ba).

  2. If f(n)=Θ(nlog⁡ba), then T(n)=Θ(nlog⁡balog⁡n).

  3. If f(n)=Ω(nlog⁡ba+ϵ) for some ϵ>0 and af(n/b)≤cf(n) for some c<1 and sufficiently large n, then T(n)=Θ(f(n)).

Examples:

  • Merge sort: T(n)=2T(n/2)+Θ(n) → case 2 → T(n)=Θ(nlog⁡n).

  • Binary search: T(n)=T(n/2)+Θ(1) → case 2 → T(n)=Θ(log⁡n).

  • Strassen’s matrix multiplication: T(n)=7T(n/2)+Θ(n2) → case 1 → T(n)=Θ(nlog⁡27)≈Θ(n2.81).

4.3 Recurrence Tree Method

  • Visualize recursion tree, sum costs at each level.

  • Useful when master theorem does not apply.


5. Divide and Conquer

Strategy:

  1. Divide problem into smaller subproblems.

  2. Conquer subproblems recursively.

  3. Combine solutions to obtain overall solution.

5.1 Merge Sort

  • Divide: Split array into two halves.

  • Conquer: Recursively sort each half.

  • Combine: Merge two sorted halves in O(n).

  • TimeO(nlog⁡n)SpaceO(n) auxiliary.

5.2 Quick Sort

  • Divide: Choose a pivot, partition array into elements ≤ pivot and ≥ pivot.

  • Conquer: Recursively sort subarrays.

  • Combine: Trivial (in-place).

  • Worst-caseO(n2) (unbalanced partitions), AverageO(nlog⁡n).

5.3 Strassen’s Matrix Multiplication

  • Standard multiplication: O(n3).

  • Strassen: divides matrices into 2×2 blocks, uses 7 multiplications instead of 8 → O(nlog⁡27)≈O(n2.81).

5.4 Closest Pair of Points

  • Divide points by x-coordinate, recursively find closest pairs in halves, then check near the dividing line in O(n) after sorting by y.

  • Time: O(nlog⁡n).

5.5 Maximum Subarray Sum (Kadane’s Algorithm)

  • Divide-and-conquer: find max crossing subarray; combined with left/right maxima.

  • Time: O(nlog⁡n), but Kadane’s O(n) is better.


6. Greedy Algorithms

Principle: Make locally optimal choice at each step, hoping to find global optimum. Works for problems with optimal substructure and greedy choice property.

6.1 Activity Selection

  • Given n activities with start and finish times, select maximum number of non-overlapping activities.

  • Greedy: Always pick the activity with the earliest finish time.

  • TimeO(nlog⁡n) after sorting.

6.2 Huffman Coding

  • Variable-length prefix codes for data compression.

  • Greedy: Repeatedly merge two smallest frequency nodes.

  • TimeO(nlog⁡n) using min-heap.

6.3 Fractional Knapsack

  • Items with weight and value, fractional allowed.

  • Greedy: Sort by value/weight ratio, take as much as possible from highest ratio.

  • TimeO(nlog⁡n).

6.4 Minimum Spanning Tree (MST)

  • Kruskal: Add smallest edge that does not form cycle (union-find). O(Elog⁡V).

  • Prim: Grow tree from a vertex, add smallest edge connecting tree to outside. O(Elog⁡V) with binary heap.

6.5 Dijkstra’s Shortest Path

  • Single-source shortest paths on non-negative weights.

  • Greedy: Always relax the vertex with smallest tentative distance.

  • TimeO((V+E)log⁡V) with binary heap.


7. Dynamic Programming

Principle: Break problem into overlapping subproblems, store results to avoid recomputation. Used when problem has optimal substructure and overlapping subproblems.

7.1 Steps

  1. Characterize structure of optimal solution.

  2. Define state (subproblem).

  3. Formulate recurrence relation.

  4. Compute in bottom-up or top-down (memoization) manner.

  5. Extract solution.

7.2 Classical DP Problems

7.2.1 Fibonacci Numbers

  • Recurrence: F(n)=F(n−1)+F(n−2).

  • Naive recursion: O(2n); DP: O(n) time, O(1) space.

7.2.2 0/1 Knapsack

  • Given weights wi, values vi, capacity W.

  • State: dp[i][c] = max value using first i items, capacity c.

  • Recurrence: dp[i][c]=max⁡(dp[i−1][c],vi+dp[i−1][c−wi]) if c≥wi.

  • Time: O(nW), Space: O(W) (optimized).

7.2.3 Longest Common Subsequence (LCS)

  • Given strings X[1..m]Y[1..n].

  • State: dp[i][j] = LCS length of X[1..i] and Y[1..j].

  • Recurrence:
    dp[i][j]={0if i=0 or j=0dp[i−1][j−1]+1if X[i]=Y[j]max⁡(dp[i−1][j],dp[i][j−1])otherwise

  • Time: O(mn).

7.2.4 Matrix Chain Multiplication

  • Given matrices A1…An with dimensions, minimize scalar multiplications.

  • State: dp[i][j] = minimum cost to multiply Ai..Aj.

  • Recurrence: dp[i][j]=min⁡i≤k<j{dp[i][k]+dp[k+1][j]+pi−1pkpj}.

  • Time: O(n3).

7.2.5 Edit Distance (Levenshtein)

  • Minimum operations (insert, delete, replace) to convert string A to B.

  • Recurrence similar to LCS but with costs.

7.2.6 Rod Cutting

  • Given rod length n and price array, maximize profit by cutting.

  • Recurrence: dp[i]=max⁡1≤j≤i(price[j]+dp[i−j]).

7.3 DP vs Greedy

  • Greedy makes irrevocable decisions; DP considers all possibilities.

  • Greedy requires greedy choice property; DP works when subproblems overlap and optimal substructure holds.


8. Graph Algorithms

8.1 Graph Representations

  • Adjacency MatrixV×V matrix, O(V2) space.

  • Adjacency List: Array of lists, O(V+E) space.

8.2 Graph Traversals

8.2.1 Breadth-First Search (BFS)

  • Uses queue; explores level by level.

  • Applications: Shortest path in unweighted graph, connectivity, bipartite check.

  • TimeO(V+E).

8.2.2 Depth-First Search (DFS)

  • Uses stack/recursion; explores deeply.

  • Applications: Topological sort, strongly connected components, cycle detection.

  • TimeO(V+E).

8.3 Minimum Spanning Tree (MST)

  • Kruskal’s (Greedy, union-find): O(Elog⁡V).

  • Prim’s (Greedy, heap): O(Elog⁡V).

8.4 Shortest Paths

8.4.1 Dijkstra’s Algorithm

  • Non-negative weights, single source.

  • Time: O((V+E)log⁡V) with binary heap; O(V2) with array for dense graphs.

8.4.2 Bellman-Ford Algorithm

8.4.3 Floyd-Warshall Algorithm

  • All-pairs shortest paths.

  • Dynamic programming: dp[k][i][j] = shortest path using vertices 1..k as intermediates.

  • Time: O(V3), Space: O(V2).

8.5 Topological Sort

  • For directed acyclic graph (DAG).

  • Kahn’s algorithm (BFS-based): compute indegrees, remove nodes with indegree 0.

  • DFS-based: push node after all descendants visited.

  • Time: O(V+E).

8.6 Strongly Connected Components (SCC)

  • Kosaraju’s Algorithm: Two DFS passes (original and transposed graph). O(V+E).

  • Tarjan’s Algorithm: Single DFS, uses low-link values. O(V+E).

8.7 Maximum Flow

  • Ford-Fulkerson: Augmenting path; time O(E⋅max flow).

  • Edmonds-Karp: BFS for augmenting path; O(VE2).

  • Dinic’s Algorithm: Level graph and blocking flow; O(V2E) or better for unit capacities.

8.8 Bipartite Matching


9. Backtracking and Branch-and-Bound

9.1 Backtracking

  • Systematic search using recursion, pruning invalid paths.

  • N-Queens: Place queens row by row, check conflicts.

  • Hamiltonian Cycle: Try vertices, backtrack if no further extension.

  • Subset Sum: Explore inclusion/exclusion.

9.2 Branch-and-Bound

  • Optimization version of backtracking with bounding to prune suboptimal solutions.

  • Used for NP-hard problems (e.g., traveling salesman, knapsack).

  • Maintain best solution so far, prune if lower bound > best.

9.3 Comparison


10. String Matching Algorithms

10.1 Naïve String Matching

  • Slide pattern over text, compare character by character.

  • Time: O((n−m+1)m).

10.2 Rabin-Karp

  • Hash pattern, then compare hash of text windows.

  • Average O(n+m), worst O(nm) due to collisions.

  • Uses rolling hash.

10.3 Knuth-Morris-Pratt (KMP)

10.4 Boyer-Moore

  • Scan from right to left, uses bad-character and good-suffix heuristics.

  • Often sublinear in practice.

10.5 Finite Automaton


11. NP-Completeness and Approximation Algorithms

11.1 Complexity Classes

  • P: Problems solvable in polynomial time.

  • NP: Problems whose solutions can be verified in polynomial time.

  • NP-Hard: A problem H is NP-hard if every problem in NP can be reduced to H in polynomial time.

  • NP-Complete: Problems that are both NP and NP-hard.

11.2 NP-Complete Problems

  • SAT3-SATHamiltonian CycleTraveling Salesman (decision)CliqueVertex CoverSubset Sum, etc.

11.3 Reductions

  • To prove a problem is NP-complete, reduce a known NP-complete problem to it.

  • Example: 3-SAT reduces to Clique, Clique reduces to Vertex Cover, etc.

11.4 Approximation Algorithms

  • Provide near-optimal solutions for NP-hard optimization problems.

  • Approximation Ratiomax⁡(ALGOPT,OPTALG).

  • Vertex Cover: Greedy gives ratio 2.

  • TSP with triangle inequality: Christofides algorithm gives 3/2 approximation.

  • Set Cover: Greedy gives O(log⁡n) approximation.

11.5 Polynomial-Time Approximation Schemes (PTAS)

  • Algorithms that can achieve (1+ϵ) approximation in time polynomial in input size for fixed ϵ.

  • Fully PTAS (FPTAS) when time is polynomial in both n and 1/ϵ.


12. Advanced Topics (Optional)

12.1 Amortized Analysis

  • Average cost per operation over a sequence.

  • Aggregate analysis: total cost / number of operations.

  • Accounting method: assign credits to operations.

  • Potential method: define potential function Φ.

12.2 Randomized Algorithms

  • Use random choices to simplify or improve performance.

  • QuickSort (randomized pivot) avoids worst-case.

  • Randomized selection (QuickSelect) O(n) expected.

  • Las Vegas: always correct, time varies; Monte Carlo: bounded error.

12.3 Parallel Algorithms

12.4 Data Structures for Algorithms

  • Heaps: priority queues.

  • Disjoint Set Union (Union-Find): with path compression and union by rank, nearly O(α(n)).

  • Binary Indexed Tree (Fenwick Tree): range queries, point updates in O(log⁡n).

  • Segment Tree: range queries and updates in O(log⁡n).


Summary

The course emphasizes:

  1. Algorithm Analysis: asymptotic notation, recurrence solving.

  2. Algorithm Design Paradigms: divide-and-conquer, greedy, dynamic programming, backtracking, branch-and-bound.

  3. Graph Algorithms: traversal, shortest paths, MST, flow.

  4. Complexity Theory: P vs NP, NP-completeness, reductions.

  5. String Algorithms: efficient pattern matching.

  6. Approximation and Randomized Algorithms.

These notes cover the foundational concepts, tools, and techniques of data science. The focus is on the data science lifecycle, exploratory data analysis (EDA), statistical inference, machine learning fundamentals, and best practices. Code examples are provided in Python using popular libraries (pandas, numpy, matplotlib, scikit-learn).


1. What is Data Science?

Data Science is an interdisciplinary field that uses scientific methods, algorithms, and systems to extract knowledge and insights from structured and unstructured data. It combines:

  • Statistics – for inference, hypothesis testing, and modeling.

  • Computer Science – for algorithms, databases, and scalable systems.

  • Domain Expertise – to interpret results and ask meaningful questions.

1.1 The Data Science Lifecycle

A typical project follows these stages:

  1. Problem Definition – Understand business goals, formulate questions.

  2. Data Acquisition – Collect data from databases, APIs, files, web scraping.

  3. Data Preparation (Cleaning & Wrangling) – Handle missing values, outliers, formatting, integration.

  4. Exploratory Data Analysis (EDA) – Summarize main characteristics, visualizations.

  5. Modeling & Machine Learning – Build predictive or descriptive models.

  6. Evaluation & Interpretation – Assess model performance, validate with domain experts.

  7. Deployment – Integrate model into production systems.

  8. Monitoring & Maintenance – Track model drift, update as needed.


2. Data Types and Structures

2.1 Data Types

2.2 Data Structures in Python

  • NumPy arrays: Homogeneous, efficient numerical operations.

  • Pandas Series: 1D labeled array.

  • Pandas DataFrame: 2D labeled data structure (rows and columns).

  • Lists, dicts: Basic Python containers.


3. Data Preprocessing (Wrangling)

Most of the time in data science is spent preparing data.

3.1 Handling Missing Values

  • Detectiondf.isnull().sum()df.info()

  • Strategies:

  • Caveat: Understand why data is missing (MCAR, MAR, MNAR) to choose appropriate method.

3.2 Outliers

  • Detection:

    • Z‑score > 3 or < -3

    • IQR (Interquartile Range): values below Q1 – 1.5×IQR or above Q3 + 1.5×IQR

    • Visualization: box plots, scatter plots

  • Treatment:

    • Cap (winsorize), transform (log), remove, or treat separately.

3.3 Data Transformation

  • Scaling/Normalization (important for distance‑based algorithms like SVM, k‑NN):

    • Min‑Max: (x - min)/(max - min) → range [0,1]

    • Standardization (Z‑score): (x - μ)/σ → mean 0, variance 1

  • Encoding Categorical Variables:

  • Feature Engineering: Create new features from existing ones (e.g., ratios, date parts, interactions).


4. Exploratory Data Analysis (EDA)

EDA is the process of summarizing and visualizing data to uncover patterns, anomalies, and hypotheses.

4.1 Summary Statistics

  • Central tendency: mean, median, mode

  • Spread: variance, standard deviation, range, IQR

  • Shape: skewness, kurtosis

4.2 Key Visualizations

4.3 Correlation and Covariance

  • Pearson correlation: measures linear relationship between two continuous variables.

  • Spearman rank correlation: monotonic relationship (non‑linear but ordered).

  • Covariance: direction of linear relationship (scale dependent).


5. Probability and Statistics for Data Science

5.1 Basic Probability

  • Probability rules: P(A∪B) = P(A) + P(B) – P(A∩B)

  • Conditional probability: P(A|B) = P(A∩B)/P(B)

  • Bayes’ theorem: P(A|B) = P(B|A)P(A)/P(B)

5.2 Distributions

  • Discrete: Bernoulli, Binomial, Poisson

  • Continuous: Uniform, Normal (Gaussian), Exponential

  • Central Limit Theorem: The sampling distribution of the sample mean approaches a normal distribution as sample size increases.

5.3 Hypothesis Testing

  • Null hypothesis (H₀) – default assumption; Alternative (H₁) – what we suspect.

  • p‑value: probability of observing data as extreme as, or more extreme than, the observed sample if H₀ is true.

  • Significance level (α) – typically 0.05.

  • Common tests:

5.4 Confidence Intervals


6. Machine Learning Fundamentals

6.1 Types of Learning

6.2 Train/Test Split & Cross‑Validation

  • Train/Test split: hold out a portion (e.g., 20‑30%) for evaluation.

  • k‑Fold Cross‑Validation: partition data into k folds, train on k‑1, validate on the remaining, repeat k times, average performance.

  • Stratified sampling: preserve class distribution in classification.

6.3 Common Algorithms

Regression

  • Linear Regression: models linear relationship between features and target.

  • Ridge/Lasso: regularized linear regression (L2/L1 penalty).

  • Polynomial regression: captures non‑linear relationships.

Classification

  • Logistic Regression: outputs probability of binary outcome.

  • k‑Nearest Neighbors (k‑NN): non‑parametric, based on similarity.

  • Decision Trees & Random Forests: tree‑based models, ensemble method.

  • Support Vector Machines (SVM): finds hyperplane that maximizes margin.

  • Naïve Bayes: based on Bayes’ theorem, assumes feature independence.

Unsupervised

  • k‑Means Clustering: partitions data into k clusters based on centroids.

  • Hierarchical Clustering: builds a tree of clusters (dendrogram).

  • Principal Component Analysis (PCA): reduces dimensionality while preserving variance.

6.4 Evaluation Metrics


7. Model Selection and Overfitting

  • Underfitting: model too simple, high bias.

  • Overfitting: model too complex, captures noise; high variance.

  • Bias‑Variance Tradeoff: need to balance.

  • Regularization: penalizes model complexity (L1, L2).

  • Hyperparameter Tuning: grid search, random search.


8. Data Science Tools and Libraries


9. Advanced Topics (Overview)


10. Practical Steps for a Data Science Project

  1. Understand the problem: Clarify goals with stakeholders.

  2. Collect data: Identify sources, ensure legality.

  3. Clean & preprocess: Handle missing, outliers, encoding.

  4. EDA: Visualize distributions, relationships, summarize.

  5. Feature engineering: Create relevant features.

  6. Model building: Select algorithms, split data, train.

  7. Evaluate: Use appropriate metrics, cross‑validation.

  8. Interpret: Explain results, check assumptions.

  9. Communicate: Present findings clearly with visualizations.

  10. Deploy & monitor: Integrate into systems, track performance.


11. Key Formulas to Remember

  • Mean: $bar{x} = frac{1}{n}sum_{i=1}^{n} x_i$

  • Variance (sample): $s^2 = frac{1}{n-1}sum (x_i – bar{x})^2$

  • Standard deviation: $s = sqrt{s^2}$

  • Pearson correlation: $r = frac{sum (x_i – bar{x})(y_i – bar{y})}{sqrt{sum (x_i – bar{x})^2 sum (y_i – bar{y})^2}}$

  • Linear regression: $hat{y} = beta_0 + beta_1 x$

  • MSE: $frac{1}{n}sum (y_i – hat{y}_i)^2$

  • Accuracy: $frac{TP + TN}{TP + TN + FP + FN}$


12. Common Mistakes to Avoid

  • Not exploring data before modeling.

  • Using default parameters without understanding them.

  • Leaking data from test set into training.

  • Ignoring domain knowledge.

  • Over‑engineering features without validation.

  • Forgetting to handle class imbalance.


Summary

Data science is a process that combines domain expertise, programming, and statistical thinking to derive insights from data. Success depends on a strong foundation in data manipulation, exploratory analysis, and machine learning, as well as the ability to communicate results effectively. The field is rapidly evolving, so staying current with tools and ethical considerations is essential.


1. Introduction to Database Administration

1.1 Role of a Database Administrator (DBA)

  • DBA is responsible for the installation, configuration, maintenance, security, and performance of databases.

  • Key responsibilities:

    • Installing and upgrading DBMS software.

    • Creating and managing database instances.

    • Implementing backup and recovery strategies.

    • Monitoring performance and tuning.

    • Managing user accounts and security.

    • Capacity planning and storage management.

    • Ensuring high availability and disaster recovery.

    • Automating routine tasks.

    • Liaising with developers for schema design and query optimization.

1.2 Types of DBAs

  • System DBA: focuses on physical installation, hardware, OS, DBMS configuration.

  • Application DBA: works with developers on schema design, SQL tuning, application‑specific databases.

  • Development DBA: involved in development lifecycle, testing, deployment.

  • Data Warehouse DBA: manages ETL processes, OLAP, large‑scale reporting.

  • Cloud DBA: manages databases in cloud environments (AWS RDS, Azure SQL, etc.).

1.3 DBA Skills

  • Deep knowledge of one or more DBMS (Oracle, SQL Server, MySQL, PostgreSQL).

  • Operating system expertise (Linux, Windows).

  • Scripting (Python, PowerShell, Bash) for automation.

  • Networking, storage, and security fundamentals.

  • Understanding of high‑availability and disaster‑recovery concepts.


2. DBMS Architecture and Components

2.1 DBMS Architecture Overview

  • Single‑tier: database resides on same machine as application (rare in production).

  • Two‑tier: client application connects directly to database server.

  • Three‑tier: client → application server → database server (common for web applications).

2.2 Key Components

  • Instance: the set of memory structures and background processes that manage database data.

    • Oracle: System Global Area (SGA) + background processes (PMON, SMON, DBWR, LGWR, etc.)

    • SQL Server: memory buffers, threads.

  • Database: physical files (data files, control files, redo logs, etc.) that store data.

  • Listener: process that accepts client connections (Oracle listener, MySQL port 3306).

  • Query Processor: parses, optimizes, executes SQL statements.

  • Storage Engine: handles reading/writing data from disk.

2.3 Memory Structures

  • Buffer Cache: caches data blocks to reduce I/O.

  • Log Buffer: temporarily stores redo/transaction log records.

  • Shared Pool: caches SQL statements, execution plans, data dictionary.


3. Installation and Configuration

3.1 Pre‑Installation Planning

  • Hardware requirements: CPU, RAM, disk I/O, network.

  • Operating system version and patches.

  • Storage layout: separate disks for data, logs, backups.

  • Filesystem (ext4, XFS, NTFS) or ASM (Oracle Automatic Storage Management).

3.2 Installation Steps (General)

  1. Download DBMS software.

  2. Install binaries (using package manager or graphical installer).

  3. Set environment variables (ORACLE_HOME, PATH, etc.).

  4. Create OS user (e.g., oraclemysql) and groups.

  5. Configure kernel parameters (shared memory, file descriptors, etc.).

  6. Run installation wizard or command‑line.

  7. Verify installation.

3.3 Post‑Installation Configuration

  • Create initial database instance (using DBCA for Oracle, init for MySQL).

  • Configure network listener (tnsnames.ora, listener.ora for Oracle; my.cnf for MySQL).

  • Set up authentication (OS authentication, password file).

  • Configure startup scripts to start database automatically on server boot.

  • Apply security patches and updates.


4. Database Creation and Storage Management

4.1 Creating a Database

  • Manual using SQL commands: CREATE DATABASE (MySQL, PostgreSQL) or with scripts (Oracle).

  • Using tools: DBCA (Oracle), SQL Server Management Studio (SSMS), MySQL Workbench.

  • Parameters: database name, character set, memory settings, storage locations.

4.2 Storage Structures

4.2.1 Logical Storage

  • Tablespace (Oracle, PostgreSQL) / Filegroup (SQL Server): container for database objects.

  • Segment: object (table, index) stored in a tablespace.

  • Extent: contiguous set of blocks allocated to a segment.

  • Block: smallest unit of I/O (typically 8KB).

4.2.2 Physical Storage

  • Data files: store actual data (.dbf.mdf.ndf).

  • Control files: database metadata (name, checkpoint info, etc.) – critical for recovery.

  • Redo logs: record all changes for recovery (online redo logs, archived logs).

  • Undo/Temp tablespace: used for transaction rollback and temporary operations.

4.3 Managing Storage

  • Monitor tablespace usage; add data files when needed.

  • Resize or auto‑extend data files.

  • Use ASM (Oracle) or LVM for simplified storage management.

  • Implement RAID for performance and redundancy (RAID 10, RAID 5).


5. User Management and Security

5.1 User Accounts

  • Create users with passwords.

  • Assign default tablespace, temporary tablespace.

  • Grant appropriate privileges (system privileges, object privileges, roles).

5.2 Authentication Methods

  • Database authentication: username/password stored in DB.

  • OS authentication: external users validated by OS.

  • Integrated authentication: Windows Active Directory (SQL Server).

  • LDAP / SSO: centralized directory services.

5.3 Privileges and Roles

  • System privileges: ability to perform actions (CREATE TABLE, ALTER DATABASE, etc.).

  • Object privileges: access to specific objects (SELECT, INSERT, UPDATE on tables).

  • Roles: collections of privileges for easier management (e.g., DBA, CONNECT, RESOURCE).

5.4 Security Best Practices

  • Enforce strong password policies.

  • Remove default passwords and accounts (e.g., scott/tiger).

  • Principle of least privilege: grant only necessary privileges.

  • Use encryption for data at rest (Transparent Data Encryption) and in transit (SSL/TLS).

  • Regularly audit access and changes (using audit trails).

  • Apply security patches promptly.


6. Backup and Recovery

6.1 Types of Failures

  • Statement failure: syntax error, integrity constraint violation.

  • Transaction failure: deadlock, application error.

  • Process failure: background process dies (e.g., Oracle PMON).

  • Instance failure: database crashes (power outage, hardware fault).

  • Media failure: disk corruption or loss of data files.

  • User error: accidental DROP TABLE, wrong UPDATE.

6.2 Backup Strategies

  • Full backup: complete copy of database.

  • Incremental backup: changes since last backup (level 0, level 1 in Oracle).

  • Differential backup: changes since last full backup (SQL Server).

  • Cold backup: database offline; consistent copy of all files.

  • Hot backup: database online; uses archive logs for consistency.

6.3 Recovery Concepts

  • Recovery point objective (RPO): acceptable data loss.

  • Recovery time objective (RTO): acceptable downtime.

  • Media recovery: restore from backup, apply archived redo logs.

  • Point‑in‑time recovery (PITR): recover to a specific time.

6.4 Backup Tools

  • Oracle RMAN (Recovery Manager) – comprehensive backup and recovery.

  • SQL Server backups via SSMS, T‑SQL (BACKUP DATABASE).

  • MySQL mysqldumpmysqlbackup (MySQL Enterprise Backup), or file‑level snapshots.

  • PostgreSQL pg_dumppg_basebackup.

6.5 Testing Recovery

  • Regularly perform restore drills.

  • Validate backup integrity.

  • Document recovery procedures.


7. High Availability and Disaster Recovery

7.1 High Availability (HA) Solutions

  • Failover clustering: multiple servers share storage; one active, others passive (Windows Failover Cluster, Linux Pacemaker).

  • Data replication: real‑time copying to another server.

  • Oracle Data Guard: physical or logical standby for failover.

  • SQL Server Always On Availability Groups: replicas with automatic failover.

  • MySQL replication: master‑slave, group replication.

7.2 Disaster Recovery (DR)

  • Geographically separated secondary site.

  • Active‑passive or active‑active configurations.

  • Regular DR drills to ensure failover works.

7.3 Load Balancing


8. Performance Monitoring and Tuning

8.1 Key Metrics

  • CPU usage: overall server load.

  • Memory usage: buffer cache hit ratio, shared pool efficiency.

  • I/O: disk latency, throughput, number of reads/writes.

  • Network: throughput, latency.

  • Wait events: indicate bottlenecks (Oracle: v$session_wait, SQL Server: sys.dm_os_wait_stats).

8.2 Monitoring Tools

  • Native: Oracle Enterprise Manager (OEM), SQL Server Management Studio (SSMS), MySQL Workbench, pgAdmin.

  • Third‑party: SolarWinds, Nagios, Zabbix, Prometheus + Grafana.

8.3 Performance Tuning Areas

  • SQL Tuning: identify expensive queries, use execution plans, add indexes, rewrite queries.

  • Index Management: identify missing indexes, drop unused indexes, rebuild fragmented indexes.

  • Memory Tuning: adjust buffer cache, shared pool, sort area.

  • I/O Tuning: distribute data files across disks, use faster storage (SSD), optimize tablespace layout.

  • Parameter Tuning: adjust DBMS parameters (e.g., db_block_sizelog_buffermax_connections).

8.4 Common Tools for Analysis

  • EXPLAIN PLAN: shows execution plan.

  • AWR (Oracle Automatic Workload Repository): performance snapshots.

  • SQL Server Profiler / Extended Events: trace queries.

  • MySQL slow query log: log queries exceeding threshold.


9. Automation and Scheduling

9.1 Automating Routine Tasks

  • Backups: scheduled using cron (Linux), Task Scheduler (Windows), or DBMS scheduler.

  • Index maintenance: rebuild/reorganize indexes.

  • Statistics gathering: update optimizer statistics.

  • Space monitoring: alert when tablespace or disk space low.

  • Purging old data: archival and deletion.

9.2 Scheduling Tools

  • Oracle Scheduler (DBMS_SCHEDULER).

  • SQL Server Agent (jobs).

  • MySQL Event Scheduler.

  • PostgreSQL pg_cron.

  • External: cron, Ansible, Jenkins.

9.3 Scripting

  • Use shell scripts, Python, PowerShell to automate complex tasks.

  • Example: script to check all databases for corruption and send email alerts.


10. Database Maintenance

10.1 Regular Maintenance Tasks

  • Check integrityDBCC CHECKDB (SQL Server), ANALYZE (Oracle), mysqlcheck.

  • Update statistics: ensure optimizer uses up‑to‑date distribution data.

  • Rebuild/reorganize indexes: reduce fragmentation.

  • Archiving and purging: remove old data to keep performance.

  • Validate backups: periodic restore tests.

10.2 Patching and Upgrades

  • Apply critical patches (CPU – Critical Patch Update for Oracle).

  • Plan upgrade paths (in‑place, migration, rolling upgrade).

  • Test in staging environment before production.


11. Cloud Database Administration

11.1 Cloud Database Services

  • Amazon RDS: managed relational databases (MySQL, PostgreSQL, Oracle, SQL Server).

  • Azure SQL Database: PaaS version of SQL Server.

  • Google Cloud SQL: managed MySQL, PostgreSQL, SQL Server.

  • Amazon Aurora: cloud‑optimized, compatible with MySQL/PostgreSQL.

  • MongoDB AtlasRedis Enterprise Cloud for NoSQL.

11.2 DBA Responsibilities in Cloud

  • Less physical management: no hardware, OS patching handled by provider.

  • Focus on: performance tuning, security configurations (VPC, IAM), backup configuration (automated snapshots), scaling (read replicas, vertical/horizontal).

  • Cost management: right‑size instances, use reserved instances, monitor storage costs.

11.3 Hybrid and Multi‑Cloud

  • On‑premises databases integrated with cloud for DR, analytics.

  • Use of replication tools (GoldenGate, AWS DMS) for data movement.


12. Tools and Utilities

12.1 Command‑Line Utilities

  • Oracle: sqlplusrmanexpdp/impdplsnrctl.

  • SQL Server: sqlcmdbcp.

  • MySQL: mysqlmysqldumpmysqladmin.

  • PostgreSQL: psqlpg_dumppg_restore.

12.2 Graphical Tools

  • Oracle Enterprise Manager (OEM)

  • SQL Server Management Studio (SSMS)

  • MySQL Workbench

  • pgAdmin

  • Azure Data Studio (cross‑platform)

12.3 Monitoring Tools

  • OEM, SSMS, MySQL Workbench, pgAdmin

  • Open‑source: Prometheus + Grafana, Zabbix, Nagios


13. Best Practices

13.1 Documentation

  • Maintain configuration documents, network diagrams, backup/recovery procedures.

  • Document password policies, user access, and change management.

13.2 Change Management

  • Use version control for scripts.

  • Follow approval process for schema changes, parameter changes.

13.3 Proactive Monitoring

  • Set up alerts for critical events (space, failed jobs, long‑running queries).

  • Regularly review logs (alert log, error logs, audit logs).

13.4 Security Hardening

  • Disable unnecessary services, restrict network access, use firewalls.

  • Encrypt backup files, test restore to validate encryption.

13.5 Capacity Planning


14. Advanced Topics

14.1 Database Cloning and Refresh

  • Create copies for development, testing using snapshots, RMAN duplicate, or export/import.

  • Mask sensitive data in non‑production environments.

14.2 Data Masking

  • Replace sensitive data with realistic but fake values.

  • Oracle Data Masking, SQL Server Dynamic Data Masking.

14.3 Auditing

  • Enable fine‑grained auditing for compliance (SOX, GDPR).

  • Audit login attempts, DDL changes, privilege grants.

14.4 Container Databases (Oracle CDB/PDB)

14.5 In‑Memory Databases

  • Oracle Database In‑Memory, SQL Server In‑Memory OLTP, MySQL HeatWave.

  • Columnar acceleration for analytics.


Summary

Database Administration and Management is a critical discipline ensuring that databases are secure, available, and performant. The DBA must master installation, configuration, backup/recovery, monitoring, tuning, and automation. With the shift to cloud, DBAs now manage hybrid environments and leverage managed services. Strong foundations in OS, networking, and scripting complement deep DBMS knowledge.

CS-507: COMPUTER ORGANIZATION AND ASSEMBLY LANGUAGE

Module 1: Introduction to Computer Organization

1.1. Computer Organization vs. Computer Architecture

1.2. Functional Units of a Computer

┌─────────────────────────────────────────────────────────┐
│                      COMPUTER SYSTEM                     │
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐     │
│  │     CPU     │  │   MEMORY    │  │  INPUT/     │     │
│  │             │  │             │  │  OUTPUT     │     │
│  │ ┌─────────┐ │  │ ┌─────────┐ │  │ ┌─────────┐ │     │
│  │ │   ALU   │ │  │ │  Cache  │ │  │ │ Keyboard│ │     │
│  │ └─────────┘ │  │ └─────────┘ │  │ │ Monitor │ │     │
│  │ ┌─────────┐ │  │ ┌─────────┐ │  │ │ Printer │ │     │
│  │ │   CU    │ │  │ │   RAM   │ │  │ │   Disk  │ │     │
│  │ └─────────┘ │  │ └─────────┘ │  │ └─────────┘ │     │
│  │ ┌─────────┐ │  └─────────────┘  └─────────────┘     │
│  │ │Register│ │                                         │
│  │ │  File  │ │  ┌─────────────────────────────┐       │
│  │ └─────────┘ │  │         SYSTEM BUS         │       │
│  └─────────────┘  └─────────────────────────────┘       │
└─────────────────────────────────────────────────────────┘

1.3. Von Neumann Architecture

The foundational architecture for most modern computers:

Von Neumann Bottleneck: The single bus between CPU and memory limits throughput; CPU often waits for data/instructions.

1.4. Harvard Architecture

  • Separate memory and buses for instructions and data

  • Allows simultaneous fetch of instruction and data

  • Used in modern CPUs with separate instruction and data caches

1.5. CPU Organization

CPU Components:

  • ALU (Arithmetic Logic Unit): Performs arithmetic and logical operations

  • CU (Control Unit): Decodes instructions and generates control signals

  • Register File: High-speed storage locations inside CPU

  • Program Counter (PC): Holds address of next instruction

  • Instruction Register (IR): Holds current instruction being executed

1.6. Instruction Cycle (Fetch-Decode-Execute)

        ┌──────────────────────────────────────┐
        │                                      │
        ▼                                      │
    ┌──────────┐    ┌──────────┐    ┌──────────┐
    │  FETCH   │───▶│  DECODE  │───▶│ EXECUTE  │
    │          │    │          │    │          │
    │ Get next │    │ Interpret│    │ Perform  │
    │instruction│   │operation │    │ operation│
    └──────────┘    └──────────┘    └──────────┘
                                         │
        ┌────────────────────────────────┘
        ▼
    ┌──────────┐
    │   STORE  │ (if applicable)
    │ Result   │
    └──────────┘

Module 2: Digital Logic Fundamentals

2.1. Number Systems

Conversions:

  • Binary to Decimal: Sum of (bit × 2^n)

  • Decimal to Binary: Repeated division by 2

  • Hex to Binary: Each hex digit = 4 bits

2.2. Binary Arithmetic

Binary Addition:        Binary Subtraction:
  1011 (11)              1011 (11)
+ 0111 (7)            - 0111 (7)
-------               -------
 10010 (18)             0100 (4)

2.3. Signed Number Representations

2’s Complement is most common because it simplifies arithmetic and has no duplicate zero.

2.4. Boolean Algebra and Logic Gates

Basic Gates:

2.5. Combinational Circuits

Circuits where output depends only on current inputs (no memory).

2.6. Sequential Circuits

Circuits with memory; output depends on current inputs and previous state.


Module 3: Processor Architecture

3.1. Instruction Set Architecture (ISA)

The interface between hardware and software.

Instruction Types:

  1. Data Transfer: MOV, LOAD, STORE

  2. Arithmetic: ADD, SUB, MUL, DIV

  3. Logical: AND, OR, XOR, NOT, SHIFT

  4. Control Transfer: JMP, CALL, RET, BRANCH

  5. Input/Output: IN, OUT

3.2. Addressing Modes

3.3. CISC vs. RISC

3.4. CPU Control Unit Design

Hardwired Control:

Microprogrammed Control:

  • Control signals stored in microcode ROM

  • Slower but easier to modify

  • Common in CISC processors

3.5. Pipelining

Basic 5-Stage Pipeline:

  1. IF (Instruction Fetch): Fetch instruction from memory

  2. ID (Instruction Decode): Decode and read registers

  3. EX (Execute): ALU operation or address calculation

  4. MEM (Memory Access): Read/write data memory

  5. WB (Write Back): Write result to register

Pipeline Hazards:


Module 4: Memory Hierarchy

4.1. Memory Hierarchy Pyramid

                    ┌─────────────┐
                    │  Registers  │  < 1 ns, ~1 KB
                    ├─────────────┤
                    │  L1 Cache   │  ~1 ns, ~64 KB
                    ├─────────────┤
                    │  L2 Cache   │  ~5 ns, ~256 KB-1 MB
                    ├─────────────┤
                    │  L3 Cache   │  ~10 ns, ~2-32 MB
                    ├─────────────┤
                    │    RAM      │  ~100 ns, ~8-128 GB
                    ├─────────────┤
                    │    SSD      │  ~100 μs, ~256 GB-4 TB
                    ├─────────────┤
                    │    HDD      │  ~10 ms, ~1-20 TB
                    └─────────────┘
                    
                    Capacity ↑  Speed ↓  Cost/GB ↓

Principle of Locality:

4.2. Cache Memory

Cache Organization:

Cache Performance:

  • Hit: Data found in cache

  • Miss: Data not in cache; must fetch from lower level

  • Hit Rate = Hits / (Hits + Misses)

  • Average Access Time = Hit Time + (Miss Rate × Miss Penalty)

4.3. Virtual Memory

Allows programs to use more memory than physically available.

Memory Management Unit (MMU): Hardware that handles address translation.

4.4. Interrupts and DMA


Module 5: Assembly Language Programming (x86)

5.1. x86 Architecture Overview

Registers (16-bit):

32-bit Extensions: EAX, EBX, ECX, EDX, ESI, EDI, EBP, ESP, EIP
64-bit Extensions: RAX, RBX, RCX, RDX, RSI, RDI, RBP, RSP, RIP

5.2. x86 Instruction Format

[Label:]  Opcode  [Destination], [Source]  [; Comment]

Example:
start:    MOV     AX, 5           ; Load 5 into AX
          ADD     AX, BX          ; Add BX to AX
          INT     21h             ; DOS interrupt

5.3. Data Types and Directives

5.4. Basic Instructions

Data Transfer:

MOV  AX, 1234h     
MOV  BX, AX        
MOV  [1234h], AX   
MOV  AX, [SI]      
XCHG AX, BX        
LEA  AX, [BX+SI]   

Arithmetic:

ADD  AX, BX        
ADC  AX, BX        
SUB  AX, BX        
SBB  AX, BX        
INC  AX            
DEC  AX            
MUL  BX            
IMUL BX            
DIV  BX            
NEG  AX            

Logical:

AND  AX, BX        
OR   AX, BX        
XOR  AX, BX        
NOT  AX            
SHL  AX, 1         
SHR  AX, 1         
SAR  AX, 1         
ROL  AX, 1         
ROR  AX, 1         

5.5. Control Flow

Conditional Jumps:

CMP  AX, BX        
JE   label         
JNE  label         
JG   label         
JGE  label         
JL   label         
JLE  label         
JA   label         
JB   label         
JC   label         
JZ   label         
JNZ  label         

Unconditional Jump and Loop:

JMP  label         
LOOP label         
CALL subroutine    
RET                

5.6. Stack Operations

PUSH AX            
POP  BX            
PUSHF              
POPF               
PUSHA              
POPA               

Stack Frame Setup:

function:
    PUSH BP        
    MOV  BP, SP    
    SUB  SP, 4     
    
    MOV  SP, BP    
    POP  BP        
    RET

5.7. Complete Assembly Program Example



.MODEL SMALL          
.STACK 100h           

.DATA                 
    num1    DB 10     
    num2    DB 20     
    result  DB ?      
    msg     DB 'Result: $'  

.CODE                 
main PROC
    
    MOV AX, @DATA
    MOV DS, AX
    
    
    MOV AL, num1      
    ADD AL, num2      
    MOV result, AL    
    
    
    LEA DX, msg       
    MOV AH, 09h       
    INT 21h
    
    
    MOV AL, result
    ADD AL, '0'       
    MOV DL, AL
    MOV AH, 02h       
    INT 21h
    
    
    MOV AH, 08h
    INT 21h
    
    
    MOV AH, 4Ch       
    INT 21h
main ENDP
END main

Module 6: Input/Output Systems

6.1. I/O Addressing Methods

x86 I/O Instructions:

6.2. Interrupt Handling

1. Device raises interrupt
2. CPU completes current instruction
3. CPU pushes flags and return address
4. CPU looks up ISR address in interrupt vector table
5. CPU jumps to ISR
6. ISR executes, handles device
7. ISR executes IRET to return

6.3. Programmed I/O vs. Interrupt-Driven I/O vs. DMA


Module 7: Advanced Topics

7.1. Parallel Processing

7.2. Multicore Processors

  • Multiple CPU cores on a single chip

  • Shared or distributed caches

  • Cache coherence protocols (MESI, MOESI)

7.3. RISC-V Architecture (Emerging)


Conclusion

Computer Organization and Assembly Language form the bridge between hardware and software. Understanding how processors execute instructions, how memory is organized, and how to program at the assembly level provides essential insight into performance optimization, system programming, and embedded systems development. This knowledge is fundamental for computer scientists and engineers working anywhere near the hardware-software interface.


1. Introduction to AI

1.1 What is AI?

  • Acting humanly: Turing Test approach.

  • Thinking humanly: cognitive modeling.

  • Thinking rationally: logic-based reasoning.

  • Acting rationally: rational agent approach (most common in modern AI).

1.2 Intelligent Agents

  • Agent: perceives environment via sensors, acts via actuators.

  • Rationality: choose action that maximizes expected performance measure given percept sequence and knowledge.

  • PEAS description:

    • Performance measure

    • Environment

    • Actuators

    • Sensors

1.3 Types of Environments


2. Problem Solving with Search

2.1 Problem Formulation

  • State spaceinitial stateactionstransition modelgoal testpath cost.

2.2 Uninformed (Blind) Search

  • b = branching factor, d = depth of shallowest goal, m = max depth, C* = optimal cost, ε = min edge cost.

2.3 Informed (Heuristic) Search

  • Heuristic h(n): estimated cost from node n to goal.

  • Greedy Best‑First: expands node with smallest h(n) – not optimal.

  • *A Search**: f(n) = g(n) + h(n). Optimal if h is admissible (never overestimates).

    • Tree‑search A* requires consistency (monotonicity) for optimality: h(n) ≤ c(n, a, n’) + h(n’).

    • Complexity: O(b^ε) where ε = |h* – h|.

2.4 Local Search

  • For optimization where path doesn’t matter (e.g., 8‑queens, scheduling).

  • Hill climbingSimulated annealingGenetic algorithms.


3. Adversarial Search (Games)

  • Minimax algorithm: assumes opponent plays optimally.

  • Alpha‑Beta pruning: eliminates branches that cannot influence final decision.

  • Evaluation functions: approximate state value when depth limit reached (e.g., weighted linear function).

  • Monte Carlo Tree Search (MCTS): used in AlphaGo; combines tree search with random rollouts.


4. Constraint Satisfaction Problems (CSP)

  • Variablesdomainsconstraints.

  • Backtracking search: DFS with constraint checking.

  • Forward checking: keep track of remaining legal values for unassigned variables; prune when domain empty.

  • Constraint propagation: AC‑3 (Arc Consistency) enforces arc consistency in O(n²d³).

  • Heuristics:

    • MRV (minimum remaining values): choose variable with fewest legal values.

    • Degree heuristic: tie-breaker for MRV.

    • LCV (least constraining value): choose value that rules out fewest choices for neighbors.


5. Knowledge Representation & Reasoning

5.1 Logic

  • Propositional logic: symbols, connectives (¬, ∧, ∨, →, ↔), truth tables.

  • First‑order logic (FOL): objects, predicates, functions, quantifiers (∀, ∃).

5.2 Inference

  • Forward chaining: data‑driven, used in production systems.

  • Backward chaining: goal‑driven, used in logic programming (Prolog).

  • Resolution: refutation‑complete for FOL with unification.

  • Horn clauses: allow efficient forward/backward chaining.

5.3 Knowledge Engineering

  • Process of building a knowledge base: identify domain, define vocabulary, encode general knowledge, encode specific problem instances.


6. Planning

  • STRIPS representation: states as conjunctions of ground literals; actions have preconditions and effects (add/del lists).

  • Forward state‑space search: from initial state, apply actions until goal reached.

  • Backward state‑space search: from goal, regress through actions.

  • Partial‑order planning: plan steps without total ordering; flexible.

  • Planning graphs: used in GraphPlan; alternate levels of proposition and action layers.


7. Uncertainty & Probabilistic Reasoning

7.1 Probability Basics

  • Prior probability: P(A), conditional probability: P(A|B) = P(A∧B)/P(B).

  • Bayes’ rule: P(H|E) = P(E|H) P(H) / P(E).

7.2 Bayesian Networks

  • Directed acyclic graph (DAG) representing conditional independencies.

  • Each node has conditional probability table (CPT) given its parents.

  • Inference: exact (variable elimination) or approximate (rejection sampling, likelihood weighting, MCMC – Gibbs sampling).

7.3 Naive Bayes Classifier

  • Assumes features are conditionally independent given class.
    P(C∣F1,…,Fn)∝P(C)∏i=1nP(Fi∣C)

  • Simple, fast, often effective.


8. Machine Learning

8.1 Supervised Learning

  • Linear regression: minimize squared error; closed form (normal equation) or gradient descent.

  • Logistic regression: binary classification; sigmoid activation; cross‑entropy loss.

  • k‑Nearest Neighbors (k‑NN): instance‑based; decision boundary flexible; sensitive to scale.

  • Support Vector Machines (SVM): maximize margin; kernel trick for nonlinear boundaries.

  • Decision trees: split on features using entropy/information gain (ID3, C4.5).

  • Ensemble methods: Bagging (Random Forests), Boosting (AdaBoost, Gradient Boosting).

Evaluation:

  • Metrics: accuracy, precision, recall, F1, ROC‑AUC.

  • Bias‑variance trade‑off: underfitting (high bias) vs overfitting (high variance).

  • Cross‑validation.

8.2 Unsupervised Learning

  • Clustering: k‑means (iterative centroid update), hierarchical clustering (agglomerative/divisive).

  • Dimensionality reduction: Principal Component Analysis (PCA) finds orthogonal directions of max variance.

8.3 Reinforcement Learning (RL)

  • Markov Decision Process (MDP): (S, A, T, R, γ).

    • Policy: π(s) → action.

    • Value functions: Vπ(s) = expected cumulative reward from s following π; Qπ(s,a) = expected cumulative reward taking a in s then following π.

  • Bellman equations:

    • V(s) = maxₐ Σ T(s,a,s’) [ R(s,a,s’) + γ V(s’) ]

    • Q(s,a) = Σ T(s,a,s’) [ R(s,a,s’) + γ maxₐ’ Q(s’,a’) ]

  • Dynamic programming: value iteration, policy iteration (require full MDP).

  • Model‑free RL:

    • Q‑learning: off‑policy, Q(s,a) ← Q(s,a) + α [ r + γ maxₐ’ Q(s’,a’) – Q(s,a) ]

    • SARSA: on‑policy.

    • Deep Q‑Networks (DQN): use neural network + experience replay + target network.

  • Policy gradients: directly optimize policy (e.g., REINFORCE, PPO).


9. Neural Networks & Deep Learning

  • Perceptron: linear classifier with step activation.

  • Multi‑layer perceptron (MLP): non‑linear activation functions (ReLU, sigmoid, tanh).

  • Backpropagation: compute gradients via chain rule; update weights with gradient descent.

  • Convolutional Neural Networks (CNN): convolution layers, pooling, spatial hierarchies; used in vision.

  • Recurrent Neural Networks (RNN): handle sequences; LSTM, GRU address vanishing gradient.

  • Transformers: self‑attention, positional encoding; backbone of modern LLMs (GPT, BERT).


10. Natural Language Processing (NLP)

  • Tokenization, stemming, lemmatization.

  • N‑gram language models: predict next word; smoothing (Laplace, Kneser‑Ney).

  • Part‑of‑Speech (POS) tagging: HMMs, CRFs.

  • Parsing: constituency (CFG, CKY) and dependency parsing.

  • Word embeddings: Word2Vec (CBOW, Skip‑gram), GloVe.

  • Large Language Models (LLMs): pre‑trained transformers (GPT, BERT, T5) fine‑tuned for tasks; prompt engineering, in‑context learning.


11. AI Ethics & Responsible AI

  • Bias and fairness: models can perpetuate or amplify societal biases.

  • Transparency and explainability: black‑box models vs interpretable models; LIME, SHAP.

  • Accountability: who is responsible for AI decisions?

  • Privacy: differential privacy, federated learning.

  • Safety and robustness: adversarial attacks, out‑of‑distribution detection.


12. Applications & Frontiers

  • Computer Vision: image classification, object detection (YOLO, Mask R‑CNN), segmentation.

  • Robotics: perception, planning, control; SLAM.

  • Autonomous systems: self‑driving cars.

  • Generative AI: GANs, VAEs, diffusion models, large language models.

  • Multi‑agent systems: cooperative/competitive agents, game theory.


Key Algorithms & Formulas Summary

Table of Contents

  1. Introduction to Big Data

  2. Big Data Ecosystem and Architecture

  3. Hadoop Distributed File System (HDFS)

  4. MapReduce Programming Model

  5. Apache Spark

  6. NoSQL Databases

  7. Data Streaming and Real-Time Processing

  8. Big Data Analytics Techniques

  9. Data Visualization for Big Data

  10. Big Data Security, Privacy, and Governance

  11. Case Studies and Applications


1. Introduction to Big Data

1.1 Definition

Big Data refers to datasets that are too large, complex, or fast-growing to be processed by traditional data processing systems. It is characterized by the 3 Vs:

  • Volume: Massive scale (terabytes to petabytes).

  • Velocity: High speed of data generation and processing (streaming, real-time).

  • Variety: Structured, semi-structured, unstructured data (text, images, logs, sensor data).

Later extensions added more Vs:

  • Veracity: Uncertainty, quality, trustworthiness of data.

  • Value: Business or scientific value derived from analysis.

  • Variability: Changes in data flow and structure.

1.2 Challenges

  • Storage and management at scale.

  • Efficient processing (parallel, distributed).

  • Data integration from heterogeneous sources.

  • Real-time analytics.

  • Privacy, security, and compliance.

1.3 Big Data Analytics Lifecycle

  1. Data Acquisition: Collection from sources (sensors, logs, databases).

  2. Data Storage: HDFS, NoSQL, cloud storage.

  3. Data Processing: Batch (MapReduce, Spark) or streaming (Kafka, Spark Streaming).

  4. Data Analysis: Machine learning, statistical models, queries.

  5. Interpretation/Visualization: Dashboards, reports.


2. Big Data Ecosystem and Architecture

2.1 Hadoop Ecosystem

Apache Hadoop is an open-source framework for distributed storage and processing.

2.2 Lambda Architecture

A robust architecture to handle both batch and real-time processing:

  • Batch Layer: Stores master dataset, precomputes views (e.g., Hadoop, Spark).

  • Speed Layer: Processes real-time data, provides incremental views (e.g., Storm, Spark Streaming).

  • Serving Layer: Merges batch and real-time results for querying.

2.3 Kappa Architecture

Simplification of Lambda: all data is treated as streams; batch is replaced by replay of streams (e.g., using Kafka and a stream processor like Flink or Spark Structured Streaming).


3. Hadoop Distributed File System (HDFS)

3.1 Architecture

  • Master-Slave structure:

    • NameNode: Manages metadata (namespace, file-to-block mapping, block locations). Single point of failure (High Availability with standby NameNode).

    • DataNode: Stores actual data blocks (default 128 MB). Replicates blocks (default 3) for fault tolerance.

3.2 Key Features

  • High Throughput rather than low latency.

  • Write-once, read-many model (append allowed).

  • Replication: Blocks replicated across racks for rack awareness.

  • Heartbeats: DataNodes send periodic heartbeats to NameNode.

  • Block Report: DataNode sends list of blocks to NameNode.

3.3 HDFS Read/Write Operations

  • Write: Client requests NameNode for block locations; NameNode returns DataNodes for pipeline; client writes to first DataNode, which forwards to next, etc.

  • Read: Client contacts NameNode for block locations; reads from nearest replica.

3.4 HDFS Commands

  • hdfs dfs -put /local /hdfs

  • hdfs dfs -get /hdfs /local

  • hdfs dfs -ls-cat-mkdir-chmod, etc.


4. MapReduce Programming Model

4.1 Overview

MapReduce is a programming model for processing large datasets in parallel across a cluster.

  • Map: (key1, value1) → list(key2, value2) – performs filtering, transformation.

  • Shuffle & Sort: Groups intermediate keys and sorts values.

  • Reduce: (key2, list(value2)) → list(key3, value3) – aggregates, summarizes.

4.2 Execution Flow

  1. Input splits (e.g., HDFS blocks) are assigned to Map tasks.

  2. Each Map task processes its split, emits intermediate key-value pairs.

  3. Intermediate data is partitioned, sorted, and spilled to disk.

  4. Reduce tasks fetch their partitions, merge, and run the reduce function.

  5. Output is written to HDFS.

4.3 Key Components

  • JobTracker (YARN ResourceManager): Manages job scheduling.

  • TaskTracker (YARN NodeManager): Executes tasks on nodes.

  • Combiner: Mini-reduce on map output (optional) to reduce network traffic.

  • Partitioner: Determines which reducer receives which key.

4.4 Word Count Example

Map:

map(String docid, String text):
    for each word w in text:
        emit(w, 1)

Reduce:

reduce(String word, Iterable counts):
    sum = 0
    for each count in counts:
        sum += count
    emit(word, sum)

4.5 Limitations

  • High overhead for iterative algorithms.

  • Requires writing many Java classes.

  • Limited to batch processing; no real-time.


5. Apache Spark

5.1 Overview

Spark is a fast, in-memory cluster computing framework that extends the MapReduce model. It provides APIs in Scala, Java, Python, R.

5.2 Resilient Distributed Dataset (RDD)

  • RDD: Immutable, partitioned collection of records that can be processed in parallel.

  • Transformations: Lazy operations (e.g., mapfilterflatMapjoin) that build a DAG of transformations.

  • Actions: Trigger computation (e.g., countcollectsaveAsTextFile).

5.3 Spark Components

  • Spark Core: Basic RDD APIs, task scheduling.

  • Spark SQL: DataFrames/Datasets, SQL queries, Catalyst optimizer.

  • Spark Streaming: Micro-batch processing for real-time data.

  • MLlib: Scalable machine learning library.

  • GraphX: Graph processing (PageRank, etc.).

5.4 Spark Execution

  • Driver: Runs user’s main function, creates SparkContext.

  • Cluster Manager: Standalone, YARN, Mesos, Kubernetes.

  • Executors: Run tasks and store data.

5.5 DataFrames and Datasets

5.6 Spark vs MapReduce


6. NoSQL Databases

6.1 Motivation

Traditional RDBMS (ACID) face limitations with:

6.2 Types of NoSQL Databases

6.2.1 Key-Value Stores

  • Examples: Redis, Riak, Amazon DynamoDB.

  • Features: Simple hash table; extremely fast.

  • Use cases: Caching, session storage.

6.2.2 Column-Family Stores

  • Examples: Apache Cassandra, HBase.

  • Features: Data stored in columns (column families) rather than rows; optimized for wide tables.

  • Use cases: Time-series data, IoT.

6.2.3 Document Stores

  • Examples: MongoDB, Couchbase.

  • Features: JSON/BSON documents; flexible schema; rich querying.

  • Use cases: Content management, catalogs.

6.2.4 Graph Databases

  • Examples: Neo4j, JanusGraph.

  • Features: Nodes, edges, properties; efficient for connected data.

  • Use cases: Social networks, recommendation engines.

6.3 CAP Theorem

  • ConsistencyAvailabilityPartition tolerance – at most two can be fully achieved in a distributed system.

  • NoSQL databases often choose AP (Cassandra) or CP (HBase).

6.4 HBase

  • Column-oriented on top of HDFS.

  • Architecture: HMaster (management), RegionServers (host regions).

  • Data model: Table → row key → column families → columns.

  • Operations: Get, Put, Scan.


7. Data Streaming and Real-Time Processing

7.1 Stream Processing Concepts

  • Event Time vs Processing Time: Time when event occurred vs when processed.

  • Windowing: Tumbling, sliding, session windows.

  • State: Maintain state across events (e.g., counts, sessions).

  • Exactly-Once Semantics: Ensures no data loss or duplication.

7.2 Apache Kafka

  • Distributed publish-subscribe messaging system.

  • Components:

    • Producer: Publishes messages to topics.

    • Broker: Stores partitions of topics.

    • Consumer: Subscribes to topics.

    • ZooKeeper (or KRaft): Coordination.

  • Partitioning: Topics split into partitions for parallelism.

  • Offsets: Unique ID per message within partition.

7.3 Spark Streaming

  • Micro-batching: divides stream into small batches (e.g., 1s).

  • DStream (Discretized Stream): sequence of RDDs.

  • Operations: mapwindowreduceByKeyAndWindow, etc.

  • Structured Streaming: DataFrame-based, event-time processing, sinks.

7.4 Apache Flink

  • True streaming (not micro-batching).

  • Event-time processingstate management (savepoints), exactly-once semantics.

  • APIs: DataStream API, Table API, SQL.

7.5 Comparison: Spark Streaming vs Flink


8. Big Data Analytics Techniques

8.1 Machine Learning at Scale

  • MLlib (Spark): Distributed algorithms: classification (logistic regression, SVM), regression, clustering (K-means), collaborative filtering (ALS), dimensionality reduction.

  • Mahout: Hadoop-based ML (now focusing on Spark).

  • Graph processing: PageRank, community detection.

8.2 Data Mining with Hadoop

  • Association Rules: Apriori, FP-Growth (parallelized in Mahout/MLlib).

  • Clustering: K-means, Bisecting K-means.

  • Classification: Decision trees (Random Forest), Naive Bayes.

8.3 SQL-on-Hadoop

  • Hive: HiveQL translates to MapReduce/Tez/Spark. Metastore for schema.

  • Impala: MPP SQL engine for low latency (Cloudera).

  • Presto/Trino: Distributed SQL engine for multiple data sources.

  • Spark SQL: DataFrames, Catalyst optimizer.

8.4 Text Analytics

  • Word countTF-IDFtopic modeling (LDA in Spark).

  • Sentiment analysis at scale.

  • Named entity recognition.


9. Data Visualization for Big Data

9.1 Challenges

  • Volume: sampling or aggregation needed.

  • Velocity: real-time dashboards.

  • Variety: multiple data types.

9.2 Tools

  • Tableau: Connects to Hadoop, Spark, NoSQL.

  • Power BI: Big data connectors.

  • Apache Superset: Open-source, supports large datasets.

  • Zeppelin: Notebook with visualizations, supports Spark, etc.

  • Kibana: For Elasticsearch data.

9.3 Techniques

  • Aggregation (e.g., summary statistics, precomputed cubes).

  • Sampling (if trends are sufficient).

  • Progressive visualization (streaming updates).


10. Big Data Security, Privacy, and Governance

10.1 Security Challenges

  • Distributed systems increase attack surface.

  • Data privacy regulations (GDPR, CCPA).

10.2 Security Mechanisms

  • Authentication: Kerberos (Hadoop), LDAP.

  • Authorization: Apache Ranger, Sentry for fine-grained access control.

  • Encryption: At rest (HDFS encryption zones), in transit (TLS).

  • Auditing: Logging access.

10.3 Data Governance

  • Data lineage: Tracking data origins and transformations.

  • Metadata management: Apache Atlas, Hive Metastore.

  • Data quality: Tools like Deequ (Amazon) for data profiling.


11. Case Studies and Applications

11.1 Recommendation Systems

11.2 Fraud Detection

  • Real-time streaming (Kafka + Flink) to detect anomalous transactions.

  • Machine learning models (isolation forest, logistic regression) at scale.

11.3 Internet of Things (IoT)

  • Sensor data ingested via Kafka.

  • Time-series storage in HBase/Cassandra.

  • Stream processing for real-time alerts; batch analytics for trend analysis.

11.4 Log Analytics

  • Flume/Kafka to ingest logs.

  • Elasticsearch + Kibana (ELK stack) for search and visualization.

  • Spark for large-scale log analysis.


Summary

Big Data Analytics courses cover the entire lifecycle: storage (HDFS, NoSQL), processing (MapReduce, Spark, streaming), analytics (machine learning, SQL), and visualization. Emphasis is placed on distributed systems concepts, fault tolerance, scalability, and practical tools. Key takeaways:

  • Hadoop provides the foundational distributed storage and batch processing.

  • Spark has become the de facto standard for unified analytics (batch, streaming, ML, SQL).

  • NoSQL databases address specific scalability and schema flexibility needs.

  • Streaming platforms (Kafka, Flink) enable real-time analytics.

  • Security, governance, and visualization are critical for production deployments.

Familiarity with programming (Java, Scala, Python) and hands-on experience with these tools is essential for mastery.

Here are detailed study notes for CS-510: Artificial Neural Networks and Deep Learning, written in comprehensive paragraphs with illustrative examples to reinforce key concepts.

At the heart of deep learning is the artificial neuron, a mathematical unit that mimics the function of a biological neuron. A neuron receives a set of inputs x1,x2,…,xn, multiplies each by a corresponding weight wi, sums them along with a bias term b, and then passes the result through a non-linear activation function ϕ. The output is y=ϕ(∑i=1nwixi+b)Feedforward neural networks (or multilayer perceptrons) consist of layers of such neurons: an input layer, one or more hidden layers, and an output layer. Each layer’s output becomes the input to the next layer.

Activation functions introduce non-linearity, enabling the network to learn complex patterns. Common choices include:

Example: Consider a neuron with two inputs x1=0.5,x2=1.0, weights w1=0.2,w2=−0.3, bias b=0.1, and ReLU activation. The weighted sum is z=0.2∗0.5+(−0.3)∗1.0+0.1=0.1−0.3+0.1=−0.1. Then ReLU(−0.1)=0. The neuron outputs 0, meaning it does not fire.

Training a neural network involves adjusting the weights to minimize a loss function that measures the discrepancy between predictions and true targets. For regression, mean squared error is common; for classification, cross-entropy loss is typical. The workhorse algorithm for computing gradients of the loss with respect to all weights is backpropagation, which is an efficient application of the chain rule. It proceeds in two passes:

Example (simple backpropagation): Suppose a network with one hidden neuron (sigmoid) and one output neuron (linear). Given input x, true target t, loss L=12(y−t)2. The forward equations: h=σ(w1x+b1)y=w2h+b2. The backward pass: ∂L∂y=y−t∂L∂w2=∂L∂y⋅h∂L∂h=∂L∂y⋅w2, then ∂L∂w1=∂L∂h⋅σ′(w1x+b1)⋅x. These gradients are used to update weights via gradient descent.

Optimization algorithms extend basic stochastic gradient descent (SGD) to improve convergence. Momentum accumulates past gradients to smooth updates and escape local minima. Adam (Adaptive Moment Estimation) combines momentum with adaptive learning rates per parameter, making it a popular default choice. A key practical challenge is the vanishing gradient problem in deep networks: gradients become exponentially small as they propagate backward through many layers, preventing early layers from learning. ReLU activation, careful initialization (e.g., He initialization for ReLU), and batch normalization help mitigate this.

Deep networks are prone to overfitting due to their high capacity. Several techniques are used to improve generalization:

Example (Dropout in a fully connected layer): Suppose a hidden layer has 100 neurons and we use dropout with p=0.5. During each training step, each neuron has a 50% chance of being set to 0. This effectively trains a random sub-network, and across many steps, the whole network learns to be robust. In PyTorch, one line like nn.Dropout(0.5) inserts this behavior.

Learning rate scheduling is also crucial: starting with a higher learning rate and gradually decreasing it (e.g., step decay, cosine annealing) can help converge to a better optimum.

CNNs are specialized for grid-like data such as images. They exploit two key ideas: local connectivity (each neuron connects only to a small spatial region of the input) and parameter sharing (the same filter is applied across all spatial positions). The main building blocks are:

RNNs are designed for sequential data (time series, text, audio). They maintain a hidden state ht that is updated at each time step based on the current input xt and the previous hidden state: ht=ϕ(Wxhxt+Whhht−1+bh). This allows the network to exhibit temporal dynamics.

Example (character-level text generation): Train an RNN on a corpus of text, where at each step the input is a one-hot encoded character and the output is the next character. After training, the network can generate new text by sampling from the output probability distribution.

However, vanilla RNNs suffer from vanishing and exploding gradients over long sequences. Long Short-Term Memory (LSTM) networks and Gated Recurrent Units (GRU) address this with gating mechanisms:

Example (LSTM for sentiment analysis): Each word in a sentence is embedded into a vector and fed sequentially into an LSTM. The final hidden state is passed to a classifier to predict sentiment (positive/negative). LSTMs are effective at capturing context like negation (“not good”) across many words.

The Transformer architecture, introduced in “Attention Is All You Need,” has revolutionized sequence modeling by replacing recurrence with self-attention. It processes entire sequences in parallel, capturing relationships between all pairs of positions. The core components are:

Example (BERT, a Transformer-based model): BERT (Bidirectional Encoder Representations from Transformers) uses the encoder stack of the Transformer. It is pre-trained on large text corpora with masked language modeling (predicting masked tokens) and next-sentence prediction. It achieves state-of-the-art performance on a wide range of NLP tasks by fine-tuning on downstream datasets. For sentiment analysis, one can add a classification head on top of BERT’s [CLS] token output and fine-tune.

Transformers have also been adapted to computer vision (Vision Transformer, ViT), where an image is split into patches treated as tokens, achieving competitive results with CNNs.

Autoencoders learn compressed representations of data without labels. An encoder maps input to a lower-dimensional latent space, and a decoder reconstructs the input from that representation. Variational Autoencoders (VAEs) learn a probabilistic latent space, enabling generation of new samples. Denoising autoencoders are trained to reconstruct clean inputs from corrupted ones, which helps in feature learning.

Generative Adversarial Networks (GANs) consist of a generator that creates fake samples and a discriminator that tries to distinguish real from fake. They are trained in a min-max game, leading to generators that can produce realistic images, audio, etc. For instance, StyleGAN produces high-resolution human faces that are indistinguishable from real photos.

Graph Neural Networks (GNNs) extend deep learning to graph-structured data (social networks, molecules). They perform message passing: each node aggregates information from its neighbors to update its representation. Graph Convolutional Networks (GCNs) are a popular variant.

Example (GNN for molecule property prediction): Each atom is a node with features (atomic number, etc.), bonds are edges. A GCN processes the graph, and a readout function produces a fixed-size vector representing the whole molecule, which is then used to predict toxicity or solubility.

Implementing deep learning models involves choosing a framework (PyTorch, TensorFlow) and managing the training pipeline. Key practical aspects:

These notes provide a comprehensive foundation for CS-510. Mastery comes from implementing these concepts, experimenting with architectures, and understanding the theoretical underpinnings that drive the practical success of deep learning.

Leave a Reply

Your email address will not be published. Required fields are marked *