Normalization

ER modeling is a database design technique,

while Normalization is another database design technique.

Purpose of Normalization

begins by examining the relationships (called functional dependencies) between attributes.
uses a series of tests (called normal forms) to identify the optimal grouping of the attributes.
Remove repeating groups
Remove unwanted functional dependencies

Characteristics of a suitable set of relations include

attributes with a close logical relationship are found in the same relation;
minimal redundancy: each attribute is represented only once (except for the foreign keys).

Benefits of having a normalized database with well-designed relations

easier for users to access and maintain the data
less data duplications -> take up minimal storage space on the computer -> minimizing costs
reducing the opportunities for data inconsistencies

How Normalization Supports Database Design

Data Redundancy

Major aim of relational database design is to group attributes into relations to minimize data redundancy.

Update Anomalies

Relations that contain redundant information may potentially suffer from update anomalies.

Insertion anomaly

Cannot insert a record with null Primary Key.

Have to enter duplicate data

Problems with missing data

Cannot insert a value which is not in table column

Add a non-key value requires adding primary key value as well

Modification anomaly

Changing the particular attribute requires changing the other infos of that particular attribute.

Need to change multiple rows of table when update a value

An update of one attribute may have to be carried out on several tuples if information is duplicated

Deletion anomaly

Deleting a row that represents the last member of specific attribute will also remove all information of that attribute in the database.

Deletion of one element of data may cause other data (primary key) to be lost

Delete a row that represents the last member of staff in a branch will also remove all information of that branch in the database

Example

Anomalies occur when a table contains facts about two or more different themes.

Thats why we need Normalization.

Every normalized relation has a single theme
Normalization is to break up relation
Normalization resulting files show a high degree of data independence.

Functional Dependencies (FD)

Functional dependency (FD) is an important concept associated with normalization.

used to identify the primary key for a relation.

Definition of Functional dependency (FD)

Assume that A and B are attributes of a relation
if each value of A is associated with exactly one value of B,
B is functionally dependent on A (denoted A → B)
A is the determinant of B
Must be 1 to 1 relationship
1 to Many is ok if the one is primary key

Partial Dependencies (PD)

B is functionally dependent on a subset of A is called Partial Dependencies.

Eg: If {A,B} → {C} but also {A} → {C} then {C} is partially functionally dependent on {A,B}.

e.g. {staffNo, sName} -> {branchNo}

but also {staffNo} -> {branchNo}

Therefore {branchNo} is partially functionally dependent on {staffNo, sName}

Characteristics of Functional Dependencies

There is a one-to-one relationship between the attribute(s) on the left-hand side (determinant) and those on the right-hand side of a functional dependency.
Holds for all time.
The determinant has the minimal number of attributes necessary to maintain the dependency with the attribute(s) on the right hand-side

Transitive Dependencies

describes a condition where A, B, and C are attributes of a relation
- such that if A → B and B → C, then C is transitively dependent on A via B
- A→B, B→C, then A→C
  - e.g. orderNo and supplierNo and supplier-name
  - orderNo -> supplierNo, supplierNo-> supplier-name, Then orderNo -> supplier-name

It is important to recognize a transitive dependency

because its existence in a relation can potentially cause update anomalies

The Process of Normalization

Normalization is a formal technique for analyzing a relation based on its primary key and the functional dependencies between the attributes of that relation. Often executed as a series of steps. Each step corresponds to a specific normal form, which has known properties.

As normalization proceeds, the relations become progressively more restricted (stronger) in format and also less vulnerable to update anomalies.

resulting files show a high degree of data independence.

Unnormalized Form (UNF)

Basically: There are multivalued attributes or repeating groups

A table that contains one or more repeating groups.

Example:

Normalization : 1NF to 3NF

Normalization Tool

First Normal Form (1NF)

non-keys depend on key

Removed all repeating groups

Remove repeating groups by entering appropriate data into the empty rows.

Basically: No multivalued attributes or repeating groups

each row and column contains one and only one value.

Example:

Second Normal Form (2NF)

non-keys depend on the whole key

Removed partial dependencies

Basically: 1NF + No partial dependencies

1NF + each non-key attribute must be dependent on the 1 key only

Achieve By Spliting the table

To convert 1 NF to 2NF:

Identify Primary key for the 1NF relation
Identify the functional dependencies in the relation
If partial dependencies exist on the primary key remove them by placing them in a new relation along with a copy of their determinant. (Split Table)

Example:

1NF form: {SalespersonNo, ProdNo} -> {everything}

Take away the Partial Dependencies (PD) of 2NF from the table. Form them as new table according to the primary key.

Note you need to keep the original copy of the primary keys in the original table. Then rename the original table.

Now you have 3 Tables. It is in 2NF without partial dependencies.

Third Normal Form (3NF)

non-keys depend on the whole key, nothing but the key

Removed transitive dependencies

Basically: 2NF + No transitive dependencies (Splitted Table)

1NF + 2NF + none of the non-primary-key attribute is transitively dependent on the primary key

No (non-key→non-key) via transitive dependence

To convert 2 NF to 3NF:

Identify Primary key for the 2NF relation
Identify the functional dependencies in the relation
If transitive dependencies exist on the primary key remove them by placing them in a new relation along with a copy of their determinants. (Split Table)

Example:

Take the previous 2NF example.

We can see there is a nonkey-nonkey relation in the Salesperson table.

{SalespersonNo} -> {DeptNo} , {DeptNo} -> {ManagerName}

Therefore {SalespersonNo} -> {ManagerName} => Transitively dependent

The violating 3NF.

Take it away to form a new table, save the copy of original key as Foreign key in the original table.

Now you have 4 Tables. It is in 3NF without transitive dependencies.

Data Warehousing

To setup a DataWarehouse, all you need is to take all the data and make them consistant and put them in a nice location and nice format.

NOT IN 3NF (Big Data)

De-nomralising Data

The data should be de-normalised to 2NF.
Data Redundancy occur in 2NF
You need more storage

Why?

You can get the data more quickly
The purpose of a DataWarehouse is to provide aggregate data which is in a suitable format for decision making.

Business Intelligence (BI)

BI refers to the processes for

collecting and analysing data

with the purpose of facilitating corporate decision making.

In order to make the business better.

Data warehousing is one of the key technologies for implementing BI.

How?

BI Helps enterprise users make better business decision by:

Understanding the health of the organization
- The whats and hows
Collaborating on the same view of data and business logic
Reducing the time to make decision (automation)

Challenges in Delivery

Large volume of related data
Variety of data formats
Location of data
Data are not easily queried
Data are not optimized for analytical queries
Data are not accessible
Data definitions are not consistent
Users may not have the skills, tools, or permission to query the data

Ideally, BI is

Empowerment – directly usable
Fast - responsive
Timely - available
Accurate – trusted with quality
Usable – has value

Relationship between BI and DW

OLTP VS OLAP

Online transaction processing (OLTP)	Online analytical processing (OLAP)
designed to maximize transaction throughput	designed to hold historical data for analyses
Data stored in normalized format	Data stored in denormalized format
DB size is small (around 100MB to 100GB)	DB size is huge (around 100GB to few TB)
Used by normal users	Used by Decision maker
Slow query response if amount of data is very large	Quicker query response, generate quicker reports

Data Warehousing Concepts

A subject-oriented, integrated, time-variant, and non-volatile collection of data in support of management’s decision-making process (Inmon, 1993).

The Goal of Data Warehousing

To integrate corporate data into a single repository from which users can easily run queries, produce reports, and perform analysis (for making business decisions).

Subject-Oriented

warehouse is organized around the major subjects of the enterprise
not organized around the business operations
This is reflected in the need to store decision-support data rather than application-oriented data.

Integrated Data

integrates corporate application- oriented data from different source systems
- often includes data that is inconsistent (different format)
integrated data source must be made consistent
- to present a unified view of the data to the users

Time Variant

data in a data warehouse is identified with a particular time period
data in data warehouse provide information from a historical point of view

Non-volatile Data

Data in the warehouse is not normally updated in real-time (RT)
It is refreshed from operational systems on a regular basis
New data is always added as a supplement to the database, rather than a replacement.

However, emerging trend is towards real-time or near real-time DWs

Benefits of Data Warehousing

Potential high returns on investment (ROI)
Competitive advantage
- Better business decisions
- Reveal of untapped information
Increased productivity of corporate decision- makers
- allows corporate decision makers to perform more accurate and consistent analysis

OLTP VS Data Warehousing

DBMS built for online transaction processing (OLTP)
- usually not suitable for data warehousing.

OLTP is designed to maximize transaction throughput.

DW is designed to support ad hoc queries for business decision making.

Therefore:

OLTP is for making transaction.
DW is for making decision.

	OLTP	Data Warehousing
Main purpose	Support Operational processing	Support analytical processing
Data age	Current	Historic
Data latency	Real-time	Depends on length of cycle for data supplements to warehouse
Data granularity	Detailed data	Detailed data, lightly and highly summarized data
Data processing	Predictable pattern	Less Predictable pattern
Reporting	Predictable, relatively static fixed reporting	Unpredictable, dynamic reporting
Users	Serves large number of operational users	Serves lower number of managerial users

Data Warehouse Architecture

Operational Data Sources

Main sources are online transaction processing (OLTP) databases.
include sources such as personal databases and spread sheets, and web usage log files.
Holds current and integrated operational data for analysis

ETL Manager

ETL stands for extraction, transformation, and loading.

ETL Manager do the extraction and loading of data into the warehouse

Data for a DW must be extracted from one or more data sources
transformed into a form that is easy to analyze
- consistent with data already in the warehouse
then finally loaded into the DW

Warehouse Manager

Warehouse Manager manage data in the warehouse.

Analysis of data to ensure consistency
Transformation and merging of source data from temporary storage into data warehouse tables
Creation of indexes and views on base tables
Generation of denormalizations if necessary
Generation of aggregations if necessary
Backing-up and archiving data

Query Manager

Query Manager manage user queries.

Directing queries to the appropriate tables and scheduling the execution of queries
generates query profiles to allow the warehouse manager
- to determine which indexes and aggregations are appropriate
- for speeding up future queries

End-User Access Tools

End-user access tools include:

Traditional reporting and query

OLAP

Data mining

Main purpose of DW is to support decision makers.

This is achieved through the provision of a range of access tools including:

reporting and querying,
application and development,
Online Analytical Processing (OLAP)
Data mining

ETL and Data Marts

ETL (Extraction, Transformation and Loading)

E - Get the Data
T - Make to useful
L - Save it to the warehouse

Extraction

Targets one or more data sources
- typically include OLTP databases
- but can also include personal databases and spreadsheets,
- Enterprise Resource Planning (ERP) files, and web usage log files.
data sources are normally internal
- but can also include external sources
- such as the systems used by suppliers and/or customers

Transformation

Applies a series of rules or functions to the extracted data
- determines how the data will be used for analysis
- can involve transformations such as
  - data summations
  - data encoding
  - data merging
  - data splitting
  - data calculations
  - creation of surrogate keys.

Loading

Data Loading involves:

sorting
summarizing
consolidating
checking integrity
building indices
partitions (for speeding up queries).

Data Marts (Sub-sets of the DataWarehouse)

Don’t mess with my data.
Keep it simple for the user.
Small problems are easier to solve.

Usage of Data Marts

give users access to the data they need to analyze most often
provide data in a form that matches the collective view of the data by a group of users in a department or business application area
improve end-user response time due to the reduction in the volume of data to be accessed
provide appropriately structured data as dictated by the requirements of the end-user access tools
Building a data mart is simpler compared with establishing an enterprise-wide DW (EDW)
The cost of implementing data marts is normally less than that required to establish a EDW
The future users of a data mart are more easily defined and targeted to obtain support for a data mart than an enterprise-wide data warehouse project

Data Marts vs Data Warehouse

	Data Warehouse	Data Mart
Scope	Corporate	Line of Business (LOB)
Subject	Multiple	Single subject
Data Sources	Many	Few
Size (typical)	100 GB-TB+	< 100 GB
Implementation Time	Months to years	Months

Data Warehousing Design

Design the Dimensional Model (star schema)

The design is to provide instantaneous query results for analysts.

A dimensional model has:

1 fact table
- In the middle of star schema is a Fact.
- The fact table contains measures
a number of dimension tables
- Dimension Tables contains textual descriptions

OLAP Cube

An OLAP cube stores data in an optimized way to provide a quick response to various types of complex queries by using dimensions and measures.

Once the cube is created, we may perform queries on it instead of on the original data sources.

Example:

How to Design a OLAP Cube

Identify the attribute first (usually 3) -> Determine the dimensions
Identify how many distinct data in the attribute -> Determine vector length of each dimension

MySQL: WITH ROLLUP

SQL “with rollup” modifier summaries output to include not only one extra row that represents higher-level summary operations, but also Multiple rows for multiple columns “with rollup”

Example:

Lets say we have a table sales.

CREATE TABLE sales
(
  year INT,
  country VARCHAR(20),
  prodcut VARCHAR(32),
  profit INT
);

modifier summaries output to include not only one extra row that represents higher-level summary operations

1
2
3

SELECT year, SUM(profit) AS profit
FROM sales
GROUP BY year ASC WITH ROLLUP

Output:

Multiple rows for multiple columns “with rollup”

1
2
3

SELECT year, SUM(profit) AS profit
FROM sales
GROUP BY year ASC, country ASC, product ASC WITH ROLLUP

Output:

Database Security

What Database Security

The mechanisms that protect the database against intentional or accidental threats.

Database security encompasses:

hardware
software
people
data

Why Care Database Security

A database represents an essential corporate resource that should be properly secured using appropriate controls.

We consider database security in relation to the following situations:

theft and fraud
loss of confidentiality (secrecy)
loss of privacy
loss of integrity
loss of availability

Theft and fraud

affect not only the database environment but also the entire organization
attention should focus on reducing the opportunities for this occurring

Loss of Confidentiality and Privacy

privacy refers to the need to protect data about individuals
loss of confidentiality could lead to loss of competitiveness
loss of privacy could lead to legal action being taken against the organization

Loss of Integrity and Availability

Loss of data integrity results in invalid or corrupted data (data corruption)
- seriously affect the operation of an organization
Loss of availability means that the data, or the system, or both cannot be accessed
- seriously affect an organization’s financial performance.

Threats

An organization should invest time and effort in identifying the most serious threats.

And then initiate appropriate plans and countermeasures.

Example of Threats:

Potential threats to computer systems

Data Security Best Practices

Principle of Least Privilege
Stay up-to-date on patches
Remove/disable unneeded default accounts
Firewalling/Access Control
Running Database processes under dedicated non-privileged account.
Password Security

Principle of Least Privilege

Do not give accounts privileges that aren’t needed.

Firewalling/Access Control

Throttling connections
- Use firewall software like IPTables
- Don’t let the world connect to your database server

Access Control in MySQL

To disable remote access to MySQL server, added the following statement to [mysqld] section of my.cnf (Linux/Mac) or my.ini (Windows):

1	skip-networking

Access Control to Error Log:

Ensure only “root” and “mysql” have access to the logfile <hostname>.err. The file is stored in the mysql data directory. This file contains very sensitive information such as passwords, addresses, table names, stored procedure names and code parts.

Password Security

Change password on a regular basis
Don’t use/reuse the same password
Categorize your password
Don’t use the same password for Gmail and banking
Avoid using words found in a dictionary

Make password more secure:

Make password long and complex
- At least 8 characters
- Use uppercase/lowercase, numbers, and special characters
Don’t use/reuse password iteration
Avoid using personal information
Never store passwords in obvious place
Don’t go to a secure site using public PC
Never give password to anybody
- Aware of social engineering

Password Security in MySQL

To rename the root in the MySQL console (after login as root):

USE mysql;
UPDATE user SET user ='admin' WHERE user='root';
UPDATE user SET password=PASSWORD('new password') WHERE user='admins';
flush privileges;

Restart server after changing root username and password and you can login as admin user with full privilege.

The “LOCAL INFILE” command can be used to gain access to other files on the operating system, for instance “/etc/passwd” in Linux or C:\Windows\win.ini in Windows.

To disable "LOCAL INFILE”, add the following parameter to the [mysqld] section of the MySQL configuration file (my.ini):

1	set-variable=local-infile=0

SQL Injection

SQL injection is a technique where malicious users can inject SQL commands into an SQL statements via web page input.

A user input is used as part of the SQL statement.

SQL injection is a code injection technique that might destroy your database.
SQL injection is one of the most common web hacking techniques.
SQL injection is the placement of malicious code in SQL statements, via web page input.

Hacking with SQL Injection

Exploit the unsanitized field
Completes the single quote and closing bracket
Makes use of end symbol ; and add extra sql statement
- Could be catastrophic, result in a trapdoor
Makes use of comment symbol # or -- or /* */
Could inject OR 1=1 for math input based on 1=1 is Always True
Could inject " or ""=" for string input based on “”=“” is Always True

Note:

Comment Syntax available in MySQL:

1
2
3

# This is a MySQL comment
-- This is a standard SQL comment
/* This is a multiline comment */

Injection with commenting

Example:

Original Code:

1	SELECT * from UserTbl WHERE username = 'fred' and password = 'xyz';

Where fred and xyz are the input.

Injection Input:

when a hacker know the username of a user and put the following string to the username field.

fred'--

Injected Code:

SELECT * FROM UserTbl WHERE username ='fred'--' and password = '';


-- This query is equivalent to:
select * from UserTbl where username = 'fred'

Injection based on 1=1 is Always True

Example:

Original Code:

1	SELECT ProdName FROM Product WHERE Stock=12;

Where 12 is the input.

Injection Input:

12 or 1=1

Injected Code:

SELECT ProdName FROM Product WHERE Stock=12 or 1=1;

-- This query is equivalent to:
SELECT ProdName FROM Product;

Makes use of end symbol `;` and add extra sql statement

Example:

Original Code:

1	SELECT ProdName FROM Product WHERE Stock=12;

Where 12 is the input.

Injection Input:

1	12 or 1=1; SELECT * from Product;

Injected Code:

SELECT ProdName FROM Product WHERE Stock=12 or 1=1; SELECT * from Product;

-- This query is equivalent to:
SELECT * from Product;

Injection based on “”=“” is Always True

Example:

Original Code:

1	SELECT * FROM Customer WHERE FirstName = "Mary" AND LastName = "Hill";

Where Mary and Hill are the inputs.

Injection Input:

" or ""="

Injected Code:

SELECT * FROM Customer WHERE FirstName = "" or ""="" AND LastName = "" or ""="";

-- This query is equivalent to:
SELECT * FROM Customer;

Preventing SQL Injection

To prevent SQL injection, developers should

stop writing dynamic queries (Impossible)
prevent user supplied input which contains malicious SQL from affecting the logic of the executed query.

Prepared statements

It is a Defense Technique.

Keypoint: using parameterized queries.

use the user input as a parameter.

Prepared statements ensure that an attacker is not able to change the intent of a query, even if SQL commands are inserted by an attacker.

A small example on Java to prevent SQL injection:

String custNo = request.getParameter(“CustNo”);
String query = "SELECT * FROM Customer WHERE CustNo = ?"; //? indicate the parameter
PreparedStatement pstmt = conn.prepareStatement(query); 
pstmt.setString(1, custNo); //first parameter as custNo
ResultSet res = pstmt.executeQuery();

The problem with SQL injection is, that a user input is used as part of the SQL statement. By using prepared statements you can force the user input to be handled as the content of a parameter (and not as a part of the SQL command).

But if you don’t use the user input as a parameter for your prepared statement but instead build your SQL command by joining strings together, you are still vulnerable to SQL injections even when using prepared statements.

Big Data

Big Data relates to data creation, storage, retrieval and analysis that is remarkable in terms of Volume, Velocity, and Variety. (3Vs of Big Data)

Big Data has multiple definitions. 3Vs, 5Vs, and even 42Vs

5Vs of Big Data

Volume

the incredible amounts of data generated each second
constantly-updated data feeds containing information
Terabytes to exabytes of existing data to process

Velocity

the speed at which vast amounts of data are being generated, collected and analyzed
generate massive log data in real-time
Streaming data, milliseconds to seconds to respond

Variety

the different types of data
not just numbers and strings, but even 3D data, audio and video
Structured, unstructured, text, multimedia

Veracity

the quality or trustworthiness of the data
Accuracy of data

Value

the worth of the data being extracted

Source of Big Data

Social media and networks
- all of us are generating data
Scientific instruments
- collecting all sorts of data
Mobile devices
- tracking all objects all the time
Sensor technology and networks
- Measuring all kinds of data

Innovation is no longer hindered by the ability to collect data but by the ability to manage, analyze, summarize, visualize, and discover knowledge from data in a timely and scalable manner.

Storing Big Data

Traditional SQL vs Unstructured

Traditional SQL Databases

Traditional SQL Databases comprise of relational tables with fixed schema.

ACID Rules of Traditional SQL Databases:

Atomic
- A transaction is a logical unit of work which must be either completed with all of its data modifications, or none of them is performed.
Consistent
- At the end of the transaction, all data must be left in a consistent state
Isolated
- Modifications of data performed by a transaction must be independent of another transaction.
Durable
- When the transaction is completed, effects of the modifications performed by the transaction must be permanent in the system.

Disadvantage of Traditional SQL Databases:

very complicated to Join tables that are distributed across multiple machines
Difficult to deploy a big database on a distributed network
- because dividing a relational database across multiple machines is nontrivial.
Maintaining a state of data consistency can become tricky on a distributed database
not designed to store a huge amount of data
when data volume grows, reading from and writing to a relational DB while maintaining ACID compliance can be computationally expensive

Structured Database (SQL Databases)

can be easily organized, stored and updated in a defined data model
e.g. numbers/text
have a consistent format (schema)
data can be processed, searched, queried, combined (joined)
- analyzed straightforwardly using SQL

###Unstructured Databases

do not have a defined data model or common format
massive datasets consist of unstructured data

Why need it:

Too much data
growing fast data volume
not consistent
critical time

To tackle above problems, Unstructured Databases and NoSQL is introduced.

NoSQL (Not only SQL)

Non-relational database management systems (no JOIN operation)
Designed for storing large-scale unstructured data across multiple servers (nodes)
Type of data storing may not require fixed schema
Used by many IT giants: Google, Facebook, Linkedln, etc.
A database infrastructure well-adapted to big data, easy to scale out

We need NoSQL because of the huge data growth.

NoSQL vs RDBMS

NoSQL vs SQL

NoSQL	SQL (RDBMS)
if most data is correct everything is ok	Integrity is mission-critical
data format unknown or inconsistent	data format consistent and well-defined
data will be replaced	data is of long-term value
write-once, read multiple	frequent data updates
Unpredictable / exponential growth	predictable, linear growth
only programmers writing queries	non-programmers writing queries
Replication	Regular backup
Sharding	access through master server
Key-value pair storage, doc store	data stored in separated tables
ACID not guaranteed	Follow the ACID rules
Highly scalable	Not easy to scale

5 Advantages of NoSQL

Elastic Scaling
Good for Big Data
Require Less Management
Economics
Flexible Data Models

Key/Value Stores

A key-value store looks like a big hash table in which each record in the table is an object identified by a unique key.

The system does not know anything about the data being stored.

records are addressed by unique keys.

retrieve information based on a single key.

Examples: LinkedIn’s Voldemort, Apache Cassandra

Document Stores

Similar to key-value stores but with a major difference:

values are documents

Document stores have flexible schema in that any number of fields can be added to a document without affecting the other documents.

Unlike key-value stores where only query against the keys is allowed, document stores allow query against documents.

Example implementation: CouchDB, MongoDB

MapReduce

a programming framework for writing applications that process vast amounts of data (multi-terabyte) in-parallel on a large cluster (thousands of nodes) of commodity hardware in a reliable, fault-tolerant manner.

An Algorithm or concept to process Huge amount of data in a faster way
- Divide it, then Map and Reduce

MapReduce opts for scaling out rather than scaling up
- Scale Out: Adding more low-end servers as need.
- Scale Up: Replace the existing server by a bigger (more expensive) one when demand is high
  
  Scale out is much cheaper

MapReduce provides

Automatic parallelization and distribution
Fault-tolerance
Status and monitoring

MapReduce tasks

MapReduce applications have two separate and distinct tasks:

Map and Reduce.

Mapper
- Takes input data and processes it to produce key/value pairs
- Each mapper works in each input files

The key/value pairs are then fed to the Reducer.

Reducer
- Takes those key/values pairs and then combines or aggregates them to produce the final result (a single value).

Hadoop

Hadoop is an implementation of the MapReduce framework.

Hadoop is open source, but Google MapReduce is not.

Hadoop is implemented in Java.

It allows programmers to write parallel programs that execute on a cluster with thousands of nodes (servers).

Hadoop consists of :

Hadoop Common package, which provides file system and OS level abstractions;
a MapReduce engine
the Hadoop Distributed File System (HDFS).

Hadoop Distributed File System (HDFS)

Files split into 64MB or 128MB blocks
Blocks are replicated across nodes
Namenodes (masters) store filenames and locations.
Datanodes (workers) store data blocks

Flow of Data in MapReduce

Several mappers read the input from HDFS (discussed later), each taking a portion
The mappers apply the map function and write the output to disk.
Optionally, combiners partially reduce mapper output.
The immediate data is transferred across the network from mappers to reducers and stored to disk.
Each reducer groups intermediate values together by keys
The reducers read the sorted data from disk and apply the reduce function
Output is written to disk

Hadoop vs RDBMS

	Hadoop	Traditional RDBMS
Data size	Petabytes (10^15 or 1024^5)	Gigabytes
Access	Batch	Interactive and batch
Update	Write once, read many times	Read/Write many times
Data Type	Unstructured (dynamic schema)	Structured (static schema)
Integrity	Low	High
Scaling	Scale out	Scale up

Hadoop:

Affordable Storage / Computer
Structured or Unstructured
Resilient Auto Scalability

Relational Databases:

Interactive response times
ACID
Structured data
Cost/Scale prohibitive

NoSQL on Hadoop

NoSQL databases can be built on top on a Hadoop cluster.
HBase is a NoSQL database that runs on top a Hadoop cluster and provides random real-time read/write access to data.
When you want to randomly access key-value stores, you use HBase
When you want to batch process a large volume of data in parallel, you use Hadoop which provides HDFS and implements the MapReduce processing framework

Normalization

Purpose of Normalization

Characteristics of a suitable set of relations include

Benefits of having a normalized database with well-designed relations

How Normalization Supports Database Design

Data Redundancy

Update Anomalies

Insertion anomaly

Modification anomaly

Deletion anomaly

Example

Functional Dependencies (FD)

Definition of Functional dependency (FD)

Partial Dependencies (PD)

Characteristics of Functional Dependencies

Transitive Dependencies

The Process of Normalization

Unnormalized Form (UNF)

Normalization : 1NF to 3NF

First Normal Form (1NF)

Second Normal Form (2NF)

Third Normal Form (3NF)

Data Warehousing

NOT IN 3NF (Big Data)

De-nomralising Data

Why?

Business Intelligence (BI)

How?

Challenges in Delivery

Ideally, BI is

Relationship between BI and DW

OLTP VS OLAP

Data Warehousing Concepts

The Goal of Data Warehousing

Subject-Oriented

Integrated Data

Time Variant

Non-volatile Data

Benefits of Data Warehousing

OLTP VS Data Warehousing

Data Warehouse Architecture

Operational Data Sources

ETL Manager

Warehouse Manager

Query Manager

End-User Access Tools

ETL and Data Marts

ETL (Extraction, Transformation and Loading)

Extraction

Transformation

Loading

Data Marts (Sub-sets of the DataWarehouse)

Usage of Data Marts

Data Marts vs Data Warehouse

Data Warehousing Design

Design the Dimensional Model (star schema)

OLAP Cube

How to Design a OLAP Cube

MySQL: WITH ROLLUP

Database Security

What Database Security

Why Care Database Security

Theft and fraud

Loss of Confidentiality and Privacy

Loss of Integrity and Availability

Threats

Data Security Best Practices

Principle of Least Privilege

Firewalling/Access Control

Access Control in MySQL

Password Security

Password Security in MySQL

SQL Injection

Hacking with SQL Injection

Injection with commenting

Injection based on 1=1 is Always True

Makes use of end symbol ; and add extra sql statement

Injection based on “”=“” is Always True

Preventing SQL Injection

Prepared statements

Makes use of end symbol `;` and add extra sql statement